MapR-Streams Pipeline Visualization

Idea created by john.humphreys on Mar 28, 2018
    • cathy
    • john.humphreys

    MapR Streams has the information regarding which processes are actively consuming from a topic/consumer-group (cursor/assign/list). 


    I assume they have similar data for who is producing messages (at least source IPs) even though it's probably not stored as conveniently as it is for consumers.


    It would be really cool and useful if there was an endpoint to display a DAG/graph of the streams in the cluster so you could see how many applications are putting data into and taking data out of a streams pipeline (e.g. Java producer A writes to topic 1 which is consumed by Spark Streaming App B and Java App C, each of which write to other streams, etc.).  This would be especially useful or operational staff in understanding an application they are managing.  Note that I'm not sure you'd be able to actually determine it was a Spark streaming app or a normal Java app; so you might have to have less detail there.


    If any required information is lacking, an extra API method could be provided to allow producers and/or consumers to register themselves against the stream (regularly if necessary).  That may require some manual effort by users but I assume most would be happy to do it for this kind of benefit. The streams are based on MapR-DB so it would be trivial to sore this extra information with them once the idea was well thought out.


    The metadata for any given stream would be pretty light so it shouldn't be particularly challenging to aggregate the data for all the streams to find the connected graphs for usage of all the streams in a cluster.


    These kinds of features could eventually be enriched to display consumer lag/communication time on the DAG vertexes and similar information to that.