Log compaction seems like a nice feature to have to reduce amount of raw log accumulation as well as have an almost latest snapshot of your datasource available whenever you want.
Invite carol mcdonald, MapR-Streams expert, to share her knowledge.
I am not sure what you mean by log compaction, MapR Streams are not stored as logs. If you are talking about Kafka compacted topics, then there are bugs with this: Pulling the Thread on Kafka's Compacted Topics . You really don't have to worry about compaction with MapR Streams, the storage is different and more efficient. You can set a ttl for older messages to be deleted automaticall.
A topic is just metadata in MapR Streams; it does not introduce overhead to normal operations. MapR Streams uses only one data structure for a stream, no matter how many topics it has, and the MapR storage system provides extremely fast and scalable storage for that data.
On the other hand, Kafka represents each topic by at least one directory and several files in a general purpose file system. The more topics/partitions Kafka has the more files it creates. This makes it harder to buffer disk operations, perform sequential I/O, and it increases the complexity of what ZooKeeper must manage.
MapR Streams splits partitions into smaller linked objects, called “partitionlets”, that are spread among the nodes in the cluster. As data is written to the cluster, the active partitionlets (those handling new data) are dynamically balanced according to load, minimizing hotspotting.”
With MapR Streams messaging, it’s entirely reasonable to save message data for long periods of time for those use cases in which a long-term history is desirable.
How Apache Kafka and MapR Streams Handle Topic Partitions | MapR
Kafka vs. MapR Streams: Why MapR? | MapR
Thanks carol mcdonald for low level information on mapr topics . however my question was more on storage related to number of messages per topics. If we have an application that emits millions of raw messages that otherwise can be easily compacted based on primary key, eventually we are going to run out of space. It would be nice to have some kind of compaction on those messages based on some key within message. Kafka Log compaction may have implementation issue but at least the proposal seem intriguing. Another benefit of log compaction is that you can get latest view of your source data from streaming platfrom without having to create/manage it yourself. This is particularly useful with CDC datasources. just a thought. we are still scratching surface with streaming.
Thanks, Carol. And Nirav , here is MapR-Streams product page which contains everything you need to know about MapR Streams.
a really reliable way to store one message per key is to store the data in MapR-DB with either the HBase or JSON api . If you set the number of versions to keep to 1 (the default), then only one value per row key will be stored. You could read from MapR Streams and write to MapR-DB . With this architecture you could also have multiple views of the same data, which is exactly what liaison does , you can read about that here: How Stream-First Architecture Patterns Are Revolutionizing Healthcare Platforms | MapR
Retrieving data ...