Hey all, I am reviewing the documentation around Kafka Connect for MapR 5.2 under MapR Streams Clients and Tools (MapR Streams Clients and Tools ) I have some questions, suggestions, and concerns on this documentation, but first let me say if there is better documentation that I am just missing please let me know.
So in reading this documentation, coming from someone who lives in multiple worlds (not just a MapR Centric world) this documentation is confusing to me. Now, I am not trying to be critical, it's just that I consider myself fairly savvy on MapR things and just in my initial review I found myself asking "Does this mean X or Y?" on a number of subjects, and I would imaging it could get cloudier for others, perhaps newcomers, reviewing the documentation...
On Common Worker Configuration Options there is discussion of the config item bootstrap.servers. I "believe" based on my usage of librdkafka that for MapR we would leave this blank, and instead specify topics that are /path/to/streams:topic However, in both the bootstrap.servers and in the HDFS Example: Publish to MapR-FS example, there is no mention of this. I guess I don't see how someone who is trying to use streams would read this documentation and be successful on the first attempt. Unless there is some magic in the Kafka Connect MapR version that just accepts localhost:9092 (the default for bootstrap.servers) and ignores things... how would one approach this.
My suggestion here would be to review each of the configuration pages for these, outline how it would in Kafka, and then how someone would tweak settings to be useful in MapR streams. Like the bootstrap.servers settings. If you are connecting to a Apache Kakfa cluster, these are the hosts and ports of your brokers. In MapR Streams, there are not brokers, instead set this to '' and specify your topics using the MapR Streams format of /path/to/stream:topic. That would allow someone reading this documentation to understand both how Kafka Connect works, and how MapR Streams would work. (This is a simple example, but there are lots of settings under the Clients and Tools settings that needs some tweaking to that effect).
Given that this is documentation for MapR, I find some of the docs here a bit focused on Apache Kafka and HDFS which is very confusing... For example under HDFS Configuration Options I believe many of the settings work for both HDFS and MapRFS export, but on the specific settings for MapR, perhaps the configuration items such as hdfs.url should also provide a MapRFS example with IP of CLDB as the variable. It would help people who have MapR clusters (I am guessing a majority of the people who read the documentation) apply this tools to their MapR environment without having to reference another documentation page or look that up on Google.
In addition to specific changes around MapR related items. I think in general, seeing examples of settings and what they do could also be very useful to folks trying to use these tools. For example, format.class. The description is
"The format class used when writing data to HDFS." Type: String, Default: io.confluent.connect.hdfs.avro.AvroFormat
Ok, so what other options are there? Is there a Parquet output? There is a page that mentions that... Architecture of Kafka Connect How do I do that? How about the other options, the other examples, this applies for many of the other configuration settings. I guess as a user who is reviewing the documentation, it may be a bit dense not knowing all the things that could be done here, and the documentation should be the place that helps me (and users in general) understand that and see the power of both MapR Streams (specifically) and tools that MapR has put the time and effort into developing to work with Streams.