Streamsets is a general purpose dataflow Management system. Kafka Connect was designed specifically for Apache Kafka and one endpoint in every Kafka connector is always Kafka and the other endpoint is another data system.
Both Kafka Connect and StreamSets Data Collector are open source Apache licensed tools that can help you with getting event streams in and out of Apache Kafka/MapR Streams and build data pipelines.
Both Kafka Connect and StreamSets Data Collector have advantages and disadvantages. As I did not find anywhere a comparison of Kafka Connect versus StreamSets Data Collector, I decided to write a short one!
1. Please find below a few advantages and disadvantages of Kafka Connect and its related connectors not only based on reading its related documentation but also based on my experience using a few of its connectors.
1.1 Few advantages of Kafka Connect:
- Kafka Connect framework is already included in Apache Kafka since release 0.9 and in MapR Streams. It is not an additional tool to install and manage.
- Kafka Connect is optimally integrated with Kafka as a full streaming data platform not just as a messaging system. You can build end-to-end streaming data applications by combining Kafka Core to store event streams, Kafka Connect to import/export event streams and Kafka Streams ( a lightweight java library) to process your event streams.
- Kafka Connect runtime offers automatic balancing of the work, auto-failover, dynamic scaling (up or down), fault-tolerance, in-built offset management for auto-recovery, ...
- It comes with many ready to use connectors to stream events in and out of Kafka/MapR Streams with no code but simple configurations.
- When running Kafka Connect in distributed mode, you simply use a REST API to create, modify and destroy connectors.
- Kafka Connect integrates with the schema registry to capture schema information from sources if present.
- It abstracts away the serialization format in Kafka so that no need to re-write same connectors to support different format.
- It also comes with SMT (Single Message Transformers) since Kafka 0.10.2
1.2 Few disadvantages of Kafka Connect:
- It is not GUI based to graphically build a data flow like Streamsets or Nifi although some UI support for configuration and monitoring is available from third parties: A Kafka Connect UI is available from Landoop and a Confluent Control Center is available from Confluent.
- You still need to build your own tool for managing and monitoring Kafka Connect connectors or rely on commercial tools for that purpose.
- Kafka Connect connectors for different applications or data systems are not maintained within Apache Kafka main code base. Although many are available as open source, certified, maintained and supported by a few vendors, many other connectors might be lacking features or might not be kept up-to-date or ready for prime time.
- Some connectors have limitations. For example: FileStream connector is not recommended for use in production. The JDBC connector does not handle delete. The HDFS sink connector has limited JSON support and stops processing messages that it cannot deal with as it doesn't support a "Dead Letter Queue” as in other queuing systems
- Kafka Connect currently feels more like a “bag of tools” rather than a packaged solution at the current time, at least without purchasing support from vendors.
- There is still a need from potential users to know which Kafka Connect connectors have been proven in real-world applications. Only a couple companies (The Hyve, Pandora Media and WePay) publicly shared their experiences in a couple blogs. Confluent claims in their 2017 Kafka Report an increase of the popularity of Kafka Connect over the last year especially for connecting to databases.
2. Please find below a few advantages and disadvantages of Streamsets Data Collector based on my reading of its documentation.
2.1 Few advantages of StreamSets Data Collector:
- GUI based and flows can be setup with no hand coding.
- Fine-grained data lineage and provenance for every data record
- A holistic view of your end-to-end data pipeline
- Out of the box connectors for many data systems
- Besides MapR Streams which Kafka 0.9 API compatible, SreamSets Data Collector integrates with other components of the MapR CDP (Converged Data Platform): MapR FS and MapR DB.
- Does not require Kafka as it can be used with other message queues like Kinesis or RabbitMQ nor message queues at all as it can be used to load directly to MapR-FS or HDFS (and many other destinations as well).
- Integrates with Confluent Schema Registry (for schema registration and lookups, anywhere you can use Avro, not just with Kafka).
2.2 Few disadvantages of StreamSets Data Collector:
- StreamSets is an additional tool to install, learn, manage and monitor.
- To run in distributed mode for scalability, it requires additional tools (YARN and Spark streaming).
- It is not a good fit for very low latency use cases as it relies on Spark streaming which is a micro-batch streaming framework.
- It is not clear from its documentation what guarantees it provides for reliability, delivery semantics and fault-tolerance.
- It lacks a few connectors compared to Kafka Connect such as: Some Change Data Capture (CDC) connectors such as IBM DB2, PostgreSQL; SAP; Splunk; mainframe, some IOT connectors such as Azure and CoAP; IMDB connectors such as Ignite and Hazelcast; JMX and a few more. Obviously, both Streamsets and Kafka Connect will keep adding new connectors.
- A separate commercial license is required for its management: Dataflow Performance Manager.
I invite you to comment below on anything you think is incorrect, add whatever advantages or disadvantages I might have missed for both StreamSets and Kafka Connect or share your real-world experience with both tools. Let's start a discussion!
Kafka Connect documentation:
Streamsets Data Collector documentation