My use case is: as files are created or modified in an Amazon S3 bucket, I want to move them to a MapR Stream as soon as possible. Ideally I'd like to include the S3 keypath, so a consumer of the stream could be idempotent with respect to each keypath.
Apache Streamsets has an s3 Origin, but I have found it to be unreliable. In my tests it has stopped reading files, or delivered files late, without logging anything. Also, in Streamsets, errors that happen at the s3 Origin stage don't get sent to the pipeline's error record destination.
Another problem with Streamsets - the above test was done reading the s3 files in "Lexigraphically Ascending Key Names" order. This could miss data if a file with an "lower" key name was retroactively updated. Streamsets has a mode to read the files in LastModified order, but the performance is bad - for a large bucket, it only read 8000 files in 5 days.
I was looking at this MapR blog post: Data Systems that Integrate with MapR-ES via Kafka Connect, which lists that S3 among the data systems supported via Kafka Connect via MapR. The article suggests to search on GitHub, and this is what I found:
kafka-connect-s3 by Spredfast - An open source project which purports to support this use case but unfortunately requires Kafka 0.10+, and MapR 5.2 is based on Kafka 0.9.0.0
kafka-connect-storage-cloud by Confluent - Appears to only support Kafka -> S3, not the other way around.
Has anyone encountered this use case and found a solution? Or more generally, is anyone aware of tools which efficiently poll s3 for recently updated files? Judging by this Github Issue it is not a use case that is well-supported by S3.