We're in the process of migrating a legacy batch-based system (Spark) to a streaming system (also Spark).
- The old system digests very large data files to a columnar database using Sqoop.
- The new system should digest the same data to OpenTSDB over MapR-DB.
We need to shut down our legacy database and replace it with OpenTSDB as our first step (for business reasons).
How can I digest from Spark (batch oriented) to OpenTSDB?
- Sqoop doesn't support OpenTSDB.
- If I use the REST API for OpenTSDB in a spark batch job, it will write the same values multiple times from different executors (I don't think I can fix this).
- Anything else I can think of (e.g. invoking external apps to digest the results) would be prone to failure and it would be hard to ensure it ran properly / was rescheduled.