Replicating Drifting Data - Going Beyond the Basics of Big Data Ingest- Atlanta Hadoop Users Group - GA - April 6, 2017

Document created by cwarman Employee on Feb 13, 2017Last modified by cwarman Employee on Feb 13, 2017
Version 2Show Document
  • View in full screen mode


Date:April 6, 2017
Time:7:00pm ET
Registration Link: 



Data drift, the gradual morphing of data structure and semantics, is a fact of life in enterprise IT. New requirements force schema changes, the meaning of database columns changes over time, and infrastructure upgrades add new fields to log files. Left unchecked, drift in data sources can cause applications and dataflows to fail, with costly downtime and, in the worst case, corruption in downstream data stores.


In this session, we'll start by looking at how we can deal with the problem of drift, focusing on the concrete example of replicating a relational database into Hive. We'll then examine some alternative approaches using open source tools such as Sqoop, NiFi and StreamSets Data Collector. Finally, we'll build a simple data pipeline to read the relational schema, create equivalent Hive tables, and then continuously ingest data from the relational database to Hive, altering the Hive schema as columns are added to the source tables.



Pat Patterson has been working with Internet technologies since 1997, building software and communities at Sun Microsystems, Huawei, Salesforce and StreamSets. At Sun, Pat was the community lead for the OpenSSO open source project, while at Huawei he developed cloud storage infrastructure software. As a developer evangelist at Salesforce, Pat focused on identity, integration and the Internet of Things. Now community champion at StreamSets, Pat is responsible for the care and feeding of the StreamSets open source community.


Have a burning question? Ask Craig Warman