[MapR Talk] How Fake Data Can Solve Real Problems and Enhance Security

Document created by aalvarez on Dec 1, 2015Last modified by aalvarez on Dec 7, 2015
Version 3Show Document
  • View in full screen mode


In an effort to help grow organic communities interested in new technologies, MapR speakers around the world provide technical talks on numerous topics. Please browse our MapR Talks Directory  to learn how to search and request a talk in seconds.



Open source is great, but to work it has to be developed in the open. Privacy is great, but things have to be kept private.


So what happens when you find a bug in open source software that only happens when you run it with private data? How do you file the bug? Build a test case?


Or what happens when you have a killer machine learning system, but can't prove its worth to your potential customers because you aren't allowed to see their confidential data?


The best answer is often to figure out how to make fake data that looks real enough to exhibit all of the bug-making or algorithm confounding properties of the real data. That is, you have to make fake data that seems so real that it fools the bug. To do that you may need some pretty elaborate sleight of math.


I will describe log-synth, an open-source program for generating realistic fake data. Log-synth can make up names, addresses or sample from realistically perverse numerical distributions. You can build data sets that can join cleanly but which have long-tailed frequency distributions. You can build fairly realistic session histories. And if log-synth won't do what you need, it is very easy to extend. I will also describe physics based approaches for emulating sensor data.


The first use of log-synth was to demonstrate a bug in Hive where joining 10 billion facts against 32 dimensions caused the query optimizer to fail. I will describe what happened and how we found and fixed the bug by generating fake (but realistic) data to take the place of the customer's highly confidential dataset.


Another use was where the emulation of a merchant compromise scenario allowed open source development and testing of an algorithm that later worked without change on live data. And which, by the way, found the bad guys.


KEYWORDS: Data Science, Algorithms, Machine Learning


Location Availability & Request Link

North America. Please refer to the MapR Talks Directory  for specific countries.

You can request this talk here: Speaker Request


Related Resources

Find all Meetups and Events resources.

Find all MapR Talks: mapr talk

Learn more about Mapr Talks and how to book a speaker: Meetup and Event Organizers Resources