The Trouble with Kappa Architecture

Blog Post created by MichaelSegel on Apr 4, 2017

On LinkedIn, I wrote the following:

Kappa Architecture is the next 'great thing' to be misunderstood and misapplied. No wonder 'Big Data' projects fail.

After talking with a couple of friends I thought I should explain why I said this...


Jay Kreps wrote an article questioning Lambda Architecture and proposed an alternative. The article can be found here.  At LinkedIn, Jay built some infrastructure using Kafka and Samza such that you didn't need to create a separate persistence layer storing the data in some form of a database.


The issue isn't with what Jay was saying. For certain applications, this architecture makes sense. In the MapR world, you can replace the Kafka / Samza with MapR Streams.  However, the issue is that the Kappa Architecture isn't a good solution for all or for many of the problems where one want's fast data ingestion.


As a solutions architect, you have to ask yourself a couple of questions prior to choosing a tool. In this case, 'Where is the data coming from? (what is the data)' and 'How are we going to use the data'). While attending a couple of local meetups in Chicago, I noticed from some of the comments, that there was a general acceptance that Kappa was the way to go, no questions asked.  This is a very dangerous 'group think'.  This is one of the main factors as to why Big Data projects fail.  

Why Not Kappa

Suppose you're building a Data Lake and rather than perform batch updates from your legacy systems, you intend to use a tool like IBM's CDC or Oracle's Golden Gate software to perform log shipping.  For each transaction performed on a table, you capture the log information and you place it on your Kafka queue and ship it off to your lake.

Since this information contains the latest state of the data row, you can easily just persist the message via Samza and your ingestion is complete.


But here's the problem... when it comes time to use the data, you have to walk thru the results.  Its a sequential scan.

At the same time, you're still treating Hadoop as a relational model.


In this use case, you will want to still persist the CDC record for the table as well as apply the changes to the table. This allows you to retain a snapshot that matches the underlying legacy table, as well as the historical changes to the row.

However, you will still take a performance hit when you try to use the data. Hadoop isn't a relational platform. Its hierarchical.  So to get the most from the data, you will need to transform the data from a relational model to a hierarchical model, then persist.  (Think MapRDB JSON tables)   In this use case, you're storing your data in multiple formats during the ingestion stage.  Your initial storage of the transaction could follow Kappa, however, its less than optimal. You would want to store the data as a record collocating the changes in time order per record ID not in order of inbound messages.


What does this mean?

Note that this is just one example of a use case where if you attempt to implement Kappa, you will end up with a big fail. Its not one application/use case, but a class of use cases.  While Kappa makes sense if you're processing logs, which is a class of use cases, its not fit as a general solution. 


The key here is to understand what each tool does and how to effectively use the tool rather than blindly chose a tool since its the next great thing.  As always Caveat Emptor. ;-)



Editors note: The views, opinions and positions expressed by the authors and those providing comments on these blogs are theirs alone, and do not necessarily reflect the views, opinions or positions of MapR.