Can anyone help me understand about the Complex Structured Files like Parquet files systems? Where is it used? Any examples?
For OP: There are many use cases for "complex" data - depending what that means to you. To me it means data that contains non-scalar datatypes as values, such as maps and arrays, and also (may) allow arbitrarily nesting these datatypes (an array of maps, or a map of arrays). Probably you can imagine use cases for data that may be shaped differently than a 2 dimensional table of rows and columns ...
With respect to parquet, this is a topic I have been curious about.
I am interested in possibilities flexible-ish schema's with parquet. I see it supports both MAP and ARRAY types - the MAP does not require map keys to be explicitly declared, if I understand it correctly.
I do need to try myself, but haven't had time yet, and would like to know: can Drill deal with Parquet maps and arrays the same way it does with JSON maps and arrays?
Also, can anyone suggest any command line tools similar to "avro-tools.jar", which can be used to convert json to parquet or otherwise WRITE parquet files? there are "parquet-tools" as part of the parquet-mr project on github, but they only read, they don't seem to write (parquet).
I know drill can CTAS parquet but I want something lower level and lighter than drill, but seems to not exist - I suspect this is for some good reason no tool like this is readily available, but I am not sure why.
Thanks! And sorry for hijacking this thread, if thats what I'm doing.
Structured files provide more flexibility and speed than "normal" files would provide. For example, instead of storing data in TSV (tab separated values) or CSV (comma separated values) files, formats like Parquet allow for compression and column level data access. These files are also compressed in such a way that they can be "split" for parallel processing; a gzip file can't be split up and must be read through one process instead of broken up into pieces and processed that way. The speed gain when doing this is significant.
Request you to elaborate more on it.
I guess I'd need to know what part you'd like more information on? Here is the documentation on the Parquet format itself: https://parquet.apache.org/documentation/latest/
This is a great information.
You might want to check out Kite SDK (Kite: A Data API for Hadoop) for a light-weight option of converting existing data sets to Parquet.
Retrieving data ...