How to build and use parquet-tools to read parquet files

Document created by Hao Zhu Employee on Feb 18, 2016
Version 1Show Document
  • View in full screen mode

Author: Hao Zhu

Original Publication Date: February 23, 2015

Goal:

How to build and use parquet-tools to read parquet files.

Solution:

1. Download and Install maven.

Follow below link:http://maven.apache.org/download.cgi

2. Download the parquet source code

git clone https://github.com/Parquet/parquet-mr.git

3. Build the parquet-tools.

cd parquet-mr/parquet-tools/ mvn clean package -Plocal

The resulting jar is target/parquet-tools.jar.Note, you may meet error such as below:

Failure to find com.twitter:parquet-hadoop:jar:1.6.0rc3-SNAPSHOT in https://oss.sonatype.org/content/repositories/snapshots was cached in the local repository

It is because the pom.xml is pointing to version 1.6.0rc3-SNAPSHO, however that version does not exist in https://oss.sonatype.org/content/repositories/snapshots/com/twitter/parquet-hadoop/ .The fix is to modify parquet-mr/pom.xml and also parquet-mr/parquet-tools/pom.xml to one valid version, for example:<version>1.6.1-SNAPSHOT</version>

4. Show help manual

cd target 
java -jar parquet-tools-1.6.1-SNAPSHOT.jar --help

5. Dump the schema

Take sample nation.parquet file for example.

# java -jar parquet-tools-1.6.1-SNAPSHOT.jar schema /tmp/nation.parquet

message root {

  required int64 N_NATIONKEY;

  required binary N_NAME (UTF8);

  required int64 N_REGIONKEY;

  required binary N_COMMENT (UTF8);

}

6. Read the data

 

# java -jar parquet-tools-1.6.1-SNAPSHOT.jar cat /tmp/nation.parquet

N_NATIONKEY = 0

N_NAME = ALGERIA

N_REGIONKEY = 0

N_COMMENT = haggle. carefully f

 

(... ...)

7. Read first n records

# java -jar parquet-tools-1.6.1-SNAPSHOT.jar head -n3 /tmp/nation.parquet

N_NATIONKEY = 0

N_NAME = ALGERIA

N_REGIONKEY = 0

N_COMMENT = haggle. carefully f

 

N_NATIONKEY = 1

N_NAME = ARGENTINA

N_REGIONKEY = 1

N_COMMENT = al foxes promise sly

 

N_NATIONKEY = 2

N_NAME = BRAZIL

N_REGIONKEY = 1

N_COMMENT = y alongside of the p

8. Show meta info

# java -jar parquet-tools-1.6.1-SNAPSHOT.jar meta /tmp/nation.parquet

file: file:/tmp/nation.parquet

creator: parquet-mr

 

file schema: root

--------------------------------------------------------------------------------

N_NATIONKEY: REQUIRED INT64 R:0 D:0

N_NAME: REQUIRED BINARY O:UTF8 R:0 D:0

N_REGIONKEY: REQUIRED INT64 R:0 D:0

N_COMMENT: REQUIRED BINARY O:UTF8 R:0 D:0

 

row group 1: RC:25 TS:1352 OFFSET:4

--------------------------------------------------------------------------------

N_NATIONKEY: INT64 SNAPPY DO:0 FPO:4 SZ:130/219/1.68 VC:25 ENC:PLAIN,BIT_PACKED

N_NAME: BINARY SNAPPY DO:0 FPO:134 SZ:267/296/1.11 VC:25 ENC:PLAIN,BIT_PACKED

N_REGIONKEY: INT64 SNAPPY DO:0 FPO:401 SZ:79/218/2.76 VC:25 ENC:PLAIN,BIT_PACKED

N_COMMENT: BINARY SNAPPY DO:0 FPO:480 SZ:468/619/1.32 VC:25 ENC:PLAIN,BIT_PACKED

9. Dump all data

Note: Values are in column format.

# java -jar parquet-tools-1.6.1-SNAPSHOT.jar dump --disable-meta /tmp/nation.parquet

INT64 N_NATIONKEY

--------------------------------------------------------------------------------

*** row group 1 of 1, values 1 to 25 ***

value 1: R:0 D:0 V:0

value 2: R:0 D:0 V:1

value 3: R:0 D:0 V:2

(...)

 

BINARY N_NAME

--------------------------------------------------------------------------------

*** row group 1 of 1, values 1 to 25 ***

value 1: R:0 D:0 V:ALGERIA

value 2: R:0 D:0 V:ARGENTINA

value 3: R:0 D:0 V:BRAZIL

(...)

 

INT64 N_REGIONKEY

--------------------------------------------------------------------------------

*** row group 1 of 1, values 1 to 25 ***

value 1: R:0 D:0 V:0

value 2: R:0 D:0 V:1

value 3: R:0 D:0 V:1

(...)

 

BINARY N_COMMENT

--------------------------------------------------------------------------------

*** row group 1 of 1, values 1 to 25 ***

value 1: R:0 D:0 V: haggle. carefully f

value 2: R:0 D:0 V:al foxes promise sly

value 3: R:0 D:0 V:y alongside of the p

(...)

Attachments

    Outcomes