Spark fails to parse a json object with multiple lines

Document created by Hao Zhu Employee on Feb 7, 2016
Version 1Show Document
  • View in full screen mode

Author: Hao Zhu

 

Original Publication Date: June 12, 2015

 

Environment :

Spark 1.3.1

Symptom

Spark fails to parse a json object with multiple lines.

This issue can happen when either creating a DataFrame using:

val people = sqlContext.jsonFile(path)

or creating a temp table using Spark SQL:

CREATE TEMPORARY TABLE jsonTable2 
USING org.apache.spark.sql.json
OPTIONS (
   path "/xxx/test2.json"
);

Sample error message is:

java.lang.RuntimeException: Failed to parse record   "array" : [ {. 
Please make sure that each line of the file (or each string in the RDD) is a valid JSON object or an array of JSON objects.

Root Cause

As mentioned in Spark Documentation:Note that the file that is offered as a json file is not a typical JSON file. Each line must contain a separate, self-contained valid JSON object. As a consequence, a regular multi-line JSON file will most often fail.

Solution

Convert the json object from multiple lines to a single line.

 

For example, convert below json object:

 

{

  "array" : [ {

  "count" : "site1",

  "sitename" : "sitename1"

  }, {

  "count" : "site2",

  "sitename" : "sitename2"

  } ]

}

to below:

{ "array" : [ { "count" : "site1", "sitename" : "sitename1" }, {"count" : "site2", "sitename" : "sitename2" } ] }

After that, Spark can successfully parse it. For example:

 

spark-sql> CREATE TEMPORARY TABLE jsonTable

  > USING org.apache.spark.sql.json

  > OPTIONS (

  > path "/xxx/test_spark.json"

  > )

  > ;

Time taken: 3.738 seconds

 

spark-sql> select * from jsonTable ;

[{"count":"site1","sitename":"sitename1"},{"count":"site2","sitename":"sitename2"}]

Time taken: 1.4 seconds, Fetched 1 row(s)

 

spark-sql> select array[0].count,array[0].sitename from jsonTable;

site1 sitename1

Time taken: 0.184 seconds, Fetched 1 row(s)

It is also possible to put multiple single-line json objects into one file.

For example:

 

# cat test_spark_multiple.json

{ "array" : [ { "count" : "site1", "sitename" : "sitename1" }, {"count" : "site2", "sitename" : "sitename2" } ] }

{ "array" : [ { "count" : "site1", "sitename" : "sitename1" }, {"count" : "site2", "sitename" : "sitename2" } ] }

{ "array" : [ { "count" : "site1", "sitename" : "sitename1" }, {"count" : "site2", "sitename" : "sitename2" } ] }

{ "array" : [ { "count" : "site1", "sitename" : "sitename1" }, {"count" : "site2", "sitename" : "sitename2" } ] }

{ "array" : [ { "count" : "site1", "sitename" : "sitename1" }, {"count" : "site2", "sitename" : "sitename2" } ] }

The query result would be:

 

spark-sql> CREATE TEMPORARY TABLE jsonTable3

  > USING org.apache.spark.sql.json

  > OPTIONS (

  > path "/xxx/test_spark_multiple.json"

  > );

Time taken: 0.234 seconds

 

spark-sql> select * from jsonTable3 ;

[{"count":"site1","sitename":"sitename1"},{"count":"site2","sitename":"sitename2"}]

[{"count":"site1","sitename":"sitename1"},{"count":"site2","sitename":"sitename2"}]

[{"count":"site1","sitename":"sitename1"},{"count":"site2","sitename":"sitename2"}]

[{"count":"site1","sitename":"sitename1"},{"count":"site2","sitename":"sitename2"}]

[{"count":"site1","sitename":"sitename1"},{"count":"site2","sitename":"sitename2"}]

Time taken: 0.153 seconds, Fetched 5 row(s)

Attachments

    Outcomes