Read multi-line JSON in Apache Spark

I tried to use the JSON file as a small DB. After creating the template table in the DataFrame, I requested it using SQL and got an exception. Here is my code:

val df = sqlCtx.read.json("/path/to/user.json") df.registerTempTable("user_tt") val info = sqlCtx.sql("SELECT name FROM user_tt") info.show() 

df.printSchema() result:

 root |-- _corrupt_record: string (nullable = true) 

My JSON file:

 { "id": 1, "name": "Morty", "age": 21 } 

Exeption:

 Exception in thread "main" org.apache.spark.sql.AnalysisException: cannot resolve 'name' given input columns: [_corrupt_record]; 

How can i fix this?

UPD

_corrupt_record is

 +--------------------+ | _corrupt_record| +--------------------+ | {| | "id": 1,| | "name": "Morty",| | "age": 21| | }| +--------------------+ 

UPD2

This is strange, but when I rewrite my JSON to make it oneliner, everything works fine.

 {"id": 1, "name": "Morty", "age": 21} 

So the problem is newline .

UPD3

I found the following sentence in the docs:

Please note that the file that is offered as a json file is not a typical JSON file. Each line must contain a separate standalone virtual JSON object. As a result, a regular multi-line JSON file most often fails.

It is not possible to save JSON in this format. Is there any workaround to get rid of the layered JSON structure or convert it to oneliner?

+13
source share
2 answers

Sparks> = 2.2

Spark 2.2 introduced the wholeFile multiLine which can be used to load JSON files (not JSONL):

 spark.read .option("multiLine", true).option("mode", "PERMISSIVE") .json("/path/to/user.json") 

See:

  • SPARK-18352 - Parse regular multi-line JSON files (not just JSON lines).
  • SPARK-20980 - Rename the wholeFile option to multiLine for JSON and CSV.

Sparks <2.2

Well, using formatted JSONL data may be inconvenient, but I will argue that this is not a problem with the API, but the format itself. JSON is simply not designed for parallel processing on distributed systems.

It does not contain any scheme and without any special assumptions regarding its formatting and form it is almost impossible to correctly identify top-level documents. This may be the worst format to use on systems like Apache Spark. It is also quite complicated and generally impractical to write valid JSON in distributed systems.

However, if the individual files are valid JSON documents (single document or an array of documents), you can always try wholeTextFiles :

 spark.read.json(sc.wholeTextFiles("/path/to/user.json").values()) 
+31
source

Just to add a zero 323 response, the option in Spark 2.2. 2+ for reading multi-line JSON was renamed to multiLine (see Spark documentation here ).

Therefore, the correct syntax is now:

 spark.read .option("multiLine", true).option("mode", "PERMISSIVE") .json("/path/to/user.json") 

This happened at https://issues.apache.org/jira/browse/SPARK-20980 .

+5
source

All Articles