SparkSQL has an excellent trick: it will read the data of your parquet, correctly reading the scheme from the parquet metadata. What else: if you have data partitioned using a schema key=value, SparkSQL automatically rewrites through the directory structure, reading this valueas a column with a name key. The documentation about this - along with a pretty clear example - is here .
Unfortunately, my data is split in a way that works well for Cascading, and does not seem to expect SparkSQL expectations:
2015/
└── 04
├── 29
│ └── part-00000-00002-r-00000.gz.parquet
└── 30
└── part-00000-00001-r-00000.gz.parquet
In Cascading, I can indicate PartitionTap, say that the first three elements will be year, monthand day, and I will leave for the race. But I can’t understand how to achieve a similar effect in SparkSQL. Can I do any of the following:
- Just ignore the splitting; recount to parquet data and read everything found. (I know that I could roll my own code using the Hadoop API
FileSystem, but I would not.) - Indicate a partial diagram - for example, "columns - year (int), month (int), day (int), plus the output of the rest from the floor"
- Specify the whole scheme?
( , SparkSQL , . , .)