Spark and parquet: reading shared data

SparkSQL has an excellent trick: it will read the data of your parquet, correctly reading the scheme from the parquet metadata. What else: if you have data partitioned using a schema key=value, SparkSQL automatically rewrites through the directory structure, reading this valueas a column with a name key. The documentation about this - along with a pretty clear example - is here .

Unfortunately, my data is split in a way that works well for Cascading, and does not seem to expect SparkSQL expectations:

2015/
└── 04
    ├── 29
    │   └── part-00000-00002-r-00000.gz.parquet
    └── 30
        └── part-00000-00001-r-00000.gz.parquet

In Cascading, I can indicate PartitionTap, say that the first three elements will be year, monthand day, and I will leave for the race. But I can’t understand how to achieve a similar effect in SparkSQL. Can I do any of the following:

  • Just ignore the splitting; recount to parquet data and read everything found. (I know that I could roll my own code using the Hadoop API FileSystem, but I would not.)
  • Indicate a partial diagram - for example, "columns - year (int), month (int), day (int), plus the output of the rest from the floor"
  • Specify the whole scheme?

( , SparkSQL , . , .)

+4

All Articles