How to select all columns that start with a common label

I have a DataFrame in Spark 1.6 and want to select only a few columns from it. Column Names:

colA, colB, colC, colD, colE, colF-0, colF-1, colF-2 

I know that I can do this to select specific columns:

 df.select("colA", "colB", "colE") 

but how to choose, say "colA", "colB" and all colF- * columns at once? Is there a way like in Pandas ?

+6
source share
1 answer

First take the column names with df.columns , then filter to the desired column names you want .filter(_.startsWith("colF")) . This gives you an array of strings. But the choice takes select(String, String*) . Fortunately, there will be select(Column*) for the columns, so finally convert the rows to columns with .map(df(_)) and finally turn Array of Columns into var arg with : _* .

 df.select(df.columns.filter(_.startsWith("colF")).map(df(_)) : _*).show 

This filter can be made more complex (just like Pandas). This, however, is a rather ugly solution (IMO):

 df.select(df.columns.filter(x => (x.equals("colA") || x.startsWith("colF"))).map(df(_)) : _*).show 

If the list of other columns is corrected, you can also combine a fixed array of column names with a filtered array.

 df.select((Array("colA", "colB") ++ df.columns.filter(_.startsWith("colF"))).map(df(_)) : _*).show 
+9
source

All Articles