How to select all columns that start with a common label

Question

How to select all columns that start with a common label

I have a DataFrame in Spark 1.6 and want to select only a few columns from it. Column Names:

colA, colB, colC, colD, colE, colF-0, colF-1, colF-2

I know that I can do this to select specific columns:

 df.select("colA", "colB", "colE")

but how to choose, say "colA", "colB" and all colF- * columns at once? Is there a way like in Pandas ?

+6

scala apache-spark spark-dataframe

user299791 Feb 11 '16 at 13:15

source share

1 answer

Michael Lloyd Lee mlk · Accepted Answer · 2016-02-11T14:07:07+0000

First take the column names with df.columns , then filter to the desired column names you want .filter(_.startsWith("colF")) . This gives you an array of strings. But the choice takes select(String, String*) . Fortunately, there will be select(Column*) for the columns, so finally convert the rows to columns with .map(df(_)) and finally turn Array of Columns into var arg with : _* .

 df.select(df.columns.filter(_.startsWith("colF")).map(df(_)) : _*).show

This filter can be made more complex (just like Pandas). This, however, is a rather ugly solution (IMO):

 df.select(df.columns.filter(x => (x.equals("colA") || x.startsWith("colF"))).map(df(_)) : _*).show

If the list of other columns is corrected, you can also combine a fixed array of column names with a filtered array.

 df.select((Array("colA", "colB") ++ df.columns.filter(_.startsWith("colF"))).map(df(_)) : _*).show

How to select all columns that start with a common label

More articles: