I think if you use where(array_contains(...)) , this will work. Here is my result:
scala> import org.apache.spark.SparkContext import org.apache.spark.SparkContext scala> import org.apache.spark.sql.DataFrame import org.apache.spark.sql.DataFrame scala> def testData (sc: SparkContext): DataFrame = { | val stringRDD = sc.parallelize(Seq | ("""{ "name": "ned", "tags": ["blue", "big", "private"] }""", | """{ "name": "albert", "tags": ["private", "lumpy"] }""", | """{ "name": "zed", "tags": ["big", "private", "square"] }""", | """{ "name": "jed", "tags": ["green", "small", "round"] }""", | """{ "name": "ed", "tags": ["red", "private"] }""", | """{ "name": "fred", "tags": ["public", "blue"] }""")) | val sqlContext = new org.apache.spark.sql.SQLContext(sc) | import sqlContext.implicits._ | sqlContext.read.json(stringRDD) | } testData: (sc: org.apache.spark.SparkContext)org.apache.spark.sql.DataFrame scala> | val df = testData (sc) df: org.apache.spark.sql.DataFrame = [name: string, tags: array<string>] scala> val report = df.select ("*").where (array_contains (df("tags"), "private")) report: org.apache.spark.sql.DataFrame = [name: string, tags: array<string>] scala> report.show +------+--------------------+ | name| tags| +------+--------------------+ | ned|[blue, big, private]| |albert| [private, lumpy]| | zed|[big, private, sq...| | ed| [red, private]| +------+--------------------+
Note that it works if you write where(array_contains(df("tags"), "private")) , but if you write where(df("tags").array_contains("private")) (more directly similar to what you originally wrote), it fails with array_contains is not a member of org.apache.spark.sql.Column . Having looked at the source code for Column , I see something there that needs to be processed contains (for this I built an contains instance), but not array_contains . Perhaps this is an oversight.
Robert Dodier
source share