Spark sql vs dataframe queries

For effective work with Spark. I am wondering if it is good to use SQL queries using SQLContext or is it better to make queries using DataFrame functions like df.select ().

Any idea? :)

+7
performance apache-spark apache-spark-sql spark-dataframe
source share
3 answers

There is no difference in performance. Both methods use exactly the same execution mechanism and internal data structures. At the end of the day, it comes down to personal preference.

  • Perhaps DataFrame much easier to create programmatically and provide minimal security.

  • Regular SQL queries can be much more concise, easier to understand. They are also portable and can be used without any change with each supported language. Using HiveContext they can also be used to expose certain functions that may not be available in other ways (for example, UDF without Spark fairings).

+12
source share

Ideally, the Spark catalyst should optimize both calls to the same execution plan, and the performance should be the same. How to call is just a matter of your style. In fact, there is a difference according to the Hortonworks report ( https://community.hortonworks.com/articles/42027/rdd-vs-dataframe-vs-sparksql.html ), where SQL is superior to Dataframes for the case when you need GROUPed records with their full COUNTERS that MATCH the title of the entry.

+2
source share

Using a DataFrame, you can split SQL into several statements / queries, which helps in debugging, simple improvements, and code maintenance.

Overcoming complex SQL queries into simpler queries and assigning the result in DF gives you a better understanding.

By parsing the request into several DFs, the developer gets the advantage of using the cache and repair (for even distribution of data across partitions using a unique key / key that is close to unique).

+1
source share

All Articles