Spark jdbc df limit ... what is he doing?

Question

Spark jdbc df limit ... what is he doing?

I am trying to learn how to understand what is happening inside Spark, and here is my current confusion. I am trying to read the first 200 rows from an Oracle table in Spark:

val jdbcDF = spark.read.format("jdbc").options(
  Map("url" -> "jdbc:oracle:thin:...",
  "dbtable" -> "schema.table",
  "fetchSize" -> "5000",
  "partitionColumn" -> "my_row_id",
  "numPartitions" -> "16",
  "lowerBound" -> "0",
  "upperBound" -> "9999999"
  )).load()

jdbcDF.limit(200).count()

This, I expect, is pretty fast. A similar action in a table with 500K rows is completed in a reasonable amount of time. In this particular case, the table is much larger (hundreds of millions of rows), but limiting (200), in my opinion, would make it fast? How do I figure out where he spends his time?

+2

apache-spark apache-spark-sql

MK. Oct 3 '16 at 16:32

source share

1 answer

eliasah · Accepted Answer · 2016-10-03T16:39:40+0000

In fact, a spark is not yet capable of compressing a predicate limit.

, , , , , , . .

:

val jdbcDF = spark.read.format("jdbc").options(
  Map("url" -> "jdbc:oracle:thin:...",
  "dbtable" -> "(select * from schema.table limit 200) as t",
  "fetchSize" -> "5000",
  "partitionColumn" -> "my_row_id",
  "numPartitions" -> "16",
  "lowerBound" -> "0",
  "upperBound" -> "9999999"
  )).load()

, , , .

:

val n : Int = ???

val jdbcDF = spark.read.format("jdbc").options(
  Map("url" -> "jdbc:oracle:thin:...",
  "dbtable" -> s"(select * from schema.table limit $n) as t",
  "fetchSize" -> "5000",
  "partitionColumn" -> "my_row_id",
  "numPartitions" -> "16",
  "lowerBound" -> "0",
  "upperBound" -> "9999999"
  )).load()

~~JIRA (SPARK-10899) , .~~

EDIT:. JIRA . - SPARK-12126. , .

Spark jdbc df limit ... what is he doing?

More articles: