Dataframe for Oracle creates a case-sensitive table

Question

Dataframe for Oracle creates a case-sensitive table

Spark: 2.1.1

I save the dataframe as an Oracle table, but as a result, there are case-sensitive columns in the Oracle table.

 val properties = new java.util.Properties properties.setProperty("user", ora_username) properties.setProperty("password", ora_pwd) properties.setProperty("batchsize", "30000") properties.setProperty("driver", db_driver)

 spark.sql("select * from myTable").repartition(50).write.mode(SaveMode.Overwrite).jdbc(url,"myTable_oracle", properties)

When I see in Oracle ,

Select * from myTable_oracle; => works
Select col1 from myTable_oracle; => Does not work
Select "col1" from myTable_oracle; => it works, but it’s very annoying.

Tried the setting below, but still the same problem:

 spark.sqlContext.sql("set spark.sql.caseSensitive=false")

The same code used in Spark 1.6.1 that creates an Oracle table with case-insensitive columns. But with Spark 2.1.1 I ran into this problem.

0

apache-spark apache-spark-sql

user3886907 Jun 07 '17 at 8:38

source share

1 answer

user3886907 · Answer 1 · 2017-06-08T05:15:52+0000

I found a problem and solution: Starting with Spark 2.x, each column name gets a double quote when creating the table, and therefore, the resulting column table of the Oracle table becomes case sensitive when trying to query them through sqlPlus.

dialect.quoteIdentifier
[ https://github.com/apache/spark/blob/branch-2.1/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JdbcUtils.scala#L645]

and this .quoteIdentifier dialect is double quotation marks [ " ]

  def quoteIdentifier(colName: String): String = { s""""$colName"""" }

[ https://github.com/apache/spark/blob/branch-2.1/sql/core/src/main/scala/org/apache/spark/sql/jdbc/JdbcDialects.scala#L90]

Solution: Unregister the existing OracleDialect and Reregister while redefining dialect.quoteIdentifier along with other necessary materials needed to work with Oracle Dialect

 import java.sql.Types import org.apache.spark.sql.types._ import org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils import org.apache.spark.sql.jdbc.{ JdbcDialects, JdbcType, JdbcDialect } val url= "jdbc:oracle:thin:@HOST:1567/SID" val dialect = JdbcDialects JdbcDialects.unregisterDialect(dialect.get(url)) val OracleDialect = new JdbcDialect { override def canHandle(url: String): Boolean = url.startsWith("jdbc:oracle") || url.contains("oracle") override def getCatalystType(sqlType: Int, typeName: String, size: Int, md: MetadataBuilder): Option[DataType] = { // Handle NUMBER fields that have no precision/scale in special way because JDBC ResultSetMetaData converts this to 0 procision and -127 scale if (sqlType == Types.NUMERIC && size == 0) { // This is sub-optimal as we have to pick a precision/scale in advance whereas the data in Oracle is allowed // to have different precision/scale for each value. This conversion works in our domain for now though we // need a more durable solution. Look into changing JDBCRDD (line 406): // FROM: mutableRow.update(i, Decimal(decimalVal, p, s)) // TO: mutableRow.update(i, Decimal(decimalVal)) Some(DecimalType(DecimalType.MAX_PRECISION, 10)) } // Handle Timestamp with timezone (for now we are just converting this to a string with default format) //else if (sqlType == -101) { // Some(StringType) // } else None } override def getJDBCType(dt: DataType): Option[JdbcType] = dt match { case StringType => Some(JdbcType("VARCHAR2(2000)", java.sql.Types.VARCHAR)) case BooleanType => Some(JdbcType("NUMBER(1)", java.sql.Types.NUMERIC)) case IntegerType => Some(JdbcType("NUMBER(10)", java.sql.Types.NUMERIC)) case LongType => Some(JdbcType("NUMBER(19)", java.sql.Types.NUMERIC)) case DoubleType => Some(JdbcType("NUMBER(19,4)", java.sql.Types.NUMERIC)) case FloatType => Some(JdbcType("NUMBER(19,4)", java.sql.Types.NUMERIC)) case ShortType => Some(JdbcType("NUMBER(5)", java.sql.Types.NUMERIC)) case ByteType => Some(JdbcType("NUMBER(3)", java.sql.Types.NUMERIC)) case BinaryType => Some(JdbcType("BLOB", java.sql.Types.BLOB)) case TimestampType => Some(JdbcType("DATE", java.sql.Types.TIMESTAMP)) case DateType => Some(JdbcType("DATE", java.sql.Types.DATE)) //case DecimalType.Fixed(precision, scale) => Some(JdbcType("NUMBER(" + precision + "," + scale + ")", java.sql.Types.NUMERIC)) //case DecimalType.Unlimited => Some(JdbcType("NUMBER(38,4)", java.sql.Types.NUMERIC)) case _ => None } //Imp from Spark2.0 since otherwise oracle table columns would be case-sensitive override def quoteIdentifier(colName: String): String = { colName } } JdbcDialects.registerDialect(OracleDialect)

Dataframe for Oracle creates a case-sensitive table

More articles: