How to Connect Teradata using Pyspark

Question

I am trying to connect teradata server through PySpark.

My CLI code is as below,

from pyspark.sql import SparkSession
spark=SparkSession.builder
                  .appName("Teradata connect")
                  .getOrCreate()
df = sqlContext.read
               .format("jdbc")
               .options(url="jdbc:teradata://xy/",
                        driver="com.teradata.jdbc.TeraDriver",
                        dbtable="dbname.tablename",
                        user="user1",password="***")
               .load()

Which is giving error,

py4j.protocol.Py4JJavaError: An error occurred while calling o159.load. : java.lang.ClassNotFoundException: com.teradata.jdbc.TeraDriver

To resolve this I think, I need to add jar terajdbc4.jar and `tdgssconfig.jar.

In Scala, to add jar we can use

    sc.addJar("<path>/jar-name.jar")

If I use the same for PySpark, I am having error,

AttributeError: 'SparkContext' object has no attribute 'addJar'.

or

AttributeError: 'SparkSession' object has no attribute 'addJar'

How can I add jar terajdbc4.jar and tdgssconfig.jar?

pyspark2 --jars /data/1/gcgeeapmxtldu/lib/tdgssconfig.jar,/data/1/gcgeeapmxtldu/lib/terajdbc4.jar spark = SparkSession.builder.appName("sparkanalysis")\ .config("spark.driver.extraClassPath","/local_path/terajdbc4.jar,/local_path/tdgssconfig.jar")\ .config("spark.executor.extraClassPath","/local_path/terajdbc4.jar,/local_path/tdgssconfig.jar")\ .config("spark.jars","/local_path/terajdbc4.jar,/local_path/tdgssconfig.jar")\ .config("spark.repl.local.jars","/local_path/tdgssconfig.jar,/local_path/terajdbc4.jar")\ .getOrCreate() — Soumya
– Soumya, Commented May 5, 2019 at 1:26
df = spark.read.format("jdbc")\ .option("url","jdbc:teradata://xyz")\ .option("driver","com.teradata.jdbc.TeraDriver")\ .option("dbtable","table").option("user","USR1").option("password","*****")\ .load() — Soumya
– Soumya, Commented May 5, 2019 at 1:27

GMc · Accepted Answer · 2019-05-03 01:13:59Z

1

Try following this post which explains how to add jdbc drivers to pyspark.

How to add jdbc drivers to classpath when using PySpark?

The above example is for postgres and docker, but the answer should work for your scenario. Note, you are correct about the driver files. Most JDBC drivers are in a single file, but Teradata splits it out into two parts. I think one is the actual driver and the other (tdgss) has security stuff in it. Both files must be added to the classpath for it to work.

Alternatively, simply google "how to add jdbc drivers to pyspark".

answered May 3, 2019 at 1:13

GMc

1,7841 gold badge11 silver badges26 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

Soumya Over a year ago

I have used below to open CLI, pyspark2 --driver-class-path /path/terajdbc4.jar:/path/tdgssconfig.jar but received error as py4j.protocol.Py4JJavaError: An error occurred while calling o76.load. : java.lang.ClassNotFoundException: com.teradata.jdbc.TeraDriver

GMc Over a year ago

I believe you need to use a comma (not a colon) to separate the jar file names to the spark shells. This is certainly true for the scala environment (spark-shell). Additionally, you should consider where to place the files, if you have HDFS, it is very likely that the spark shell (including pyspark) will try to find the files in HDFS at the path you specify. If you are still getting the error, try putting the files in HDFS and give that path to the pyspark.

Soumya Over a year ago

Finally, I got it fixed, the problem was, my jar fils got corrupted and i had some fix in my code, first, CLI command 'pyspark2 --jars /local_path/tdgssconfig.jar,/local_path/terajdbc4.jar' second, for sparksession add blw,' .config("spark.driver.extraClassPath","/local_path/terajdbc4.jar,/local_path/tdgssconfig.jar") \ .config("spark.executor.extraClassPath","/local_path/terajdbc4.jar,/local_path/tdgssconfig.jar") \ .config("spark.jars","/local_path/terajdbc4.jar,/local_path/tdgssconfig.jar")\ .config("spark.repl.local.jars","/local_path/tdgssconfig.jar,/local_path/terajdbc4.jar")

Collectives™ on Stack Overflow

How to Connect Teradata using Pyspark

1 Answer 1

3 Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

3 Comments

Linked

Related