10

The spark cluster setting is as follows:

conf['SparkConfiguration'] = SparkConf() \
.setMaster('yarn-client') \
.setAppName("test") \
.set("spark.executor.memory", "20g") \
.set("spark.driver.maxResultSize", "20g") \
.set("spark.executor.instances", "20")\
.set("spark.executor.cores", "3") \
.set("spark.memory.fraction", "0.2") \
.set("user", "test_user") \
.set("spark.executor.extraClassPath", "/usr/share/java/postgresql-jdbc3.jar")

When I try to write the dataframe to the Postgres DB using the following code:

from pyspark.sql import DataFrameWriter
my_writer = DataFrameWriter(df)

url_connect = "jdbc:postgresql://198.123.43.24:1234"
table = "test_result"
mode = "overwrite"
properties = {"user":"postgres", "password":"password"}

my_writer.jdbc(url_connect, table, mode, properties)

I encounter the below error:

Py4JJavaError: An error occurred while calling o1120.jdbc.   
:java.sql.SQLException: No suitable driver
    at java.sql.DriverManager.getDriver(DriverManager.java:278)
at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$2.apply(JdbcUtils.scala:50)
at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$2.apply(JdbcUtils.scala:50)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.createConnectionFactory(JdbcUtils.scala:49)
at org.apache.spark.sql.DataFrameWriter.jdbc(DataFrameWriter.scala:278)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:381)
at py4j.Gateway.invoke(Gateway.java:259)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:209)
at java.lang.Thread.run(Thread.java:745)

Can anyone provide some suggestions on this? Thank you!

3 Answers 3

11

Try write.jdbc and pass the parameters individually created outside the write.jdbc(). Also check the port on which postgres is available for writing mine is 5432 for Postgres 9.6 and 5433 for Postgres 8.4.

mode = "overwrite"
url = "jdbc:postgresql://198.123.43.24:5432/kockpit"
properties = {"user": "postgres","password": "password","driver": "org.postgresql.Driver"}
data.write.jdbc(url=url, table="test_result", mode=mode, properties=properties)
Sign up to request clarification or add additional context in comments.

Comments

3

Have you downloaded the PostgreSQL JDBC Driver? Download it here: https://jdbc.postgresql.org/download.html.

For the pyspark shell you use the SPARK_CLASSPATH environment variable:

$ export SPARK_CLASSPATH=/path/to/downloaded/jar
$ pyspark

For submitting a script via spark-submit use the --driver-class-path flag:

$ spark-submit --driver-class-path /path/to/downloaded/jar script.py

Comments

2

Maybe you can try passing the JDBC driver class explicitly (Note that you may need to put the driver jar in the classpath for all spark nodes):

df.write.option('driver', 'org.postgresql.Driver').jdbc(url_connect, table, mode, properties)

3 Comments

Thanks for the reply. It gives the below error message: TypeError: 'DataFrameWriter' object is not callable
@Yiliang, sorry, in pyspark write is not a function, you should do df.write instead of df.write(). My mistake
Thanks Daniel. Now I encounter java.lang.NullPointerException for that line. Any idea what could go wrong?

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.