1

I am trying to fetch data from Postgresql in my spark application.But now I am confused how to install postgresql driver in my docker image. I also tried to install postgresql as apt-get install command as mentioned below (Dockerfile).

Dockerfile:

FROM python:3


ENV SPARK_VERSION 2.3.2
ENV SPARK_HADOOP_PROFILE 2.7
ENV SPARK_SRC_URL https://www.apache.org/dist/spark/spark-$SPARK_VERSION/spark-${SPARK_VERSION}- 
bin-hadoop${SPARK_HADOOP_PROFILE}.tgz
ENV SPARK_HOME=/opt/spark
ENV PATH $PATH:$SPARK_HOME/bin

RUN wget ${SPARK_SRC_URL}
RUN tar -xzf spark-${SPARK_VERSION}-bin-hadoop${SPARK_HADOOP_PROFILE}.tgz

RUN mv spark-${SPARK_VERSION}-bin-hadoop${SPARK_HADOOP_PROFILE} /opt/spark
RUN rm -f spark-${SPARK_VERSION}-bin-hadoop${SPARK_HADOOP_PROFILE}.tgz
RUN apt-get update && \
apt-get install -y openjdk-8-jdk-headless \
 postgresql && \
rm -rf /var/lib/apt/lists/*
ENV JAVA_HOME  /usr/lib/jvm/java-8-openjdk-amd64/

COPY requirements.txt ./
RUN pip install --no-cache-dir -r requirements.txt


COPY my_script.py ./
CMD [ "python", "./my_script.py" ]

requirements.txt : pyspark==2.3.2 numpy

my_script.py :

from pyspark import SparkContext
from pyspark import SparkConf

#spark conf
conf1 = SparkConf()
conf1.setMaster("local[*]")
conf1.setAppName('hamza')
print(conf1)
sc = SparkContext(conf = conf1)
print('hahahha')

from pyspark.sql import SQLContext
sqlContext = SQLContext(sc)
print(sqlContext)

from pyspark.sql import DataFrameReader
url = 'postgresql://IP:PORT/INSTANCE'
properties = {'user': 'user', 'password': 'pass'}
df = DataFrameReader(sqlContext).jdbc(
url='jdbc:%s' % url, table=query, properties=properties
)

Getting this error :

Traceback (most recent call last):
  File "./my_script.py", line 26, in <module>
, properties=properties
  File "/usr/local/lib/python3.7/site-packages/pyspark/sql/readwriter.py", line 527, in jdbc
  return self._df(self._jreader.jdbc(url, table, jprop))
  File "/usr/local/lib/python3.7/site-packages/py4j/java_gateway.py", line 1257, in __call__
answer, self.gateway_client, self.target_id, self.name)
  File "/usr/local/lib/python3.7/site-packages/pyspark/sql/utils.py", line 63, in deco
return f(*a, **kw)
  File "/usr/local/lib/python3.7/site-packages/py4j/protocol.py", line 328, in get_return_value
format(target_id, ".", name), value)
       py4j.protocol.Py4JJavaError: An error occurred while calling o28.jdbc.
   : java.sql.SQLException: No suitable driver
    at java.sql.DriverManager.getDriver(DriverManager.java:315)

Kindly guide me how to setup this driver Thanks

2
  • Think you still need to install the postgresql jdbc driver. Commented Dec 25, 2018 at 12:07
  • Possible duplicate of Spark Unable to find JDBC Driver Commented Dec 25, 2018 at 13:19

3 Answers 3

6

Adding these lines in Dockerfile solved the issue :

ENV POST_URL https://jdbc.postgresql.org/download/postgresql-42.2.5.jar
RUN wget ${POST_URL}
RUN mv postgresql-42.2.5.jar /opt/spark/jars

Thanks everyone

Sign up to request clarification or add additional context in comments.

Comments

3

This is not the Docker way of doing things. Docker approach is not having all services inside one container but splitting them into several, where each container should have one main process, like database, you application or etc.

Also, when using separate containers, you dont care about intalling all necessary stuff in your Dockerfile - you simply select ready-to-use containers with desired database types. By the way, if you are using python:3 docker image, how do you know, maintainers wont change the set of installed services, or even the OS type? They can do it easily because they only provide 'Python` service, everything else is not defined.

So, what I recommend is:

  1. Split you project into different containers (Dockerfiles)
  2. Use standard postgres image for you database - all services and drivers are already onboard
  3. Use docker-compose (or whatever) for launching both containers and linking them together in one network.

3 Comments

Okay sounds good But if I want to add postgresql driver in this docker How would I do so?
If you still want all-in-one docker, add postgres driver, not onle databse itself. Here is the solution: stackoverflow.com/questions/34948296/…
My issue is that I want a dockerfile that runs postgres on a different OS (amazonlinux).
0

Network creation like this

sudo docker network create local_central_db_network

create a file named as docker-compose.yml

version: '3.3'

services:
  postgres:
    image: postgres:16.4 # Using PostgreSQL version 16.4
    container_name: local-central-db-pgsql-container-16.4
    restart: always
    ports:
      - "5432:5432" # Expose port 5432 for external access
    environment:
      POSTGRES_USER: root # Replace with your username
      POSTGRES_PASSWORD: root # Replace with your password
    volumes:
      - ./dbdata:/var/lib/postgresql/data # Persist data between restarts
    networks:
      - local_central_db_network

networks:
  local_central_db_network:
    external: true

Command : Docker : Setup : Initialized : run these commands

sudo docker-compose down
sudo docker-compose build && docker-compose up -d

Connect in command line

docker exec -it local-central-db-pgsql-container-16.4 bash
psql -U root
\q

Note: in clients/visual software use like this

enter image description here

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.