socket.timeout mongoDB pyspark

Question

I am trying to execute a python file in SPARK using a mongoDB connector. The python file do a query to get some data from mongoDB and them process this data with a map operation in SPARK.

The the execution stops getting this error message: "socket.timeout: timed out", while the map operation is being executed. That is the output I get:

Traceback (most recent call last): File "/home/ana/computational_tools_for_big_data/project/review_analysis.py", line 27, in bad_reviews = reviews_1.rdd.map(lambda r: r.text).collect() File "/usr/local/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 777, in collect File "/usr/local/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 142, in _load_from_socket File "/usr/local/spark/python/lib/pyspark.zip/pyspark/serializers.py", line 139, in load_stream File "/usr/local/spark/python/lib/pyspark.zip/pyspark/serializers.py", line 156, in _read_with_length File "/usr/local/spark/python/lib/pyspark.zip/pyspark/serializers.py", line 543, in read_int File "/usr/lib/python2.7/socket.py", line 384, in read data = self._sock.recv(left) socket.timeout: timed out

I get this problem because the file I am querying it is very big 2.3GB, I tried the same with a file of 1GB and it is the same problem but it is works with a smaller file of 400MB.

Is it possible to change the timeout or something to make it work? Is there any other way to process a big amount of data faster?

Community · Accepted Answer · 2020-06-20 09:12:55Z

Your issue is the socket connection is taking more time than the timeout specified. Refer this document to change the timeouts and other settings.

The property you want to change

socketTimeoutMS: (integer or None) Controls how long (in milliseconds) the driver will wait for a response after sending an ordinary (non-monitoring) database operation before concluding that a network error has occurred. Defaults to None (no timeout).

E.g. MongoClient('localhost', 27017, socketTimeoutMS=6000)

Of course based on how much time it actually takes for 2.3GB file transfer, you might want to go above one minute (6000), I mentioned in the example.

Documentation of MongoClient

https://mongodb.github.io/node-mongodb-native/driver-articles/mongoclient.html

Documentation of PyMongo MongoClient

http://api.mongodb.com/python/current/api/pymongo/mongo_client.html

This is pymongo client. @Ana is looking for pyspark mongodb connector config setting.

Collectives™ on Stack Overflow

socket.timeout mongoDB pyspark

1 Answer 1

1 Comment

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

1 Comment

Related