I am trying to execute a python file in SPARK using a mongoDB connector. The python file do a query to get some data from mongoDB and them process this data with a map operation in SPARK.
The the execution stops getting this error message: "socket.timeout: timed out", while the map operation is being executed. That is the output I get:
Traceback (most recent call last): File "/home/ana/computational_tools_for_big_data/project/review_analysis.py", line 27, in bad_reviews = reviews_1.rdd.map(lambda r: r.text).collect() File "/usr/local/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 777, in collect File "/usr/local/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 142, in _load_from_socket File "/usr/local/spark/python/lib/pyspark.zip/pyspark/serializers.py", line 139, in load_stream File "/usr/local/spark/python/lib/pyspark.zip/pyspark/serializers.py", line 156, in _read_with_length File "/usr/local/spark/python/lib/pyspark.zip/pyspark/serializers.py", line 543, in read_int File "/usr/lib/python2.7/socket.py", line 384, in read data = self._sock.recv(left) socket.timeout: timed out
I get this problem because the file I am querying it is very big 2.3GB, I tried the same with a file of 1GB and it is the same problem but it is works with a smaller file of 400MB.
Is it possible to change the timeout or something to make it work? Is there any other way to process a big amount of data faster?