Insert 271 million records from text file to mongodb

Question

I have 271 million records, line by line in a text file that I need to add to MongoDB, and I'm using Python and Pymongo in order to do this.

I first split the single file containing the 271 million records into multiple files containing each 1 million lines, and have written the current code to add it to the database:

import os
import threading
from pymongo import MongoClient


class UserDb:
    def __init__(self):
        self.client = MongoClient('localhost', 27017)
        self.data = self.client.large_data.data


threadlist = []

def start_add(listname):
    db = UserDb()
    with open(listname, "r") as r:
        for line in r:
            if line is None:
                return
            a = dict()
            a['no'] = line.strip()
            db.data.insert_one(a)
    print(listname, "Done!")


for x in os.listdir(os.getcwd()):
    if x.startswith('li'):
        t = threading.Thread(target=start_add, args=(x,))
        t.start()
        threadlist.append(t)

print("All threads started!")


for thread in threadlist:
    thread.join()

This starts as many threads as there are files, and adds every line to the db as it goes through it. The bad thing is that after 3 hours it had only added 8.630.623.

What can I do to add the records faster?

There are 261 threads currently running.

Example of one row of data: 12345678 (8 digits)

How many parallel threads are running? (not the total number of files) I mean how many files are being added to the database simultaneously? — watch-this
– watch-this, Commented Oct 24, 2019 at 12:00
At somepoint, you spend more time managing threads than using them. I am interested in seeing what one row of data looks like. I gather it's a scalar? — Cyril D.
– Cyril D., Commented Oct 24, 2019 at 12:04
I have seen your edit. You should visit codereview.stackexchange.com/help/merging-accounts to get your accounts merged. — Heslacher
– Heslacher, Commented Oct 24, 2019 at 12:21
Btw on own questions you can always comment (at least if you are using the same account) — Heslacher
– Heslacher, Commented Oct 24, 2019 at 12:23
Are you sure the performance bottleneck is in this code? Is it chewing up 100% of the CPU? If not, maybe the problem is on the mongo side or IOPS. Until you find and fix the bottleneck other optimizations won't make it go much faster. — chicks
– chicks, Commented Oct 24, 2019 at 12:40

IEatBagels · Accepted Answer · 2019-10-24 14:12:48Z

First, you need to consider that adding 271 million records is going to be long, there's no changing that. You should also consider your processing power. You could have the best code in the world, if it runs on an average computer it's going to be average.

Now, what can be done better?

What you're doing is called bulk insertion. There are mechanisms with every database engine that enable doing this faster than one insert at a time. For example, you should consider using the Bulk object to insert many lines at the same time. Basically, you want most of your processing to be done by the database engine, simply because it was written by experts on how to do things fast and without problems. Looking at the Bulk object maximum size, your bulk insert cannot be bigger than 16 megabytes. Considering you have 8 digits number, you can figure out what's the biggest bulk insert you can do!

Running 261 threads probably hurts your performance. After all, the 261 threads probably don't run in parallel (because.. I doubt you have 261 threads available, but who knows). Do more research of multi-threading, maybe using a simple parallel library with way less threads would even be faster. I've got to admit I'm not an expert in threading myself, but I'm pretty confident running 261 threads will simply hurt your performance, these threads need to be managed!

Now, there are a couple things that bother me with your situation :

How many files do you have? If you have like... a million files with 271 lines each, you should consider pre-processing the whole thing to reduce the number of file access necessary. I'd even consider keeping only 1 file.
If you have \$2.71*10^8\$ 8 digit numbers and, I didn't have my morning coffee yet, there are \$9*10^7\$ possible combinations of 8 digits, you would have a lot of duplicates. Are you sure there isn't a simpler way to insert your data than reading 271000000 8 digits numbers from files?

Anyways, I think the main thing you need to consider is using bulk insertion, which brings me to my final tip : Read the documentation of whatever database engine you're using. They are big pieces of code and it's pretty sure you will find something interesting to help you.

If leading zero's are a thing, then it would be 10^8 possible combinations. If not and we don't have silly things like zero-right-fill (I've seen it on StackOverflow once), then it's still 10^8. That's still 1.71 * 10^8 duplicates in a best-case scenario, though. — Gloweye
– Gloweye, Commented Oct 25, 2019 at 14:39
@Gloweye I assumed there were no leading zeros, which means we have numbers from 10000000 to 99999999, which gives 9*10^7 possibilities? That's... right, isn't it? Unless I'm missing something. — IEatBagels
– IEatBagels, Commented Oct 25, 2019 at 15:00
I was thinking 8 positions, each with a possibility of 0 to 9. That's 10^8. Now, simply not printing leading 0's does not make it a different number. Our difference seems to be that I would consider 1234 a valid number, regardless of being written like 1234 or 00001234. If it's NOT a valid number, then you're right. I think. — Gloweye
– Gloweye, Commented Oct 25, 2019 at 15:02

Graipher · Accepted Answer · 2019-10-25 12:17:40Z

In order to make the comment by @IEatBagels made in their answer more explicit, you can just do this:

def start_add(file_name):
    db = UserDb()
    with open(file_name) as file:
        db.data.insert_many(
            ({"no": line.strip()} for line in file if line),
            ordered=False)
    print(file_name, "Done!")

Using the ordered=False option might actually already do what you want, which is let the server parallelize it if possible (taken from the documentation):

ordered (optional): If True (the default) documents will be inserted on the server serially, in the order provided. If an error occurs all remaining inserts are aborted. If False, documents will be inserted on the server in arbitrary order, possibly in parallel, and all document inserts will be attempted.

This means you can probably just have one file, or, if you insist on it running in parallel on the computer running the script, use at most #CPU threads (potentially times two if you CPU supports hyper-threading).

Note that I also followed Python's official style-guide, PEP8, by using lower_case and used the fact that "r" is the default mode for opening files with open.

The argument I passed to the call is a generator, which means that the file does not even need to fit into memory all at once (the file object returned by open is also a kind of generator). Unfortunately, insert_many itself does just put them into a list, although another page in the documentation lead me to believe this is not the case:

PyMongo will automatically split the batch into smaller sub-batches based on the maximum message size accepted by MongoDB, supporting very large bulk insert operations

As mentioned in the above link, you can do the chunking yourself, though (directly in Python, or via different files as you already did).

Stack Exchange Network

Insert 271 million records from text file to mongodb

2 Answers 2

You must log in to answer this question.

Hot Network Questions

Insert 271 million records from text file to mongodb

2 Answers 2

You must log in to answer this question.

Related

Hot Network Questions