Basic Python+C integration

Question

I have the following code which is incredible slow when running in python:

items = Item.objects.all()

new_list = []
for line in open(file, 'r').readlines():
    if line in items:
        new_list.append(line)

This might take about 1 hour to run. How would I do this for loop in C, something like:

items = Item.objects.all() # 1M+ objects

new_list = c_function()

First, are you sure you need to do it in C? If you write this as a list comprehension or a call to filter, a lot of the slow looping-in-Python and calling-list.append part is optimized for free; try new_list = [id for id in list_of_ids if id in items] or new_list = list(filter(items.__contains__, list_of_ids)). — abarnert
– abarnert, Commented Sep 5, 2014 at 22:35
Also, what type is items? If it's a list or something else with linear-time searching, then it doesn't really matter what language it's in, it's going to be way too slow, because you've got an O(NM) algorithm instead of an O(N) algorithm. Change it to a set (or possibly some kind of sorted, logarithmic collection) instead of a list. — abarnert
– abarnert, Commented Sep 5, 2014 at 22:36
Also, are you using any libraries that aren't compatible with PyPy? If not, have you tried just running in PyPy? (I suspect that using the right data structure will already get you the orders-of-magnitude speedup you want, switching to a comprehension of filter call will squeeze out another 10-50%, and PyPy's JIT will get you another 10-50% at best in this case… but you never know, and you might as well try.) — abarnert
– abarnert, Commented Sep 5, 2014 at 23:05

abarnert · Accepted Answer · 2014-09-05 23:48:58Z

I can explain how to rewrite this in C, but first let me explain why you probably don't want to.

Your biggest problem most likely has nothing to do with Python being slow, but with using a quadratic algorithm instead of a linear one. If items is a list, or any other kind of unordered sequence, if id in items is going to have to do a full literal search through all of items for each id. That will take forever, no matter what language you do it in. If you use a set, or any other kind of hash table (or a logarithmic binary-searchable collection), then if id in items will take very little time for each id, no matter how big items is, and it will be more than fast enough in Python.

Even if you can't change the code that returns items, just constructing a set out of it right before searching that set will give you a massive speedup in exchange for the tiny performance hit of an extra pass:

items = set(items)

Your second biggest problem also has nothing to do with Python being slow:

for line in open(file, 'r').readlines():

You're reading the entire file into memory, then splitting it into a giant list in memory, before you can even start working. There is literally no reason to do this except to slow your code down. Files are already iterables of lines of text. Just change this to:

for line in open(file, 'r'):

You should use a with statement rather than just calling open and leaking the file:

with open(file, 'r') as f:
    for line in f:

If that's still too slow, well, now we've reached something that is Python's fault. Python loops are slow, and calling list.append dynamically is also slow, and you're doing that right in the middle of your inner loop.

But you can eliminate both of these problems, from within Python, by using a comprehension, calling filter, or, simplest of all (if you don't need to preserve the order of the ids), calling set.intersect.

Again, since files are iterables, you can just use the file in any of these:

with open(file, 'r') as f:
    new_list = [line for line in f if line in items]

with open(file, 'r') as f:
    new_list = list(filter(items.__contains__, f))

with open(file, 'r') as f:
    new_set = items.intersect(f)

Either way, you're moving the slow looping into C and replacing the dynamic call to list.append with a direct call to the underlying C function. So, you've effectively rewritten the slowest part of your code in C, while still remaining in Python.

If you haven't tried PyPy, and you aren't using any non-PyPy-compatible libraries, try replacing CPython with PyPy. If you've made the above fixes, there's probably not much more it can do with this code, but I've been surprised before… and its JIT optimizer could definitely radically speed up some surrounding code.

If you really want to write this in C, however, you can. I'd strongly consider using Cython first, which will compile almost any native Python code into a C extension, and which is very easy to hint with static types. But let's see how you'd write this in native C.

static PyObject *
mymodule_idfilter(PyObject *self, PyObject *args) {
    const char *pathname = NULL;
    FILE *f = NULL;
    PyObject *items = NULL;
    char line[1024];
    PyObject *line_obj = NULL;
    PyObject *result = NULL;
    int contains;

    if (!PyArg_ParseTuple(args, "so", &pathname, &items))
        goto err;
    f = fopen(pathname, "r");
    if (!f)
        goto err;

    result = PyList_New(0);
    if (!result)
        goto err;

    while (!feof(f) && !ferror(r)) {
        /* NOTE: If your lines can be arbitrary length, this will
           do the wrong thing; presumably the first 1023 bytes of
           the line won't match anything in items, and neither will
           the remainder... if you want to handle that, that's
           simple C programming, nothing to do with Python */
        if (fgets(line, sizeof(line)-1, f)) {
            line_obj = PyString_FromString(line);
            if (!line_obj)
                goto_err;
            contains = PySequence_Contains(items, id);
            if (contains == -1)
                goto err;
            if (contains == 1)
                if (!PyList_Append(result, id))
                    goto err;
            Py_DECREF(line_obj);
        }
    }
    fclose(f);
    Py_DECREF(items);
    return result;
err:
    if (f) fclose(f);
    Py_XDECREF(lineobj);
    Py_XDECREF(items);
    Py_XDECREF(id);
    Py_XDECREF(result);
    return NULL;
}

Of course the inner core of the loop is calling PySequence_Contains(items, id), which is effectively doing the same work as id in items (including dynamically looking up the __contains__ method and everything), so that part won't be any faster. The fact that we're using PyList_Append instead of list.append will speed things up, but not much more than a list comprehension or filter would, because they effectively do the same. The fact that we're looping over the file with fgets instead of file.__next__ will speed things up, but since we end up constructing a Python string and looking it up in a set of Python strings we're still doing a lot of the same work; at best I'd expect a 30% speedup here—and at the cost of flexibility—arbitrary-length strings, optional universal newlines, Unicode (if you were using Python 3.x), …

If you actually want to make this significantly faster, you're going to have to replace that items with a wrapper around some C-implemented hash table of unboxed C strings; then you can avoid dealing with boxing, unboxing, and refcounting entirely within the inner loop. But that requires major changes outside the code you asked about.

jcoppens · Accepted Answer · 2014-09-05 23:38:04Z

0

items = Item.objects.all()

new_list = []
for line in open(file, 'r').readlines():
    if line in items:
        new_list.append(line)

There is some improvement possible here, particularly the 'if line in items': Convert items into a dictionary. Then the 'in' function will hash the key and find it without having to run over the complete list each time.

To convert items to a dictionary, look at the fromkeys(list) method.

answered Sep 5, 2014 at 23:38

jcoppens

5,4726 gold badges32 silver badges53 bronze badges

1 Comment

abarnert Over a year ago

Why would you use a dictionary instead of a set if you have no values for the keys?

Collectives™ on Stack Overflow

Basic Python+C integration

2 Answers 2

Comments

1 Comment

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

1 Comment

Related