Running "wc -l <filename>" within Python Code

Question

I want to do 10-fold cross-validation for huge files ( running into hundreds of thousands of lines each). I want to do a "wc -l " each time i start reading a file, then generate random numbers a fixed number of times, each time writing that line number into a separate file . I am using this:

import os 
for i in files:
    os.system("wc -l <insert filename>").

How do I insert the file name there. Its a variable. I went through the documentation but they mostly list out ls commands, something that doesn't have this problem.

FYI, google says 1 lakh == 100 000.

Lauritz V. Thaulow
– Lauritz V. Thaulow

2011-06-29 13:47:46 +00:00
Commented Jun 29, 2011 at 13:47 — Lauritz V. Thaulow
– Lauritz V. Thaulow, Commented Jun 29, 2011 at 13:47

Lauritz V. Thaulow · Accepted Answer · 2011-06-29 14:08:29Z

Let's compare:

from subprocess import check_output

def wc(filename):
    return int(check_output(["wc", "-l", filename]).split()[0])

def native(filename):
    c = 0
    with open(filename) as file:
        while True:
            chunk = file.read(10 ** 7)
            if chunk == "":
                return c
            c += chunk.count("\n")

def iterate(filename):
    with open(filename) as file:
        for i, line in enumerate(file):
            pass
        return i + 1

Go go timeit function!

from timeit import timeit
from sys import argv

filename = argv[1]

def testwc():
    wc(filename)

def testnative():
    native(filename)

def testiterate():
    iterate(filename)

print "wc", timeit(testwc, number=10)
print "native", timeit(testnative, number=10)
print "iterate", timeit(testiterate, number=10)

Result:

wc 1.25185894966
native 2.47028398514
iterate 2.40715694427

So, wc is about twice as fast on a 150 MB compressed files with ~500 000 linebreaks, which is what I tested on. However, testing on a file generated with seq 3000000 >bigfile, I get these numbers:

wc 0.425990104675
native 0.400163888931
iterate 3.10369205475

Hey look, python FTW! However, using longer lines (~70 chars):

wc 1.60881590843
native 3.24313092232
iterate 4.92839002609

So conclusion: it depends, but wc seems to be the best bet allround.

ThiefMaster · Accepted Answer · 2011-06-29 12:46:27Z

8

import subprocess
for f in files:
    subprocess.call(['wc', '-l', f])

Also have a look at http://docs.python.org/library/subprocess.html#convenience-functions - for example, if you want to access the output in a string, you'll want to use subprocess.check_output() instead of subprocess.call()

answered Jun 29, 2011 at 12:46

ThiefMaster

320k85 gold badges608 silver badges648 bronze badges

5 Comments

crazyaboutliv Over a year ago

And this also gives me an error. Says this : Traceback (most recent call last): File "../../scripts/gps_scripts/cross-validation.py", line 10, in <module> print subprocess.call(['wc','-l',f]) File "/usr/lib/python2.7/subprocess.py", line 486, in call return Popen(*popenargs, **kwargs).wait() File "/usr/lib/python2.7/subprocess.py", line 672, in init errread, errwrite) File "/usr/lib/python2.7/subprocess.py", line 1213, in _execute_child raise child_exception TypeError: execv() arg 2 must contain only strings

Lauritz V. Thaulow Over a year ago

@crazyaboutliv You passed it a file object instead of a file name.

sudo Over a year ago

one-liner to get the file line count in Python: int(subprocess.check_output(["wc", "-l", fname]).decode("utf8").split()[0])

clg4 Over a year ago

-@sudo, your one-liner works great on my windows 7 box, but does not work on windows 10. I get FileNotFoundError: [WinError 2] The system cannot find the file specified. Yet I can see the file clearly exists.

user3673 Over a year ago

sudo's answer should be included in the correct one.

Fredrik Pihl · Accepted Answer · 2011-06-29 13:52:11Z

4

No need to use wc -l Use the following python function

def file_len(fname):
    with open(fname) as f:
        for i, l in enumerate(f, 1):
            pass
    return i

This is probably more efficient than calling an external utility (that loop over the input in a similar fashion).

Update

Dead wrong, wc -l is a lot faster!

seq 10000000 > huge_file

$ time wc -l huge_file 
10000000 huge_file

real    0m0.267s
user    0m0.110s
sys 0m0.010s

$ time ./p.py 
10000000

real    0m1.583s
user    0m1.040s
sys 0m0.060s

edited Jun 29, 2011 at 13:52

answered Jun 29, 2011 at 12:43

Fredrik Pihl

45.9k7 gold badges89 silver badges133 bronze badges

3 Comments

ThiefMaster Over a year ago

Depending on the size of the file it might be faster to use wc since it's written in C.

Fredrik Pihl Over a year ago

@ThiefMaster true, it's all about knowing your input

crazyaboutliv Over a year ago

Yes, my files are 30 lakh lines. I was thinking that counting this way would get slower .

Nathan Fellman · Accepted Answer · 2011-06-29 12:43:56Z

3

os.system gets a string. Just build the string explicitly:

import os 
for i in files:
    os.system("wc -l " + i)

answered Jun 29, 2011 at 12:43

Nathan Fellman

129k105 gold badges267 silver badges327 bronze badges

4 Comments

ThiefMaster Over a year ago

"Execute the command (a string) in a subshell." - I smell security holes if the file list comes from an untrusted source.

Nathan Fellman Over a year ago

I agree, but os.system is a gaping security hole to begin with, for precisely that reason.

crazyaboutliv Over a year ago

Guys, I need to keep this in deployment. This goes straight into production . Do you guys suggest then to stick to enumerate, even though it would take a tad longer ?

crazyaboutliv Over a year ago

This is giving me an error btw :( . Here : wc: invalid option -- 'g' Try wc --help' for more information. 256 wc: invalid option -- 'g' Try wc --help' for more information. 256 wc: invalid option -- 'g' Try `wc --help' for more information. 256 <repeat this another 10-12 times> [ The code is same as written above>

bashrc · Accepted Answer · 2016-05-10 15:49:07Z

3

Here is a Python approach I found to solve this problem:

count_of_lines_in_any_textFile = sum(1 for l in open('any_textFile.txt'))

edited May 10, 2016 at 15:49

bashrc

4,8451 gold badge26 silver badges49 bronze badges

answered May 10, 2016 at 15:14

user6316035

311 bronze badge

2 Comments

konstunn Over a year ago

I don't close the file here, do you? Or will Python garbage collector do that for you?

earth2jason Over a year ago

This causes a StopIteration error if you are using an enumeration method afterwards.

james.peng · Accepted Answer · 2017-06-30 03:12:22Z

1

I found a much more simple way:

import os
linux_shell='more /etc/hosts|wc -l'
linux_shell_result=os.popen(linux_shell).read()
print(linux_shell_result)

answered Jun 30, 2017 at 3:12

james.peng

4131 gold badge4 silver badges14 bronze badges

1 Comment

thewaywewere Over a year ago

While this code may answer the question, providing additional context regarding how and/or why it solves the problem would improve the answer's long-term value.

tzot · Accepted Answer · 2011-06-29 20:00:17Z

My solution is very similar to the “native” function by lazyr:

import functools

def file_len2(fname):
    with open(fname, 'rb') as f:
        lines= 0
        reader= functools.partial(f.read, 131072)
        for datum in iter(reader, ''):
            lines+= datum.count('\n')
            last_wasnt_nl= datum[-1] != '\n'
        return lines + last_wasnt_nl

This, unlike wc, considers a final line not ending with '\n' as a separate line. If one wants the same functionality as wc, then it can be (quite unpythonically :) written as:

import functools as ft, itertools as it, operator as op

def file_len3(fname):
    with open(fname, 'rb') as f:
        reader= ft.partial(f.read, 131072)
        counter= op.methodcaller('count', '\n')
        return sum(it.imap(counter, iter(reader, '')))

with comparable times to wc in all test files I produced.

NB: this applies to Windows and POSIX machines. Old MacOS used '\r' as line-end characters.

Collectives™ on Stack Overflow

Running "wc -l <filename>" within Python Code

7 Answers 7

Comments

5 Comments

3 Comments

4 Comments

2 Comments

1 Comment

Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

7 Answers 7

Comments

5 Comments

3 Comments

4 Comments

2 Comments

1 Comment

Comments

Linked

Related