Python subprocess - saving output in a new file

Question

I use the following command to reformat a file and it creates a new file:

sed -e '1s/^/[/' -e 's/$/,/' -e '$s/,$/]/' toto> toto.json

It works fine on the command line.

I try to use it through a python script, but it does not create a new file.

I try:

subprocess.call(["sed", "-e","1s/^/[/","-e", "s/$/,/","-e","$s/,$/]/ ",sys.argv[1], " > ",sys.argv[2]])

The issue is: it gives me the output in the stdout and raise an error:

sed: can't read >: No such file or directory
Traceback (most recent call last):
File "test.py", line 14, in <module>
subprocess.call(["sed", "-e","1s/^/[/","-e", "s/$/,/","-e","$s/,$/]/", 
sys.argv[1], ">",sys.argv[2])
File "C:\Users\Anaconda3\lib\subprocess.py", line 291, in 
check_call raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command '['sed', '-e', '1s/^/[/', '-e', 
's/$/,/', '-e', '$s/,$/]/', 'toto.txt, '>', 'toto.json']' returned non-zero 
exit status 2.

I read the other issues with the subprocess and try other commands with the option shell=True but, it did not work either. I use python 3.6

For information, the command add a bracket in the first and last line and add a comma at the end of each line except the last one. So, it does:

from
a
b
c

to:

[a,
b,
c]

Don't add spaces to the arguments passing to subprocess.call() (i.e. " > "), subprocess module does that for you. Also, in dependence of how sed handles the STDOUT forwarding you might need to add shell=True to invoke the command via your shell. — zwer
– zwer, Commented Dec 20, 2017 at 11:31

Serge Ballesta · Accepted Answer · 2017-12-20 11:37:07Z

3

On Linux and other Unix systems, the redirection characters are not part of the command but are interpreted by the shell, so it does not make sense to pass it as parameters to a subprocess.

Hopefully, subprocess.call allows the stdout parameter to be a file object. So you should do:

subprocess.call(["sed", "-e","1s/^/[/","-e", "s/$/,/","-e","$s/,$/]/ ",sys.argv[1]],
    stdout=open(sys.argv[2], "w"))

answered Dec 20, 2017 at 11:37

Serge Ballesta

150k13 gold badges137 silver badges267 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

zwer · Accepted Answer · 2017-12-20 16:44:17Z

I had a hunch that Python can do this much faster than sed but I didn't have the time to check until now, so... Based on your comment to Arount's answer:

my real file is actually quite big, the command line is way faster than a python script

That's not necessarily true and in fact, in your case, I suspected that Python could do it many, many times faster than sed because with Python you're not limited to iterating over your file through a line buffer nor you need a full blown regex engine just to get the line separators.

I'm not sure how big your file is, but I generated my test example as:

with open("example.txt", "w") as f:
    for i in range(10**8):  # I would consider 100M lines as "big" enough for testing
        print(i, file=f)

Which essentially creates a 100M lines long (888.9MB) file with a different number on each line.

Now, timing your sed command alone, running at the highest priority (chrt -f 99) results in:

[zwer@testbed ~]$ sudo chrt -f 99 /usr/bin/time --verbose \
> sed -e '1s/^/[/' -e 's/$/,/' -e '$s/,$/]/' example.txt > output.txt
    Command being timed: "sed -e 1s/^/[/ -e s/$/,/ -e $s/,$/]/ example.txt"
    User time (seconds): 56.89
    System time (seconds): 1.74
    Percent of CPU this job got: 98%
    Elapsed (wall clock) time (h:mm:ss or m:ss): 0:59.28
    Average shared text size (kbytes): 0
    Average unshared data size (kbytes): 0
    Average stack size (kbytes): 0
    Average total size (kbytes): 0
    Maximum resident set size (kbytes): 1044
    Average resident set size (kbytes): 0
    Major (requiring I/O) page faults: 1
    Minor (reclaiming a frame) page faults: 313
    Voluntary context switches: 7
    Involuntary context switches: 29
    Swaps: 0
    File system inputs: 1140560
    File system outputs: 1931424
    Socket messages sent: 0
    Socket messages received: 0
    Signals delivered: 0
    Page size (bytes): 4096
    Exit status: 0

The result would be even worse if you were actually to call it from Python as it would also come with the subprocess and STDOUT redirecting overheads.

However, if we leave it to Python to do all the work instead of sed:

import sys

CHUNK_SIZE = 1024 * 64  # 64k, tune this to the FS block size / platform for best performance

with open(sys.argv[2], "w") as f_out:  # open the file from second argument for writing
    f_out.write("[")  # start the JSON array
    with open(sys.argv[1], "r") as f_in:  # open the file from the first argument for reading
        chunk = None
        last_chunk = ''  # keep a track of the last chunk so we can remove the trailing comma
        while True:
            chunk = f_in.read(CHUNK_SIZE)  # read the next chunk
            if chunk:
                f_out.write(last_chunk)  # write out the last chunk
                last_chunk = chunk.replace("\n", ",\n")  # process the new chunk
            else:  # EOF
                break
    last_chunk = last_chunk.rstrip()  # clear out the trailing whitespace
    if last_chunk[-1] == ",":  # clear out the trailing comma
        last_chunk = last_chunk[:-1]
    f_out.write(last_chunk)  # write the last chunk
    f_out.write("]")  # end the JSON array

without ever touching the shell results in:

[zwer@testbed ~]$ sudo chrt -f 99 /usr/bin/time --verbose \
> python process_file.py example.txt output.txt
    Command being timed: "python process_file.py example.txt output.txt"
    User time (seconds): 1.75
    System time (seconds): 0.72
    Percent of CPU this job got: 93%
    Elapsed (wall clock) time (h:mm:ss or m:ss): 0:02.65
    Average shared text size (kbytes): 0
    Average unshared data size (kbytes): 0
    Average stack size (kbytes): 0
    Average total size (kbytes): 0
    Maximum resident set size (kbytes): 4716
    Average resident set size (kbytes): 0
    Major (requiring I/O) page faults: 3
    Minor (reclaiming a frame) page faults: 14835
    Voluntary context switches: 16
    Involuntary context switches: 0
    Swaps: 0
    File system inputs: 3120
    File system outputs: 1931424
    Socket messages sent: 0
    Socket messages received: 0
    Signals delivered: 0
    Page size (bytes): 4096
    Exit status: 0

And given the utilization, the bottleneck is actually I/O, left to its own devices (or working from a very fast storage instead of a virtualized HDD as on my testbed) Python could do it even faster.

So, it took sed 32.5 times longer to do the same task that Python did. Even if you were to optimize your sed a bit, Python will still work faster because sed is limited to a line buffer so a lot of time will be wasted on the input I/O (compare the numbers in the above benchmark) and there's no (easy) way around that.

Conclusion: Python is way faster than sed for this particular task.

after reading, your benchmark. I take back what I said. Python is way faster. The subprocess was not need it here. I will keep this in mind for the future ;)
@youpi - in most instances of people reaching out for external tools that I've encountered, a native Python solution was faster. Most of the time - many times faster. That's why whenever I catch myself typing import subprocess I pause and think whether I'm just too lazy to type out a Python solution or will it actually improve my performance - and in 90% of cases it's the former ;)

Arount · Accepted Answer · 2017-12-20 12:13:45Z

0

Don't do that. Don't use any OS calls if you can avoid it.

If you are using Python, just do pythonic Python script.

Something like:

input_filename = 'toto'
output_filename = 'toto.json'

with open(input_filename, 'r') as inputf:
    lines = ['{},\n'.format(line.rstrip()) for line in inputf]
    lines = ['['] + lines + [']']

    with open(output_filename, 'w') as outputf:
        outputf.writelines(lines)

It basically does the same as your command line.

Trusts this piece of code is kind of dirty and only for example purpose. I advise you to do your own and avoid oneliners like I did.

edited Dec 20, 2017 at 12:13

answered Dec 20, 2017 at 11:12

Arount

10.5k1 gold badge34 silver badges45 bronze badges

4 Comments

youpi Over a year ago

thks for your response, but my real file is actually quite big, the command line is way faster than a python script.

zwer Over a year ago

Why are you importing the re module? Why would you call inputf.readlines() in this situation? It would just gobble up memory with no benefit whatsoever. Also, why not stream out the data directly to the output file instead of loading everything up in the working memory? Oh, so many questions...

Arount Over a year ago

@zwer right, right, again right. My point was "you can and should do it in a pythonic way, not with subprocess". I didn't tried to produce a clean and efficiant piece of code, but just a proof you don't need os calls here.

zwer Over a year ago

@Arount - I agree with your point (check my benchmarks above), I just found the execution, hm, questionable ;)

Collectives™ on Stack Overflow

Python subprocess - saving output in a new file

3 Answers 3

Comments

2 Comments

4 Comments

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

Comments

2 Comments

4 Comments

Related