I had a hunch that Python can do this much faster than sed but I didn't have the time to check until now, so... Based on your comment to Arount's answer:
my real file is actually quite big, the command line is way faster than a python script
That's not necessarily true and in fact, in your case, I suspected that Python could do it many, many times faster than sed because with Python you're not limited to iterating over your file through a line buffer nor you need a full blown regex engine just to get the line separators.
I'm not sure how big your file is, but I generated my test example as:
with open("example.txt", "w") as f:
for i in range(10**8): # I would consider 100M lines as "big" enough for testing
print(i, file=f)
Which essentially creates a 100M lines long (888.9MB) file with a different number on each line.
Now, timing your sed command alone, running at the highest priority (chrt -f 99) results in:
[zwer@testbed ~]$ sudo chrt -f 99 /usr/bin/time --verbose \
> sed -e '1s/^/[/' -e 's/$/,/' -e '$s/,$/]/' example.txt > output.txt
Command being timed: "sed -e 1s/^/[/ -e s/$/,/ -e $s/,$/]/ example.txt"
User time (seconds): 56.89
System time (seconds): 1.74
Percent of CPU this job got: 98%
Elapsed (wall clock) time (h:mm:ss or m:ss): 0:59.28
Average shared text size (kbytes): 0
Average unshared data size (kbytes): 0
Average stack size (kbytes): 0
Average total size (kbytes): 0
Maximum resident set size (kbytes): 1044
Average resident set size (kbytes): 0
Major (requiring I/O) page faults: 1
Minor (reclaiming a frame) page faults: 313
Voluntary context switches: 7
Involuntary context switches: 29
Swaps: 0
File system inputs: 1140560
File system outputs: 1931424
Socket messages sent: 0
Socket messages received: 0
Signals delivered: 0
Page size (bytes): 4096
Exit status: 0
The result would be even worse if you were actually to call it from Python as it would also come with the subprocess and STDOUT redirecting overheads.
However, if we leave it to Python to do all the work instead of sed:
import sys
CHUNK_SIZE = 1024 * 64 # 64k, tune this to the FS block size / platform for best performance
with open(sys.argv[2], "w") as f_out: # open the file from second argument for writing
f_out.write("[") # start the JSON array
with open(sys.argv[1], "r") as f_in: # open the file from the first argument for reading
chunk = None
last_chunk = '' # keep a track of the last chunk so we can remove the trailing comma
while True:
chunk = f_in.read(CHUNK_SIZE) # read the next chunk
if chunk:
f_out.write(last_chunk) # write out the last chunk
last_chunk = chunk.replace("\n", ",\n") # process the new chunk
else: # EOF
break
last_chunk = last_chunk.rstrip() # clear out the trailing whitespace
if last_chunk[-1] == ",": # clear out the trailing comma
last_chunk = last_chunk[:-1]
f_out.write(last_chunk) # write the last chunk
f_out.write("]") # end the JSON array
without ever touching the shell results in:
[zwer@testbed ~]$ sudo chrt -f 99 /usr/bin/time --verbose \
> python process_file.py example.txt output.txt
Command being timed: "python process_file.py example.txt output.txt"
User time (seconds): 1.75
System time (seconds): 0.72
Percent of CPU this job got: 93%
Elapsed (wall clock) time (h:mm:ss or m:ss): 0:02.65
Average shared text size (kbytes): 0
Average unshared data size (kbytes): 0
Average stack size (kbytes): 0
Average total size (kbytes): 0
Maximum resident set size (kbytes): 4716
Average resident set size (kbytes): 0
Major (requiring I/O) page faults: 3
Minor (reclaiming a frame) page faults: 14835
Voluntary context switches: 16
Involuntary context switches: 0
Swaps: 0
File system inputs: 3120
File system outputs: 1931424
Socket messages sent: 0
Socket messages received: 0
Signals delivered: 0
Page size (bytes): 4096
Exit status: 0
And given the utilization, the bottleneck is actually I/O, left to its own devices (or working from a very fast storage instead of a virtualized HDD as on my testbed) Python could do it even faster.
So, it took sed 32.5 times longer to do the same task that Python did. Even if you were to optimize your sed a bit, Python will still work faster because sed is limited to a line buffer so a lot of time will be wasted on the input I/O (compare the numbers in the above benchmark) and there's no (easy) way around that.
Conclusion: Python is way faster than sed for this particular task.
subprocess.call()(i.e." > "),subprocessmodule does that for you. Also, in dependence of howsedhandles the STDOUT forwarding you might need to addshell=Trueto invoke the command via your shell.