Revisions to Is there a faster way to remove a line (given a line number) from a file?

replaced http://unix.stackexchange.com/ with https://unix.stackexchange.com/

Source Link

edited Apr 13, 2017 at 12:36

1

In the special case that the content of the lines which should be deleted are unique in the file, another option might be using grep -v and the content of the line rather than the line numbers. For instance if only one unique line should be deleted (the deletion of a single line was for instance asked in this duplicate thread thread), or many lines which all have the same unique content.

Here is an example

grep -v "content of lines to delete" input.txt > input.tmp

I have made some benchmarks with a file which contains 340 000 lines. The way with grep seems to be around 15 times faster than the sed method in this case.

Here are the commands and the timings:

time sed -i "/CDGA_00004.pdbqt.gz.tar/d" /tmp/input.txt

real    0m0.711s
user    0m0.179s
sys     0m0.530s

time perl -ni -e 'print unless /CDGA_00004.pdbqt.gz.tar/' /tmp/input.txt

real    0m0.105s
user    0m0.088s
sys     0m0.016s

time (grep -v CDGA_00004.pdbqt.gz.tar /tmp/input.txt > /tmp/input.tmp; mv /tmp/input.tmp /tmp/input.txt )

real    0m0.046s
user    0m0.014s
sys     0m0.019s

I have tried both with and without the setting LC_ALL=C, it does not change the timings. The search string (CDGA_00004.pdbqt.gz.tar) is somewhere in the middle of the file.

In the special case that the content of the lines which should be deleted are unique in the file, another option might be using grep -v and the content of the line rather than the line numbers. For instance if only one unique line should be deleted (the deletion of a single line was for instance asked in this duplicate thread), or many lines which all have the same unique content.

Here is an example

grep -v "content of lines to delete" input.txt > input.tmp

I have made some benchmarks with a file which contains 340 000 lines. The way with grep seems to be around 15 times faster than the sed method in this case.

Here are the commands and the timings:

time sed -i "/CDGA_00004.pdbqt.gz.tar/d" /tmp/input.txt

real    0m0.711s
user    0m0.179s
sys     0m0.530s

time perl -ni -e 'print unless /CDGA_00004.pdbqt.gz.tar/' /tmp/input.txt

real    0m0.105s
user    0m0.088s
sys     0m0.016s

time (grep -v CDGA_00004.pdbqt.gz.tar /tmp/input.txt > /tmp/input.tmp; mv /tmp/input.tmp /tmp/input.txt )

real    0m0.046s
user    0m0.014s
sys     0m0.019s

I have tried both with and without the setting LC_ALL=C, it does not change the timings. The search string (CDGA_00004.pdbqt.gz.tar) is somewhere in the middle of the file.

In the special case that the content of the lines which should be deleted are unique in the file, another option might be using grep -v and the content of the line rather than the line numbers. For instance if only one unique line should be deleted (the deletion of a single line was for instance asked in this duplicate thread), or many lines which all have the same unique content.

Here is an example

grep -v "content of lines to delete" input.txt > input.tmp

I have made some benchmarks with a file which contains 340 000 lines. The way with grep seems to be around 15 times faster than the sed method in this case.

Here are the commands and the timings:

time sed -i "/CDGA_00004.pdbqt.gz.tar/d" /tmp/input.txt

real    0m0.711s
user    0m0.179s
sys     0m0.530s

time perl -ni -e 'print unless /CDGA_00004.pdbqt.gz.tar/' /tmp/input.txt

real    0m0.105s
user    0m0.088s
sys     0m0.016s

time (grep -v CDGA_00004.pdbqt.gz.tar /tmp/input.txt > /tmp/input.tmp; mv /tmp/input.tmp /tmp/input.txt )

real    0m0.046s
user    0m0.014s
sys     0m0.019s

I have tried both with and without the setting LC_ALL=C, it does not change the timings. The search string (CDGA_00004.pdbqt.gz.tar) is somewhere in the middle of the file.

quotes forgotten

Source Link

edited Mar 19, 2017 at 12:46

Jadzia

151
1
6

In the special case that the content of the lines which should be deleted are unique in the file, another option might be using grep -v and the content of the line rather than the line numbers. For instance if only one unique line should be deleted (the deletion of a single line was for instance asked in this duplicate thread), or many lines which all have the same unique content.

Here is an example

grep -v "content of lines to delete" input.txt > input.tmp

I have made some benchmarks with a file which contains 340 000 lines. The way with grep seems to be more around 15 times faster than the sed method in this case.

Here are the commands and the timings:

time sed -i "/CDGA_00004.pdbqt.gz.tar/dd" /tmp/input.txt

real    0m0.711s
user    0m0.179s
sys     0m0.530s

time perl -ni -e 'print unless /CDGA_00004.pdbqt.gz.tar/' /tmp/input.txt

real    0m0.105s
user    0m0.088s
sys     0m0.016s

time (grep -v CDGA_00004.pdbqt.gz.tar /tmp/input.txt > /tmp/input.tmp; mv /tmp/input.tmp /tmp/input.txt )

real    0m0.046s
user    0m0.014s
sys     0m0.019s

I have tried both with and without the setting LC_ALL=C, it does not change the timings. The search string (CDGA_00004.pdbqt.gz.tar) is somewhere in the middle of the file.

In the special case that the content of the lines which should be deleted are unique in the file, another option might be using grep -v and the content of the line rather than the line numbers. For instance if only one unique line should be deleted (the deletion of a single line was for instance asked in this duplicate thread), or many lines which all have the same unique content.

Here is an example

grep -v "content of lines to delete" input.txt > input.tmp

I have made some benchmarks with a file which contains 340 000 lines. The way with grep seems to be more around 15 times faster than the sed method in this case.

Here are the commands and the timings:

time sed -i /CDGA_00004.pdbqt.gz.tar/d /tmp/input.txt

real    0m0.711s
user    0m0.179s
sys     0m0.530s

time perl -ni -e 'print unless /CDGA_00004.pdbqt.gz.tar/' /tmp/input.txt

real    0m0.105s
user    0m0.088s
sys     0m0.016s

time (grep -v CDGA_00004.pdbqt.gz.tar /tmp/input.txt > /tmp/input.tmp; mv /tmp/input.tmp /tmp/input.txt )

real    0m0.046s
user    0m0.014s
sys     0m0.019s

I have tried both with and without the setting LC_ALL=C, it does not change the timings. The search string (CDGA_00004.pdbqt.gz.tar) is somewhere in the middle of the file.

In the special case that the content of the lines which should be deleted are unique in the file, another option might be using grep -v and the content of the line rather than the line numbers. For instance if only one unique line should be deleted (the deletion of a single line was for instance asked in this duplicate thread), or many lines which all have the same unique content.

Here is an example

grep -v "content of lines to delete" input.txt > input.tmp

I have made some benchmarks with a file which contains 340 000 lines. The way with grep seems to be around 15 times faster than the sed method in this case.

Here are the commands and the timings:

time sed -i "/CDGA_00004.pdbqt.gz.tar/d" /tmp/input.txt

real    0m0.711s
user    0m0.179s
sys     0m0.530s

time perl -ni -e 'print unless /CDGA_00004.pdbqt.gz.tar/' /tmp/input.txt

real    0m0.105s
user    0m0.088s
sys     0m0.016s

time (grep -v CDGA_00004.pdbqt.gz.tar /tmp/input.txt > /tmp/input.tmp; mv /tmp/input.tmp /tmp/input.txt )

real    0m0.046s
user    0m0.014s
sys     0m0.019s

I have tried both with and without the setting LC_ALL=C, it does not change the timings. The search string (CDGA_00004.pdbqt.gz.tar) is somewhere in the middle of the file.

Source Link

answered Mar 19, 2017 at 12:29

Jadzia

151
1
6

In the special case that the content of the lines which should be deleted are unique in the file, another option might be using grep -v and the content of the line rather than the line numbers. For instance if only one unique line should be deleted (the deletion of a single line was for instance asked in this duplicate thread), or many lines which all have the same unique content.

Here is an example

grep -v "content of lines to delete" input.txt > input.tmp

I have made some benchmarks with a file which contains 340 000 lines. The way with grep seems to be more around 15 times faster than the sed method in this case.

Here are the commands and the timings:

time sed -i /CDGA_00004.pdbqt.gz.tar/d /tmp/input.txt

real    0m0.711s
user    0m0.179s
sys     0m0.530s

time perl -ni -e 'print unless /CDGA_00004.pdbqt.gz.tar/' /tmp/input.txt

real    0m0.105s
user    0m0.088s
sys     0m0.016s

time (grep -v CDGA_00004.pdbqt.gz.tar /tmp/input.txt > /tmp/input.tmp; mv /tmp/input.tmp /tmp/input.txt )

real    0m0.046s
user    0m0.014s
sys     0m0.019s

I have tried both with and without the setting LC_ALL=C, it does not change the timings. The search string (CDGA_00004.pdbqt.gz.tar) is somewhere in the middle of the file.

Stack Exchange Network

Return to Answer