Deleting lines from large file if strings in list are found in the first 12 characters of line?

Question

I have a file matrix of +184000 lines * +5400 rows that looks like this

denovo1 someverylaaargenumbers and lotandlotsoftextuntil 5400.........
denovo10 someverylaaargenumbers and lotandlotsoftextuntil 5400........
denovo100 someverylaaargenumbers and lotandlotsoftextuntil 5400.......
denovo1000 someverylaaargenumbers and lotandlotsoftextuntil 5400......
denovo10000 someverylaaargenumbers and lotandlotsoftextuntil 5400.....
denovo100000 someverylaaargenumbers and lotandlotsoftextuntil 5400......
denovo184117 someverylaaargenumbers and lotandlotsoftextuntil 5400......

I have a list of identifiers in second file file that looks like this:

denovo1
denovo100
denovo1000
denovo100000

I wish to purge the lines in matrix 1 if the identifier is found in file 2. Thus:

denovo10 someverylaaargenumbers and lotandlotsoftextuntil 5400........
denovo10000 someverylaaargenumbers and lotandlotsoftextuntil 5400.....
denovo184117 someverylaaargenumbers and lotandlotsoftextuntil 5400......

I have this short unix code that reads line by line and finds the strings in file 2:

while read -r line
do
echo $line
sed -i '' '/$line/d' /my/path/matrix1
done < /my/path/file2

and it does work, but it takes forever because it reads all the lines to the end. Is there some way to make the machine read only the first 12 characters of each line?

Freddy · Accepted Answer · 2019-05-24 22:12:48Z

1

With grep:

grep -vwf file matrix > matrix.new
mv matrix.new matrix

option -f FILE use FILE as pattern input file
option -w select only those lines containing matches that form whole words
option -v select non-matching lines

Note that file must not contain any empty lines.

Or if you create your identifier file manually with an anchor ^ to match the start of the line and a space character after each identifier to mark the end of the pattern:

printf '^%s \n' denovo{1,100,1000,100000} > file
grep -vf file matrix > matrix.new
mv matrix.new matrix

edited May 24, 2019 at 22:12

answered May 24, 2019 at 21:59

Freddy

26.3k1 gold badge27 silver badges64 bronze badges

Add a comment |

John1024 · Accepted Answer · 2019-05-24 21:39:34Z

0

Try:

$ awk 'FNR==NR{ids[$1]; next} !($1 in ids)' ids file
denovo10 someverylaaargenumbers and lotandlotsoftextuntil 5400........
denovo10000 someverylaaargenumbers and lotandlotsoftextuntil 5400.....
denovo184117 someverylaaargenumbers and lotandlotsoftextuntil 5400......

How it works:

FNR==NR{ids[$1]; next}

While reading the first file, ids, this creates a key in associative array ids with the id. It then skips the rest of the commands and jumps to the next line.
!($1 in ids)

While reading the second file, this prints the line if the first field is not a key in associative array ids.

To update the original file

When you are satisfied that the code is working correctly, the file can be changed:

awk 'FNR==NR{ids[$1]; next} !($1 in ids)' ids file >tmp && mv tmp file

answered May 24, 2019 at 21:39

John1024

76.4k12 gold badges176 silver badges165 bronze badges

Sorry for being a moron in awk. I don't understand where to put the file names in your example. If the first file's name is matrix 1, and file2 contains the identifiers, then how should the command look?

Christoffer Bugge Harder
– Christoffer Bugge Harder

2019-05-24 22:04:41 +00:00
Commented May 24, 2019 at 22:04
awk 'FNR==NR{ids[matrix1]; next} !($1 in ids)' ids file2?

Christoffer Bugge Harder
– Christoffer Bugge Harder

2019-05-24 22:05:31 +00:00
Commented May 24, 2019 at 22:05
If file2 is the identifiers and the data is in matrix 1, then use: awk 'FNR==NR{ids[$1]; next} !($1 in ids)' file2 "matrix 1".

John1024
– John1024

2019-05-24 22:46:35 +00:00
Commented May 24, 2019 at 22:46
FNR==NR Yuck!!! (btw, "Yuck!" is probably the proper way to pronounce awk.) LOL

filbranden
– filbranden

2019-05-25 13:32:22 +00:00
Commented May 25, 2019 at 13:32
You are free to look at awk that way if you want, @filbranden, but you'll be missing out on one of Unix's most useful scripting/glue tools.

John1024
– John1024

2019-05-25 21:52:23 +00:00
Commented May 25, 2019 at 21:52

| Show 1 more comment

Stack Exchange Network

Deleting lines from large file if strings in list are found in the first 12 characters of line?

2 Answers 2

To update the original file

You must log in to answer this question.

Hot Network Questions

Deleting lines from large file if strings in list are found in the first 12 characters of line?

2 Answers 2

To update the original file

You must log in to answer this question.

Related

Hot Network Questions