Remove lines from a file based on patterns in another file which may partially match a particular column in first file

Question

I have searched for similar answers but none solve the partial match problem. Patterns file is file2 and lines to be removed are in file1.csv which is a pretty big file with far more columns than those represented here.

I have the following fields in file1.csv:

UPDATE:

Linking page,Last crawled
https://start.me/discover/be/entertainment/betting?locale=ro,"Nov 17, 2018"
https://imgcop.com/img/Bwin-Mobile-App-77898390/,"Nov 17, 2018"
https://start.me/site/unibet.be?locale=fr,"Nov 17, 2018"
https://poker.partypoker402.com/en/blog/matt-savage-talks-wpt500.html,"Nov 17, 2018"

file2 contains:

https://roulette2.tk
paradisebingo.t
paradisebingo.tm
free-bwin.ro
sb288.co

OUTPUT
Linking page,Last crawled
Linking page,Last crawled
Linking page,Last crawled
Linking page,Last crawled
Linking page,Last crawled
https://start.me/discover/be/entertainment/betting?locale=ro,"Nov 17, 2018"
https://start.me/discover/be/entertainment/betting?locale=ro,"Nov 17, 2018"
https://start.me/discover/be/entertainment/betting?locale=ro,"Nov 17, 2018"
https://start.me/discover/be/entertainment/betting?locale=ro,"Nov 17, 2018"
https://start.me/discover/be/entertainment/betting?locale=ro,"Nov 17, 2018"
etc....

The output is being repeated. I am not sure what is wrong.

awk 'FNR == NR{ neg[$1]; next } { for ( i in neg ) if ( $1 !~ i) print }' file2.txt FPAT='([^,]*)|("[^"]+")' file1.csv > out.csv

but can't get it to work. For some strange reason grep fails:

grep -vwF -f file2 file1.csv > output.csv

Your grep command works for me. I see no reason why you have to use the -w and -F parameters. — finswimmer
– finswimmer, Commented Jan 28, 2019 at 5:15

Inian · Accepted Answer · 2019-01-28 05:21:39Z

1

What you have looks a decent attempt, but the clause for regex match does not work as you wanted it to do. In $2 !~ neg[$1] on the file1, you are trying to look up the value of neg['156398439'] because the $1 will be retrieved from the second file and not the first. So your condition would never match.

You can do something like below where you take the regex comparison inside the action part on file1 by having a loop

awk 'FNR == NR { neg[$1]; next }{ for ( i in neg ) if ( $2 !~ i) print  }' file2 FS="," file1

Also I don't think FS can take that complex a regular expression to de-limit CSV files, remember FS defines what de-limiter to split on and not on how to define fields. You seem to have had an expression that explains how the field should look like. GNU awk allows another variable FPAT to define such regular expression.

You can use

awk 'FNR == NR { neg[$1]; next }{ for ( i in neg ) if ( $2 !~ i) print  }' file2 FPAT='([^,]*)|("[^"]+")' file1

edited Jan 28, 2019 at 5:21

answered Jan 28, 2019 at 5:10

Inian

13.1k2 gold badges42 silver badges55 bronze badges

The 2nd awk answer takes forever and after cancelling the command, I find that each line is being printed multiple times within the output file(when redirected). I have made sure that the output file is named differently from file1. wc -l shows 636316 lines whereas my input file has only 963 lines!!

Mallik Kumar
– Mallik Kumar

2019-01-28 05:30:24 +00:00
Commented Jan 28, 2019 at 5:30
@MallikKumar: I've tested the answer on your given input. It seems to work fine.

Inian
– Inian

2019-01-28 05:43:58 +00:00
Commented Jan 28, 2019 at 5:43
This works perfectly when file2 is only one line long. However, in general, it prints a line from file1 if any pattern from file2 doesn’t match it — and prints each line from file1 n or n − 1 times. You should remove (i.e., suppress) a line if any pattern matches it, and print a line only if all the patterns don’t match it.

G-Man Says 'Reinstate Monica'
– G-Man Says 'Reinstate Monica'

2019-01-28 06:52:11 +00:00
Commented Jan 28, 2019 at 6:52
I have updated my question with real data and the results of @G-Man 's solutions also.

Mallik Kumar
– Mallik Kumar

2019-01-28 07:06:02 +00:00
Commented Jan 28, 2019 at 7:06

Add a comment |

G-Man Says 'Reinstate Monica' · Accepted Answer · 2019-02-04 07:59:59Z

Inian’s answer works perfectly when file2 is only one line long, and is a good start on a more general answer. But I believe that

awk 'FNR == NR { neg[$1]; next } { ok=1; for (i in neg) if ($2 ~ i) ok=0; if (ok) print }' file2 FS="," file1

will do what you want in general. Like your answer, it starts by reading file2 and storing its contents (the patterns that you want to remove from file) in an array. Like Inian’s answer, it then reads file1. For each line in file1, it loops through the patterns from file2. We assume that the line is OK; if it matches any pattern, then it’s not OK. If it is still OK after checking all the patterns, we print it.

But I put FS="," as an argument between file2 and file1 just because that’s the way Inian did it. It doesn’t matter what field separator we use when we read file2, as long as it doesn’t appear therein — and file2 contains no commas. So we could simplify the above a little by specifying the field separator the ‘normal’ way — with a -F option at the beginning of the command:

awk -F, 'FNR == NR { neg[$1]; next } { ok=1; for (i in neg) if ($2 ~ i) ok=0; if (ok) print }' file2 file1

You can use -F"," if you prefer; they’re equivalent.

The test FNR == NR is so popular and pervasive that we use it without thinking. FNR is the line number (a.k.a. record number) within the current file, and NR is the line number across all input. So, for example,

$ cat cats
Felix
Garfield
Heathcliff

$ cat dogs
Lassie
Marmaduke
Snoopy

$ awk '{ print FNR, NR, $0 }' cats dogs
1 1 Felix
2 2 Garfield
3 3 Heathcliff
1 4 Lassie
2 5 Marmaduke
3 6 Snoopy

… and so FNR and NR are equal for each line of the first file to be processed, and not in subsequent file(s). And so we use FNR == NR to test whether we are processing the first file.

But this is actually a bad practice. What if the first file is empty?

$ cat unicorns

$ wc unicorns
      0       0       0 unicorns

$ awk '{ print FNR, NR, $0 }' unicorns dogs
1 1 Lassie
2 2 Marmaduke
3 3 Snoopy

FNR == NR is true for the first file that actually has data. If your file2 will never ever ever be empty, you may be able to get away with ignoring this issue. But, based on the definition of your problem, if file2 is empty, the output should be all of file1, because we aren’t removing anything. But, if you run the above command with an empty file2, you will get no output, because awk thinks it’s reading the first file (file2) when it’s actually reading the second file (file1).

A safer way to do this is to put an assignment between the file arguments:

awk -F, 'FILE != 2 { neg[$1]; next } { ok=1; for (i in neg) if ($2 ~ i) ok=0; if (ok) print }' file2 FILE=2 file1

The question is a little ambiguous. What does “partial match” mean, exactly? Inian chose to interpret it in the sense that the question suggests — like grep. If any value from file2 matches the value from the second column of file1 as a regular expression, then remove that line of file1. But there are two problems with this.

The surprise factor. I took the files in the question and added a
```
154376352,"http://sb288eco.tm","example4"
```
line to file1, and ran my first command. That "example4" line was not output, because sb288.co (from file2), taken as a regular expression (in which . means “match any character”), matched sb288eco.

If that’s what you want and expect to happen, you might as well stop reading this now.
Regular expression processing is computationally expensive. Regular expressions have to be parsed and processed. This will likely take more time than simple string comparison.

We can solve both of the above issues by testing whether the string from file2 is present in the value from file1 with awk’s index function:

awk -F, 'FILE != 2 { neg[$1]; next } { ok=1; for (i in neg) if (index($2,i) > 0) ok=0; if (ok) print }' file2 FILE=2 file1

With the above, a . in file2 matches only a . in file1, and not any other character. I invite you to test the above on your data and see whether it is any faster.

P.S. I just noticed that you changed the file format since I posted my answer. Originally you wanted to test the values from file2 against values from the second column of file1. Now you seem to want to test against values from the first column of file1. To accommodate this change, you should take the part of any of the above answers that compares $2 to i, and change it to use $1 instead. Or, if you really want to test the entire line from file1, use $0.

So, bottom line, you might want to use

awk -F, 'FILE != 2 { neg[$1]; next } { ok=1; for (i in neg) if (index($1,i) > 0) ok=0; if (ok) print }' file2 FILE=2 file1

as your command. With line breaks for readability, that’s

awk -F, 'FILE != 2 { neg[$1]; next }
                   {
                     ok=1
                     for (i in neg)
                             if (index($1,i) > 0) ok=0
                     if (ok) print
                   }' \
        file2 FILE=2 file1

awk liner works...is there a way of speeding up the process? — Mallik Kumar
– Mallik Kumar, Commented Jan 28, 2019 at 7:40

Stack Exchange Network

Remove lines from a file based on patterns in another file which may partially match a particular column in first file

2 Answers 2

You must log in to answer this question.

Hot Network Questions

Remove lines from a file based on patterns in another file which may partially match a particular column in first file

2 Answers 2

You must log in to answer this question.

Related

Hot Network Questions