Counting a list of strings on every line of multiple files

Question

I have 250 strings and I need to count the number of times each one appears on every line of my 400 files (which are up to 20,000 lines). Example of strings:

journal
moon pig
owls

Example of one file:

This text has journal and moon pig
This text has owls and owls

Example output:

1   0
1   0
0   2

EDIT: where column one counts strings from the first line of the file, and column two represents the second line of the file.

I have working code but its obviously very slow. I'm sure awk could speed it up but I'm not good enough to write it.

for file in folder/*
do
    name=$(basename "$file" .txt)
    linenum=1
    while read line
    do
        while read searches
        do
            ###count every time string appears on line and save
            count=$(echo $line | grep -oi "$searches" | wc -l)
            echo $count >> out/${name}_${linenum}.txt
        done < strings.txt
        linenum=$((linenum+1))
    done < $file
done

EDIT: I do 400 pastes like this, where x is the number of lines in the original file.

paste out/file1_{1..x}.txt > out/file1_all.txt

Does anyone know how to speed this up?

From your example of my files and example output it's unclear to me what the columns in the output are meant to correlate to. Are the two sample lines meant to come from two files, rather than one? — tink
– tink, Commented Dec 3, 2019 at 0:15
Do you really want a file with 400 columns as the final result? In case you would sum up the numbers in each row anyway the command might get far easier. — Socowi
– Socowi, Commented Dec 3, 2019 at 0:43
Likewise, I don't understand "count on every line". Do you want possibly zero counts for all 8 million lines, or counts for lines that only have one or more matches, or counts for files that have any matches. Having an output that tells you only an index of a string, and does not tell you which file had the matches, seems useless. Maybe you could explain what you planned to do with the output next. I would expect to need: "File myName has 50 matches of 7 patterns" and maybe list counts for each separate pattern, and maybe a totals list for all the files together. — Paul_Pedant
– Paul_Pedant, Commented Dec 3, 2019 at 0:46
It looks like you're trying to process text data with Bash. If you're dealing with real text, it might be worth using a more advanced language and a real NLP library (e.g. Spacy in Python), because it will take care of a lot of tokenization issues. — Erwan
– Erwan, Commented Dec 3, 2019 at 1:40
Can you explain your sample output? Why is the count for owls in the 2nd column? — glenn jackman
– glenn jackman, Commented Dec 3, 2019 at 17:04

glenn jackman · Accepted Answer · 2019-12-03 21:24:30Z

2

If

$ cat strings
journal
moon pig
owls

and

$ cat file
I like to journal about owls and moon pigs.
owls are birds. moon pigs are not.
owls owls owls

then, you could use GNU awk like this

gawk '
    NR == FNR { string[++n] = $0; next}
    {
        for (i=1; i<=n; i++)
            # gsub() return the number of replacements.
            # it is a convenient way to count instances of fixed strings.
            count[i][FNR] = gsub(string[i], string[i])
        if (FNR > max)
            max = FNR
    }
    END {
        for (i=1; i<=n; i++) {
            for (j=1; j<=max; j++)
                printf "%s\t", 0 + count[i][j]
            print ""
        }
    }
' strings file

outputs

1   0   0
1   1   0
1   1   3

I haven't explained that awk program at all. See if you can figure it out, and ask any questions you have.

answered Dec 3, 2019 at 21:24

glenn jackman

88.5k16 gold badges124 silver badges179 bronze badges

It works perfectly!!! Thank you so much

SarahM
– SarahM

2019-12-04 20:14:27 +00:00
Commented Dec 4, 2019 at 20:14
Well, max is not really needed, j<=FNR is just enough.

user232326
– user232326

2019-12-05 19:13:54 +00:00
Commented Dec 5, 2019 at 19:13
I was thinking that multiple data files would be fed into awk. But that doesn't work with the output format.

glenn jackman
– glenn jackman

2019-12-05 19:48:58 +00:00
Commented Dec 5, 2019 at 19:48
I wrote an answer for multiple file input. Seems to work.

user232326
– user232326

2019-12-05 23:55:11 +00:00
Commented Dec 5, 2019 at 23:55

Add a comment |

user232326user232326 · Accepted Answer · 2019-12-05 23:52:51Z

The core algorithm to get an array of counts per each line which processes each line immediately is:

gawk ' NR == FNR { string[++n] = $0; next}
       { for (i=1; i<=n; i++) 
             printf("%s\t", gsub(string[i],""))
         print ""
       }
     ' strings file

That is based on gsub to giving the count of replacements performed.

That will generate an output of:

1   1   0   
0   0   2

Which is simply the transposed matrix of what you asked for. It gets a bit complex in awk to transpose columns and rows. And also to process more than one file. We can concatenate (pipe) both scripts using an empty line as indicator of file change. Processing the same file twice:

awk '
    NR == FNR { string[++n] = $0; next}
    FNR==1 && p == 1 { print "" }
    { for (i=1; i<=n; i++) printf("%s\t", gsub(string[i],""))
      print ""
      p = 1
    }
    END    { print "" }
' strings.txt    infile.txt    infile.txt |
awk '!/^$/{ 
       for(i=1;i<=NF;i++) f[NR-r][i]=$i ;
       if (maxf<NF) maxf = NF ;
       if (maxr<(NR-r)) maxr = NR-r ;
     } 
     /^$/{
         for(      i=1 ; i<=maxf ; i++ )
         {
             for(  j=1 ; j<=maxr ; j++ )
                 printf("%s\t",f[j][i])
             print ( "loop", maxf, maxr, r )
         }
     r=NR
     print ( "" )
     maxf=0
     maxr=0
     delete f
     }'

Which gives the answer as asked:

Stack Exchange Network

Counting a list of strings on every line of multiple files

2 Answers 2

You must log in to answer this question.

Hot Network Questions

Counting a list of strings on every line of multiple files

2 Answers 2

You must log in to answer this question.

Related

Hot Network Questions