How to preserve order of appearance when counting words in a word list file

Question

I have two files:

file1 contains a list of unique words
file2 contains several sentences

I want to output a tab separated file with the occurrence of each word in listed in file1 in file2 while preserving the order in which they are listed in file 1.

For example:

file 1:
```
dog 
apple
cat
```

file 2:

the dog played with the cat and the cat was white.
the boy ate the apple.

Desired output:
```
dog 1
apple 1
cat 2
```

I tried existing answers in the community, but they all sort the output.

In your example you should include some non-trivial cases like dog-fish and pineapple and contiguous words like dog dog in file2 and at least 1 case in file1 that doesn't appear in file2, e.g. rabbit. If you only show trivial sunny day cases in your example then you greatly increase your chances of getting an answer that only works for trivial sunny day cases. — Ed Morton
– Ed Morton, Commented Jun 16, 2022 at 23:35
@EdMorton I liked "sunny day". When I saw the comment asking about Arabic, I liked it all over again. — Paul_Pedant
– Paul_Pedant, Commented Jun 17, 2022 at 7:49
@M.A.G Don't significantly change any question (e.g. by introducing non-latin words) after you got answers and it's not clear if "output" is the output you want that you don't get or output you get that you don't want. I rolled this back as it was before you got answers, just ask a new followup question if you have other character sets to consider. — Ed Morton
– Ed Morton, Commented Jun 17, 2022 at 11:24

Ed Morton · Accepted Answer · 2022-06-16 23:38:09Z

3

Using any POSIX awk in any shell on every Unix box:

$ cat tst.awk
BEGIN { OFS="\t" }
NR==FNR {
    words[NR] = $1
    next
}
{
    $0 = " " $0 " "
    gsub(/[^[:alpha:]]+/,"  ")
    for ( i in words ) {
        word = words[i]
        cnts[word] += gsub(" "word" ","&")
    }
}
END {
    for ( i=1; i in words; i++ ) {
        word = words[i]
        print word, cnts[word]+0
    }
}

$ awk -f tst.awk file1 file2
dog     1
apple   1
cat     2

The above assumes that "word"s are all alphabetic characters and that you want the matches to be case-sensitive or the input is all lower case as in your example and that the words in file1 are unique as in your example.

edited Jun 16, 2022 at 23:38

answered Jun 16, 2022 at 23:23

Ed Morton

35.9k6 gold badges25 silver badges60 bronze badges

Thank you! I have edited my question though, what if punctuation exists and non latin words (Arabic)?

M.A.G
– M.A.G

2022-06-17 07:35:11 +00:00
Commented Jun 17, 2022 at 7:35
1

Then ask a new followup question.

Ed Morton
– Ed Morton

2022-06-17 11:21:18 +00:00
Commented Jun 17, 2022 at 11:21

Add a comment |

jubilatious1 · Accepted Answer · 2022-06-23 00:55:18Z

Using Raku (formerly known as Perl_6)

Below is a general solution for matching lines of one file (saved as @a array) against lines of a second file (saved as @b array), counting occurrences (i.e. Bagging):

raku -e 'my  @a =  dir(test => "alphabet.txt").IO.lines.reverse; my @b = $*ARGFILES.lines;  \
         for @a -> $a {@b.grep(/<$a>/).Bag.pairs.say};'  alphabet.txt alphabet.txt

In constructing @a, Raku is given a dir() location and a test => "…" filename. In constructing @b, one-or-more files are entered on the command line, and read off via Raku's $*ARGFILES dynamic variable.

General Input is alphabet.txt, one letter per line and reversed immediately upon reading into Raku to place the array in "z".."a" order;

General Output (when two copies of "a".."z" alphabet.txt are entered on the command-line):

(z => 2)
(y => 2)
(x => 2)
(w => 2)
(v => 2)
(u => 2)
(t => 2)
(s => 2)
(r => 2)
(q => 2)
(p => 2)
(o => 2)
(n => 2)
(m => 2)
(l => 2)
(k => 2)
(j => 2)
(i => 2)
(h => 2)
(g => 2)
(f => 2)
(e => 2)
(d => 2)
(c => 2)
(b => 2)
(a => 2)

Note how the return stays in the same order as the @a array, and how Raku doesn't require a sort call to produce the output above.

Finally, solving the OP's issue, all that has to be changed from the code above is using my @b = $*ARGFILES.lines.words instead of my @b = $*ARGFILES.lines.

[To obtain tab-separated output use .put instead of .say in the code above. This drops the surrounding parens and the => arrow between the two columns].

Final Code:

~$ raku -e 'my @a = dir(test => "dog_apple_cat.txt").IO.lines.grep(*.chars);  \
            my @b = $*ARGFILES.lines.words; for @a -> $a {  \
            @b.grep(/<$a>/).Bag.pairs.put};' text.txt
dog 1
apple.  1
cat 2

https://docs.raku.org/type/Bag
https://raku.org

Stack Exchange Network

How to preserve order of appearance when counting words in a word list file

2 Answers 2

You must log in to answer this question.

Hot Network Questions

How to preserve order of appearance when counting words in a word list file

2 Answers 2

You must log in to answer this question.

Related

Hot Network Questions