1

I have two files:

  • file1 contains a list of unique words
  • file2 contains several sentences

I want to output a tab separated file with the occurrence of each word in listed in file1 in file2 while preserving the order in which they are listed in file 1.

For example:

  • file 1:
    dog 
    apple
    cat
    
  • file 2:
    the dog played with the cat and the cat was white.
    the boy ate the apple.
    
  • Desired output:
    dog 1
    apple 1
    cat 2
    

I tried existing answers in the community, but they all sort the output.

3
  • 3
    In your example you should include some non-trivial cases like dog-fish and pineapple and contiguous words like dog dog in file2 and at least 1 case in file1 that doesn't appear in file2, e.g. rabbit. If you only show trivial sunny day cases in your example then you greatly increase your chances of getting an answer that only works for trivial sunny day cases. Commented Jun 16, 2022 at 23:35
  • 1
    @EdMorton I liked "sunny day". When I saw the comment asking about Arabic, I liked it all over again. Commented Jun 17, 2022 at 7:49
  • 1
    @M.A.G Don't significantly change any question (e.g. by introducing non-latin words) after you got answers and it's not clear if "output" is the output you want that you don't get or output you get that you don't want. I rolled this back as it was before you got answers, just ask a new followup question if you have other character sets to consider. Commented Jun 17, 2022 at 11:24

2 Answers 2

3

Using any POSIX awk in any shell on every Unix box:

$ cat tst.awk
BEGIN { OFS="\t" }
NR==FNR {
    words[NR] = $1
    next
}
{
    $0 = " " $0 " "
    gsub(/[^[:alpha:]]+/,"  ")
    for ( i in words ) {
        word = words[i]
        cnts[word] += gsub(" "word" ","&")
    }
}
END {
    for ( i=1; i in words; i++ ) {
        word = words[i]
        print word, cnts[word]+0
    }
}

$ awk -f tst.awk file1 file2
dog     1
apple   1
cat     2

The above assumes that "word"s are all alphabetic characters and that you want the matches to be case-sensitive or the input is all lower case as in your example and that the words in file1 are unique as in your example.

2
  • Thank you! I have edited my question though, what if punctuation exists and non latin words (Arabic)? Commented Jun 17, 2022 at 7:35
  • 1
    Then ask a new followup question. Commented Jun 17, 2022 at 11:21
0

Using Raku (formerly known as Perl_6)

Below is a general solution for matching lines of one file (saved as @a array) against lines of a second file (saved as @b array), counting occurrences (i.e. Bagging):

raku -e 'my  @a =  dir(test => "alphabet.txt").IO.lines.reverse; my @b = $*ARGFILES.lines;  \
         for @a -> $a {@b.grep(/<$a>/).Bag.pairs.say};'  alphabet.txt alphabet.txt

In constructing @a, Raku is given a dir() location and a test => "…" filename. In constructing @b, one-or-more files are entered on the command line, and read off via Raku's $*ARGFILES dynamic variable.

General Input is alphabet.txt, one letter per line and reversed immediately upon reading into Raku to place the array in "z".."a" order;

General Output (when two copies of "a".."z" alphabet.txt are entered on the command-line):

(z => 2)
(y => 2)
(x => 2)
(w => 2)
(v => 2)
(u => 2)
(t => 2)
(s => 2)
(r => 2)
(q => 2)
(p => 2)
(o => 2)
(n => 2)
(m => 2)
(l => 2)
(k => 2)
(j => 2)
(i => 2)
(h => 2)
(g => 2)
(f => 2)
(e => 2)
(d => 2)
(c => 2)
(b => 2)
(a => 2)

Note how the return stays in the same order as the @a array, and how Raku doesn't require a sort call to produce the output above.

Finally, solving the OP's issue, all that has to be changed from the code above is using my @b = $*ARGFILES.lines.words instead of my @b = $*ARGFILES.lines.

[To obtain tab-separated output use .put instead of .say in the code above. This drops the surrounding parens and the => arrow between the two columns].

Final Code:

~$ raku -e 'my @a = dir(test => "dog_apple_cat.txt").IO.lines.grep(*.chars);  \
            my @b = $*ARGFILES.lines.words; for @a -> $a {  \
            @b.grep(/<$a>/).Bag.pairs.put};' text.txt
dog 1
apple.  1
cat 2

https://docs.raku.org/type/Bag
https://raku.org

You must log in to answer this question.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.