awk dynamic string matching

Question

I have two files - (a) one from which I get a name and the file in which the name occurs and (b) the actual file on which I want to match the name and get two words before and after it.

Snapshot of first file

Ito 65482.txt
David Juno Ilrcwrry Hold 73586.txt
David Jones 73586.txt
Jacob FleUchbautr 73586.txt

The name is a string separated by spaces as above.

Snapshot of file 65482.txt (it contains garbled OCR text)

nose just brnukiiitt tip tinwallfin the golden 
path of Ito etmlmbimiiit tlmmgli the trees 
Butt It as tie not intra and plcturosiiiicness 
limit wo were of m that is not altogether We 
and hunting and llslilng In plenty anti lit lIly

Desired output in the format

Ito path of etmlmbimiiit tlmmgli

i.e. Two words before and after the match.

#!/bin/bash
fPath='/Users/haimontidutta/Research/IIITD/Aayushee/Code/Source-Code/Thesis/src/ReputedPersonDetection/data/OutputofNERFinal_v1a.txt'
echo "Enter Script"

while IFS=' ' read -ra arr
do 
 fname="${arr[${#arr[@]}-1]}"
 #echo $fname
 name=""
 for((idx=0; idx<$[${#arr[@]}-1]; ++idx))
 do
  name=$name" ${arr[idx]}"
 done
 #echo $name 
 filepath='/Users/haimontidutta/Research/IIITD/Aayushee/Code/Source-Code/Thesis/src/ReputedPersonDetection/data/final/'$fname
 #echo $fName
 #echo $filepath

 #Extract window around name
 awk -v nm="$name" '{
     for(i=1;i<=NF;i++)
     {
       #print $i 
       if($i~$nm)
       {
        print nm OFS  $(i-2) OFS $(i-1) OFS $(i+1) OFS $(i+2); exit;
      }}}' $filepath
done < $fPath

I am able to extract the name and filepath, but in the awk statement the dynamic matching of the name is failing and the window cannot be obtained.

How do I do this?

The name match should be "David Juno" and not to just " Juno". — Haimonti Dutta
– Haimonti Dutta, Commented May 13, 2021 at 20:07

Ed Morton · Accepted Answer · 2021-05-15 12:49:20Z

2

Using GNU awk for arrays of arrays:

$ cat tst.awk
NR==FNR {
    file = $NF
    name = $1 (NF>2 ? " " $2 : "")
    if ( !(file in file2names) && ((getline line < file) > 0) ) {
        close(file)
        ARGV[ARGC++] = file
    }
    file2names[file][name]
    next
}
{
    $0 = " " $0 " "
    for (name in file2names[FILENAME]) {
        if ( pos = index($0," "name" ") ) {
            split(substr($0,1,pos),bef)
            split(substr($0,pos+length(name)+1),aft)
            print name, bef[1], bef[2], aft[1], aft[2]
        }
    }
}

$ awk -f tst.awk file
Ito path of etmlmbimiiit tlmmgli

If you actually want all pre-filename strings from "file" to be part of the name instead of just the first 1 or 2 strings (see the comments below) then just change:

name = $1 (NF>2 ? " " $2 : "")

to this in gawk:

name = gensub(/\s+\S+$/,"",1)

or this in any awk:

name = $0
sub(/ +[^ ]+$/,"",name)

With any other awk you'd just store the names for a file as a space-separated string, e.g. instead of file2names[file][name] you'd do file2names[file] = (file in file2names ? file2names[file] FS : "") name and then split them before looping, e.g. instead of for (name in file2names[file]) you'd do split(file2names[FILENAME],names); for (name in names)

The input file above is just the first file in your example.

edited May 15, 2021 at 12:49

answered May 14, 2021 at 14:45

Ed Morton

35.8k6 gold badges25 silver badges60 bronze badges

This will only match one or two word names (like "Ito" or "David Jones"), not longer names ("David Juno Ilrcwrry Hold"). Use something like file=$NF; NF--; name=$0 instead.

cas
– cas

2021-05-15 04:52:36 +00:00
Commented May 15, 2021 at 4:52
@cas: right, but it remains unclear from the question whether everything preceding $NF in "file" should be considered the "name" or not.

Cbhihe
– Cbhihe

2021-05-15 07:20:08 +00:00
Commented May 15, 2021 at 7:20
@Cbhihe it's very clear that the file contains a name (with one or more words in the name) and a filename, with the filename being the last field. There's nothing in the question to even suggest that only the first 1 or 2 words of the name are needed. That notion seems to have been introduced by some comments confusing the input file format with the two words (before & after) wanted in the output. Even the OP's sh code is extracting the name from all but the last field, which directly contradicts that notion.

cas
– cas

2021-05-15 07:33:41 +00:00
Commented May 15, 2021 at 7:33
@cas: Reviewed OP's code: ` for((idx=0; idx<$[${#arr[@]}-1]; ++idx)); do name=$name" ${arr[idx]}"; done` ... You were right.

Cbhihe
– Cbhihe

2021-05-15 08:36:03 +00:00
Commented May 15, 2021 at 8:36
@cas Unlike NF++, NF-- is undefined behavior per POSIX. Some awks will delete that last field, others will ignore it, others still could so anything else. The OP specifically only wants 1 or 2 word names, see the comments under the question - when asked about the line containing "David Juno Ilrcwrry Hold" they said "David Juno" should be matched. Obviously it's a trivial tweak (sub(/[[:space:]]+[^[:space:]]+$/,"")), if I'm misreading that but the OP has accepted by answer so I think I got it right.

Ed Morton
– Ed Morton

2021-05-15 12:35:45 +00:00
Commented May 15, 2021 at 12:35

| Show 1 more comment

glenn jackman · Accepted Answer · 2021-05-13 21:22:42Z

Given input files:

$ cat first.file
Ito 65482.txt
David Juno Ilrcwrry Hold 73586.txt
David Jones 73586.txt
Jacob FleUchbautr 73586.txt

$ cat 65482.txt
nose just brnukiiitt tip tinwallfin the golden
path of Ito etmlmbimiiit tlmmgli the trees
Butt It as tie not intra and plcturosiiiicness
limit wo were of m that is not altogether We
and hunting and llslilng In plenty anti lit lIly

$ cat 73586.txt
Lorem ipsum David Jones dolor sit amet, consectetur adipiscing elit. Curabitur non ultrices tellus. Donec porttitor sodales mattis. Nulla eu ante eget libero dictum accumsan nec non odio. Nullam lobortis porttitor mauris a feugiat. Vestibulum ultrices ipsum at maximus consequat. Vivamus molestie Jacob FleUchbautr tortor ac felis varius gravida. Cras accumsan dolor at velit sodales auctor. Vestibulum sit amet scelerisque eros, quis porta orci. Donec eget erat dolor. Integer id vestibulum massa. Quisque lacus risus, venenatis nec euismod nec, ultrices sed mi. Proin tincidunt ipsum mattis lectus pulvinar interdum. Suspendisse convallis justo iaculis, semper nisl at, imperdiet ante.
# ..........^^^^^^^^^^^..................................................................................................................................................................................................................................................................................^^^^^^^^^^^^^^^^^

then:

mapfile -t files < <(awk '{print $NF}' first.file | sort -u)

word='[^[:blank:]]+'

for file in "${files[@]}"; do
    mapfile -t names < <(grep -wF "$file" first.file | sed -E 's/ [^ ]+$//')
    pattern="($word $word) ($(IFS='|'; echo "${names[*]}")) ($word $word)"
    declare -p file pattern
    grep -oE "$pattern" "$file" | sed -E "s/$pattern/\\2 \\1 \\3/"
done

outputs

declare -- file="65482.txt"
declare -- pattern="([^[:blank:]]+ [^[:blank:]]+) (Ito) ([^[:blank:]]+ [^[:blank:]]+)"
Ito path of etmlmbimiiit tlmmgli
declare -- file="73586.txt"
declare -- pattern="([^[:blank:]]+ [^[:blank:]]+) (David Juno Ilrcwrry Hold|David Jones|Jacob FleUchbautr) ([^[:blank:]]+ [^[:blank:]]+)"
David Jones Lorem ipsum dolor sit
Jacob FleUchbautr Vivamus molestie tortor ac

That regular expression requires 2 words to appear before and after the name. If the name appears at the start or end of the line, no match.

cas · Accepted Answer · 2021-05-15 07:38:37Z

This can be done in awk, but IMO is much easier to do in perl. And that's even before you consider that there are over 800 perl library modules for various natural language processing tasks in Lingua::*, which is what you seem to be doing.

The following perl script first builds up a commonly used perl data structure called a Hash-of-Arrays (HoA) using the filenames as the keys to the an associative array (aka hash), and each key's values being an indexed array of names. See man perldsc for more info on HoA and other perl data structures.

The %files HoA would end up with data like:

{
  "65482.txt" => ["Ito"],
  "73586.txt" => ["David Juno Ilrcwrry Hold", "David Jones", "Jacob FleUchbautr"],
}

It also uses an array called @order to remember the order in which each filename was seen, so that they can be processed later in the same order (this is often useful because perl hashes, as in many other languages, are inherently unordered. If you don't care about the order, you can just iterate over the keys of the hash instead)

If a filename does not exist, it prints a warning message to STDERR and skips to the next line of the "first" file. The print STDERR ... line can be deleted or commented out if you don't want the warnings, or just redirect stderr to /dev/null when you run it.

Once it has finished building the %files HoA, it opens each file for read, creates and pre-compiles a regular expression matching any of the names wanted for that particular file, and prints every line matching the RE.

The regular expressions it builds would end up with values like:

(((\w+\s+){2})(David Juno Ilrcwrry Hold|David Jones|Jacob FleUchbautr)((\s+\w+){2}))

The reason for doing it this way is so that each filename only has to be processed once, and each line of each file only has to be examined once to see if it matches one of the names. If you have many files and/or if they are very large, this results in an enormous performance boost over the naive approach of reading and matching every line of every file repeatedly, once for every name listed in the "first" file - e.g. if you had 1000 files with 1000 lines each, and a total of 50 names to match, the naive method would have to read and match a line 50 million times (files * lines * names) instead of just 1 million times (files * lines)

The script is set up to make it easy to choose how to match the words before and after a matched name. Un-comment only one of the two my $count= lines in the script. The first strictly requires exactly two words before AND after each name - this is already un-commented. The second is relaxed about how many words can exist before or after a name (from 0 to 2).

#!/usr/bin/perl -l

use strict;
my %files = ();
my @order = ();

# Un-comment only one of the following two lines.
my $count=2;
#my $count='0,2';

# First, build up a HoA where the key is the filename and
# the value is an array of names to match in that file.
while(<>) {
  s/^\s+|\s+$//;   # strip leading and trailing spaces
  next if (m/^$/); # skip empty lines
  my ($name,$filename) = m/^(.*)\s+(.*)$/; # extract name and filename

  # warn about and skip filenames that don't exist
  if (! -e $filename) {
    print STDERR "Error found on $ARGV, line $.: '$filename' does not exist.";
    next;
  };

  # remember the order we first see each filename.
  push @order, $filename unless ( defined($files{$filename}) );

  # Add the name to the %files HoA
  push @{ $files{$filename} }, $name;
};

# Process each file once only, in order.
foreach my $filename (@order) {
  open(my $fh,"<",$filename) || die "Error opening $filename for read: $!\n";

  my $re = "(((\\w+\\s+){$count})(" .           # two words
           join('|',@{ $files{$filename} }) .   # the names
           ")((\\s+\\w+){$count}))";            # and two words again

  $re = qr/$re/;  # add an 'i' after '/' for case-insensitive

  while(<$fh>) {
    if (m/$re/) {
      my $found = join(" ",$4,$2,$5);
      $found =~ s/\s\s+/ /g;
      print $found
    };
  };
}

Save as, e.g. match.pl and make executable with chmod +x match.pl, and run like:

$ ./match.pl first.txt 
Error found on first.txt line 2: '73586.txt' does not exist.
Error found on first.txt line 3: '73586.txt' does not exist.
Error found on first.txt line 4: '73586.txt' does not exist.
Ito path of etmlmbimiiit tlmmgli

BTW, it's not what you asked for, but I'd recommend printing the matching name separated from the found words with a colon (:) or anything other than a space. A tab is good too. This will make it much easier to parse the output file with other programs. i.e.

Ito:path of etmlmbimiiit tlmmgli

You can do this by changing the my $found = line to:

my $found = "$4:" . join(" ",$2,$5);

or

my $found = "$4\t" . join(" ",$2,$5);

With respect to This can be done in awk, but IMO is much easier to do in perl - you keep posting about things being hard to do in awk when they aren't, please consider the possibility that you just are more familiar with perl and so may not be aware of the simple ways to do some tasks in awk. I find it vastly easier to do anything in awk than perl but I know that's because of my unfamiliarity with perl, not because of the tool/language. — Ed Morton
– Ed Morton, Commented May 14, 2021 at 15:02
@EdMorton The reason I keep saying that is because it's true. Many things are easier in perl, much easier, because perl has built-in functions (like join and map and splice) that make string and array handling easier - no need to write your own join function, or rewrite the same tedious for (i=1;i<=NF;i++) loops all the time. And then, there's the huge library of CPAN modules - how many lines of awk code would it take my own CSV or JSON or XML or HTML parser, for example - a lot more than just use Text::CSV, use JSON, etc. Nothing like CPAN exists for awk. — cas
– cas, Commented May 14, 2021 at 15:46
Please consider that YOU are just unfamiliar with perl. I'm very familiar with both awk (~30 years) and perl (25+ years), and know the strengths and limitations of both. There really are many things that are trivially easy to do in perl, that require significantly more effort copy/pasting & editing, or just retyping, the same boiler-plate code that you've used in that last few hundred awk scripts. I still write a lot more awk than perl, but mostly for simple stuff of a dozen lines or so. Anything more complex I'll do in perl because I can do it in a fraction of the time. — cas
– cas, Commented May 14, 2021 at 15:51
BTW, I didn't say "hard to do in awk". I said "much easier to do in perl". Those two sentences have different meanings. What I generally think about it is "some things are a PITA to do in awk, but not in perl". And I tell people "use perl" when it's appropriate for exactly the same reasons I tell people "use awk, not a shell while read loop" when that is appropriate. Use the right/best tool for the job. — cas
– cas, Commented May 14, 2021 at 15:54
@HaimontiDutta cool, teaching useful stuff was the reason I posted my answer. It's why I put in the effort to explain what the script does and how it does it and why - that's more important than just the code. — cas
– cas, Commented May 15, 2021 at 4:17

Stack Exchange Network

awk dynamic string matching

3 Answers 3

You must log in to answer this question.

Hot Network Questions

awk dynamic string matching

3 Answers 3

You must log in to answer this question.

Related

Hot Network Questions