2

I need to check numbers in each line on a specific column in one variable against all lines in two specific columns in another variable using awk, keeping lines in the first variable that meet parameters.

Attempts I have made so far to do this in one powerful awk command have failed. I can obviously do this in an external loop, but it would be very slow as I have 100's of thousands of lines to check. I appreciate any and all help with solving this, and I am always looking to improve my use of awk, so if you have a solution it would be great to have an explanation so I can learn and improve myself.

Here is an example:

  • Lets say I want to print only the lines in column 2 from ${ListToCheckFrom}, if the number there is > column 2 and < column 3 in any line from ${ListToCheckAgainst}

  • Input example:

    ListToCheckFrom="C,2  
    C,22  
    C,12  
    hr,15"
    
    ListToCheckAgainst="C1,25,50  
    hr1,22,30  
    r,12,18  
    C,15,44"  
    
  • Expected output:

    C,22  
    hr,15
    

2 Answers 2

2

Since you have tagged the question with bash, you can make use of process substitution to read the shell variables like input files. The following script snippet should do:

#!/bin/bash

ListToCheckFrom="C,2  
C,22  
C,12  
hr,15"

ListToCheckAgainst="C1,25,50  
hr1,22,30  
r,12,18  
C,15,44"

awk -F',' 'list=="constr"{n++; low[n]=$2;high[n]=$3;next}
           {for (i=1;i<=n;i++) {if ($2>low[i]&&$2<high[i]) {print;next};}}' \
           list=constr <(echo "$ListToCheckAgainst") \
           list=chk <(echo "$ListToCheckFrom")

This will specify the echo'd content of $ListToCheckAgainst as first input file, and the echo'd content of $ListToCheckFrom as second input file. It will set an awk variable list to either constr or chk before "opening" each file, so that awk can internally distinguish which of the "files" is currently being processed.

  • When processing the "constraints" from $ListToCheckAgainst, it simply stores the "lower" and "upper" bound, as specified in columns 2 and 3, in arrays low and high, respectively. Apart from that, it skips processing immediately to the next input line.
  • When processing the list to check from $ListToCheckFrom, it scans all ranges registered previously, and if it finds that column 2 falls within any one of them, prints it (and immediately skips processing to the next input line).

If your data is stored in "physical" files rather than shell variables, you can simply use the filenames instead of the process substitutions as command-line arguments.

6
  • @AdminBee if the first file was empty and the script used NR==FNR it wouldn't fail, they'd just get no output same as if they were setting flags between files. Commented May 11, 2022 at 15:34
  • @EdMorton You are right in that in this case, there would effectively be no difference. But in general it is more robust, because while with the variable-setting "in between files" you simply would read no constraints (which is reasonable if there aren't any for a particular run), using the NR==FNR test with an empty first file would make your program mistake the second file for the first, which has repercussions if the goal is not simply to filter lines. So I think the habit of using these temporary variables does have its merits. Commented May 11, 2022 at 15:41
  • I understand where you're coming from and you're right it's not a BAD idea but personally I'd only consider using a flag if/when I need it, which is very rarely, and even then I usually wouldn't use one. If I needed to care about a first file being empty and it's a case where I'm not worried about portability then I'd use ARGIND==1 since I use gawk, and if I am then I'd usually use FILENAME==ARGV[1]. But using a flag is fine too. Commented May 11, 2022 at 16:09
  • I've gone back through the output and it would be even better if I could add column 1 into the command, where column one in both variables have to be identical. I tried adapting the code here for that and failed. I'm clearly misunderstanding something. Do I need to make a new question, or edit this question. I tried: awk -F',' 'list=="constraints"{n++; low[n]=$2;high[n]=$3;c[n]=$1;next} {for (i=1;i<=n;i++) {if (($2>=low[i]&&$2<=high[i])||($3>=low[i]&&$3<=high[i])&&($1==c[i])) {print;next};}}' list=constraints <(echo "$ListToCheckAgainst") list=check <(echo "$ListToCheckFrom") Commented May 11, 2022 at 16:47
  • @AdminBee Sorry I should also mention that I adapted to handle an additional column in the variable list to check against, that bit did work. I would really appreciate it if you can explain where I'm going wrong Commented May 11, 2022 at 16:51
2
$ cat tst.sh
#!/usr/bin/env bash

ListToCheckFrom='C,2
C,22
C,12
hr,15'

ListToCheckAgainst='C1,25,50
hr1,22,30
r,12,18
C,15,44'

awk '
    BEGIN { FS="," }
    NR==FNR {
        begs2ends[$2] = $3
        next
    }
    {
        for ( beg in begs2ends ) {
            beg += 0
            end = begs2ends[beg]+0
            if ( (beg < $2) && ($2 < end) ) {
                print
                next
            }
        }
    }
' <(printf '%s\n' "$ListToCheckAgainst") <(printf '%s\n' "$ListToCheckFrom")

$ ./tst.sh
C,22
hr,15

You must log in to answer this question.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.