Awk- Compare Numbers from Two Files and write Differences in New File

Question

I have two lists with item numbers and want to mark the difference between these lists by writing the numbers wich aren't in both files in a new file.

Both Files have the item number in column 2 and part ID in column 3. First I want to check if the item number in file 1 exists , if so the part ID has to be checked and if it's true, go to the next item number. Otherwise, if one of the conditions isn't true the difference should be written in a new created file. In case that a item number is only in one of the two files the program should write "Artikel[x] cannot be found".

EXAMPLE

FILE 1

Artikel[ 456]= 1,2
Artikel[ 877]= 3
Artikel[ 278]= 4
Artikel[ 453]= 13

FILE 2

Artikel[ 456]= 2, 1 
Artikel[ 877]= 3, 5 
Artikel[ 387]= 4, 9, 4 
Artikel[ 947]= 10

OUTPUT

Artikel[ 877]= 3 != Artikel[ 877]= C3, C5
Artikel[ 278]= 4 != Artikel[ 278 ]= 4, 9, 4
Artikel[ 453]= 13 cannot be found in File 2!
Artikel[ 947]= 10 cannot be found in File 1!

I thought i could do this with writing the item number from File 1 in an array and just check every line in File 2 but somehow I am struggling to manage that. I would appreciate any help.

Thanks

Would C3, C5 be the same as C5, C3 or different? Does the order of entries matter? — terdon
– terdon ♦, Commented Dec 18, 2023 at 10:40
FILE2 has values that are comma-space separated, while FILE1 has values that are comma separated. Which is correct? — jubilatious1
– jubilatious1, Commented Dec 19, 2023 at 0:04

Ed Morton · Accepted Answer · 2023-12-18 13:22:23Z

Using any POSIX awk:

$ cat tst.awk
BEGIN {
    FS = "[]=[]+"
    f1 = ARGV[1]
    f2 = ARGV[2]
}
{
    gsub(/[[:space:]]+/,"")
    gsub(/,/,"& ")
    key = $1 "[ " $2 " ]="
    keys[key]
    vals = substr($0,index($0,"=")+1)
}
FILENAME == f1 {
    f1KeyVals[key] = vals
}
FILENAME == f2 {
    f2KeyVals[key] = vals
}
END {
    for ( key in keys ) {
        if ( (key in f1KeyVals) && (key in f2KeyVals) ) {
            if ( f1KeyVals[key] != f2KeyVals[key] ) {
                areDifferent = 0

                delete f1vals
                split(f1KeyVals[key],tmp,/, */)
                for ( i in tmp ) { f1vals[tmp[i]] }

                delete f2vals
                split(f2KeyVals[key],tmp,/, */)
                for ( i in tmp ) { f2vals[tmp[i]] }

                for ( val in f1vals ) {
                    if ( val in f2vals ) {
                        delete f2vals[val]
                    }
                    else {
                        areDifferent = 1
                        break
                    }
                }

                for ( val in f2vals ) {
                    areDifferent = 1
                    break
                }

                if ( areDifferent ) {
                    printf "%s %s != %s %s\n", key, f1KeyVals[key], key, f2KeyVals[key]
                }
            }
        }
        else if ( key in f2KeyVals ) {
            printf "%s cannot be found in %s!\n", key, f1
        }
        else {
            printf "%s cannot be found in %s!\n", key, f2
        }
    }
}

$ awk -f tst.awk file1 file2
Artikel[ 5129720100 ]= cannot be found in file2!
Artikel[ 5089100000 ]= C3 != Artikel[ 5089100000 ]= C3, C5
Artikel[ 4005530901 ]= cannot be found in file1!
Artikel[ 5091270000 ]= C4 != Artikel[ 5091270000 ]= C4, C19, C34

The above assumes that if values are ever duplicated, e.g. C1, C1, C2 they should be treated the same way as if they weren't, i.e. C1, C2.

Thank you very much, works exactly like what i was looking for. Now i'm going to try to understand how the code works. — user596594
– user596594, Commented Dec 18, 2023 at 12:26

Kaz · Accepted Answer · 2023-12-18 18:52:56Z

In TXR Lisp:

(defun read-file (path)
  (let ((h (hash)))
    (with-stream (s (open-file path))
      (whilet ((line (get-line s)))
        (if-match `Artikel[ @item ]= @list` line
          (let ((idh (flow list
                       (tok #/[^ ,]+/)
                       hash-list)))
            (set [h item] idh))))
      h)))

(defun out-one (h file)
  (dohash (item ids h)
    (put-line `Artikel[ @item ] = @{(hash-values ids) ", "} cannot be found in @file!`)))

(defun out-both (h)
  (dohash (item id-pair h)
    (tree-bind (left-ids . right-ids) id-pair
      (unless (equal left-ids right-ids)
         (put-line `Artikel[ @item ] = @{(hash-values left-ids) ", "} !=\ \
                    Artikel[ @item ] = @{(hash-values right-ids) ", "}`)))))

(let* ((h0 (read-file "file1"))
       (h1 (read-file "file2")))
  (out-one (hash-diff h0 h1) "File 2")
  (out-one (hash-diff h1 h0) "File 1")
  (out-both [hash-isec h0 h1 cons]))

Output:

$ txr diff.tl
Artikel[ 5129720100 ] = C13 cannot be found in File 2!
Artikel[ 4005530901 ] = C10 cannot be found in File 1!
Artikel[ 5091270000 ] = C4 != Artikel[ 5091270000 ] = C34, C19, C4
Artikel[ 5089100000 ] = C3 != Artikel[ 5089100000 ] = C3, C5

micans · Accepted Answer · 2023-12-19 12:16:50Z

The following is not a complete solution, but it is useful to be aware of the Unix command join.

▷  join -v 1 -t= <(sort -k 1b,1 FILE1) <(sort -k 1b,1 FILE2) | tr '=' '\t'
Artikel[ 5129720100 ]    C13
▷  join -v 2 -t= <(sort -k 1b,1 FILE1) <(sort -k 1b,1 FILE2) | tr '=' '\t'
Artikel[ 4005530901 ]    C10
▷  join -t= <(sort -k 1b,1 FILE1) <(sort -k 1b,1 FILE2) | tr '=' '\t'
Artikel[ 4003526101 ]    C1,C2    C2,C1
Artikel[ 5089100000 ]    C3    C3,C5
Artikel[ 5091270000 ]    C4    C4,C19,C34

join requires sorted input with leading blanks ignored, hence the sort options as shown. With -v <i> the unpairable lines from file <i> are output. With these outputs it becomes a lot easier to compute what you need.

jubilatious1 · Accepted Answer · 2023-12-19 19:31:01Z

Using Raku (formerly known as Perl_6)

~$ raku -e 'my %hash1; for "path/to/file1.txt".IO.lines() {  
               .split("= ") andthen %hash1.append: .[0] => .[1].split(",") };  
            my %hash2; for "path/to/file2.txt".IO.lines() {  
               .split("= ") andthen %hash2.append: .[0] => .[1].split(",") };  
            for (%hash1.keys ∩ %hash2.keys).map(*.key) -> $i {  
                unless %hash1{$i} == %hash2{$i} {  
                    put ($i ~ "= " ~ %hash1{$i}.join(",") ~ " != " ~ $i ~ "= " ~ %hash2{$i}.join(","))  // next} };  
            my ($k2,$v2) = %hash2{(%hash2.keys (-) %hash1.keys)}:kv;   
            my ($k1,$v1) = %hash1{(%hash1.keys (-) %hash2.keys)}:kv;  
            put $k2 ~ "= " ~ $v2.join(",") ~ " cannot be found in File 1!" // next;  
            put $k1 ~ "= " ~ $v1.join(",") ~ " cannot be found in File 2!" // next;'

Sample Output:

Artikel[ 5091270000 ]= C4 != Artikel[ 5091270000 ]= C4,C19,C34
Artikel[ 5089100000 ]= C3 != Artikel[ 5089100000 ]= C3,C5
Artikel[ 4005530901 ]= C10 cannot be found in File 1!
Artikel[ 5129720100 ]= C13 cannot be found in File 2!

Above is an answer written in Raku, the newest member of the Perl-family of programming languages. Raku features high-level support for Unicode built-in, as well as an advanced Regex engine. This answer takes advantage of Raku's %-sigiled hash (key-value) data-structure (a feature of Perl-family languages).

Briefly, files are read-in linewise into a %hash: the line is split on = to give two parts, and the .[0] first part becomes the key while the comma-split second part (.[1]) becomes the value.
Raku has Set functions built-in, so you can get the Intersection of hash-keys simply by writing %hash1.keys ∩ %hash2.keys (using either the Unicode infix ∩ character or the 3-character ASCII infix (&) ).
From the Intersection result, the code %hash{$k} is a basic key lookup to return the associated value(s). With this knowledge you can build an output string (~ tilde is used to concatenate strings together). Hash keys for which values are equal will not be output because of the unless %hash1{$i} == %hash2{$i} clause (unless is the same as if not).
Raku also has Set Difference functions, here represented by the 3-character ASCII infix (-). The 3-character ASCII infix (-) is used because the actual Unicode "SET-MINUS" symbol ∖ (U+2216) is easily confused with another. Both hash-key differences are computed, an output string is constructed for each, and each output.

Note 1: The code above makes no assumptions about the unique-ness of values per key. So if you have duplicate values in one file (but not the other), it will show up in the output as a difference. To make values unique per hash, add unique to each hash constructor, e.g. %hash.push: .[0] => .[1].split(",").unique.

Note 2: The above code doesn't try to simplify the "Artikel" key, but you'd probably do better to simplify each .[0] key down to only digits using a Regex, like so: .[0].match(/ \d+ /).Str.

Note 3: in this example input paths are hard-coded , but you could hard-code one (a proof file) and take a test file off the command line with $*ARGFILES.IO.lines() {...};, or even $*IN.IO.lines() {...}; (making sure to < redirect STDIN appropriately). See the second link below for more CLI options (e.g. using Raku's @*ARGS command-line array, etc).

https://docs.raku.org/language/setbagmix#Sets,_bags,_and_mixes
https://docs.raku.org/language/create-cli
https://raku.org

Stack Exchange Network

Awk- Compare Numbers from Two Files and write Differences in New File

4 Answers 4

You must log in to answer this question.

Hot Network Questions

Awk- Compare Numbers from Two Files and write Differences in New File

4 Answers 4

You must log in to answer this question.

Related

Hot Network Questions