0

I have to compare 2 files file a and file b with the same columns but are randomly placed in 2 files like compare column values account, code, date, type, pc, vol and bs and if a match is found for any row then replace the column 1 account in file a with account1 from file b and display all the non matched records from file a . the output file c should look like below

I don't have any background in linux, after going through the online forums I got a slight idea that this can be done via AWK but I am not that familiar with awk. please help. The comparison need to be done in linux environment

by randomly placed columns I mean is that in the command I should be able give the columns that I can use to match file a with file b (the order of the column will not be the same in both the files). It can be 8 or 10 or 15 columns to match in both the files.

similiar to that account and account1 must not be the first and last column , the command should have ability to pick what column needs to be updated as well

duplicates processing -->if file b has many duplicates for a 1 record in file a then it should update the file b first matched record so that it cannot be used again in the matching --> if file a has many duplicates and file b has only 1 matched record then the final file c should have only 1 account value get updated rather than updating all the records of matched values of file a

file a

account,temp1,code,type,date,subtask,pc,toy,vol,bs,sub
6576,WEQR,TYRE,BS,54122022,OBCD,K,BAT,5000,F,SCSC
1234,GFHD,ASDF,BS,21122022,STOP,C,CAT,1000,S,MATH
7654,GHAD,LOPI,CV,9089022,KGAD,G,BSEE,5908,J,IOYU

file b

account,code,date,type,inst,insttype,pc,str,vol,bs,name,xdate,account1
1234,ASDF,21122022,BS,GOLDY,RUB,C,123.1,1000,S,RON,90891234,CCCCC
2761,LOPCS,10122022,BSFD,SLV,STR,C,123.9,1001,B,RON,99999988,DDDDD
0980,RTDF,28822025,JUFG,BRNZ,HIY,C,123.8,2000,S,RON,88881234,EEEEE

file c

account,temp1,code,type,date,subtask,pc,toy,vol,bs,sub
6576,WEQR,TYRE,BS,54122022,OBCD,K,BAT,5000,F,SCSC
CCCCC,GFHD,ASDF,BS,21122022,STOP,C,CAT,1000,S,MATH
7654,GHAD,LOPI,CV,9089022,KGAD,G,BSEE,5908,J,IOYU
2
  • 1
    Please edit your question and explain in more detail what you consider a match? What exactly means "same columns but are randomly placed"? Is account always the first column? Is account1 always the last column? Can there be duplicate values? Commented Nov 22, 2022 at 17:27
  • @Bodo i have updated question based on your query Commented Nov 23, 2022 at 8:55

2 Answers 2

2

This answer starts with a solution using Miller, continues with a solution using Miller in conjunction with csvsql from csvkit, and then finishes off with a solution that only uses csvsql.


Using Miller (mlr) to first (left-)join the data from fileA with the data in fileB on the following named fields:

account,code,date,type,pc,vol,bs

... and then rename the account1 field account (for the records that have an account1 field, which will only be the ones that were joined).

We then reorder the fields and remove the ones we don't want in the output.

mlr --csv \
    join -f fileA -j account,code,date,type,pc,vol,bs --ul then \
    rename account1,account then \
    cut -o -f account,temp1,code,type,date,subtask,pc,toy,vol,bs,sub fileB

The output, given the data in the question:

account,temp1,code,type,date,subtask,pc,toy,vol,bs,sub
CCCCC,GFHD,ASDF,BS,21122022,STOP,C,CAT,1000,S,MATH
6576,WEQR,TYRE,BS,54122022,OBCD,K,BAT,5000,F,SCSC
7654,GHAD,LOPI,CV,9089022,KGAD,G,BSEE,5908,J,IOYU

Note that the order of the fields in the two input files is irrelevant.

If you don't know what fields you may use to join on, you may calculate the common field names separately (unfortunately, Miller can't do a "natural join" operation but must be given an explicit list of field names to join on):

mlr --csv put -q '
    if (NR == 1) {
        for (k in $*) { @f[k] = 1 }
    } else {
       for (k in @f) {
           is_null($[k]) { unset @f[k] }
       }
    }
   end {
       common_fieldnames = joink(@f,",");
       emit common_fieldnames
   }' fileA fileB

For the given data, this outputs the following CSV data set

common_fieldnames
"account,code,type,date,pc,vol,bs"

To only get the comma-delimited list, use options that would generate header-less unquoted CSV output, e.g. --csv in combination with --headerless-csv-output and --quote-none.


A totally different approach is to use csvsql from csvkit to perform a natural left join, then use mlr for post-processing the output:

csvsql --query 'SELECT * FROM "fileA" NATURAL LEFT JOIN "fileB"' fileA fileB |
mlr --csv \
    put 'is_not_null($account1) { $account = $account1 }' then \
    cut -o -f account,temp1,code,type,date,subtask,pc,toy,vol,bs,sub

This way, you don't have to care about what fields are common between the two files.

You could even do it all in SQL if you wish:

csvsql --query '
    CREATE TEMPORARY TABLE tmp AS SELECT * FROM "fileA" NATURAL LEFT JOIN "fileB";
    UPDATE tmp SET account = account1 WHERE account1 IS NOT NULL;
    SELECT account,temp1,code,type,date,subtask,pc,toy,vol,bs,sub FROM tmp;' fileA fileB
2
  • sorry cannot use Miller in my company Commented Nov 23, 2022 at 4:46
  • @Shetty That's a problem that I can not solve for you, sorry. Commented Nov 23, 2022 at 7:21
-1
#!/bin/bash

IFS=, hdrs_to_check="$*"
col_to_change="account"
col_with_value="account1"

awk -F, -v hdrs_to_check="$hdrs_to_check" \
        -v col_to_change="$col_to_change" \
        -v col_with_value="$col_with_value" '

function make_compound_key(hdrs_2_idx_map) {
    key = ""
    for (i = 1; i <= hdrs_to_check_arr_size; i++) {
        hdr_name = hdrs_to_check_arr[i]
        hdr_idx = hdrs_2_idx_map[hdr_name]
        key = key$hdr_idx","
    }
    return key
}

BEGIN {
    hdrs_to_check_arr_size = split(hdrs_to_check, hdrs_to_check_arr, ",")
}

#Both file headers processing
FNR == 1 {
    #Map file_a and file_b headers to their indexes.
    for (i = 1; i <= NF; i++) {
        if (NR == 1) {
            f_b_hdr_2_idx_map[$i] = i
        } else {
            f_a_hdr_2_idx_map[$i] = i
        }
    }
    next
}

#file_b processing
FNR == NR {
    key = make_compound_key(f_b_hdr_2_idx_map)

    # Duplicate keys are not changing the "account1" value, it is fixed
    # from the first key occurence.
    if (!(key in file_b_dct)) {
        file_b_dct[key] = $f_b_hdr_2_idx_map[col_with_value]
    }

    next
}

#file_a processing
FNR != NR {
    key = make_compound_key(f_a_hdr_2_idx_map)

    # "file_a_updated_dct" is needed to prevent following lines
    # with same "key" be changed. These lines are left unchanged.
    if ((key in file_b_dct) && !(key in file_a_updated_dct)) {
        $f_a_hdr_2_idx_map[col_to_change] = file_b_dct[key]         
        file_a_updated_dct[key]
    }

    print
}
' file_b.txt OFS=, file_a.txt
#' file_b.txt OFS=, file_a.txt | column -s, -t

Usage:

$ ./my_script.sh account code date type pc vol bs
6
  • sorry how do i run this on linux environment Commented Nov 23, 2022 at 4:45
  • @Shetty paste to a command line and press Enter. Or put this text into shell script, make it executable and run this script. Commented Nov 23, 2022 at 6:45
  • sorry for my ignorance , will you be able to tell how does it compare the 7 column values Commented Nov 23, 2022 at 6:58
  • @Shetty It compares only file_a and file_b account columns. I assumed this basing on your file_c example and only account column changing, now I reread your question and see that some other columns should be checked too, but only account column's value should be replaced by the account1 value. Commented Nov 23, 2022 at 7:19
  • will you be able to check and provide solution for that? Commented Nov 23, 2022 at 8:53

You must log in to answer this question.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.