Revisions to compare 2 csv files on multiple columns and replace one column with column from another file

added 317 characters in body

Source Link

edited Nov 23, 2022 at 8:36

355.8k
42
735
1.1k

This answer starts with a solution using Miller, continues with a solution using Miller in conjunction with csvsql from csvkit, and then finishes off with a solution that only uses csvsql.

Using Miller (mlr) to first (left-)join the data from fileA with the data in fileB on the following named fields:

You could even do it all in SQL if you wish:

csvsql --query '
    CREATE TEMPORARY TABLE tmp AS SELECT * FROM "fileA" NATURAL LEFT JOIN "fileB";
    UPDATE tmp SET account = account1 WHERE account1 IS NOT NULL;
    SELECT account,temp1,code,type,date,subtask,pc,toy,vol,bs,sub FROM tmp;' fileA fileB

Using Miller (mlr) to first (left-)join the data from fileA with the data in fileB on the following named fields:

This answer starts with a solution using Miller, continues with a solution using Miller in conjunction with csvsql from csvkit, and then finishes off with a solution that only uses csvsql.

Using Miller (mlr) to first (left-)join the data from fileA with the data in fileB on the following named fields:

You could even do it all in SQL if you wish:

csvsql --query '
    CREATE TEMPORARY TABLE tmp AS SELECT * FROM "fileA" NATURAL LEFT JOIN "fileB";
    UPDATE tmp SET account = account1 WHERE account1 IS NOT NULL;
    SELECT account,temp1,code,type,date,subtask,pc,toy,vol,bs,sub FROM tmp;' fileA fileB

added 456 characters in body

Source Link

edited Nov 23, 2022 at 7:54

Kusalananda ♦

355.8k
42
735
1.1k

A totally different approach is to use csvsql from csvkit to perform a natural left join, then use mlr for post-processing the output:

csvsql --query 'SELECT * FROM "fileA" NATURAL LEFT JOIN "fileB"' fileA fileB |
mlr --csv \
    put 'is_not_null($account1) { $account = $account1 }' then \
    cut -o -f account,temp1,code,type,date,subtask,pc,toy,vol,bs,sub

This way, you don't have to care about what fields are common between the two files.

A totally different approach is to use csvsql from csvkit to perform a natural left join, then use mlr for post-processing the output:

csvsql --query 'SELECT * FROM "fileA" NATURAL LEFT JOIN "fileB"' fileA fileB |
mlr --csv \
    put 'is_not_null($account1) { $account = $account1 }' then \
    cut -o -f account,temp1,code,type,date,subtask,pc,toy,vol,bs,sub

This way, you don't have to care about what fields are common between the two files.

added 675 characters in body

Source Link

edited Nov 23, 2022 at 7:27

Kusalananda ♦

355.8k
42
735
1.1k

Using Miller (mlr) to first (left-)join the data from fileA with the data in fileB on the following named fields:

account,code,date,type,pc,vol,bs

... and then rename the account1 field account (for the records that have an account1 field, which will only be the ones that were joined).

We then reorder the fields and remove the ones we don't want in the output.

mlr --csv \
    join -f fileA -j account,code,date,type,pc,vol,bs --ul then \
    rename account1,account then \
    cut -o -f account,temp1,code,type,date,subtask,pc,toy,vol,bs,sub fileB

The output, given the data in the question:

account,temp1,code,type,date,subtask,pc,toy,vol,bs,sub
CCCCC,GFHD,ASDF,BS,21122022,STOP,C,CAT,1000,S,MATH
6576,WEQR,TYRE,BS,54122022,OBCD,K,BAT,5000,F,SCSC
7654,GHAD,LOPI,CV,9089022,KGAD,G,BSEE,5908,J,IOYU

Note that the order of the fields in the two input files is irrelevant.

If you don't know what fields you may use to join on, you may calculate the common field names separately (unfortunately, Miller can't do a "natural join" operation but must be given an explicit list of field names to join on):

mlr --csv put -q '
    if (NR == 1) {
        for (k in $*) { @f[k] = 1 }
    } else {
       for (k in @f) {
           is_null($[k]) { unset @f[k] }
       }
    }
   end {
       common_fieldnames = joink(@f,",");
       emit common_fieldnames
   }' fileA fileB

For the given data, this outputs the following CSV data set

common_fieldnames
"account,code,type,date,pc,vol,bs"

To only get the comma-delimited list, use options that would generate header-less unquoted CSV output, e.g. --csv in combination with --headerless-csv-output and --quote-none.

Using Miller (mlr) to first (left-)join the data from fileA with the data in fileB on the following named fields:

account,code,date,type,pc,vol,bs

... and then rename the account1 field account (for the records that have an account1 field, which will only be the ones that were joined).

We then reorder the fields and remove the ones we don't want in the output.

mlr --csv \
    join -f fileA -j account,code,date,type,pc,vol,bs --ul then \
    rename account1,account then \
    cut -o -f account,temp1,code,type,date,subtask,pc,toy,vol,bs,sub fileB

The output, given the data in the question:

account,temp1,code,type,date,subtask,pc,toy,vol,bs,sub
CCCCC,GFHD,ASDF,BS,21122022,STOP,C,CAT,1000,S,MATH
6576,WEQR,TYRE,BS,54122022,OBCD,K,BAT,5000,F,SCSC
7654,GHAD,LOPI,CV,9089022,KGAD,G,BSEE,5908,J,IOYU

Note that the order of the fields in the two input files is irrelevant.

If you don't know what fields you may use to join on, you may calculate the common field names separately (unfortunately, Miller can't do a "natural join" operation but must be given an explicit list of field names to join on):

mlr --csv put -q '
    if (NR == 1) {
        for (k in $*) { @f[k] = 1 }
    } else {
       for (k in @f) {
           is_null($[k]) { unset @f[k] }
       }
    }
   end {
       common_fieldnames = joink(@f,",");
       emit common_fieldnames
   }' fileA fileB

For the given data, this outputs the following CSV data set

common_fieldnames
"account,code,type,date,pc,vol,bs"

Using Miller (mlr) to first (left-)join the data from fileA with the data in fileB on the following named fields:

account,code,date,type,pc,vol,bs

... and then rename the account1 field account (for the records that have an account1 field, which will only be the ones that were joined).

We then reorder the fields and remove the ones we don't want in the output.

mlr --csv \
    join -f fileA -j account,code,date,type,pc,vol,bs --ul then \
    rename account1,account then \
    cut -o -f account,temp1,code,type,date,subtask,pc,toy,vol,bs,sub fileB

The output, given the data in the question:

account,temp1,code,type,date,subtask,pc,toy,vol,bs,sub
CCCCC,GFHD,ASDF,BS,21122022,STOP,C,CAT,1000,S,MATH
6576,WEQR,TYRE,BS,54122022,OBCD,K,BAT,5000,F,SCSC
7654,GHAD,LOPI,CV,9089022,KGAD,G,BSEE,5908,J,IOYU

Note that the order of the fields in the two input files is irrelevant.

If you don't know what fields you may use to join on, you may calculate the common field names separately (unfortunately, Miller can't do a "natural join" operation but must be given an explicit list of field names to join on):

mlr --csv put -q '
    if (NR == 1) {
        for (k in $*) { @f[k] = 1 }
    } else {
       for (k in @f) {
           is_null($[k]) { unset @f[k] }
       }
    }
   end {
       common_fieldnames = joink(@f,",");
       emit common_fieldnames
   }' fileA fileB

For the given data, this outputs the following CSV data set

common_fieldnames
"account,code,type,date,pc,vol,bs"

To only get the comma-delimited list, use options that would generate header-less unquoted CSV output, e.g. --csv in combination with --headerless-csv-output and --quote-none.

added 675 characters in body

Source Link

edited Nov 23, 2022 at 7:21

Kusalananda ♦

355.8k
42
735
1.1k

Loading

added 73 characters in body

Source Link

edited Nov 22, 2022 at 20:24

Kusalananda ♦

355.8k
42
735
1.1k

Loading

Source Link

answered Nov 22, 2022 at 20:19

Kusalananda ♦

355.8k
42
735
1.1k

Loading

Stack Exchange Network

Return to Answer