Run command on each line of CSV file, using fields in different places of the command

Question

I have a CSV file and want to run a command for each line, using the fields of the file as separate arguments.

For example given the following file:

foo,42,red
bar,13,blue
baz,27,green

I want to run the following commands one after another (note that order of arguments in the command may differ from the input file):

my_cmd --arg1 42 --arg2 foo --arg3 red
my_cmd --arg1 13 --arg2 bar --arg3 blue
my_cmd --arg1 27 --arg2 baz --arg3 green

What is the easiest way to achieve this? It seems like it might be possible with xargs but I couldn't figure out how exactly.

Can we assume that the data might contain properly quoted fields, fields with spaces, and fields with globbing characters, as in mute swan,"1,23",* * green * *? — Kusalananda
– Kusalananda ♦, Commented Feb 25 at 13:46
@Kusalananda In my current use-case none of these occur but for a more general answer, it would of course be great if such input would be handled properly. — luator
– luator, Commented Feb 25 at 13:59
If the file is huge, beware that : launching a command is quite "heavy" : the shell has to fork itself, etc, for each commands launched (so for each line of the file). Depending on your treatment, if possible (ie, if the treatment would not necessitate a fork), it could be order of magnitudes (10x? 100x? 1000x?) more efficient to just: parse the file and do actions directly within the parsing program (awk? perl?). Otherwise another optimisation could be: if there are less than N background processes, launch a new one (so there would be up to N in parrallel, and their duration have less impact) — Olivier Dulac
– Olivier Dulac, Commented Feb 25 at 15:04
@EdMorton Yes, this is intentional. It should be possible to use arguments in a different order (potentially using some of them multiple times). — luator
– luator, Commented Feb 26 at 7:57

Marcus Müller · Accepted Answer · 2025-02-25 19:38:52Z

12

GNU parallel can read csv directly, and has item replacement built in.

More or less directly taken from man parallel:

parallel --csv 'my_cmd --arg1 {2} --arg2 {1} --arg3 {3}' :::: file.csv

Add -j1 before my_cmd to these invocations be executed one-after-the-other. Or don't, and have them be executed in parallel.

(on debian and fedora, it's in the package called parallel, not in moreutils or moreutils-parallel)

Thank you, Ole Tange!

edited Feb 25 at 19:38

answered Feb 25 at 14:18

Marcus Müller

51k4 gold badges77 silver badges118 bronze badges

2

It even handles CSV with "tricky" fields neatly. It uses the Text::CSV Perl module for the CSV parsing. Look out, though, because an unrelated parallel utility seems to exist, written by one Tollef Fog Heen (found in the moreutils package on Debian-based systems).

Kusalananda
– Kusalananda ♦

2025-02-25 14:43:50 +00:00
Commented Feb 25 at 14:43
1

@Kusalananda good hint, I ran into that trouble before, because I was just sudo dnf install /usr/bin/parallel, which gave me moreutils-parallel, which was wrong.

Marcus Müller
– Marcus Müller

2025-02-25 14:46:09 +00:00
Commented Feb 25 at 14:46
1

@Luuk You need to install the Perl module separately. On Debian-based system, this should be part of the libtext-csv-perl package (or perl-Text-CSV on other systems). Whoever packaged it probably thought it a less useful dependency to include by default...

Kusalananda
– Kusalananda ♦

2025-02-25 14:53:23 +00:00
Commented Feb 25 at 14:53
1

Note that this will execute the commands in parallel which can be very nice but might be undesired in some cases. If so, simply set -j 1 to reduce the number of jobs to 1, resulting in sequential execution.

luator
– luator

2025-02-25 14:56:33 +00:00
Commented Feb 25 at 14:56
2

Writing it like that is a bit misleading as it implies parallel would run my_cmd while in fact it asks a shell (which shell determined at run time with heuristics) to interpret code made up of the concatenation of those arguments but where those {1}... are expanded to the argument quoted hopefully in the right syntax for that shell. PARALLEL_SHELL=sh parallel --csv 'my_cmd --arg1 {2} --arg2 {1} --arg3 {3}' :::: file.csv would make it clearer that it's shell code being interpreted as opposed to a command being executed and ensure it's parsed by sh.

Stéphane Chazelas
– Stéphane Chazelas

2025-02-25 19:34:44 +00:00
Commented Feb 25 at 19:34

| Show 3 more comments

wobtax · Accepted Answer · 2025-02-25 14:33:19Z

8

I find awk a little easier than fussing with xargs, so I tend to assemble the arguments using awk, then pass them to xargs:

$ awk -F ',' '{ print "--arg1", $2, "--arg2", $1, "--arg3", $3 }' csv.txt | xargs -L1 echo
--arg1 42 --arg2 foo --arg3 red
--arg1 13 --arg2 bar --arg3 blue
--arg1 27 --arg2 baz --arg3 green

Here -L1 says "run one command per line of input."

edited Feb 25 at 14:33

answered Feb 25 at 13:46

wobtax

1,1753 silver badges17 bronze badges

Add a comment |

Kusalananda · Accepted Answer · 2025-02-25 14:15:02Z

The following first use Miller (mlr) to convert the headerless CSV input to JSONL output (lines of single JSON objects). The jq JSON processor then reads these objects and outputs their parts as arguments to a command. The output is shell code that can be eval-ed.

$ cat file
foo,42,red
bar,13,blue
baz,27,green
"""mute"" swan","1,23",* * green * *

$ mlr --c2l -N cat file
{"1": "foo", "2": 42, "3": "red"}
{"1": "bar", "2": 13, "3": "blue"}
{"1": "baz", "2": 27, "3": "green"}
{"1": "\"mute\" swan", "2": "1,23", "3": "* * green * *"}

$ mlr --c2l -N cat file | jq -r '["my_cmd", "--arg1", ."1", "--arg2", ."2", "--arg3", ."3"] | @sh'
'my_cmd' '--arg1' 'foo' '--arg2' 42 '--arg3' 'red'
'my_cmd' '--arg1' 'bar' '--arg2' 13 '--arg3' 'blue'
'my_cmd' '--arg1' 'baz' '--arg2' 27 '--arg3' 'green'
'my_cmd' '--arg1' '"mute" swan' '--arg2' '1,23' '--arg3' '* * green * *'

The @sh output operator attempts to quote the given data in a way that would be appropriate for a shell. It is not foolproof, but it tends to do a good job most of the time.

$ eval "$(mlr --c2l -N cat file | jq -r '["my_cmd", "--arg1", ."1", "--arg2", ."2", "--arg3", ."3"] | @sh')"
zsh: command not found: my_cmd
zsh: command not found: my_cmd
zsh: command not found: my_cmd
zsh: command not found: my_cmd

$ my_cmd () { echo ""; printf 'arg: %s\n' "$@"; }
$ eval "$(mlr --c2l -N cat file | jq -r '["my_cmd", "--arg1", ."1", "--arg2", ."2", "--arg3", ."3"] | @sh')"

arg: --arg1
arg: foo
arg: --arg2
arg: 42
arg: --arg3
arg: red

arg: --arg1
arg: bar
arg: --arg2
arg: 13
arg: --arg3
arg: blue

arg: --arg1
arg: baz
arg: --arg2
arg: 27
arg: --arg3
arg: green

arg: --arg1
arg: "mute" swan
arg: --arg2
arg: 1,23
arg: --arg3
arg: * * green * *

You can also run things directly from Miller, but I don't know how well its exec() function deals with values that need to be quoted in the shell (or if that is even an issue). I might come back later and revise this when and if I get around to testing this.

jubilatious1 · Accepted Answer · 2025-02-25 21:23:35Z

Using Raku (formerly known as Perl_6)

...with Raku's Text::CSV module:

~$  raku -MText::CSV -e '

    my $fh = open "luator.txt", :r;
    my $parser = Text::CSV.new;

    until $fh.eof {
          $_ = $parser.getline($fh); 
          run "echo", .[0], .[1], .[2] given $_;
    }

    $fh.close;'

Raku is a programming language in the Perl-family that features some nice functions for invoking external commands. The two options are calling shell or calling run. According to the Docs, calling run is safer.

Above, when you declare the $parser object you can set various parameters, such as accepting a non-comma separator (example: my $parser = Text::CSV.new(sep => "|");). Then the file is read/parsed linewise with getline(). A simple output is shown above using echo.

Sample Input:

~$ cat luator.txt
foo,42,red
bar,13,blue
baz,27,green

Sample Output (with echo):

foo 42 red
bar 13 blue
baz 27 green

Below, using run "printf", "%s\t", .[0].uc, .[1], .[2].uc given $_; run "printf", "\n";, separating column output with \t tabs. Note here we add .uc to uppercase the first and third columns, to show that you can still clean up text if you need to (before invoking your my_cmd):

Sample Output (with printf):

FOO 42  RED
BAR 13  BLUE
BAZ 27  GREEN

Finally, you can take input files off the command line using Raku's $*ARGFILES dynamic variable. Obviously, you'll substitute your my_cmd in place of printf below:

~$ raku -MText::CSV -e '
         my $parser = Text::CSV.new;    
         until $*ARGFILES.eof {
               $_ = $parser.getline($*ARGFILES); 
         run "printf", "%s ", "--arg1", .[0], "--arg2", .[1], "--arg3", .[2] given $_;
         run "printf", "\n";
   };'   luator.txt
--arg1 foo --arg2 42 --arg3 red
--arg1 bar --arg2 13 --arg3 blue
--arg1 baz --arg2 27 --arg3 green

Otherwise, see the first link below for how to save output to a Raku "Proc" (process) object, or the second link below for using "Proc::Async" (asynchronous process interface).

https://docs.raku.org/type/Proc
https://docs.raku.org/type/Proc/Async
https://raku.org

Ed Morton · Accepted Answer · 2025-02-27 10:34:14Z

Separating field selection/ordering (f=... below) from adding --argN (the loop) so that it's easy to modify fields and/or order and potentially use the same field multiple times as the OP said in a comment is required, using any awk and a POSIX xargs:

$ awk -F, -v f='2,1,3' '
    {
        n = split(f, flds)
        for (i = 1; i <= n; i++) {
            printf " --arg%d \"%s\"", i, $(flds[i])
        }
        print ""
    }
' file | xargs -L1 echo my_cmd
my_cmd --arg1 42 --arg2 foo --arg3 red
my_cmd --arg1 13 --arg2 bar --arg3 blue
my_cmd --arg1 27 --arg2 baz --arg3 green

Remove the echo when done testing.

Given that, changing the order and duplicating fields is as simple as changing f='...':

$ awk -F, -v f='3,1,3,2,1' '
    {
        n = split(f, flds)
        for (i = 1; i <= n; i++) {
            printf " --arg%d \"%s\"", i, $(flds[i])
        }
        print ""
    }
' file | xargs -L1 echo my_cmd
my_cmd --arg1 red --arg2 foo --arg3 red --arg4 42 --arg5 foo
my_cmd --arg1 blue --arg2 bar --arg3 blue --arg4 13 --arg5 bar
my_cmd --arg1 green --arg2 baz --arg3 green --arg4 27 --arg5 baz

The \"s around %s are to ensure xargs would handle fields that contain spaces correctly, otherwise a field like a b would be split into 2 separate arguments for my_cmd.

Stack Exchange Network

Run command on each line of CSV file, using fields in different places of the command

5 Answers 5

You must log in to answer this question.

Linked

Hot Network Questions

Run command on each line of CSV file, using fields in different places of the command

5 Answers 5

You must log in to answer this question.

Linked

Related

Hot Network Questions