Add list of words at the end of the 1st line in a loop

Question

I have a list with the names of several files:

file1
file2

I would like to add at the end of the 1st line of each file the name of the file e.g.:

file 1 first row will be

> ATCGCCfile1

file 2 first row will be

> ATTTCCfile2

My idea is now to create a variable a with the file names' name:

a="file1 file2"

loop over the files and add the word re-calling the index:

for i in $a; do cat $i | awk '{print $0"$i"}' > $i.extended_word.txt; done

however in the output the awk '{print $0"$i"}' does not work and gives me just the $i without the name of the file e.g. in case of the 1st file >ATCGCC$i .

What am i doing wrong? I also tried the parentesis ( awk '{print $0"${i}"}') but without succeed.

Do all your files have just one sequence or can you have more lines starting with >? Do you always have a space after the >? That isn't standard, or required by the fasta format. More importantly, how many files do you have? Do you even need a loop? If you do, can we just do for f in * and get all the files of interest, or should that be for f in *fa or something else? Can you show us the output of ls in the target directory and tell us what files should be modified? — terdon
– terdon ♦, Commented May 24, 2023 at 16:01
all files have two rows, it is a fasta file yes. It starts the first line wit the > and the second does not have the >. I would like to modify only the first line. — fusion.slope
– fusion.slope, Commented May 24, 2023 at 16:15
OK, and how many files? You really don't want to use a variable, so please show us the output of ls in the directory with your files and indicate which ones should be changed. Can we just do for file in *fa; do... to get all files? And where does the ATTTCC come from? Do you want us to add something from the sequence line or the ID line? Can you show us an example file? — terdon
– terdon ♦, Commented May 24, 2023 at 16:39
I have corrected the post adding file2 is supposed to be ATTTCCfile2 — fusion.slope
– fusion.slope, Commented May 25, 2023 at 8:40

Stéphane Chazelas · Accepted Answer · 2023-05-24 17:37:18Z

3

To store a list, you want an array:

a=(
  file1
  file2
  'other file with spaces'
  $'even with\nnewlines'
)
awk -- '
  FNR == 1 {
    close(out)
    $0 = $0 FILENAME
    out = FILENAME".extended_word.txt"
  }
  {print > out}' "${a[@]}"

Beware that if file names contain = characters, you'd need to make sure that what's left of the = is not a valid awk variable. For instance, if you have a file=1 file, make it ./file=1.

With GNU awk, a work around is to write it:

gawk -e '
  FNR == 1 {$0 = $0 FILENAME}
  {print > FILENAME ".extended_word.txt"}
  ' -E /dev/null "${a[@]}"

Which works even for files with = in their path but unfortunately still not for a file called -.

edited May 24, 2023 at 17:37

answered May 24, 2023 at 15:20

Stéphane Chazelas

585k96 gold badges1.1k silver badges1.7k bronze badges

all files have two rows and each file is a fasta file. It starts the first row wit the > and the second does not have the >. I would like to modify only the first row. This approach modify also the second row but works..

fusion.slope
– fusion.slope

2023-05-24 16:16:41 +00:00
Commented May 24, 2023 at 16:16
1

@fusion.slope see edit. I had missed that part of the requirement.

Stéphane Chazelas
– Stéphane Chazelas

2023-05-24 16:45:02 +00:00
Commented May 24, 2023 at 16:45
print > FILENAME ".extended_word.txt" is undefined behavior per POSIX (any expression on the right side of in/out redirection needs to be inside parens), it needs to be print > (FILENAME ".extended_word.txt") for the syntax to be portable, and you need to close the output files as you go (e.g. at the start of every FNR==1 section close the previous output file) to avoid a "too many open files" error if you cross some threshold. Once you make those changes it could be written more concisely as FNR == 1 { close(out); out=FILENAME ".extended_word.txt"; $0 = $0 FILENAME} {print > out}'

Ed Morton
– Ed Morton

2023-05-24 17:25:42 +00:00
Commented May 24, 2023 at 17:25
gawk does work around both of those issues I mentioned but it slows down as your number of open output files increases so it's still worth closing them as you go if it's easy to do as in this case.

Ed Morton
– Ed Morton

2023-05-24 17:28:35 +00:00
Commented May 24, 2023 at 17:28
1

@EdMorton, I know that some awk implementations don't like it (OpenBSD's IIRC?) but I can't find the POSIX text that makes it undefined. The grammar specification at least would make that valid. In any case, I've included your improvements, thanks.

Stéphane Chazelas
– Stéphane Chazelas

2023-05-24 17:34:52 +00:00
Commented May 24, 2023 at 17:34

| Show 6 more comments

schrodingerscatcuriosity · Accepted Answer · 2023-05-24 15:41:31Z

1

Assuming all the files are in the same directory and the list is in a file:

files.txt

file1
file2

Using mapfile and sed:

mapfile -t files < files.txt

for f in "${files[@]}"; do
  if [ ! -f "$f" ]; then
    echo "File '$f' does not exists."
    continue
  fi
  
  sed "1s;.*;&$f;" "$f"
  
  # With option '-i' the file will be written
  # sed -i "1s;.*;&$f;" "$f"
done

answered May 24, 2023 at 15:41

schrodingerscatcuriosity

12.8k5 gold badges38 silver badges64 bronze badges

1

Worth noting that the file paths cannot contain &, ;, newline nor backslash characters, and with GNU or busybox sed at least cannot start with -. (with possibly nasty consequences if they do).

Stéphane Chazelas
– Stéphane Chazelas

2023-05-24 16:49:12 +00:00
Commented May 24, 2023 at 16:49

Add a comment |

Stack Exchange Network

Add list of words at the end of the 1st line in a loop

2 Answers 2

You must log in to answer this question.

Hot Network Questions

Add list of words at the end of the 1st line in a loop

2 Answers 2

You must log in to answer this question.

Related

Hot Network Questions