duplicated entries of an array in awk

Question

I have a file with 4 columns. When I put these 4 columns into an array using NR as the index, the entries are duplicated somehow. See below for an elaboration of the issue.

The first 5 lines of the file look like this

-bash-4.2$ cat -ve file | head -n 5
chr start end p$
13 59341171 59343427 1.86642E-18$
10 72886545 72888679 1.13636E-09$
16 81900987 81902805 6.79697E-09$
1 46797890 46800143 2.24436E-08$

I assigned each line as an entry of an array indexed by the NR, then the print out of the array looks like this (using the first 5 lines as an example):

-bash-4.2$ awk 'NR<6 {a[NR]=$0; 
>                     for(x in a)
>                     print x, a[x]}' file
1 chr start end p
1 chr start end p
2 13 59341171 59343427 1.86642E-18
1 chr start end p
2 13 59341171 59343427 1.86642E-18
3 10 72886545 72888679 1.13636E-09
4 16 81900987 81902805 6.79697E-09
1 chr start end p
2 13 59341171 59343427 1.86642E-18
3 10 72886545 72888679 1.13636E-09
4 16 81900987 81902805 6.79697E-09
5 1 46797890 46800143 2.24436E-08
1 chr start end p
2 13 59341171 59343427 1.86642E-18
3 10 72886545 72888679 1.13636E-09

I can see that the 5 lines of the file are there, but entries are duplicated a few times. I wonder what the problem is and how to fix it. Thanks in advance.

Looking at your data, you might also be interested in our sister site: Bioinformatics. — terdon
– terdon ♦, Commented Apr 3, 2023 at 13:11
Aside from the problem you asked about, for(x in a) can shuffle the order of your output lines, you should use for(x=1; x in a; x++) instead if you want to retain the input order. See gnu.org/software/gawk/manual/gawk.html#Scanning-an-Array. — Ed Morton
– Ed Morton, Commented Apr 3, 2023 at 13:34
I assume your real code does need to look at each later line for something else in the same awk program? Otherwise you should just filter on the fly, without building an array at all, and exit after 5 records. A bit like cat -n | head -5 but with different formatting for the line numbers. — Peter Cordes
– Peter Cordes, Commented Apr 3, 2023 at 23:03
If you were going to do this with awk you should use '{print} NR==5{exit}' or {a[NR]=$0} NR==5{exit} END{ ... } instead of NR<6 {a[NR]=$0} END { ... } for better efficiency. — Ed Morton
– Ed Morton, Commented Apr 3, 2023 at 23:33

cas · Accepted Answer · 2023-04-03 12:07:22Z

6

You're telling it to print the entire array for every line where NR < 6.

If you only want to print the array once, do it after the NR < 6 {} block, in an END block.

For example:

awk 'NR<6 { a[NR] = $0 }; 
     END  { for(x in a) print x, a[x] }' file
1 chr start end p
2 13 59341171 59343427 1.86642E-18
3 10 72886545 72888679 1.13636E-09
4 16 81900987 81902805 6.79697E-09
5 1 46797890 46800143 2.24436E-08

answered Apr 3, 2023 at 12:07

cas

83.9k8 gold badges136 silver badges205 bronze badges

1

Or just filter on the fly, without building an array at all, and exit after 5 records. A bit like cat -n | head -5 but with different formatting for the line numbers. Perhaps their real use-case is more complicated and they do need to look at each later line for something else, and this is just a [mcve] of this part of their problem. But worth at least mentioning this.

Peter Cordes
– Peter Cordes

2023-04-03 23:02:08 +00:00
Commented Apr 3, 2023 at 23:02
1

yeah, there's lots of ways to just number the input lines (btw head -5 | cat -n is better, doesn't waste time reading and numbering lines it's not going to print). I assumed the Q was some kind of minimal working example for something else.

cas
– cas

2023-04-04 01:39:21 +00:00
Commented Apr 4, 2023 at 1:39

Add a comment |

Stack Exchange Network

duplicated entries of an array in awk

1 Answer 1

You must log in to answer this question.

Hot Network Questions

duplicated entries of an array in awk

1 Answer 1

You must log in to answer this question.

Related

Hot Network Questions