3

I have a file with 4 columns. When I put these 4 columns into an array using NR as the index, the entries are duplicated somehow. See below for an elaboration of the issue.

The first 5 lines of the file look like this

-bash-4.2$ cat -ve file | head -n 5
chr start end p$
13 59341171 59343427 1.86642E-18$
10 72886545 72888679 1.13636E-09$
16 81900987 81902805 6.79697E-09$
1 46797890 46800143 2.24436E-08$

I assigned each line as an entry of an array indexed by the NR, then the print out of the array looks like this (using the first 5 lines as an example):

-bash-4.2$ awk 'NR<6 {a[NR]=$0; 
>                     for(x in a)
>                     print x, a[x]}' file
1 chr start end p
1 chr start end p
2 13 59341171 59343427 1.86642E-18
1 chr start end p
2 13 59341171 59343427 1.86642E-18
3 10 72886545 72888679 1.13636E-09
4 16 81900987 81902805 6.79697E-09
1 chr start end p
2 13 59341171 59343427 1.86642E-18
3 10 72886545 72888679 1.13636E-09
4 16 81900987 81902805 6.79697E-09
5 1 46797890 46800143 2.24436E-08
1 chr start end p
2 13 59341171 59343427 1.86642E-18
3 10 72886545 72888679 1.13636E-09

I can see that the 5 lines of the file are there, but entries are duplicated a few times. I wonder what the problem is and how to fix it. Thanks in advance.

5
  • Looking at your data, you might also be interested in our sister site: Bioinformatics. Commented Apr 3, 2023 at 13:11
  • 2
    Aside from the problem you asked about, for(x in a) can shuffle the order of your output lines, you should use for(x=1; x in a; x++) instead if you want to retain the input order. See gnu.org/software/gawk/manual/gawk.html#Scanning-an-Array. Commented Apr 3, 2023 at 13:34
  • 1
    @EdMorton Thanks! I was just wondering that too. Cheers Commented Apr 3, 2023 at 14:06
  • I assume your real code does need to look at each later line for something else in the same awk program? Otherwise you should just filter on the fly, without building an array at all, and exit after 5 records. A bit like cat -n | head -5 but with different formatting for the line numbers. Commented Apr 3, 2023 at 23:03
  • 1
    If you were going to do this with awk you should use '{print} NR==5{exit}' or {a[NR]=$0} NR==5{exit} END{ ... } instead of NR<6 {a[NR]=$0} END { ... } for better efficiency. Commented Apr 3, 2023 at 23:33

1 Answer 1

6

You're telling it to print the entire array for every line where NR < 6.

If you only want to print the array once, do it after the NR < 6 {} block, in an END block.

For example:

awk 'NR<6 { a[NR] = $0 }; 
     END  { for(x in a) print x, a[x] }' file
1 chr start end p
2 13 59341171 59343427 1.86642E-18
3 10 72886545 72888679 1.13636E-09
4 16 81900987 81902805 6.79697E-09
5 1 46797890 46800143 2.24436E-08
2
  • 1
    Or just filter on the fly, without building an array at all, and exit after 5 records. A bit like cat -n | head -5 but with different formatting for the line numbers. Perhaps their real use-case is more complicated and they do need to look at each later line for something else, and this is just a [mcve] of this part of their problem. But worth at least mentioning this. Commented Apr 3, 2023 at 23:02
  • 1
    yeah, there's lots of ways to just number the input lines (btw head -5 | cat -n is better, doesn't waste time reading and numbering lines it's not going to print). I assumed the Q was some kind of minimal working example for something else. Commented Apr 4, 2023 at 1:39

You must log in to answer this question.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.