1

I'm reformatting a big file with sample metadata. I have a file (let's call it File2) with the group each sample belong to, with one id and pop per line. My idea was to while read over that file and use sed -i to update each of the samples info. The issue is that sed is not updating the file.

The input file is a .fam file from plink, in this fashion:

pop id 0 0 0 -9
pop id 0 0 0 -9
pop id 0 0 0 -9
pop id 0 0 0 -9

Right now pop and id are the same, so I want to update the file with File2, but the sed code I normally use for this doesn't seem to work:

while read -r id pop; do sed -i 's/^$id/$pop/' File1.fam; done < File2.txt

I have tried only the sed command without iteration and it works fine. But I have 700 samples and I would dread having to do this one by one.

Why is it not working?

4
  • How does your File2.txt look? Is it a list of lines where each line lists two items: a "pop" and an "id"? Is your goal to replace the "pop" and "id" literals with the matching values from the same line number of File2.txt? Commented Jul 23 at 10:34
  • Yes, the format of File2.txt is just id pop, and the idea was to iterate over that file and change the first appearance of $id for its $pop. Commented Jul 23 at 12:00
  • single quotes != double quotes. See mywiki.wooledge.org/Quotes. Commented Jul 29 at 12:27
  • Please edit your question to show a minimal reproducible example with examples of both input files, File1.fam and File2.txt, plus the expected output given that input so we can best help you. You say "I have 700 samples" - also clarify if that means 700 lines in one of the files or 700 instances of one or the other of the files or something else. Commented Jul 29 at 12:32

2 Answers 2

1

Assuming that your files are formatted as follows:

$ cat file1.fam
pop id1 0 0 0 -9
pop id2 0 0 0 -9
pop id3 0 0 0 -9

$ cat file2.txt
id3   POP003
id2   POP002
id1   POP001

If your goal is to replace the 1st column in file1.fam with the values from the 2nd column from file2.txt using the id* values for matching, you can:

  1. Read file2.txt into a map: map[id] = pop.
  2. Iterate file1.fam and replace the 1st field with map[id] where id is taken from the 2nd field.

E.g.,

awk 'NR==FNR { map[$1]=$2; next } { if ($2 in map) $1 = map[$2]; print }' \
    file2.txt OFS=' ' file1.fam

In the command above, awk reads the two files sequentially: file2.txt, then file1.fam. When it reads file2.txt, the number of the current record NR is equal to the current record in the current file FNR. Look at the following example for better understanding:

awk '{print FNR, NR, $0}' file1.fam file2.txt
1 1 pop id1 0 0 0 -9
2 2 pop id2 0 0 0 -9
3 3 pop id3 0 0 0 -9
1 4 id3   POP003
2 5 id2   POP002
3 6 id1   POP001

The NR===FNR block fills the map with the keys from the first column(IDs) and values from the second one(pop values). For the rest of the lines, the first column in(pop) is replaced with the matching value from the map (if any).

The result is printed to the standard output. You can redirect it to a file if you wish:

awk ... > output.txt

Note that awk parses space-separated fields. If the values in your files may contain spaces, you might need to adjust the field separator(FS) or consider using other tools(e.g., Perl). But the idea will remain the same.

Sign up to request clarification or add additional context in comments.

3 Comments

Thanks for the explanation. This solution works. I in the end coded some basic python script to do it, but I was wondering why did sed -i didn't worked when iterating but did when inputing the data by hand.
@PedroMorell, your sed expression 's/^$id/$pop/' didn't work because 1) it's single-quoted—the shell doesn't expand the variables in single-quoted strings; 2) ^$id tries to match the ID at the beginning of the line in the File1.fam, but the ID is in the second column. You could try something like sed -r -i .bak "s/^(.+)( +${id})/${pop}\2/", but I don't recommend that because the ID and pop values might contain special characters that sed might interpret as a part of the regular expression. Therefore, a more sophisticated parser (awk/Perl/Python) is much safer.
You don't need OFS=' ' as a blank char is the default for OFS. Also the more idiomatic way to write { if ($2 in map) $1 = map[$2]; print } would be $2 in map { $1 = map[$2] } 1.
0

This might work for you (GNU parallel and sed):

sed -E 's#(.*) (.*)#s/^\1/\2/#' file2 | parallel sed -f - -i.bak {1} ::: file?.fam

Turn file2 into a sed script and then using parallel run the generated script over all the sample files.

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.