2

After some processing I ended up with a file like this one:

ALA 251
VAL 252
TYR 253
LYS 254
SER 255
ALA 256
ALA 257
MET 258
LEU 259
ASP 260
MET 261
THR 262
GLY 263
ALA 264
GLY 265
TYR 266
VAL 267
TRP 268

Let's call the first column "res" and the second "num". Please note that "res" is always composed of 3 letters, and "num" going from 1 to 4 numbers.

I am looking for a way to extract the position (value of column "num") corresponding to the first "res" of the exact column pattern composed of four successive "res" like this one:

TYR
LYS
SER
ALA

In this case, according to the file and the indicated pattern, the output should then be:

253

I made several attemps with awk. It seems it should be doable but my skills aren't sufficient for the moment. I would be very gratefull if any brilliant user has a proposition for this.

2
  • 2
    Not an answer, but a suggestion: this format is hard to read. I bet that we could give you much better answers if you asked on our sister site, Bioinformatics but ask about your actual input data, before you format it in this way. Presumably, you have a protein fasta file at some point and it would almost certainly be much simpler to find the pattern there. Commented Sep 19, 2019 at 15:07
  • @terdon Thanks for the advise. You're right it is protein, but extracted from structural pdb file. Commented Sep 20, 2019 at 7:19

5 Answers 5

2

Notwithstanding terdon’s excellent suggestion, the following AWK script does the job:

$1 == "TYR" { seq = $1; start = $2; next }
($1 == "LYS" && seq == "TYR") || ($1 == "SER" && seq == "LYS") { seq = $1; next }
$1 == "ALA" && seq == "SER" { print start }
{ seq = "" }

This looks for TYR, and remembers the start position; it also matches TYR, LYS, SER in the correct sequence, noting the previous item in the sequence in seq at every stage. Non-matching lines cause the sequence to be cleared.

1

Sliding window with sed:

parse.sed

# Establish the sliding window
1N
2N

# Maintain the sliding window
N

# Match the desired pattern to the current window
/^TYR \(.*\)\nLYS .*\nSER .*\nALA .*$/ { 
  h;                           # Save the window in hold space
  s//\1/p;                     # Extract desired output
  x;                           # Re-establish window
}

# Maintain the sliding window
D

Run it like this:

sed -nf parse.sed infile

Output:

253
1
  • Wow, Thanks, it is working great. I definitely should have a better look of how sed is working. Commented Sep 20, 2019 at 7:56
0

Same approach as in Stephen Kitt's answer, but without additional seq variable. Instead the sequential "number" is used to determine if the current line belongs to the set we're looking for.

awk '{
  if ($1=="TYR") {
    i=$2 # remember index
  }
  else if (i!=0) { 
    if ($2==i+1 && $1=="LYS" || $2==i+2 && $1=="SER" || $2==i+3 && $1=="ALA") {
      if ($2==i+3) { # are we there yet?
        print i; exit
      }
    }
    else {
      i=0 # nope, reset index
    }
  }
}' file

(Unneeded curly braces and indentation left for readability)

0
0

Transpose your data and pattern and the task becomes easier, e.g.:

# Determine offset to pattern
offset=$(cut -d' ' -f1 infile | paste -s | grep -ob "$(paste -s patternfile)" | cut -d: -f1)

# Adjust for field length
reslength=3
(( offset = offset % reslength + 1 ))

# Extract desired number
awk -v offset=$offset 'NR==offset { print $2 }' infile

Output:

253

Note that this solution assumes that there is only one match.

1
  • Launching the script I got : "Illegal variable name." I am working on it. Thanks a lot for the proposition and comment. I calculated that using a 4 res window renders highly improbable than two matches appear in the any "infiles" I am working with. Best :-) Commented Sep 20, 2019 at 8:03
0

You could do this with a sliding window:

parse.awk

# Split the pattern into the `p` array and remember how many there are in `n`
BEGIN { n = split(pat, p, "\n") }

# Collect n lines into the `A` array
NR <= n { A[NR] = $0; next }

# Maintain the sliding window after n lines
NR  > n {
  for(i=2; i<=n; i++)
    A[i-1] = A[i]
  A[n] = $0
}

# Test if the current window contains the pattern
{
  hit = 1
  for(i=1; i<=n; i++) {
    split(A[i], x)
    if(x[1] != p[i]) {
      hit = 0
      break
    }
  }

  # If the window matches print the second column
  if(hit) {
    split(A[1], x)
    print x[2]
  }
}

Run it like this:

awk -v pat="$(< patternfile)" -f parse.awk infile

Output:

253
3
  • Wow... very elegant. I never saw the usage of : " awk -v pat="$(< variable)" Commented Sep 20, 2019 at 9:14
  • In particular < . I had troubles to make the script work to begin with, getting "Illegal variable name." But it was simply solved switching to bash. Many thanks! Commented Sep 20, 2019 at 9:22
  • @Siondubois: Sorry, that is a bashism. In most shells you can get away with `"$(cat patternfile)" Commented Sep 24, 2019 at 7:05

You must log in to answer this question.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.