Find position of specific column pattern

Question

After some processing I ended up with a file like this one:

ALA 251
VAL 252
TYR 253
LYS 254
SER 255
ALA 256
ALA 257
MET 258
LEU 259
ASP 260
MET 261
THR 262
GLY 263
ALA 264
GLY 265
TYR 266
VAL 267
TRP 268

Let's call the first column "res" and the second "num". Please note that "res" is always composed of 3 letters, and "num" going from 1 to 4 numbers.

I am looking for a way to extract the position (value of column "num") corresponding to the first "res" of the exact column pattern composed of four successive "res" like this one:

TYR
LYS
SER
ALA

In this case, according to the file and the indicated pattern, the output should then be:

I made several attemps with awk. It seems it should be doable but my skills aren't sufficient for the moment. I would be very gratefull if any brilliant user has a proposition for this.

Not an answer, but a suggestion: this format is hard to read. I bet that we could give you much better answers if you asked on our sister site, Bioinformatics but ask about your actual input data, before you format it in this way. Presumably, you have a protein fasta file at some point and it would almost certainly be much simpler to find the pattern there. — terdon
– terdon ♦, Commented Sep 19, 2019 at 15:07
@terdon Thanks for the advise. You're right it is protein, but extracted from structural pdb file. — Sion du bois
– Sion du bois, Commented Sep 20, 2019 at 7:19

Stephen Kitt · Accepted Answer · 2019-09-19 15:11:39Z

2

Notwithstanding terdon’s excellent suggestion, the following AWK script does the job:

$1 == "TYR" { seq = $1; start = $2; next }
($1 == "LYS" && seq == "TYR") || ($1 == "SER" && seq == "LYS") { seq = $1; next }
$1 == "ALA" && seq == "SER" { print start }
{ seq = "" }

This looks for TYR, and remembers the start position; it also matches TYR, LYS, SER in the correct sequence, noting the previous item in the sequence in seq at every stage. Non-matching lines cause the sequence to be cleared.

answered Sep 19, 2019 at 15:11

Stephen Kitt

481k60 gold badges1.2k silver badges1.4k bronze badges

Add a comment |

Thor · Accepted Answer · 2019-09-24 07:10:44Z

1

Sliding window with sed:

parse.sed

# Establish the sliding window
1N
2N

# Maintain the sliding window
N

# Match the desired pattern to the current window
/^TYR \(.*\)\nLYS .*\nSER .*\nALA .*$/ { 
  h;                           # Save the window in hold space
  s//\1/p;                     # Extract desired output
  x;                           # Re-establish window
}

# Maintain the sliding window
D

Run it like this:

sed -nf parse.sed infile

Output:

edited Sep 24, 2019 at 7:10

answered Sep 19, 2019 at 17:49

Thor

17.5k3 gold badges55 silver badges71 bronze badges

Wow, Thanks, it is working great. I definitely should have a better look of how sed is working.

Sion du bois
– Sion du bois

2019-09-20 07:56:39 +00:00
Commented Sep 20, 2019 at 7:56

Add a comment |

Freddy · Accepted Answer · 2019-09-19 16:21:07Z

Same approach as in Stephen Kitt's answer, but without additional seq variable. Instead the sequential "number" is used to determine if the current line belongs to the set we're looking for.

awk '{
  if ($1=="TYR") {
    i=$2 # remember index
  }
  else if (i!=0) { 
    if ($2==i+1 && $1=="LYS" || $2==i+2 && $1=="SER" || $2==i+3 && $1=="ALA") {
      if ($2==i+3) { # are we there yet?
        print i; exit
      }
    }
    else {
      i=0 # nope, reset index
    }
  }
}' file

(Unneeded curly braces and indentation left for readability)

Thor · Accepted Answer · 2019-09-19 16:50:59Z

0

Transpose your data and pattern and the task becomes easier, e.g.:

# Determine offset to pattern
offset=$(cut -d' ' -f1 infile | paste -s | grep -ob "$(paste -s patternfile)" | cut -d: -f1)

# Adjust for field length
reslength=3
(( offset = offset % reslength + 1 ))

# Extract desired number
awk -v offset=$offset 'NR==offset { print $2 }' infile

Output:

Note that this solution assumes that there is only one match.

edited Sep 19, 2019 at 16:50

answered Sep 19, 2019 at 16:37

Thor

17.5k3 gold badges55 silver badges71 bronze badges

Launching the script I got : "Illegal variable name." I am working on it. Thanks a lot for the proposition and comment. I calculated that using a 4 res window renders highly improbable than two matches appear in the any "infiles" I am working with. Best :-)

Sion du bois
– Sion du bois

2019-09-20 08:03:57 +00:00
Commented Sep 20, 2019 at 8:03

Add a comment |

Thor · Accepted Answer · 2019-09-24 07:08:29Z

0

You could do this with a sliding window:

parse.awk

# Split the pattern into the `p` array and remember how many there are in `n`
BEGIN { n = split(pat, p, "\n") }

# Collect n lines into the `A` array
NR <= n { A[NR] = $0; next }

# Maintain the sliding window after n lines
NR  > n {
  for(i=2; i<=n; i++)
    A[i-1] = A[i]
  A[n] = $0
}

# Test if the current window contains the pattern
{
  hit = 1
  for(i=1; i<=n; i++) {
    split(A[i], x)
    if(x[1] != p[i]) {
      hit = 0
      break
    }
  }

  # If the window matches print the second column
  if(hit) {
    split(A[1], x)
    print x[2]
  }
}

Run it like this:

awk -v pat="$(< patternfile)" -f parse.awk infile

Output:

edited Sep 24, 2019 at 7:08

answered Sep 19, 2019 at 17:39

Thor

17.5k3 gold badges55 silver badges71 bronze badges

Wow... very elegant. I never saw the usage of : " awk -v pat="$(< variable)"

Sion du bois
– Sion du bois

2019-09-20 09:14:24 +00:00
Commented Sep 20, 2019 at 9:14
In particular < . I had troubles to make the script work to begin with, getting "Illegal variable name." But it was simply solved switching to bash. Many thanks!

Sion du bois
– Sion du bois

2019-09-20 09:22:45 +00:00
Commented Sep 20, 2019 at 9:22
@Siondubois: Sorry, that is a bashism. In most shells you can get away with `"$(cat patternfile)"

Thor
– Thor

2019-09-24 07:05:38 +00:00
Commented Sep 24, 2019 at 7:05

Add a comment |

Stack Exchange Network

Find position of specific column pattern

5 Answers 5

You must log in to answer this question.

Hot Network Questions

Find position of specific column pattern

5 Answers 5

You must log in to answer this question.

Related

Hot Network Questions