1

In a comparison function I am bascially looking for a pattern (e.g. "AAA") inside a long binary object (for an example, aaaAAAbbbBBB)

I'm working backwards through the file (I know the match will be closer to the end than beginning), an adding 1 byte to the variable that is being checked for the match:

1. aaaAAAbbbBB[B]
2. aaaAAAbbbB[BB]
3. aaaAAAbbb[BBB]
4. aaaAAAbb[bBBB]
5. ... 
n. aaa[AAAbbbBBB] 

match found, offset = -n

Given that I know my pattern is 3 elements long, I wondered if I can simply window the search variable rather than incrementing it - it gets very slow when the match is +1,000,000 elements deep in the list - windowed view of the same data would be:

1. aaaAAAbbb[BBB]
2. aaaAAAbb[bBB]B
3. aaaAAAb[bbB]BB
4. aaaAAA[bbb]BBB
5. ...
n. aaa[AAA]bbbBBB

match found, offset = -n

My current search looks like:

if marker in f_data[-counter:]:
    offset = (len(f_data)-counter)+len(marker)
    return offset

In MATLAB I would have used the array addressing to move through the array,(e.g. calling window = a[5:8], window = a[4:7] etc) but I don't think that's possible in Python (2.7)

I can see a few suggestions for using a sliding window, ( Rolling or sliding window iterator in Python - this looks like a close match) but I can't see how to implement it or they reference libs that I don't know how to use.

Is there a built in function for doing this?

1
  • Can you hold the whole file in memory? Commented Sep 26, 2012 at 1:31

3 Answers 3

5

Why not just use rfind() or rindex()?

haystack = "aaaAAAbbbBBB"
needle   = "AAA"

pos = haystack.rfind(needle)

if pos >= 0:
    print "found at", pos - len(haystack)
else:
    print "not found"
Sign up to request clarification or add additional context in comments.

2 Comments

Oh, that's useful, thank you. There might be n needles in the haystack, but I only need the last one - how would you cope with multiple matches?
@JayGattuso rfind searches from the right. If you need the leftmost match, use find.
0

Two things:

(1) the standard string type holds bytes, and you can use regexes with this. I could suggest that you slurp the object into a string, and perform a regex search.

(2) If you do want to do it the hard way, there's http://docs.python.org/library/itertools.html#itertools.groupby

3 Comments

Thank you, I did think about RegEx but I have multiple match cases, and I only need the last one, at the time (before I know how deep the match would be) my approach seemed reasonable. I suspect I am going about things the wrong way!..
@JayGattuso I don't see why needing only the last one precludes a regex match, or indeed rfind.
the preclusion is more me not knowing how to handle a multiple match scenario rather than an issue with the tools themselves... :)
0

I think this makes use of the window() iterator function you mentioned.

>>> l = "ABCABACAAASSD"
>>> from itertools import islice
>>>
>>> def window(seq, n=2):
...     "Returns a sliding window (of width n) over data from the iterable"
...     "   s -> (s0,s1,...s[n-1]), (s1,s2,...,sn), ...                   "
...     it = iter(seq)
...     result = tuple(islice(it, n))
...     if len(result) == n:
...         yield result
...     for elem in it:
...         result = result[1:] + (elem,)
...         yield result
...
>>>
>>> data = [c for c in l] # get each byte/charactor as separate item in list
>>> data
['A', 'B', 'C', 'A', 'B', 'A', 'C', 'A', 'A', 'A', 'S', 'S', 'D']
>>> for idx, elements in enumerate(window(reversed(data), n=3)):
...     section = "".join(elements)
...     if section == "AAA":
...         print "found at {}!".format(idx)
...
found at 3!
>>>

To explain:

  • reversed() takes in a list and returns an iterator with the elements in reverse order
  • window() takes in an iterable object (list, tuple, iterator) and returns n number of elements, shifting the index 1 element at a time.
  • enumerate() takes an iterable and simply attaches a counter, so it will return the counter/position and the given element item.

1 Comment

Thank you, one day I hope to meet code like that and understand it!... I suspect at the root I am going about this process the wrong way.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.