0

I've got a question regarding Linear Searching in Python. Say I've got the base code of

for l in lines:
  for f in search_data:
     if my_search_function(l[1],[f[0],f[2]]):
        print "Found it!"
        break

in which we want to determine where in search_data exists the value stored in l[1]. Say my_search_function() looks like this:

def my_search_function(search_key, search_values):
   for s in search_values:
      if search_key in s:
         return True
    return False

Is there any way to increase the speed of processing? Binary Search would not work in this case, as lines and search_data are multidimensional lists and I need to preserve the indexes. I've tried an outside-in approach, i.e.

for line in lines:
    negative_index = -1
    positive_index = 0
    middle_element = len(search_data) /2 if len(search_data) %2 == 0 else (len(search_data)-1) /2
    found = False

    while positive_index < middle_element:
        # print str(positive_index)+","+str(negative_index)
        if my_search_function(line[1], [search_data[positive_index][0],search_data[negative_index][0]]):
            print "Found it!"
            break
        positive_index = positive_index +1
        negative_index = negative_index -1

However, I'm not seeing any speed increases from this. Does anyone have a better approach? I'm looking to cut the processing speed in half as I'm working with large amounts of CSV and the processing time for one file is > 00:15 which is unacceptable as I'm processing batches of 30+ files. Basically the data I'm searching on is essentially SKUs. A value from lines[0] could be something like AS123JK and a valid match for that value could be AS123. So a HashMap would not work here, unless there exists a way to do partial matches in a HashMap lookup that wouldn't require me breaking down the values like ['AS123', 'AS123J', 'AS123JK'], which is not ideal in this scenario. Thanks!

4
  • Do you have sample data and sample search request/response? Commented Oct 20, 2015 at 13:25
  • See edit above Pieter21 Commented Oct 20, 2015 at 13:27
  • Are you looking for prefix matches (match at the start of string), like your example, or substring matches? Commented Oct 20, 2015 at 14:59
  • Prefix matches like the example Commented Oct 20, 2015 at 17:15

3 Answers 3

1

Binary Search would not work in this case, as lines and search_data are multidimensional lists and I need to preserve the indexes.

Regardless, it may be worth your while to extract the strings (along with some reference to the original data structure) into a flat list, sort it, and perform fast binary searches on it with help of the bisect module.

Or, instead of a large number of searches, sort also a combined list of all the search keys and traverse both lists in parallel, looking for matches. (Proceeding in a similar manner to the merge step in merge sort, without actually outputting a merged list)

Code to illustrate the second approach:

lines = ['AS12', 'AS123', 'AS123J', 'AS123JK','AS124']
search_keys = ['AS123', 'AS125']

try:
    iter_keys = iter(sorted(search_keys))
    key = next(iter_keys)
    for line in sorted(lines):
        if line.startswith(key):
            print('Line {} matches {}'.format(line, key))
        else:
            while key < line[:len(key)]:
                key = next(iter_keys)
except StopIteration: # all keys processed
    pass
Sign up to request clarification or add additional context in comments.

Comments

0

Depends on problem detail.

For instance if you search for complete words, you could create a hashtable on searchable elements, and the final search would be a simple lookup.

Filling the hashtable is pseudo-linear.

3 Comments

Basically the data I'm searching on is essentially SKUs. A value from lines[0] could be something like AS123JK and a valid match for that value could be AS123. So a HashMap would not work here, unless there exists a way to do partial matches in a HashMap lookup that wouldn't require me breaking down the values like ['AS123', 'AS123J', 'AS123JK'], which is not ideal in this scenario.
@AndrewSmiley: A trie is the standard data structure here.
@Charles You're right. I didn't catch that before. However, I know the Complexity of a Trie is O(L*W), but I would like better in my code.
0

Ultimately, I was broke down and implemented Binary Search on my multidimensional lists by sorting using the sorted() function with a lambda as a key argument.Here is the first pass code that I whipped up. It's not 100% efficient, but it's a vast improvement from where we were

def binary_search(master_row, source_data,master_search_index, source_search_index):
    lower_bound = 0
    upper_bound = len(source_data) - 1
    found = False
    while lower_bound <= upper_bound and not found:
        middle_pos = (lower_bound + upper_bound) // 2

        if source_data[middle_pos][source_search_index] < master_row[master_search_index]:

            if search([source_data[middle_pos][source_search_index]],[master_row[master_search_index]]):
                return {"result": True, "index": middle_pos}
                break
            lower_bound = middle_pos + 1

        elif source_data[middle_pos][source_search_index] > master_row[master_search_index] :

            if search([master_row[master_search_index]],[source_data[middle_pos][source_search_index]]):
                return {"result": True, "index": middle_pos}
                break
            upper_bound = middle_pos - 1
        else:

            if len(source_data[middle_pos][source_search_index]) > 5:

                return {"result": True, "index": middle_pos}
            else:
                break

and then where we actually make the Binary Search call

#where master_copy is the first multidimensional list, data_copy is the second
#the search columns are the columns we want to search against
for line in master_copy:
    for m in master_search_columns:
        found = False
        for d in data_search_columns:
            data_copy = sorted(data_copy, key=lambda x: x[d], reverse=False)
            results = binary_search(line, data_copy,m, d)
            found = results["result"]
            if found:
                line = update_row(line, data_copy[results["index"]], column_mapping)
                found_count = found_count +1
                break
        if found:
            break

Here's the info for sorting a multidimensional list Python Sort Multidimensional Array Based on 2nd Element of Subarray

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.