Skip to main content
edited tags
Link
Jon Seigel
  • 12.5k
  • 8
  • 61
  • 93
Added final coded solution
Source Link
moinudin
  • 139.3k
  • 45
  • 195
  • 218

I have a 384MB text file with 50 million lines. Each line contains 2 space-separated integers: a key and a value. The file is sorted by key. I need an efficient way of looking up the values of a list of about 200 keys in Python.

My current approach is included below. It takes 30 seconds. There must be more efficient Python foo to get this down to a reasonable efficiency of a couple of seconds at most.

# list contains a sorted list of the keys we need to lookup
# there is a sentinel at the end of list to simplify the code
# we use pointer to iterate through the list of keys
for line in fin:
  line = map(int, line.split())
  while line[0] == list[pointer].key:
    list[pointer].value = line[1]
    pointer += 1
  while line[0] > list[pointer].key:
    pointer += 1
  if pointer >= len(list) - 1:
    break # end of list; -1 is due to sentinel

Coded binary search + seek solution (thanks kigurai!):

entries = 24935502 # number of entries
width   = 18       # fixed width of an entry in the file padded with spaces
                   # at the end of each line
for i, search in enumerate(list): # list contains the list of search keys
  left, right = 0, entries-1 
  key = None
  while key != search and left <= right:
    mid = (left + right) / 2
    fin.seek(mid * width)
    key, value = map(int, fin.readline().split())
    if search > key:
      left = mid + 1
    else:
      right = mid - 1
  if key != search:
    value = None # for when search key is not found
  search.result = value # store the result of the search

I have a 384MB text file with 50 million lines. Each line contains 2 space-separated integers: a key and a value. The file is sorted by key. I need an efficient way of looking up the values of a list of about 200 keys in Python.

My current approach is included below. It takes 30 seconds. There must be more efficient Python foo to get this down to a reasonable efficiency of a couple of seconds at most.

# list contains a sorted list of the keys we need to lookup
# there is a sentinel at the end of list to simplify the code
# we use pointer to iterate through the list of keys
for line in fin:
  line = map(int, line.split())
  while line[0] == list[pointer].key:
    list[pointer].value = line[1]
    pointer += 1
  while line[0] > list[pointer].key:
    pointer += 1
  if pointer >= len(list) - 1:
    break # end of list; -1 is due to sentinel

I have a 384MB text file with 50 million lines. Each line contains 2 space-separated integers: a key and a value. The file is sorted by key. I need an efficient way of looking up the values of a list of about 200 keys in Python.

My current approach is included below. It takes 30 seconds. There must be more efficient Python foo to get this down to a reasonable efficiency of a couple of seconds at most.

# list contains a sorted list of the keys we need to lookup
# there is a sentinel at the end of list to simplify the code
# we use pointer to iterate through the list of keys
for line in fin:
  line = map(int, line.split())
  while line[0] == list[pointer].key:
    list[pointer].value = line[1]
    pointer += 1
  while line[0] > list[pointer].key:
    pointer += 1
  if pointer >= len(list) - 1:
    break # end of list; -1 is due to sentinel

Coded binary search + seek solution (thanks kigurai!):

entries = 24935502 # number of entries
width   = 18       # fixed width of an entry in the file padded with spaces
                   # at the end of each line
for i, search in enumerate(list): # list contains the list of search keys
  left, right = 0, entries-1 
  key = None
  while key != search and left <= right:
    mid = (left + right) / 2
    fin.seek(mid * width)
    key, value = map(int, fin.readline().split())
    if search > key:
      left = mid + 1
    else:
      right = mid - 1
  if key != search:
    value = None # for when search key is not found
  search.result = value # store the result of the search
Clarified code
Source Link
moinudin
  • 139.3k
  • 45
  • 195
  • 218

I have a 384MB text file with 50 million lines. Each line contains 2 space-separated integers: a key and a value. The file is sorted by key. I need an efficient way of looking up the values of a list of about 200 keys in Python.

My current approach is included below. It takes 30 seconds. There must be more efficient Python foo to get this down to a reasonable efficiency of a couple of seconds at most.

# addlist contains a sorted list of the keys we need to lookup
# there is a sentinel at the end of list to simplify the code
# we use pointer to iterate through the list of keys
for line in fin:
  line = map(int, line.split())
  while line[0] == list[pointer].key:
    list[pointer].value = line[1]
    pointer += 1
  while line[0] > list[pointer].key:
    pointer += 1
  if pointer >= len(list) - 1:
    break # end of list; -1 is due to sentinel

I have a 384MB text file with 50 million lines. Each line contains 2 space-separated integers: a key and a value. The file is sorted by key. I need an efficient way of looking up the values of a list of about 200 keys in Python.

My current approach is included below. It takes 30 seconds. There must be more efficient Python foo to get this down to a reasonable efficiency of a couple of seconds at most.

# add sentinel at the end of list
for line in fin:
  line = map(int, line.split())
  while line[0] == list[pointer].key:
    list[pointer].value = line[1]
    pointer += 1
  while line[0] > list[pointer].key:
    pointer += 1
  if pointer >= len(list) - 1:
    break # end of list; -1 is due to sentinel

I have a 384MB text file with 50 million lines. Each line contains 2 space-separated integers: a key and a value. The file is sorted by key. I need an efficient way of looking up the values of a list of about 200 keys in Python.

My current approach is included below. It takes 30 seconds. There must be more efficient Python foo to get this down to a reasonable efficiency of a couple of seconds at most.

# list contains a sorted list of the keys we need to lookup
# there is a sentinel at the end of list to simplify the code
# we use pointer to iterate through the list of keys
for line in fin:
  line = map(int, line.split())
  while line[0] == list[pointer].key:
    list[pointer].value = line[1]
    pointer += 1
  while line[0] > list[pointer].key:
    pointer += 1
  if pointer >= len(list) - 1:
    break # end of list; -1 is due to sentinel
Source Link
moinudin
  • 139.3k
  • 45
  • 195
  • 218
Loading