Feel free to skip my long-winded explanation if looking at the source code is easier!
So I've written a function to tokenize strings of text. In the simplest case, it takes a string like It's a beautiful morning and returns a list of tokens. For the preceding example, the output would be ['It', "'", 's', ' ', 'a', ' ', 'beautiful', ' ', 'morning'].
This is achieved with the first two lines of the function:
separators = dict.fromkeys(whitespace + punctuation, True)
tokens = [''.join(g) for _, g in groupby(phrase, separators.get)]
The thing to notice here is that It's get's split into ["It", "'", "s"]. In most cases, this is not a problem, but sometimes it is. For this reason, I added the stop_words kwarg, which takes a set of strings that are to be "un-tokenized". For example:
>>> tokenize("It's a beautiful morning", stop_words=set("It's"))
>>> ["It's", , ' ', 'a', ' ', 'beautiful', ' ', 'morning']
This "un-tokenization" works by means of a sliding-window that moves across the list of tokens. Consider the schema below. The window is depicted as []
Iteration 1: ['It', "'",] 's', ' ', 'a', ' ', 'beautiful', ' ', 'morning'
Iteration 2: 'It', ["'", 's',] ' ', 'a', ' ', 'beautiful', ' ', 'morning'
Iteration 3: 'It', "'", ['s', ' ',] 'a', ' ', 'beautiful', ' ', 'morning'
At each iteration, the strings contained in the window are joined and checked against the contents of stop_words. If the window reaches the end of the token list and no match is found, then the window's size increases by 1. Thus:
Iteration 9: ['It', "'", 's',] ' ', 'a', ' ', 'beautiful', ' ', 'morning'
Here we have a match, so the entire window is replaced with a single element: its contents, joined. Thus, at the end of iteration 9, we obtain:
"It's", ' ', 'a', ' ', 'beautiful', ' ', 'morning'
Now, we have to start all over again in case this new token, when combined it's neighbors, forms a stop word. The algorithm sets the window size back to 2 and continues on. The entire process stops at the end of the iteration in which the window-size is equal to the length of the token list.
This recursion is the source of my algorithm's inefficiency. For small strings with few untokenizations, it works very quickly. However, the computational time seems to grow exponentially with the number of untokenizations and the overall length of the original string.
Here is the full source code for the function:
from itertools import groupby, tee, izip
from string import punctuation, whitespace
def tokenize(phrase, stop_words=None):
separators = dict.fromkeys(whitespace + punctuation, True)
tokens = [''.join(g) for _, g in groupby(phrase, separators.get)]
if stop_words:
assert isinstance(stop_words, set), 'stop_words must be a set'
window = 2 # Iterating over single tokens is useless
while window <= len(tokens):
# "sliding window" over token list
iters = tee(tokens, window)
for i, offset in izip(iters, xrange(window)):
for _ in xrange(offset):
next(i, None)
# Join each window and check if it's in `stop_words`
for offset, tkgrp in enumerate(izip(*iters)):
tk = ''.join(tkgrp)
if tk in stop_words:
pre = tokens[0: offset]
post = tokens[offset + window + 1::]
tokens = pre + [tk] + post
window = 1 # will be incremented after breaking from loop
break
window += 1
return tokens
And here are some hard numbers to work with (the best I could do, in any case).
>>> import cProfile
>>> strn = "it's a beautiful morning."
>>> ignore = set(["they're", "we'll", "she'll", "it's", "we're", "i'm"])
>>> cProfile.run('tokenize(strn * 100, ignore=ignore)')
cProfile.run('tokenize(strn * 100, ignore=ignore)')
57534203 function calls in 15.737 seconds
Ordered by: standard name
ncalls tottime percall cumtime percall filename:lineno(function)
1 10.405 10.405 15.737 15.737 <ipython-input-140-6ef74347708e>:1(tokenize)
1 0.000 0.000 15.737 15.737 <string>:1(<module>)
1 0.000 0.000 0.000 0.000 {built-in method fromkeys}
899 0.037 0.000 0.037 0.000 {itertools.tee}
900 0.000 0.000 0.000 0.000 {len}
1 0.000 0.000 0.000 0.000 {method 'disable' of '_lsprof.Profiler' objects}
365450 1.459 0.000 1.459 0.000 {method 'join' of 'str' objects}
57166950 3.836 0.000 3.836 0.000 {next}
From this I gathered that the majority of execution time was taking place in my function's scope. As stated above, I suspect that the incessant resetting of window is responsible for the inefficiency, but I'm not sure how to diagnose this any further.
My questions are as follows:
- How can I further profile this function to ascertain whether it is, indeed, the resetting of
windowthat is responsible for the long execution time? - What can I do to improve performance?
Thanks very much in advance!
re.findall('\w+|\W+', phrase)