2

Feel free to skip my long-winded explanation if looking at the source code is easier!

So I've written a function to tokenize strings of text. In the simplest case, it takes a string like It's a beautiful morning and returns a list of tokens. For the preceding example, the output would be ['It', "'", 's', ' ', 'a', ' ', 'beautiful', ' ', 'morning'].

This is achieved with the first two lines of the function:

separators = dict.fromkeys(whitespace + punctuation, True)
tokens = [''.join(g) for _, g in groupby(phrase, separators.get)]

The thing to notice here is that It's get's split into ["It", "'", "s"]. In most cases, this is not a problem, but sometimes it is. For this reason, I added the stop_words kwarg, which takes a set of strings that are to be "un-tokenized". For example:

>>> tokenize("It's a beautiful morning", stop_words=set("It's"))
>>> ["It's", , ' ', 'a', ' ', 'beautiful', ' ', 'morning']

This "un-tokenization" works by means of a sliding-window that moves across the list of tokens. Consider the schema below. The window is depicted as []

Iteration 1:  ['It', "'",] 's', ' ', 'a', ' ', 'beautiful', ' ', 'morning'
Iteration 2:  'It', ["'", 's',] ' ', 'a', ' ', 'beautiful', ' ', 'morning'
Iteration 3:  'It', "'", ['s', ' ',] 'a', ' ', 'beautiful', ' ', 'morning'

At each iteration, the strings contained in the window are joined and checked against the contents of stop_words. If the window reaches the end of the token list and no match is found, then the window's size increases by 1. Thus:

Iteration 9:  ['It', "'", 's',] ' ', 'a', ' ', 'beautiful', ' ', 'morning'

Here we have a match, so the entire window is replaced with a single element: its contents, joined. Thus, at the end of iteration 9, we obtain:

"It's", ' ', 'a', ' ', 'beautiful', ' ', 'morning'

Now, we have to start all over again in case this new token, when combined it's neighbors, forms a stop word. The algorithm sets the window size back to 2 and continues on. The entire process stops at the end of the iteration in which the window-size is equal to the length of the token list.

This recursion is the source of my algorithm's inefficiency. For small strings with few untokenizations, it works very quickly. However, the computational time seems to grow exponentially with the number of untokenizations and the overall length of the original string.

Here is the full source code for the function:

from itertools import groupby, tee, izip
from string import punctuation, whitespace

def tokenize(phrase, stop_words=None):
    separators = dict.fromkeys(whitespace + punctuation, True)
    tokens = [''.join(g) for _, g in groupby(phrase, separators.get)]

    if stop_words:
        assert isinstance(stop_words, set), 'stop_words must be a set'
        window = 2  # Iterating over single tokens is useless
        while window <= len(tokens):
            # "sliding window" over token list
            iters = tee(tokens, window)
            for i, offset in izip(iters, xrange(window)):
                for _ in xrange(offset):
                    next(i, None)

            # Join each window and check if it's in `stop_words`
            for offset, tkgrp in enumerate(izip(*iters)):
                tk = ''.join(tkgrp)
                if tk in stop_words:
                    pre = tokens[0: offset]
                    post = tokens[offset + window + 1::]
                    tokens = pre + [tk] + post
                    window = 1  # will be incremented after breaking from loop
                    break

            window += 1

    return tokens

And here are some hard numbers to work with (the best I could do, in any case).

>>> import cProfile
>>> strn = "it's a beautiful morning."
>>> ignore = set(["they're", "we'll", "she'll", "it's", "we're", "i'm"])
>>> cProfile.run('tokenize(strn * 100, ignore=ignore)')
cProfile.run('tokenize(strn * 100, ignore=ignore)')
         57534203 function calls in 15.737 seconds

   Ordered by: standard name

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1   10.405   10.405   15.737   15.737 <ipython-input-140-6ef74347708e>:1(tokenize)
        1    0.000    0.000   15.737   15.737 <string>:1(<module>)
        1    0.000    0.000    0.000    0.000 {built-in method fromkeys}
      899    0.037    0.000    0.037    0.000 {itertools.tee}
      900    0.000    0.000    0.000    0.000 {len}
        1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}
   365450    1.459    0.000    1.459    0.000 {method 'join' of 'str' objects}
 57166950    3.836    0.000    3.836    0.000 {next}

From this I gathered that the majority of execution time was taking place in my function's scope. As stated above, I suspect that the incessant resetting of window is responsible for the inefficiency, but I'm not sure how to diagnose this any further.

My questions are as follows:

  1. How can I further profile this function to ascertain whether it is, indeed, the resetting of window that is responsible for the long execution time?
  2. What can I do to improve performance?

Thanks very much in advance!

4
  • Is the tokenizing really a bottle neck in your actual program? Commented Mar 13, 2013 at 3:51
  • 1
    @Kay, I wouldn't be so concerned with optimizing this function if it wasn't! =) Commented Mar 13, 2013 at 3:52
  • You can change the tokens comprehension to a regex like re.findall('\w+|\W+', phrase) Commented Mar 13, 2013 at 4:06
  • @JBernardo, are you talking about the second line of the function? That's not where the bottleneck is (AFAIK), and the groupby operation works just fine. Commented Mar 13, 2013 at 4:10

2 Answers 2

1

I might have misunderstood the problem, but it seems like just searching for the ignored words before spliting will solve the issue:

def tokenize(phrase, stop_words=()):
    stop_words = '|'.join(re.escape(x) + r'\b' for x in stop_words)
    other = '\s+|\w+|[^\s\w]+'
    regex = stop_words + '|' + other if stop_words else other
    return re.findall(regex, phrase)

As pointed by Michael Anderson, you should add \b to avoid matching parts of words

Edit: the new regex will separate whitespace from punctuation.

Sign up to request clarification or add additional context in comments.

7 Comments

You need a "\b" at either end of the escaped stop word I think.
@JBernardo, this seems promising! Right now, I'm getting a TypeError when stop_words is None. Could this be modified such that spaces and quotations are not lumped together as tokens? When I run your code with the stop_words set defined in my original question, tokenize("it's a 'beautiful' morning", stop_words=ignore) returns ["it's", ' ', 'a', " '", 'beautiful', "' ", 'morning']. Ideally, it would return ["it's", ' ', 'a', ' ', "'", 'beautiful', "'", ' ', 'morning']
@JBernardo, like this? '|\w*?'+ stop_words + '|\w+|\W+' That returns a bunch of empty strings.
@JBernardo, We're getting there! =) That works great when a set of stopwords is passed to the function, but it still returns a bunch of empty strings when an empty set or empty tuple is passed!
The next thing to do is to construct the regex outside the function, compile it and pass the regex in rather than recompile it each call. This should give a significant further speed-up if your stop_words are the same for each call.
|
0

I vote for regular expressions!

If you don't care about excluding punctuation from your tokens list, you can do

import re
text = '''It's a beautiful morning''' 
tokens = re.split(text, ' ')

gives you

["It's", 'a', 'beautiful', 'morning']

If you want to nuke all punctuation, you can

tokens = re.split(r'\W+', text)

to get back

['It', 's', 'a', 'beautiful', 'morning']

tokens = re.findall(r"\w+(?:[-']\w+)*|'|[-.(]+|\S\w*", text)

1 Comment

The behavior of my original function is exactly as desired. It's the timing that's problematic. Your solution doesn't exhibit the same tokenization behavior.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.