150

What is the best way to split a string like "HELLO there HOW are YOU" by upper-case words?

So I'd end up with an array like such: results = ['HELLO there', 'HOW are', 'YOU']

I have tried:

p = re.compile("\b[A-Z]{2,}\b")
print p.split(page_text)

It doesn't seem to work, though.

1
  • 6
    When you say something doesn't work, you should explain why. Do you get an exception? (If so, post the whole exception) Do you get the wrong output? Commented Nov 3, 2012 at 12:44

3 Answers 3

170

I suggest

l = re.compile("(?<!^)\s+(?=[A-Z])(?!.\s)").split(s)

Check this demo.

Sign up to request clarification or add additional context in comments.

2 Comments

what happens when you dont use compile ?
Per the re docs, "most regular expression operations are available as module-level functions and RegexObject methods. The functions are shortcuts that don’t require you to compile a regex object first, but miss some fine-tuning parameters." You can use re.split(re.split(pattern, string, maxsplit=0, flags=0)) as mentioned in the previously cited docs.
70

You could use a lookahead:

re.split(r'[ ](?=[A-Z]+\b)', input)

This will split at every space that is followed by a string of upper-case letters which end in a word-boundary.

Note that the square brackets are only for readability and could as well be omitted.

If it is enough that the first letter of a word is upper case (so if you would want to split in front of Hello as well) it gets even easier:

re.split(r'[ ](?=[A-Z])', input)

Now this splits at every space followed by any upper-case letter.

4 Comments

How would I change re.split(r'[ ](?=[A-Z]+\b)', input) so it didn't find upper case letters? E.g. It wouldn't match "A"? I tried re.split(r'[ ](?=[A-Z]{2,}+\b)', input). thanks!
@JamesEggers You mean that you want to require at least two upper-case letters, so that you do not split at words like I? re.split(r'[ ](?=[A-Z]{2,}\b)', input) should do it.
I'd suggest at least [ ]+ or maybe even \W+ to catch slightly more cases. Still, a good answer.
I tried the same approach. However, having a [ ] did not work for me. Instead, I used \s. The complete regexp that worked for me was re.split("\s(?=[A-Z]+\s)", string)
1

Your question contains the string literal "\b[A-Z]{2,}\b", but that \b will mean backspace, because there is no r-modifier.

Try: r"\b[A-Z]{2,}\b".

Comments