Unicode Regex with regex not working in Python

Question

I have the following Regex (see it in action in PCRE)

.*?\P{L}*?(\p{L}+-?(\p{L}+)?)\P{L}*$

However, Python doesn't upport unicode regex with \p{} syntax. To solve this I read I could use the regex module (not default re), but this doesn't seem to work either. Not even with u flag.

Example:

sentence = "valt nog zoveel zal kunnen zeggen, "

print(re.sub(".*?\P{L}*?(\p{L}+-?(\p{L}+)?)\P{L}*$","\1",sentence))

Output: < blank >
Expected output: zeggen

This doesn't work with Python 3.4.3.

It works fine using the regex module, use raw string notation. If that doesn't do it, place (?iV1) at the beginning of your pattern as well. — hwnd
– hwnd, Commented Aug 16, 2015 at 20:21
Are you using raw string notation. regex.sub(r'...', r'\1', sentence) — hwnd
– hwnd, Commented Aug 16, 2015 at 20:33
@hwnd I wasn't. Now I am and it is working. Thank you! I don't understand why though. What is raw string notation, and should I always use it? — Bram Vanroy
– Bram Vanroy, Commented Aug 16, 2015 at 20:38

Casimir et Hippolyte · Accepted Answer · 2015-08-18 21:58:58Z

As you can see unicode character classes like \p{L} are not available in the re module. However it doesn't means that you can't do it with the re module since \p{L} can be replaced with [^\W\d_] with the UNICODE flag (even if there are small differences between these two character classes, see the link in comments).

Second point, your approach is not the good one (if I understand well, you are trying to extract the last word of each line) because you have strangely decided to remove all that is not the last word (except the newline) with a replacement. ~52000 steps to extract 10 words in 10 lines of text is not acceptable (and will crash with more characters). A more efficient way consists to find all the last words, see this example:

import re

s = '''Ik heb nog nooit een kat gezien zo lélijk!
Het is een minder lelijk dan uw hond.'''

p = re.compile(r'^.*\b(?<!-)(\w+(?:-\w+)*)', re.M | re.U) 

words = p.findall(s)

print('\n'.join(words))

Notices:

To obtain the same result with python 2.7 you only need to add an u before the single quotes of the string: s = u'''...
If you absolutely want to limit results to letters avoiding digits and underscores, replace \w with [^\W\d_] in the pattern.
If you use the regex module, maybe the character class \p{IsLatin} will be more appropriate for your use, or whatever the module you choose, a more explicit class with only the needed characters, something like: [A-Za-záéóú...
You can achieve the same with the regex module with this pattern:
p = regex.compile(r'^.*\m(?<!-)(\pL+(?:-\pL+)*)', regex.M | regex.U)

Other ways:

By line with the re module:

p = re.compile(r'[^\w-]+', re.U)
for line in s.split('\n'):
    print(p.split(line+' ')[-2])

With the regex module you can take advantage of the reversed search:

p = regex.compile(r'(?r)\w+(?:-\w+)*\M', regex.U)
for line in s.split('\n'):
    print p.search(line).group(0)

I think you misunderstood. I don't have a single file consisting of sentences seperated by a new line. In fact, I do the regex on a single sentence at the time. In other words, I first already distinguished the lines, and then do the regex line per line.
@BramVanroy: in this case you can use search instead of findall without the M modifier (that is useless) or one of the two other ways but replace the s.split('\n') with your list of lines. ~5700 steps, for the first sentence of your example stay a too high value.
Okay, thanks. I'm going to try this. Could you explain the regex in the very last example? As I said, I'm not familiar with Python or its regex methods. Especially (?r), \M and regex.U would be useful to know. About the last but one example: isn't split rather slow? I always thought it'd be slower because you first have to find all individual matches and then make a split in the array?
@BramVanroy: (?r) is a specific feature of the regex module. It allows to perform a search from the end of the string. \M is specific to regex module too, it is an advanced word boundary but only for the end of a word (\m for the beginning). regex.U or re.U are the UNICODE flag that extends the character classes like \w, \s, \d to unicode characters. About the number of steps, see the first pattern: regex101.com/r/tU2dO4/1
O, I didn't know you could see the number of steps on regex101.com! That's handy! So now I understand everything that's going on, but when trying this in my code I get the following error: AttributeError: 'NoneType' object has no attribute 'group'. I googled some and it's probably to do with the lines containing non-latin characters. This answer tells me to use the unicode flag. But I thought we were already using that?

Community · Accepted Answer · 2017-05-23 11:53:27Z

-1

This post explains how to use unicode properties in python:

Python regex matching Unicode properties

Have you tried Ponyguruma, a Python binding to the Oniguruma regular expression engine? In that engine you can simply say \p{Armenian} to match Armenian characters. \p{Ll} or \p{Zs} work too.

edited May 23, 2017 at 11:53

CommunityBot

11 silver badge

answered Aug 16, 2015 at 19:49

melwil

2,5531 gold badge21 silver badges34 bronze badges

2 Comments

jfs Over a year ago

regex module should be used instead.

tchrist Over a year ago

@J.F.Sebastian Yes, you always want to use Matt’s regex module in doing any Unicode regular expressions in Python for any number of reasons, one of those being that he closely follows the published standard for these matters, Unicode Technical Standard #18 Unicode Regular Expressions. See my comments on the other answer about just how tricky things can be when you don’t even have Level 1 Conformance: Basic Unicode Support, because without the actual properties you simply cannot get at what you need.

Collectives™ on Stack Overflow

Unicode Regex with regex not working in Python

2 Answers 2

23 Comments

2 Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

23 Comments

2 Comments

Linked

Related