Posts categorized “Unicode”.

python, regex, unicode and brokenness

(This post included a complaint about handling of unicode codepoint >0xffff in python, including a literal such character, and it broke WordPress, which ate the remainder of the post after that character… and I am too lazy to retype it, so for now, no unicode)

I love python, I really do, but some things are … slightly irregular.

One of those things is the handling of unmatched regular expression groups when replacing. In python such a group returns None when matching, this is fine. But when replacing, this unmatched group will produce an error, rather than simple inserting the empty string. For example:

>>> re.sub('(ab)|(a)', r'\1\2', 'abc')
Traceback (most recent call last):
  File "", line 1, in 
  File "/opt/local/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/", line 151, in sub
    return _compile(pattern, flags).sub(repl, string, count)
  File "/opt/local/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/", line 275, in filter
    return sre_parse.expand_template(template, match)
  File "/opt/local/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/", line 787, in expand_template
    raise error, "unmatched group"
sre_constants.error: unmatched group

There are plenty of people with this problem on the interwebs, and even a python bug report – most “solutions” involves re-writing your expression to make the unmatched group match empty string. Unfortunately, my input expression comes from the sparql 11 compliance tests and as much as I’d like I’m not really free to change it. So, it gets ugly:

And it works, at least in my python 2.7.3 …