Home > Programming > Saying Goodbye to boost::regex.

Saying Goodbye to boost::regex.

Yesterday I decided to finally say goodbye to boost::regex (a.k.a tr1::regex) and use PCRE instead in Untropy. The reason for this is, that boost::regex doesn’t support UTF-8 properly. Yes, there is boost::u32regex together with a few related types and functions, but as the name says, they work with UTF-32, not UTF-8. This wouldn’t be a real problem if UnicodeString was a proper C++ string type, like Glib::ustring for example. Unfortunately it isn’t, as it misses the usual typedefs and iterators; it doesn’t even define ostream operators and seems to have problems with -D_GLIBCXX_DEBUG. In practice this meant that working with boost::u32regex introduced quite a few explicit string conversions, that didn’t make the API very convenient at all. And no, I’m not going to use UnicodeString as my default string type, because this would make almost every other string related API harder to use. As I strongly prefer Perl regular expressions over other flavours, using PCRE was the only reasonable choice. After all, this library is widely used, feature rich, and deals with UTF-8 properly. It also has two C++ wrappers, pcrecpp and PCRE++. Unfortunately they are both unusable:

  • pcrecpp exposes one of the most crippled C++ APIs I’ve ever seen and is poorly documented, although it is shipped with PCRE and has been provided by Google.
  • PCRE++ looks much much saner, but fails to properly separate compiled patterns from match results, although the underlying C API does exactly that.

Being accustomed to the very well designed java.util.regex API, I was left with no other choice than to write a C++ wrapper myself. Luckily the PCRE C API isn’t that bad at all and well documented (although you should really spend at least one hour reading the man page carefully while experimenting with small examples, as the API is definitely neither self explanatory nor fool proof). Finally it took me about 8 hours to design, implement and test this C++ API, that is roughly modelled after java.util.regex, and migrate existing code to it. I’m seriously considering putting an extended and overworked version of this API in a standalone library as soon as I have time.

  1. aaaa
    May 25, 2010 at 21:40

    typedef u32regex_iterator utf8regex_iterator;
    typedef u32regex_iterator utf16regex_iterator;
    typedef u32regex_iterator utf32regex_iterator;

    Unicode-32 practically is UTF-8 just in different format and it has lossless converters. The main difference is that U32 needs 4 bytes from the start and UTF-8 can store data in 1,2,3 or 4 bytes.

    See here:

    A: UTF-8 is most common on the web. UTF-16 is used by Java and Windows. UTF-32 is used by various Unix systems. The conversions between all of them are algorithmically based, fast and lossless. This makes it easy to support data input or output in multiple formats, while using a particular UTF for internal storage or processing.

    • May 28, 2010 at 00:15

      The reason I stopped using boost::regex is not that I don’t know the difference between UTF-8/16/32, nor that I don’t know how to convert from one encoding to another. I switched to a home grown PCRE wrapper because I prefer an API that nicely integrates into the rest of the program over doing explicit string conversions every time I use a regular expression. I could of course have used boost::regex instead to implement http://untropy.svn.sourceforge.net/viewvc/untropy/trunk/src/regex.hh?view=markup, but PCRE seems to be superior for other reasons too.

  2. aaaa
    May 25, 2010 at 21:45

    by the way CBuilder 2009/2010 has “UnicodeString” type. Too bad you use MS 🙂

    • May 27, 2010 at 23:53

      I’m using gcc-4.4 on top of Gentoo Linux ;-).

  3. Andre Caron
    February 27, 2012 at 01:00

    I ended up writing my own C++ wrappers for the same reasons you did. I went through the process of making it a standalone library. Check out the PCREXX project on GitHub at: https://github.com/AndreLouisCaron/pcrexx

  1. No trackbacks yet.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s

%d bloggers like this: