Posts Tagged ‘UTF-8’

Saying Goodbye to boost::regex.

April 5, 2010 5 comments

Yesterday I decided to finally say goodbye to boost::regex (a.k.a tr1::regex) and use PCRE instead in Untropy. The reason for this is, that boost::regex doesn’t support UTF-8 properly. Yes, there is boost::u32regex together with a few related types and functions, but as the name says, they work with UTF-32, not UTF-8. This wouldn’t be a real problem if UnicodeString was a proper C++ string type, like Glib::ustring for example. Unfortunately it isn’t, as it misses the usual typedefs and iterators; it doesn’t even define ostream operators and seems to have problems with -D_GLIBCXX_DEBUG. In practice this meant that working with boost::u32regex introduced quite a few explicit string conversions, that didn’t make the API very convenient at all. And no, I’m not going to use UnicodeString as my default string type, because this would make almost every other string related API harder to use. As I strongly prefer Perl regular expressions over other flavours, using PCRE was the only reasonable choice. After all, this library is widely used, feature rich, and deals with UTF-8 properly. It also has two C++ wrappers, pcrecpp and PCRE++. Unfortunately they are both unusable:

  • pcrecpp exposes one of the most crippled C++ APIs I’ve ever seen and is poorly documented, although it is shipped with PCRE and has been provided by Google.
  • PCRE++ looks much much saner, but fails to properly separate compiled patterns from match results, although the underlying C API does exactly that.

Being accustomed to the very well designed java.util.regex API, I was left with no other choice than to write a C++ wrapper myself. Luckily the PCRE C API isn’t that bad at all and well documented (although you should really spend at least one hour reading the man page carefully while experimenting with small examples, as the API is definitely neither self explanatory nor fool proof). Finally it took me about 8 hours to design, implement and test this C++ API, that is roughly modelled after java.util.regex, and migrate existing code to it. I’m seriously considering putting an extended and overworked version of this API in a standalone library as soon as I have time.