Just an addition to both the already excellent, if contradicting, answers.
The documentation for the PCRE library has always stated that "Ranges operate in the collating sequence of character values". Which is somewhat vague, and yet very precise.
It refers to collating by the index of characters in PCRE s internal character tables, which can be set up to match the current locale using pcre_maketables
. That function builds the tables in order of char value (tolower(i)
/toupper(i)
)
In other words, it doesn t collate by actual cultural sort order (the locale collation info). As an example, while German treats ö the same as o in dictionary collation, ö has a value that makes it appear outside the a-z range in all the common character encodings used for German (ISO-8859-x, unicode encodings etc.) In this case, PCRE would base its determination of whether ö is in the range [a-z]
on that code value, rather than any actual locale-defined sort order.
PHP has mostly copied PCRE s documentation verbatim in their docs. However, they ve actually gone to pains changing the above statement to "Ranges operate in ASCII collating sequence". That statement has been in the docs at least since 2004.
In spite of the above, I m not quite sure it s true, however.
Well, not in all cases, at least.
The one call PHP makes to pcre_maketables
... From the PHP source:
#if HAVE_SETLOCALE
if (strcmp(locale, "C"))
tables = pcre_maketables();
#endif
In other words, if the environment for which PHP is compiled has setlocale
and the (LC_CTYPE) locale isn t the POSIX/C locale, the runtime environment s POSIX/C locale s character order is used. Otherwise, the default PCRE tables are used - which are generated (by pcre_maketables
) when PCRE is compiled - based on the compiler s locale:
This function builds a set of character tables for character values less than 256. These can be passed to pcre_compile() to override PCRE s internal, built-in tables (which were made by pcre_maketables() when PCRE was compiled). You might want to do this if you are using a non-standard locale. The function yields a pointer to the tables.
While German wouldn t be different for [a-z]
in any common character encoding, if we were dealing with EBCDIC, for example, [a-z]
would include ± and ~. Granted, EBCDIC is the one character encoding I can think of that doesn t place a-z and A-Z in uninterrupted sequence.
Unless PCRE does some magic when using EBCDIC (and it might), while it s highly unlikely you d be including umlauts in anything but the most obscure PHP build or runtime environment (using your very own, very special, custom-made locale definition), you might, in the case of EBCDIC, include other unintended characters. And for other ranges, "collated in ASCII sequence" doesn t seem entirely accurate.
ETA: I could have saved some research by looking for Philip Hazel s own reply to a similar concern:
Another issue is with character classes ranges. You would think that [a-k] and [x-z] are well defined for latin scripts but that s not the case.
They are certainly well defined, being equivalent to [x61-x6b] and [x78-x7a], that is, related to code order, not cultural sorting order.