Question

This is a really long-standing issue in my work, that I realize I still don t have a good solution to...

C naively defined all of its character test functions for an int:

int isspace(int ch);

但是,char常常是签字的,其全部特性往往与胎盘或用于铺面的单一储存库相符合***。

And these functions have been the logical template for current C++ functions and methods, and have set the stage for the current standard library. In fact, they re still supported, afaict.

因此,如果你是空间(*pchar),你可以最终解决签署延期问题。他们难以见到,因此,他们不得不在我的经验中防范。

同样,由于空间是空洞的,而且由于实际上的宽度往往不为人所知,因此,任何现代性质的图书馆都绝不应围绕果园或果园,而只是指点人/中介人,因为只有通过分析品的特性,你才能知道其中多少是单一逻辑的,我就知道如何最好地处理问题而损失了多少?

我期望有一个真正强大的图书馆,其基础是粉碎任何特性的大小要素,只修饰(提供像空间这样的东西),但要么错失了,要么在你们所有人(知道你做些什么)使用......的情况下,找到另一个更简单的解决办法。

** These issues don t come up for fixed-sized character-encodings that can wholly contain a full character - UTF-32 apparently is about the only option that has these characteristics (or specialized environments that restrict themselves to ASCII or some such).

So, my question is:

“你对白天空间的测试,可以印成像等,这样就有两个问题:

1) Sign expansion, and
2) variable-width character issues

毕竟,大多数character encodings 是可变的:UTF-7、UTF-8、UTF-16,以及诸如Shft-JIS等较老标准。即便是扩大的ASCII,如果汇编者将果园视作已签署8个借方单位的话,也会有简单的信号延伸问题。

Please note:

不管贵格体的种类大小如何,这在多数特性上都是错误的。

这个问题存在于标准C图书馆以及C++标准图书馆;这些图书馆仍然试图通过周围的果园和果园,而不是各种空间的扼杀装置,是印页等。

实际上,这恰恰是那些打破了所困的通用性的职能类型:扼杀。如果它仅仅在储存库中工作,并且没有试图将储存库的含义视为合乎逻辑的性质(例如空间),那么抽象就会更加诚实,迫使我们的方案者在其他地方寻找有效的解决办法。

Thank You

Everyone who participated. Between this discussion and WChars, Encodings, Standards and Portability I have a much better handle on the issues. Although there are no easy answers, every bit of understanding helps.

Answer 1

How do you test for whitespace, isprintable, etc., in a way that doesn t suffer from two issues:
1) Sign expansion
2) variable-width character issues
After all, all commonly used Unicode encodings are variable-width, whether programmers realize it or not: UTF-7, UTF-8, UTF-16, as well as older standards such as Shift-JIS...

显然,你必须使用Unicode-aware图书馆,因为你已经(准确地)证明C++03标准图书馆不是。 C++11图书馆得到改进,但大多数使用情况仍然不够好。是的,一些非洲顾问办公室有32条轨道轮).,能够正确处理UTF32,但执行,而且C++没有保证,而且对于许多单条编码任务,如在图象上(字母)的重复处理,还远远不够。

IBMICU
Libiconv
microUTF-8
UTF-8 CPP, version 1.0
utfproc
and many more at http://unicode.org/resources/libraries.html.

如果问题不太涉及具体的特性测试,更涉及一般的守则做法: 不管你的框架如何。如果你重新编码气球/QT/networking,在UTF-8内部保持一切。如果你与Windows重新编码,将一切放在UTF-16内部。如果你需要有密码点的我,那么UTF-32就会把一切放在内部。否则(对于便携式、通用的代码),无论你想要做什么,因为不管怎么做,你都必须为某些专业或其它方面翻译。

Answer 2

I think you are confounding a whole host of unrelated concepts.

首先,<代码>char/code>只是一个数据类型。其首要含义是“系统是基本储存单位”,即“一旁”。其签署故意留待执行,以便每项执行都能采用最适当的(即硬件支持)版本。它的名称是“理论者”,很可能是设计C方案拟订语言方面唯一最坏的决定。

下一个概念是说明案文。在基础,案文是一系列单位,通常称为“组合”,但可以更多地参与。为此,统法协会的标准编码将“编码点”一词统一起来,指定了最基本的文本单位。现在,对我们的方案者来说,“案文”是一系列法典要点。

问题在于,比可能的特质价值多一些标准点。这个问题可以通过两种不同方式解决: (1) 使用多端编码<>,作为按顺序排列的代码顺序;或者(2)使用不同的基本数据类型。 C和C++实际提供两种解决方案: 本地的东道界面(单线轴线、文档内容、环境变量)作为逐/em>序列提供;但该语文还提供了“系统特性组”的模版<代码_t以及它们之间的翻译功能(mbstowcs/wcstombs)。

不幸的是,对于“体系的特性”和“多面编码系统”没有任何具体内容,因此,正如摆在你面前的这么多SO用户一样,你对与这些具有神秘性的广泛特性做些什么却置之不理。现在人们想要的是definite,即他们可以分享平台。我们为此目的唯一有用的编码是Unicode,该编码为大量代码点(目前多达2<>21)赋予了文字含义。除此案文外,还形成了一个穿束管的家族,即UTF-8、UTF-16和UTF-32。

The first step to examining the content of a given text string is thus to transform it from whatever input you have into a string of definite (Unicode) encoding. This Unicode string may itself be encoded in any of the transformation formats, but the simplest is just as a sequence of raw codepoints (typically UTF-32, since we don t have a useful 21-bit data type).

进行这一转变已经超出了C++标准(甚至新的标准)的范围,因此我们需要一个图书馆这样做。由于我们不了解我们“制度特性”的任何内容,我们也需要图书馆处理。

选择的普通图书馆是iconv();典型顺序从多用途输入到<>>>至 or wchar_t* 广泛编码,然后通过iconv ( s WCHAR_T-to-UTF32换算为,>>>t_%20>。



此时此刻,我们的旅途结束。 我们现在可以按密码点审查案文编号(这或许足以说明究竟是哪一个空间);或者我们可以援引一个更重的文字处理图书馆,在我们统法协会编码点(例如正常化、网络化、列报转变等)上进行复杂的文字操作。 这远远超出了普通用途方案管理人和文本处理专家的范围。

Answer 3


It is in any case invalid to pass a negative value other than EOF to isspace and the other character macros. If you have a char c, and you want to test whether it is a space or not, do isspace((unsigned char)c). This deals with the extension (by zero-extending). isspace(*pchar) is flat wrong -- don t write it, don t let it stand when you see it. If you train yourself to panic when you do see it, then it s less hard to see.

<代码>fgetc(例如)已经将EOF或特性重新编号为unsign char/code>,然后改为int,因此,对数值没有签署问题。



这确实是轻视的,因为标准性质宏观并不涵盖统法协会编码或多条编码。 如果你想妥善处理统法协会编码,那么你需要一个统法协会编码图书馆。 除了C++11有<代码>std:u32string<<>code>且有希望的外,我没有研究C++11或C1X在这方面提供的情况。 在此之前,答案是使用某种特定执行或第三方。 (令人欣慰的是,有许多图书馆可以选择。)

或许(我猜测)“完整”统法协会分类数据库如此庞大,因此可能会发生变化,要求“完整”支持的C++标准不切实际。 这在一定程度上取决于应支持哪些业务,但你可以回避统法协会编码在20年已经通过6个主要版本的问题(自第1个标准版本以来),而C++在13年有2个主要版本。 就C++而言,一套统法协会编码特征是一个迅速实现的目标,因此,它总是要界定该系统知道的哪些代码。

In general, there are three correct ways to handle Unicode text:


所有I/O(包括系统都要求返还或接受指示),将任何东西转换为外部使用的特性编码和内部固定编码。 你们可以认为,这是对投入和产出的“航空化”。 如果某些物体类型具有转换成或从异构体转变的功能,那么,你会把气流与物体混为一谈,或研究一下子流中哪些部分用于你认为你承认的序列化数据。 对于这个内部的单码编码级,它需要一定不同。 请注意,cannot be std:string,并且可能不会有std:wstring, 取决于执行。 如果标准图书馆帮助或使用“<代码>std:basic_string<>code>,则标准图书馆提供舱面,但“Unicode-aware”图书馆可做任何精密的工作。 或许还需要了解统法协会的正常化,处理标志和类似标志的合并问题,因为即便在固定的统法协会编码编码编码编码编码编码中,每升可能有一个以上的编码点。
问题涉及几类按顺序和统法协会编码顺序的临时性混合物,并仔细跟踪。 它如同(1),但通常会更加困难,因此尽管它可能正确,但实际上可能很容易发现错误。
(只有特殊目的):用UTF-8做一切。 有时,这还不够,例如,如果你们都根据ASCII的校准标记和加固的产出体来提供教导。 基本来说,它为方案工作,你们不需要用顶点来理解任何东西,而只是通过。 如果你需要实际提供文本,或者以其他方式使某人认为“不相干”但实际上很复杂的话,那就没有工作了。 和串通一样。

Answer 4

One comment up front: the old C functions like isspace took int for a reason: they support EOF as input as well, so they need to be able to support one more value than will fit in a char. The “naïve” decision was allowing char to be signed—but making it unsigned would have had severe performance implications on a PDP-11.

现在请问:

(1) 扩大

The C++ functions don t have this problem. In C++, the “correct” way of testing things like whether a character is a space is to grap the std::ctype facet from whatever locale you want, and to use it. Of course, the C++ localization, in <locale>, has been carefully designed to make it as hard as possible to use, but if you re doing any significant text processing, you ll soon come up with your own convenience wrappers: a functional object which takes a locale and mask specifying which characteristic you want to test isn t hard. Making it a template on the mask, and giving its locale argument a default to the global locale isn t rocket science either. Throw in a few typedef s, and you can pass things like IsSpace() to std::find. The only subtility is managing the lifetime of the std::ctype object you re dealing with. Something like the following should work, however:

template<std::ctype_base::mask mask>
class Is  //  Must find a better name.
{
    std::locale myLocale;
            //< Needed to ensure no premature destruction of facet
    std::ctype<char> const* myCType;
public:
    Is( std::locale const& l = std::locale() )
        : myLocale( l )
        , myCType( std::use_facet<std::ctype<char> >( l ) )
    {
    }
    bool operator()( char ch ) const
    {
        return myCType->is( mask, ch );
    }
};

typedef Is<std::ctype_base::space> IsSpace;
//  ...

(Given the influence of the STL, it s somewhat surprising that the standard didn t define something like the above as standard.)

(2) 可变性杂质问题。

There is no real answer. It all depends on what you need. For some applications, just looking for a few specific single byte characters is sufficient, and keeping everything in UTF-8, and ignoring the multi-byte issues, is a viable (and simple) solution. Beyond that, it s often useful to convert to UTF-32 (or depending on the type of text you re dealing with, UTF-16), and use each element as a single code point. For full text handling, on the other hand, you have to deal with multi-code-point characters even if you re using UTF-32: the sequence u006Du0302 is a single character (a small m with a circumflex over it).

Answer 5

我没有检测过Qt图书馆的国际化能力,但从我所知的角度来看,QString是完全统一的编码-软件,并且正在使用通用编码的QChar。我不知道这些建议的执行情况,但我预计这意味着QChar具有不可分的规模。

仅仅使用扼杀手段,就要把 yourself与Qt等大框架联系起来。

Answer 6


You seem to be confusing a function defined on 7-bit ascii with a universal space-recognition function. Character functions in standard C use int not to deal with different encodings, but to allow EOF to be an out-of-band indicator. There are no issues with sign-extension, because the numbers these functions are defined on have no 8th bit. Providing a byte with this possibility is a mistake on your part.

计划9试图用UTF图书馆解决这一问题,假设所有投入数据都是UTF-8。 这使得某些落后情况符合ASCII,因此不符合要求的方案不会全部死亡,但允许新的方案得到正确撰写。

The common notion in C, even still is that a char* represents an array of letters. It should  instead be seen as a block of input data. To get the letters from this stream, you use chartorune(). Each Rune is a representation of a letter(/symbol/codepoint), so one can finally define a function isspacerune(), which would finally tell you which letters are spaces.

与<代码>Rune的阵列有关的工作,如您将使用<代码>char阵列,以进行操纵,然后打电话<代码>runetochar(),以便在你书写之前将你的信函重新编号为UTF-8。

Answer 7

The sign extension issue is easy to deal with. You can either use:

isspace((unsigned char) ch)
isspace(ch & 0xFF)
the compiler option that makes char an unsigned type

As far the variable-length character issue (I m assuming UTF-8), it depends on your needs.

If you just to deal with the ASCII whitespace characters vf, then isspace will work fine; the non-ASCII UTF-8 code units will simply be treated as non-spaces.

But if you need to recognize the extra Unicode space characters x85xa0u1680u180eu2000u2001u2002u2003u2004u2005u2006u2007u2008u2009u200au2028u2029u202fu205fu3000, it s a bit more work. You could write a function along the lines of

bool isspace_utf8(const char* pChar)
{
    uint32_t codePoint = decode_char(*pChar);
    return is_unicode_space(codePoint);
}

most strings in practice use a multibyte encoding such as UTF-7, UTF-8, UTF-16, SHIFT-JIS, etc.

任何方案家都不得使用UTF-7或Shft-JIS作为internal的代表,除非他们疼痛。第8号、第16号或第32号政府令,仅按需要转换。

Answer 8

你的序言论点有点不实,而且可以说是不公平的,它根本不是在图书馆设计中支持统法协会编码编码编码编码编码——当然不是多条统法协会编码编码编码编码编码编码编码编码编码编码。

C和C++语文的开发以及许多图书馆在建立统法协会之前的制作。此外,由于系统使用的语文等级,它们需要一种数据类型,即与执行环境最小的可用字数相对应。不幸的是,<条码><>char/code>型号已超载,既代表执行环境的特性,也代表最低可处理字。历史表明,这也许是有缺陷的,但改变语言定义,实际上图书馆将打破大量的遗产代码,因此,这些遗产留给新语言,如C#,有8-bitcode>-byte和独特的char。

此外,统法协会编码代表的变式编码使其不适于这种内在数据类型。你们显然知道这一点,因为你建议,统法协会的特性操作应当按体字而不是机器字体进行。这需要图书馆支助,因为你指出,标准图书馆没有提供这种支助。造成这种情况的原因很多,但主要是不属于标准图书馆的范畴,因为没有标准图书馆支助联网或制图。图书馆从本质上讲,并不涉及从深层的子宫内所有目标平台普遍支持的任何东西。所有这些物品都必须由系统或第三方图书馆提供。

Support for multiple character encodings is about system/environment interoperability, and the library is not intended to support that either. Data exchange between incompatible encoding systems is an application issue not a system issue.

"How do you test for whitespace, isprintable, etc., in a way that doesn t suffer from two issues:

(1) 扩大标志,

2) variable-width character issues

面积仅为8轨以下。其定义明确规定,如果你通过一个不能作为未经签名的char子或等于宏观低地轨道F价值的论点,结果就没有界定。如果按原意使用,则不会出现问题。问题在于,为了你看来适用这一规定,这是不适当的。

After all, all commonly used Unicode encodings are variable-width, whether programmers realize it or not: UTF-7, UTF-8, UTF-16, as well as older standards such as Shift-JIS

统法协会没有界定空间。你们需要一个图书馆,以利用你正在使用的任何具体编码。 http://stackoverflow.com/questions/114611/what-is-the-best-unicode-library-for-c> 什么是统法协会C的最佳图书馆?可能是相关的。

So, my question is:

Please note:

Thank You

友情链接