English 中文(简体)
更新某些扫描仪代码以使用ICU时的问题
原标题:Questions while updating some scanner code to use ICU
  • 时间:2011-05-29 03:57:38
  •  标签:
  • c
  • utf-8
  • icu

我正在开发一个基本的手工编码词汇扫描仪,并希望支持UTF-8输入(现在已经不是1970年了!)。输入字符从stdin或一个文件中读取,一次一个,并被推入缓冲区,直到看到空白,等等。我考虑为fgetc()编写自己的包装器,它将返回组成UTF-8字符的字节的char[],并将结果作为字符串处理。。。这很容易,但会变成一个滑坡。我宁愿不浪费时间重新发明轮子,而是使用一个现有的、经过测试的库,如ICU。因此,现在我有了一个非UTF-8支持的代码,可以与fgetc()isspace()strcmp()一起使用等,我正在尝试更新以使用ICU。这是我第一次进入ICU,我一直在阅读文档,并试图通过谷歌代码搜索找到使用示例,但仍有一些困惑,我希望有人能够澄清。

u_fgetc()函数返回UCharu_fgetcx()返回UChar32。。。文档建议使用ufgetcx()来读取代码点,所以这就是我的起点。我保持了与上面相同的方法,但我将UChar32s推入缓冲区,而不是chars。

  • 将字符与已知值进行比较的正确方法是什么?最初,我可以执行if(c=++)来检查是否从输入中提取了加号。当cUChar32时,GCC不会抱怨(这是UChar32char之间的比较),但这真的合适吗?

  • 我能够使用strcmp()将缓冲的字符与已知值进行比较,例如if((strcmp,“else”)==0)。ICU提供了u_strcmp(),我认为我可能需要使用u_STRING_DECLu_STRING_INIT宏来指定已知的文字,但我不确定。文档显示它们导致了UChar[],尽管我认为我需要UChar32[]

  • 在阅读了一系列数字字符后,我一直在用strtol()转换它们,这样我就可以使用它们了。由于我现在正在转换UChar32[],ICU是否提供了类似的功能?

最佳回答

UChar用于持有代码单元,而UChar32则用于持有代码点。如果您的输入停留在基本多语言平面(BMP),UChar就足够了,事实上,大多数ICU功能都在UChar[]上运行。

强烈建议阅读ICU用户指南,其中解释了大多数内部结构和最佳实践。

  • What is the proper way to compare a Unicode character variable against a known value? A character (or UChar or UChar32) is just another integer type with a certain width and signedness, and can be compared to other integer types with the usual caveats and restrictions. As for defining a character value, C99 (chapter 6.4.3) provides Universal character names notation: u followed by four hex digits, or U followed by eight hex digits, specifying the ISO/IEC 10646 "short identifier". The area below 0x00a0 (with exceptions of 0x0024 $ , 0x0040 @ , and 0x0060 (backtick) is reserved (but can be represented by casting a simple character constant to UChar). Also reserved is the range from 0xd800 through 0xdfff (for use by UTF-16).

  • 如何定义Unicode字符串文字U_STRING_DECLU_STRING_INIT确实是您想要的。(如上所述,ICU主要在UChar[]上运行。)如果您使用C++而不是C,UNICODE_STRING_SIMPLE(可选地,后面跟着getTerminatedBuffer()以再次生成UChar[])提供了一种更舒适的定义UNICODE字符串文字的方法。

  • 如何将表示数字的Unicode字符串转换为该数字的s值unum_parse()及其在unum.h中的兄弟将在这方面为您提供帮助。

问题回答
  1. PLUS SIGN的Unicode值是U+002B,+的正常值(Latin-1)也是0x2B(053,43)。如果代码集基于ASCII或ISO-8859-x,那么您编写的内容就足够安全了。C99标准提供了形式为u0123U00102345的Unicode(通用字符名)(具有4和8个十六进制数字),但规定您不能指定小于u00A0(如u202B)的值。所以,我认为你写的是正确的。

    然而,您可以通过使用枚举来节省自己未来的焦虑,例如

     enum { PLUS_SIGN =  +  };
    

    在适当的标题中定义,并在任何需要文字加号的地方使用。这样,如果你的假设(和我的假设)是错误的,你就有一个地方可以编辑——标题。

    我注意到ICU字符串表明在应用程序中使用UTF-32是不寻常的。

  2. 在纯C中,您可能会使用wcscmp(buf,L“else”),假设系统上的wchar_t等效于uint32_tUnicodeStringUNICODE_STRING(“…”),然后使用ToUTF32()来创建UTF-32字符串。也可能有更整洁的方法。

  3. 有一些格式化类同时处理格式化和解析。您可能会使用从NumberFormat类。





相关问题
Fastest method for running a binary search on a file in C?

For example, let s say I want to find a particular word or number in a file. The contents are in sorted order (obviously). Since I want to run a binary search on the file, it seems like a real waste ...

Print possible strings created from a Number

Given a 10 digit Telephone Number, we have to print all possible strings created from that. The mapping of the numbers is the one as exactly on a phone s keypad. i.e. for 1,0-> No Letter for 2->...

Tips for debugging a made-for-linux application on windows?

I m trying to find the source of a bug I have found in an open-source application. I have managed to get a build up and running on my Windows machine, but I m having trouble finding the spot in the ...

Trying to split by two delimiters and it doesn t work - C

I wrote below code to readin line by line from stdin ex. city=Boston;city=New York;city=Chicago and then split each line by ; delimiter and print each record. Then in yet another loop I try to ...

Good, free, easy-to-use C graphics libraries? [closed]

I was wondering if there were any good free graphics libraries for C that are easy to use? It s for plotting 2d and 3d graphs and then saving to a file. It s on a Linux system and there s no gnuplot ...

Encoding, decoding an integer to a char array

Please note that this is not homework and i did search before starting this new thread. I got Store an int in a char array? I was looking for an answer but didn t get any satisfactory answer in the ...