English 中文(简体)
如何在c++中从二进制文件中获取utf-8字符数?
原标题:How can i get utf-8 char number from binary in c++?

For example, I have: 11100011 10000010 10100010. It is the binary of: ; its number in UTF-8 is:12450

我怎样才能从二进制中得到这个数字?

最佳回答

您显示的字节序列是字符的UTF-8编码版本。

您需要解码UTF-8以获得Unicode码位。

对于这个精确的字节序列,以下位组成了码点:

11100011 10000010 10100010
    ****   ******   ******

因此,将带星号的位连接起来,我们得到数字0011000010100010,它等于0x30a2或十进制的12450。

请参阅维基百科描述了解如何解释编码的详细信息。

简而言之:如果在第一个字节中设置了位7,则也设置的相邻位(称为m)的数量(2)给出了此码点的后续字节数。对于第一个字节,从每个字节中提取的位数为(8-1-1-m),从每个后续字节中提取6位。所以这里我们得到(8-1-1-2)=4+2*6=16位。

正如评论中指出的那样,有很多库可以做到这一点,所以你可能不需要自己实现它。

问题回答

维基百科页面,我想到了这个:

unsigned utf8_to_codepoint(const char* ptr) {
    if( *ptr < 0x80) return *ptr;
    if( *ptr < 0xC0) throw unicode_error("invalid utf8 lead byte");
    unsigned result=0;
    int shift=0;
    if( *ptr < 0xE0) {result=*ptr&0x1F; shift=1;}
    if( *ptr < 0xF0) {result=*ptr&0x0F; shift=2;}
    if( *ptr < 0xF8) {result=*ptr&0x07; shift=3;}
    for(; shift>0; --shift) {
        ++ptr;
        if (*ptr<0x7F || *ptr>=0xC0) 
            throw unicode_error("invalid utf8 continuation byte");
        result <<= 6;
        result |= *ptr&0x6F;
    }
    return result;
}

请注意,这是一个非常糟糕的实现(我非常怀疑它甚至可以编译),并且解析了许多可能不应该解析的无效值。我提出这个只是为了表明它比你想象的要困难得多,你应该使用一个好的unicode库。





相关问题
Undefined reference

I m getting this linker error. I know a way around it, but it s bugging me because another part of the project s linking fine and it s designed almost identically. First, I have namespace LCD. Then I ...

C++ Equivalent of Tidy

Is there an equivalent to tidy for HTML code for C++? I have searched on the internet, but I find nothing but C++ wrappers for tidy, etc... I think the keyword tidy is what has me hung up. I am ...

Template Classes in C++ ... a required skill set?

I m new to C++ and am wondering how much time I should invest in learning how to implement template classes. Are they widely used in industry, or is this something I should move through quickly?

Print possible strings created from a Number

Given a 10 digit Telephone Number, we have to print all possible strings created from that. The mapping of the numbers is the one as exactly on a phone s keypad. i.e. for 1,0-> No Letter for 2->...

typedef ing STL wstring

Why is it when i do the following i get errors when relating to with wchar_t? namespace Foo { typedef std::wstring String; } Now i declare all my strings as Foo::String through out the program, ...

C# Marshal / Pinvoke CBitmap?

I cannot figure out how to marshal a C++ CBitmap to a C# Bitmap or Image class. My import looks like this: [DllImport(@"test.dll", CharSet = CharSet.Unicode)] public static extern IntPtr ...

Window iconification status via Xlib

Is it possible to check with the means of pure X11/Xlib only whether the given window is iconified/minimized, and, if it is, how?

热门标签