English 中文(简体)
如何计算对面奶粉?
原标题:How are surrogate pairs calculated?

如果单编码代码点使用17个比值或更多,那么对面奶的计算如何?

最佳回答

统法协会编码点为零值,从0x000到0x10FFFF。 因此,它们是21个轨道分类,而不是17个轨道。

乳房是UTF-16的一种机制。 这表明21倍的微量值为1或2个16倍代码单位。

  • Scalar values from 0x000000 to 0x00FFFF are represented as a single 16-bit code unit, from 0x0000 to 0xFFFF.
  • Scalar values from 0x00D800 to 0x00DFFF are not characters in Unicode, and so they will never occur in a Unicode character string.
  • Scalar values from 0x010000 to 0x10FFFF are represented as two 16-bit code units. The first code unit encodes the upper 11 bits of the scalar value, as a code unit ranging from 0xD800-0xDBFF. There s a bit of trickiness to encode values from 0x01-0x10 in four bits. The second code unit encodes the lower 10 bits of the scalar value, as a code unit ranging from 0xDC00-0xDFFF.

详见统法协会联合会联合会联合体FQ中的样本代码,UTF-8,UTF-16, UTF-32 & BOM。 这部法律参考了统法协会编码标准部分,该部分更为详细。

问题回答

If it is code you are after, here is how a single codepoint is encoded in UTF-16 and UTF-8 respectively.

“UTF-16编码units:

if (cp < 0x10000u)
{
   *out++ = static_cast<uint16_t>(cp);
}
else
{
   *out++ = static_cast<uint16_t>(0xd800u + (((cp - 0x10000u) >> 10) & 0x3ffu));
   *out++ = static_cast<uint16_t>(0xdc00u + ((cp - 0x10000u) & 0x3ffu));
}

A single codepoint to UTF-8 codeunits:

if (cp < 0x80u)
{
   *out++ = static_cast<uint8_t>(cp);
}
else if (cp < 0x800u)
{
   *out++ = static_cast<uint8_t>((cp >> 6) & 0x1fu | 0xc0u);
   *out++ = static_cast<uint8_t>((cp & 0x3fu) | 0x80u);
}
else if (cp < 0x10000u)
{
   *out++ = static_cast<uint8_t>((cp >> 12) & 0x0fu | 0xe0u);
   *out++ = static_cast<uint8_t>(((cp >> 6) & 0x3fu) | 0x80u);
   *out++ = static_cast<uint8_t>((cp & 0x3fu) | 0x80u);
}
else
{
   *out++ = static_cast<uint8_t>((cp >> 18) & 0x07u | 0xf0u);
   *out++ = static_cast<uint8_t>(((cp >> 12) & 0x3fu) | 0x80u);
   *out++ = static_cast<uint8_t>(((cp >> 6) & 0x3fu) | 0x80u);
   *out++ = static_cast<uint8_t>((cp & 0x3fu) | 0x80u);
}

这里有希望的是,更加方便开端。

代号编码点在0xD800-0xDF00之间。 该空间的上半部分用于顶层,后半部分用于低层。

因此,为了编码U+10000,你将其余的0x10000以上分成两半,将其分成现有的位置。

D8 00 DC 00

同样,为了编码U+10FFFF,你也得到了帮助。

DB FF DF FF

换言之,从D800到DBFF的数值中,其D800部分被掩盖,其余部分被用于我们想要编码的头十个参数。 同样,从DC00到DFFF的数值掩盖了DDC00的面值,其余值则用于编码值的低十比值。

根据定义,所有这些代码点的基数为0x10000,因此不必加以明确编码,而只是这一基数所抵消。

U+00010000 = base 0x00010000 + 0x00000
= 0000 0000 0000 0000 0000
  mmnn nnnn nnpp qqqq qqqq

U+0010FFFF = base 0x00010000 + 0xFFFFF
= 1111 1111 1111 1111 1111
  mmnn nnnn nnpp qqqq qqqq

......海产中吨数为xxx 和ppqqqqqqqqqqqq为 y

1101 10mm  nnnn nnnn   D8+xxx   1110 11pp  qqqq qqqq   DC+yyy
-----------------------------   -----------------------------
1101 1000  0000 0000   D800     1110 1100  0000 0000   DC00
1101 1011  1111 1111   DBFF     1110 1111  1111 1111   DFFF




相关问题
Why are there duplicate characters in Unicode?

I can see some duplicate characters in Unicode. For example, the character C can be represented by the code points U+0043 and U+0421. Why is this so?

how to extract characters from a Korean string in VBA

Need to extract the initial character from a Korean word in MS-Excel and MS-Access. When I use Left("한글",1) it will return the first syllable i.e 한, what I need is the initial character i.e ㅎ . Is ...

File open error by using codec utf-8 in python

I execute following code on windows xp and python 2.6.4 But it show IOError. How to open file whose name has utf-8 codec. >>> open( unicode( 한글.txt , euc-kr ).encode( utf-8 ) ) Traceback ...

UnicodeEncodeError on MySQL insert in Python

I used lxml to parse some web page as below: >>> doc = lxml.html.fromstring(htmldata) >>> element in doc.cssselect(sometag)[0] >>> text = element.text_content() >>>...

Fast way to filter illegal xml unicode chars in python?

The XML specification lists a bunch of Unicode characters that are either illegal or "discouraged". Given a string, how can I remove all illegal characters from it? I came up with the following ...

热门标签