Question

java.nio.charset.Charset.forName (“utf8”).decode 。缩略语

 ED A0 80 ED B0 80

加入:

 U+10000

java.nio.charset.Charset.forName (“utf8”).decode 。并按顺序排列

 F0 90 80 80

加入:

 U+10000

以下编码上加以核实。

现在这似乎告诉我,UTF-8编码计划将编码<条码>ED A0 80 ED B0< 80/code>和<条码>F0 90 80 80 80并入相同的单典代码。

然而,如果我访问。 https://www.google.com/search?query=%A0%ED%B0%<80/strong>。

我可以看到,这显然不同于。 https://www.google.com/search?query=%F90%80。

由于谷歌搜索正在使用UTF-8编码系统(如果我错的话,也更正我),

这表明,UTF-8没有将以下编码编码:ED A0 80 ED B0 80和F0 90 80 80归入相同的统一编码标准。

因此,我基本上想到,通过官方标准,UTF-8 decode,按顺序排列成为Unicodepoint U+10000 ?

Code:

public class Test { public static void main(String args[]) { java.nio.ByteBuffer bb = java.nio.ByteBuffer.wrap(new byte[] { (byte) 0xED, (byte) 0xA0, (byte) 0x80, (byte) 0xED, (byte) 0xB0, (byte) 0x80 }); java.nio.CharBuffer cb = java.nio.charset.Charset.forName("utf8").decode(bb); for (int x = 0, xx = cb.limit(); x < xx; ++x) { System.out.println(Integer.toHexString(cb.get(x))); } System.out.println(); bb = java.nio.ByteBuffer.wrap(new byte[] { (byte) 0xF0, (byte) 0x90, (byte) 0x80, (byte) 0x80 }); cb = java.nio.charset.Charset.forName("utf8").decode(bb); for (int x = 0, xx = cb.limit(); x < xx; ++x) { System.out.println(Integer.toHexString(cb.get(x))); } } }

Answer 1

ED A0 80 ED B0 80 is the UTF-8 encoding of the UTF-16 surrogate pairD800 DC00. http://www.ietf.org/rfc/rfc2279.txt“rel=“noreferer”UTF-8:

However, pairs of UCS-2 values between D800 and DFFF (surrogate pairs in Unicode parlance)...need special treatment: the UTF-16 transformation must be undone, yielding a UCS-4 character that is then transformed as above.

However, such an encoding is used in CESU-8 and Java s "Modified UTF-8".

由于谷歌搜索正在使用UTF-8编码系统(如果我错的话,也更正我),

It appears, based on the search box, that Google is using some kind of encoding auto-detection. If you pass it F0 90 80 80, which is valid UTF-8, it interprets it as UTF-8 (?). If you pass it ED A0 80 ED B0 80, which is invalid UTF-8, it interprets it as windows-1252 (í�€í°€).

Answer 2

Java s UTF8实际上是CESU-8变量。第一个案例是使用UTF8“模版”编码的 sur。

Answer 3

F0 90 80 80

decodes as U+10000, or LINEAR B SYLLABLE B008 A.

ED A0 80 ED B0 80

decodes as U+d800 U+dc00.

友情链接