English 中文(简体)
ED A0 80 ED B0 80 是按顺序排列的有效UTF-8吗?
原标题:Is ED A0 80 ED B0 80 a valid UTF-8 byte sequence?

java.nio.charset.Charset.forName (“utf8”).decode 。 缩略语

 ED A0 80 ED B0 80

加入:

 U+10000

java.nio.charset.Charset.forName (“utf8”).decode 。 并按顺序排列

 F0 90 80 80

加入:

 U+10000

以下编码上加以核实。

现在这似乎告诉我,UTF-8编码计划将编码<条码>ED A0 80 ED B0< 80/code>和<条码>F0 90 80 80 80并入相同的单典代码。

然而,如果我访问。 https://www.google.com/search?query=%A0%ED%B0%<80/strong>

我可以看到,这显然不同于。 https://www.google.com/search?query=%F90%80

由于谷歌搜索正在使用UTF-8编码系统(如果我错的话,也更正我),

这表明,UTF-8没有将以下编码编码:ED A0 80 ED B0 80F0 90 80 80归入相同的统一编码标准。

因此,我基本上想到,通过 官方标准,UTF-8 decode,按顺序排列成为Unicodepoint U+10000 ?

Code:

public class Test {

    public static void main(String args[]) {
        java.nio.ByteBuffer bb = java.nio.ByteBuffer.wrap(new byte[] { (byte) 0xED, (byte) 0xA0, (byte) 0x80, (byte) 0xED, (byte) 0xB0, (byte) 0x80 });
        java.nio.CharBuffer cb = java.nio.charset.Charset.forName("utf8").decode(bb);
        for (int x = 0, xx = cb.limit(); x < xx; ++x) {
            System.out.println(Integer.toHexString(cb.get(x)));
        }
        System.out.println();
        bb = java.nio.ByteBuffer.wrap(new byte[] { (byte) 0xF0, (byte) 0x90, (byte) 0x80, (byte) 0x80 });
        cb = java.nio.charset.Charset.forName("utf8").decode(bb);
        for (int x = 0, xx = cb.limit(); x < xx; ++x) {
            System.out.println(Integer.toHexString(cb.get(x)));
        }
    }
}
最佳回答

ED A0 80 ED B0 80 is the UTF-8 encoding of the UTF-16 surrogate pairD800 DC00. http://www.ietf.org/rfc/rfc2279.txt“rel=“noreferer”UTF-8:

However, pairs of UCS-2 values between D800 and DFFF (surrogate pairs in Unicode parlance)...need special treatment: the UTF-16 transformation must be undone, yielding a UCS-4 character that is then transformed as above.

However, such an encoding is used in CESU-8 and Java s "Modified UTF-8".

由于谷歌搜索正在使用UTF-8编码系统(如果我错的话,也更正我),

It appears, based on the search box, that Google is using some kind of encoding auto-detection. If you pass it F0 90 80 80, which is valid UTF-8, it interprets it as UTF-8 (?). If you pass it ED A0 80 ED B0 80, which is invalid UTF-8, it interprets it as windows-1252 (í�€í°€).

问题回答

Java s UTF8实际上是CESU-8变量。 第一个案例是使用UTF8“模版”编码的 sur。

F0 90 80 80

decodes as U+10000, or LINEAR B SYLLABLE B008 A.

ED A0 80 ED B0 80

decodes as U+d800 U+dc00.





相关问题
Spring Properties File

Hi have this j2ee web application developed using spring framework. I have a problem with rendering mnessages in nihongo characters from the properties file. I tried converting the file to ascii using ...

Logging a global ID in multiple components

I have a system which contains multiple applications connected together using JMS and Spring Integration. Messages get sent along a chain of applications. [App A] -> [App B] -> [App C] We set a ...

Java Library Size

If I m given two Java Libraries in Jar format, 1 having no bells and whistles, and the other having lots of them that will mostly go unused.... my question is: How will the larger, mostly unused ...

How to get the Array Class for a given Class in Java?

I have a Class variable that holds a certain type and I need to get a variable that holds the corresponding array class. The best I could come up with is this: Class arrayOfFooClass = java.lang....

SQLite , Derby vs file system

I m working on a Java desktop application that reads and writes from/to different files. I think a better solution would be to replace the file system by a SQLite database. How hard is it to migrate ...