English 中文(简体)
UTF-8 incorrectly displayed in Lua/ Corona
原标题:

In Lua, for an iPad Corona project, I m requesting a UTF-8 server text file (containing Chinese characters) using network.request, but the result when displayed in the console or in the app shows as "garbage". Google Chrome, for instance, displays the same UTF-8 page fine, as I m setting the http header when the server sends this (using PHP) to Content-Type: text/plain; charset=utf-8 (and there s no BOM, byte order mark either). The "garbage" I m seeing in Lua looks similar to when I "force" Chrome to render the page as ISO-8859-1 using the options menu.

Does anyone have any help or pointers? If all else fails, how would I convert the "garbage" string back to its UTF-8 origins within Lua?

Thanks for any help!

问题回答

Lua doesn t know anything about UTF-8; Lua strings are just sequences of bytes. It sounds like Corona itself is parsing the strings as ISO8859-1. The most likely cause for this is them doing something really stupid and naive like treating each byte of the string as a Unicode code point.

I m afraid I don t know Corona, so can t provide any specific solutions, but I d suggest looking to see what functions it s got that involve encodings --- there may be a specific function to render a string with a particular encoding, for example.

Can you show the code for your network.request() call?

If you re downloading a html page, you should use network.download().

I had this exact same problem, except with Japanese characters. Although Lua doesn t support UTF-8, Corona acts like it does. What that means is that... if you pass a UTF-8 String to display.newText(...), it should display properly. Now, if you output to the console, it will actually print out the raw bytes of the String. And, if you try to print the length of the string, it will actually print out the number of bytes.

So, in summary, Lua treats all strings as an array of bytes. It knows nothing about UTF-8. Some Corona API methods, when passed UTF-8 strings, will display the strings correctly.

I had issues when I mixed UTF-8 with plain ASCII characters, which I believe confused Corona (what I mean is that I mixed English characters with Japanese characters... still all UTF-8, though). I have a hunch that each character in the string must be of the same length in bytes for Corona to display it properly. Try printing out one character at a time to see if that helps. Please feel free to post comments here if you run into trouble. I d like to figure this issue out myself, too.





相关问题
Why are there duplicate characters in Unicode?

I can see some duplicate characters in Unicode. For example, the character C can be represented by the code points U+0043 and U+0421. Why is this so?

how to extract characters from a Korean string in VBA

Need to extract the initial character from a Korean word in MS-Excel and MS-Access. When I use Left("한글",1) it will return the first syllable i.e 한, what I need is the initial character i.e ㅎ . Is ...

File open error by using codec utf-8 in python

I execute following code on windows xp and python 2.6.4 But it show IOError. How to open file whose name has utf-8 codec. >>> open( unicode( 한글.txt , euc-kr ).encode( utf-8 ) ) Traceback ...

UnicodeEncodeError on MySQL insert in Python

I used lxml to parse some web page as below: >>> doc = lxml.html.fromstring(htmldata) >>> element in doc.cssselect(sometag)[0] >>> text = element.text_content() >>>...

Fast way to filter illegal xml unicode chars in python?

The XML specification lists a bunch of Unicode characters that are either illegal or "discouraged". Given a string, how can I remove all illegal characters from it? I came up with the following ...

热门标签