English 中文(简体)
How do you think Google is handling this encoding issue?
原标题:

I recently came across an encoding issue specific to how Firefox encodes URLs directly entered into the address bar. It basically looks like the default Firefox character encoding for URLs is NOT UTF-8, which is the case with most browsers. Additionally, it looks like they are trying to make some intelligent decisions as to what character encoding to use, based on the content of the URL.

For example, if you enter a URL directly into the address bar (I m using Firefox 3.5.5) with a q parameter, you will get the following results:

For the given query string parameter, this is how it s actually encoded in the http request:
1) ...q=Književni --> q=Knji%9Eevni (This appears to be iso-8859-1 encoded)
2) ...q=漢字 --> q=%E6%BC%A2%E5%AD%97 (This appears to be UTF-8 encoded)
3) ...q=Književni漢字 --> Knji%C5%BEevni%E6%BC%A2%E5%AD%97 (This appears to be UTF-8 encoded ... which is odd, because notice that the first part of the value is the same as 1, which was iso-8859-1 encoded).

So, this really shouldn t be a big deal, right? Well, for me, not totally, but sort of. In the application I m working on, we have a search box in our global navigation. When a user submits a search term in our search box, the q parameter (like in our example, the parameter that holds the query string value) is submitted on the request and is UTF-8 encoded and all is well and good.

However, the URL that then appears in the address bar contains the decoded form of that URL, so the q parameter looks like "q=Književni". Now, as I mentioned before, if a user then presses the ENTER key to submit what is in the address bar, the "q=Književni" parameter is now encoded to iso-8859-1 and gets sent to our server as "q=Knji%9Eevni". The problem with this is that we are always expecting a UTF-8 encoded URL ... so when we recieve this parameter our application does not know how to interpret it and it can cause some strange results.

As I mentioned before, this appears to ONLY be a Firefox issue, and it would be rare that a user would actually run into this scenario, so it is not too concerning for us. However, I happened to notice that Google actually handles this quite nicely. Typing in the following URL using either of the differently encoded forms of the query string parameter will return nice results in Google:

http://www.google.com/search?q=Knji%C5%BEevni
http://www.google.com/search?q=Knji%9Eevni

So my question really is, how do you think they handle this scenario? Additionally, does anyone else see the same strange Firefox behavior?

最佳回答

Looks like it is using latin-1 unless any characters can t be represented in that encoding, otherwise it is using UTF-8.

If that is indeed the case, the way to get around this at the other end is to assume everything you receive is UTF-8, and validate it as UTF-8. If it fails validation as UTF-8 then assume it is latin-1 (iso-8859-1).

Due to the way UTF-8 is structured, it is highly unlikely that something that is not actually UTF-8 will pass when validated as UTF-8.

Still, the possibility exists and I don t think Firefox s behaviour is a good idea, though no doubt they have done it as a compromise - like for compatibility with servers that wouldn t know UTF-8 if they stepped in it.

问题回答

There are several parts in a url. The domain name is encoded according to the IDN (International Domain Names) rules (http://en.wikipedia.org/wiki/Internationalized_domain_name).

The part that you care about comes (usually) from a form. And the encoding of the source page determines the encoding (before the % escaping). The form element in html can also take an encoding attribute which overrides the the page setting.

So it is not the fault of Firefox, the encoding of the referrer page/form is the determining factor. And that is the standard behavior.





相关问题
Mojarra for JSF Encoding

Can anyone teach me how to use mojarra to encode my JSF files. I downloaded mojarra and expected some kind of jar but what i had downloaded was a folder of files i don t know what to do with

encoding of file shell script

How can I check the file encoding in a shell script? I need to know if a file is encoded in utf-8 or iso-8859-1. Thanks

Using Java PDFBox library to write Russian PDF

I am using a Java library called PDFBox trying to write text to a PDF. It works perfect for English text, but when i tried to write Russian text inside the PDF the letters appeared so strange. It ...

what is encoding in Ajax?

Generally we are using UTF-8 encoding standard for sending the request for every language. But in some language this encoding standard is not working properly,then in that case we are using ISO-8859-1....

Encoding of window.location.hash

Does window.location.hash contain the encoded or decoded representation of the url part? When I open the same url (http://localhost/something/#%C3%BC where %C3%BCtranslates to ü) in Firefox 3.5 and ...

Auth-code with A-Za-z0-9 to use in an URL parameter

As part of a web application I need an auth-code to pass as a URL parameter. I am currently using (in Rails) : Digest::SHA1.hexdigest((object_id + rand(255)).to_s) Which provides long strings like : ...

热门标签