English 中文(简体)
What is the fool proof way to convert some string (utf-8 or else) to a simple ASCII string in python
原标题:

Inside my python scrip, I get some string back from a function which I didn t write. The encoding of it varies. I need to convert it to ascii format. Is there some fool-proof way of doing this? I don t mind replacing the non-ascii chars with blanks or something else...

最佳回答

If you want an ASCII string that unambiguously represents what you have got, without losing any information, the answer is simple:

Don t muck about with encode/decode, use the repr() function (Python 2.X) or the ascii() function (Python 3.x).

问题回答

You say "the encoding of it varies". I guess that by "it" you mean a Python 2.x "string", which is really a sequence of bytes.

Answer part one: if you do not know the encoding of that encoded string, then no, there is no way at all to do anything meaningful with it*. If you do know the encoding, then step one is to convert your str into a unicode:

encoded_string = i_have_no_control()
the_encoding =  utf-8  # for the sake of example
text = unicode(encoded_string, the_encoding)

Then you can re-encode your unicode object as ASCII, if you like.

ascii_garbage = text.encode( ascii ,  replace )

* There are heuristic methods for guessing encodings, but they are slow and unreliable. Here s one excellent attempt in Python.

I d try to normalize the string then encode it. What about :

import unicodedata
s = u"éèêàùçÇ"
print unicodedata.normalize( NFKD ,s).encode( ascii , ignore )

This works only if you have unicode as input. Therefor, you must know what can of encoding the function ouputs and decode it. If you don t, there are encoding detection heuristics, but on short strings, there are not reliable.

Of course, you could have luck and the function outputs rely on various unknow encodings but using ascii as a code base, therefor they would allocate the same value for the bytes from 0 to 127 (like utf-8).

In that case, you can just get rid of the unwanted chars by filtering them using OrderedSets :

import string.printable # asccii chars
print "".join(OrderedSet(string.printable) & OrderedSet(s))

Or if you want blanks instead :

print("".join(((char if char in  string.printable else " ") for char in s )))

"translate" can help you to do the same.

The only way to know if your are this lucky is to try it out... Sometimes, a big fat lucky day is what any dev need :-)

What s meant by "foolproof" is that the function does not fail with even the most obscure, impossible input -- meaning, you could feed the function random binary data and IT WOULD NEVER FAIL, NO MATTER WHAT. That s what "foolproof" means.

The function should then proceed do its best to convert to the destination encoding. If it has to throw away all the trash it does not understand, then that is perfectly fine and is in fact the most desirable result. Why try to salvage all the junk? Just discard the junk. Tell the user he s not merely a moron for using Microsoft anything, but a non-standard moron for using non-standard Microsoft anything...or for attempting to send in binary data!

I have just precisely this same need (though my need is in PHP), and I also have users who are at least as moronic as I am, sometimes moreso; however, they are definitely nicer and no doubt more patient.

The best, bottom-line thing I ve found so far is (in PHP 5.3):

$fixed_string = iconv( ISO-8859-1 , UTF-8//IGNORE//TRANSLATE , $in_string );

This attempts to translate whatever it can and simply throws away all the junk, resulting in a legal UTF-8 string output. I ve also not been able to break it or cause it to fail or reject any incoming text or data, even by feeding it gobs of binary junk data.

Finding the iconv() and getting it to work is easy; what s so maddening and wasteful is reading through all the total garbage and bend-over-backwards idiocy that so many programmers seem to espouse when dealing with this encoding fiasco. What s become of the enviable (and respectable) "Flail and Burn The Idiots" mentality of old school programming? Let s get back to basics. Use iconv() and throw away their garbage, and don t be bashful when telling them you threw away their garbage -- in short, don t fail to flail the morons who feed you garbage. And you can tell them I told you so.

If all you want to do is preserve ASCII-compatible characters and throw away the rest, then in most encodings that boils down to removing all characters that have the high bit set -- i.e., characters with value over 127. This works because nearly all character sets are extensions of 7-bit ASCII.

If it s a normal string (i.e., not unicode), you need to decode it in an arbitrary character set (such as iso-8859-1 because it accepts any byte values) and then encode in ascii, using the ignore or replace option for errors:

>>> orig =  1ä2äö3öü4ü 
>>> orig.decode( iso-8859-1 ).encode( ascii ,  ignore )
 1234 
>>> orig.decode( iso-8859-1 ).encode( ascii ,  replace )
 1??2????3????4?? 

The decode step is necessary because you need a unicode string in order to use encode. If you already have a Unicode string, it s simpler:

>>> orig = u 1ä2äö3öü4ü 
>>> orig.encode( ascii ,  ignore )
 1234 
>>> orig.encode( ascii ,  replace )
 1??2????3????4?? 




相关问题
Can Django models use MySQL functions?

Is there a way to force Django models to pass a field to a MySQL function every time the model data is read or loaded? To clarify what I mean in SQL, I want the Django model to produce something like ...

An enterprise scheduler for python (like quartz)

I am looking for an enterprise tasks scheduler for python, like quartz is for Java. Requirements: Persistent: if the process restarts or the machine restarts, then all the jobs must stay there and ...

How to remove unique, then duplicate dictionaries in a list?

Given the following list that contains some duplicate and some unique dictionaries, what is the best method to remove unique dictionaries first, then reduce the duplicate dictionaries to single ...

What is suggested seed value to use with random.seed()?

Simple enough question: I m using python random module to generate random integers. I want to know what is the suggested value to use with the random.seed() function? Currently I am letting this ...

How can I make the PyDev editor selectively ignore errors?

I m using PyDev under Eclipse to write some Jython code. I ve got numerous instances where I need to do something like this: import com.work.project.component.client.Interface.ISubInterface as ...

How do I profile `paster serve` s startup time?

Python s paster serve app.ini is taking longer than I would like to be ready for the first request. I know how to profile requests with middleware, but how do I profile the initialization time? I ...

Pragmatically adding give-aways/freebies to an online store

Our business currently has an online store and recently we ve been offering free specials to our customers. Right now, we simply display the special and give the buyer a notice stating we will add the ...

Converting Dictionary to List? [duplicate]

I m trying to convert a Python dictionary into a Python list, in order to perform some calculations. #My dictionary dict = {} dict[ Capital ]="London" dict[ Food ]="Fish&Chips" dict[ 2012 ]="...

热门标签