English 中文(简体)
encoding of file shell script
原标题:

How can I check the file encoding in a shell script? I need to know if a file is encoded in utf-8 or iso-8859-1.

Thanks

最佳回答

I d just use

file -bi myfile.txt

to determine the character encoding of a particular file.

A solution with an external dependency but I suspect file is very common nowadays among all semi-modern distro s.

EDIT:

As a response to Laurence Gonsalves comment: b is the option to be brief (not include the filename) and i is the shorthand equivalent of --mime so the most portable way (including Mac OSX) then probably is:

file --mime myfile.txt 
问题回答

There s no way to be 100% certain (unless you re dealing with a file format that internally states its encoding).

Most tools that attempt to make this distinction will try and decode the file as utf-8 (as that s the more strict encoding), and if that fails, then fall back to iso-8859-1. You can do this with iconv "by hand", or you can use file:

$ file utf8.txt
utf8.txt: UTF-8 Unicode text
$ file latin1.txt
latin1.txt: ISO-8859 text

Note that ASCII files are both UTF-8 and ISO-8859-1 compatible.

$ file ascii.txt
ascii.txt: ASCII text

Finally: there s no real way to distinguish between ISO-8859-1 and ISO-8859-2, for example, unless you re going to assume it s natural language and use statistical methods. This is probably why file says "ISO-8859".

you can use the file command file --mime myfile.text

File command is not 100% certain. Simple test:

#!/bin/bash

echo "a" > /tmp/foo

for i in {1..1000000}
do
  echo "asdas" >> /tmp/foo
done

echo "üöäÄÜÖß " >> /tmp/foo

file -b --mime-encoding /tmp/foo

this outputs:

us-ascii

Ascii does not know german umlauts.

File is a bunch of bytes (sequence of bytes). Without trusting meta data (BOM only recomended for utf-16 and utf-32, MIME, header of data) you can t really detect encoding. Sequence of bytes can be interpreted as utf-8 or ISO-8859-1/2 or anything you want. Well it depends for certain sequence if iso-8850-1/utf-8 map exist. What you want is to encode the whole file content to desired character encoding. If it fails the desired encoding does not have map for this sequence of bytes.

In shell maybe use python, perl or like Laurence Gonsalves says iconv. For text files I use in python this:

f = codecs.open(path, encoding= utf-8 , errors= strict )


def valid_string(str):
  try:
    str.decode( utf-8 )
    return True
  except UnicodeDecodeError:
    return False

How do you that a file is a text file. You don t. You encode line by line with desired character encoding. Ok, you can add a little trust and check if BOM exists (file is utf encoded).





相关问题
Signed executables under Linux

For security reasons, it is desirable to check the integrity of code before execution, avoiding tampered software by an attacker. So, my question is How to sign executable code and run only trusted ...

encoding of file shell script

How can I check the file encoding in a shell script? I need to know if a file is encoded in utf-8 or iso-8859-1. Thanks

How to write a Remote DataModule to run on a linux server?

i would like to know if there are any solution to do this. Does anyone? The big picture: I want to access data over the web, using my delphi thin clients. But i´would like to keep my server/service ...

How can I use exit codes to run shell scripts sequentially?

Since cruise control is full of bugs that have wasted my entire week, I have decided the existing shell scripts I have are simpler and thus better. Here is what I have so far svn update /var/www/...

Good, free, easy-to-use C graphics libraries? [closed]

I was wondering if there were any good free graphics libraries for C that are easy to use? It s for plotting 2d and 3d graphs and then saving to a file. It s on a Linux system and there s no gnuplot ...

热门标签