ASCII-8BIT is an alias for BINARY
open-uri
does a funny thing: if the file is less than 10kb (or something like that), it returns a String
and if it s bigger then it returns a StringIO
. That can be confusing if you re trying to deal with encoding issues.
如果卷宗数量庞大,我建议人工装入插图:
require uri
require net/http
require net/https
uri = URI.parse url_to_file
http = Net::HTTP.new(uri.host, uri.port)
if uri.scheme == https
http.use_ssl = true
# possibly useful if you see ssl errors
# http.verify_mode = ::OpenSSL::SSL::VERIFY_NONE
end
body = http.start { |session| session.get uri.request_uri }.body
Then you can use the https://rubygems.org/gems/ensure-encoding gem
require ensure/encoding
utf8_body = body.ensure_encoding( UTF-8 , :external_encoding => :sniff, :invalid_characters => :transcode)
I have been pretty happy with ensure-encoding
... we use it in production at http://data.brighterplanet.com
Note that you can also say :invalid_characters => :ignore
instead of :transcode
.
Also, if you know the encoding somehow, you can pass :external_encoding => ISO-8859-1
instead of :sniff