English 中文(简体)
why a pdf document could be not searchable? [closed]
原标题:

Closed. This question is off-topic. It is not currently accepting answers.


Want to improve this question? Update the question so it s on-topic for Stack Overflow.

Closed 13 years ago.

I have a pdf document with content in Arabic language and when I try to search inside the document for a specific word, adobe reader returns no results.

it seems a format problem... how can I fix that? thanks.

最佳回答

There are at least four different ways to get text into a PDF document (in order or likelihood):

  1. Place the text with standard text operators and standard fonts
  2. Place the text with standard text operators with non-standard fonts
  3. Draw one or more images that represent the text
  4. Place the text by manually drawing the glyphs with various PDF graphics commands

Case 1 is typically searchable. Case 2 is searchable if the font and encoding are sane - if they re not (and this is likely the case for non-Latin fonts) then there is probably no reliable way to map the encoded glyphs back to Unicode (and by the way - PDF is fairly Unicode hostile). Case 3 is totally unsearchable without knowing more about how the PDF was generated. Case 4 is totally unsearchable.

That said, all cases cases be read with an OCR engine that understands Arabic. I understand that the Iris engine does Arabic.

问题回答

It might not actually be text, or it might be in a container that Reader doesn t pay attention to. It s especially common to expand text objects into vector shapes when you re dealing with fonts that most people aren t going to have installed on their system. It looks the same on the screen, but it s not searchable.





相关问题
Mojarra for JSF Encoding

Can anyone teach me how to use mojarra to encode my JSF files. I downloaded mojarra and expected some kind of jar but what i had downloaded was a folder of files i don t know what to do with

encoding of file shell script

How can I check the file encoding in a shell script? I need to know if a file is encoded in utf-8 or iso-8859-1. Thanks

Using Java PDFBox library to write Russian PDF

I am using a Java library called PDFBox trying to write text to a PDF. It works perfect for English text, but when i tried to write Russian text inside the PDF the letters appeared so strange. It ...

what is encoding in Ajax?

Generally we are using UTF-8 encoding standard for sending the request for every language. But in some language this encoding standard is not working properly,then in that case we are using ISO-8859-1....

Encoding of window.location.hash

Does window.location.hash contain the encoded or decoded representation of the url part? When I open the same url (http://localhost/something/#%C3%BC where %C3%BCtranslates to ü) in Firefox 3.5 and ...

Auth-code with A-Za-z0-9 to use in an URL parameter

As part of a web application I need an auth-code to pass as a URL parameter. I am currently using (in Rails) : Digest::SHA1.hexdigest((object_id + rand(255)).to_s) Which provides long strings like : ...

热门标签