English 中文(简体)
在将人民抵抗力量改成案文时,我能否防止ABCpdf将言词混在一起(例如,翻版)。
原标题:Can I prevent ABCpdf from mashing words together (e.g. mashingwordstogether) when convertering PDF to Text?
  • 时间:2011-10-19 19:34:09
  •  标签:
  • abcpdf

I m利用ABCpdf提取一些PDF文档的文字内容,特别是打电话Doc.GetText(“Text”)。 (每页一页一次,你打电话) 这通常运作良好,但对于一些PDF文件而言,由此产生的案文包括一些带有空间特征的文字,例如:

这种字典是穿透的。

令人感兴趣的是,如果我试图利用Apache Tika(由人民抵抗力量Box赋予权力)从完全相同的人民抵抗力量中提取案文,那么我倾向于在言辞之间获得我所期望的所有空间。 也就是说,上述判决将由Tika作出。

本句在文字之间没有任何空间。

总的来说,这两个工具像害怕犯不同的错误一样——ABCpdf像世界上最坏的事情一样,是插入一个地方,一个地方属于某个地方,而Tika则像世界上最糟糕的事物一样,不能插入一个属于哪一个空间。

是否有任何环境使ABCpdf在这方面的行动更像Tika?

最佳回答

www.un.org/Depts/DGACM/index_spanish.htm 答复: 您可以通过<代码>Doc.GetText(SVG),将XML打成<>TEXT和TSPAN内容,并确定是否有应当作为实际空间加以处理的排位。 你从PDFBox那里看到的行为可能是他们试图作出这一假设。 而且,即使Adobe Acrobat也可以通过纸板归还空间文本,因为PDFBox确实这样做。

Long Answer: This may cause more problems, as this may not be the original intent of the text in the PDF.

ABCpdf正在做正确的事,因为人民抵抗力量只描述了哪些东西应当放在产出中。 我们可以构造一份PDF文件,即ABCpdf对两种风格的解释,即使原判几乎相同。

为了证明这一点,本文是Adobe InDesign一份文件的缩略语,该文件显示了一种案文,将这两种案件与你样本判决相匹配。

“Snapshot

请注意,第一行不是用实际空间修建的,而是用手提放在各个文本区域,大致看上像一个适当空间的句子。 第二行有一句话,在单一文本区域,在字句之间有实际的文本空间。

出口到PDF,然后由ABCpdf读到,Doc.GetText(“TEXT”)将退回:

ThisSentenceDoesn tHaveAnySpacesBetweenWords.  
This Sentence Doesn t Have Any Spaces Between Words.

Thus if you wish to detect layout spaces, you must use SVG output and step through the tokens of text manually. Doc.GetText("SVG") returns text and other drawing entities as ABCpdf sees them on the page, and you can decide how you want to handle the case of layout based spacing.

You ll receive output similar to this:

<?xml version="1.0" standalone="no"?>
<svg width="612" height="792" x="0" y="0" version="1.1" baseProfile="full" xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink">
<text xml:space="preserve" x="36" y="46.1924" font-size="14" font-family="ArialMT" textLength="26.446" transform="translate(36, 46.1924) translate(-36, -46.1924)">This</text>
<text xml:space="preserve" x="66.002" y="46.1924" font-size="14" font-family="ArialMT" textLength="59.15" transform="translate(66.002, 46.1924) translate(-66.002, -46.1924)">Sentence</text>
<text xml:space="preserve" x="129.604" y="46.1924" font-size="14" font-family="ArialMT" textLength="47.46" transform="translate(129.604, 46.1924) translate(-129.604, -46.1924)">Doesn&#8217;t</text>
<text xml:space="preserve" x="181.208" y="46.1924" font-size="14" font-family="ArialMT" textLength="32.676" transform="translate(181.208, 46.1924) translate(-181.208, -46.1924)">Have</text>
<text xml:space="preserve" x="219.61" y="46.1924" font-size="14" font-family="ArialMT" textLength="24.122" transform="translate(219.61, 46.1924) translate(-219.61, -46.1924)">Any</text>
<text xml:space="preserve" x="249.612" y="46.1924" font-size="14" font-family="ArialMT" textLength="46.69" transform="translate(249.612, 46.1924) translate(-249.612, -46.1924)">Spaces</text>
<text xml:space="preserve" x="301.216" y="46.1924" font-size="14" font-family="ArialMT" textLength="54.474" transform="translate(301.216, 46.1924) translate(-301.216, -46.1924)">Between</text>
<text xml:space="preserve" x="360.016" y="46.1924" font-size="14" font-family="ArialMT" transform="translate(360.016, 46.1924) translate(-360.016, -46.1924)"><tspan textLength="13.216">W</tspan><tspan dx="-0.252" textLength="31.122">ords.</tspan></text>
<text xml:space="preserve" x="36.014" y="141.9944" font-size="14" font-family="ArialMT" transform="translate(36.014, 141.9944) translate(-36.014, -141.9944)">
<tspan textLength="181.3">This Sentence Doesn&#8217;t Have </tspan><tspan dx="-0.756" textLength="150.178">Any Spaces Between W</tspan><tspan dx="-0.252" textLength="31.122">ords.</tspan></text>
</svg>

并且指出,基本结构显示出给你造成问题的最初意图。 (xml:去除的空间和属性,为了举例来说改变白色空间)

<?xml version="1.0" standalone="no"?>
<svg>
  <text>This</text>
  <text>Sentence</text>
  <text>Doesn&#8217;t</text>
  <text>Have</text>
  <text>Any</text>
  <text>Spaces</text>
  <text>Between</text>
  <text><tspan>W</tspan><tspan>ords.</tspan></text>
  <text>
    <tspan>This Sentence Doesn&#8217;t Have </tspan>
    <tspan>Any Spaces Between W</tspan>
    <tspan>ords.</tspan>
  </text>
</svg>
问题回答

这个问题和答案以ABCpdf的旧版本为基础。

ABCpdf FCCC/SBI/2008/INF.1。

I work on the ABCpdf 。 NET软件部分,因此我的答复可能包含基于ABCpdf的概念。 这是我所知的:





相关问题
Extra blank page when converting HTML to PDF using abcPDF

I have an HTML report, with each print page contained by a <div class="page">. The page class is defined as width: 180mm; height: 250mm; page-break-after: always; background-position: centre ...

Is it possible to modify PDF Form Field Names?

Here s the situation. I have a PDF with automatically generated pdf form field names. The problem is that these names are not very user friendly. They look something like : topmostSubform[0].Page1[...

Add background-image to <td> in PDF (ABCpdf)

I m dynamically creating a PDF using ABCpdf (HTML -> PDF) I m trying to create a Table Of Contents (with leaders), and I think the easiest way to get the leaders is using a repeat-x background-image. ...

Tell abcPdf to scale the html to fit on a single pdf page

I am using abcPdf to convert an HTML report into a pdf file. The pdf has to be a single landscape A4 page. Do you know if there is any way to tell abcPdf to scale the HTML page to fit on a single ...

AbcPdf - document not applying CSS

This may be more of a tech support issue, but I m wondering if any other developers have come across this: I m using Abcpdf in my ASP.NET code to generate a PDF from HTML. It works fine, but one ...

ABC PDF - create 256 color images

I am using ABCpdf7 to create pdf documents on the fly - Here is something that I do not understand. When I create the pdf document from a url - the images in the pdf document seems to be 256 colors. ...

热门标签