Question

我的问题是个特例。

首先,

仅限tags 而不是其他标签。因此,你不需要担心任何其他方面。

我有一份是一份软件产出的html文件,但有一些错误,如无封闭的<代码> tag。

eg. I have taken all document in a string

我的文件就是这样。

    <html>
    ....
    ....
      <head>
      </head>
    ....
    ....
       <body>

    ...
    ...
    <p>                 // tag is to be removed as no closing tag

<p align="left">   AAA   </p>
<p class="style6">   BBB    </P>
<p class="style1" align="center">    CCC    </P>

<p align="left">  DDD               // tag is to be removed as no closing tag
<p class="style6">   EEE              // tag is to be removed as no closing tag
<p class="style1" align="center">    FFF             // tag is to be removed as no closing tag

<p class="style15"><strong>xxyyzz</strong><br/></p>

<p>                // tag is to be removed as no closing tag



<p> stack Overflow </P>


       <body>
      </html>

tags with DDD,EEE,FFF and unclosed  tag are to be removed As you can see it should work for every unclosed  tag whether it is having attributes like class or align.

我还想提到,没有<代码> tag within another  tag ,i

<p>
    <p>
    </p>

     <p>
     </p>

</p>

这种情况永远不会发生。

我尝试利用区域执法和安全局,但不能得到完美的回答。

为那些将提供帮助的人预示了许多事。

关于

Answer 1

I really appreciate help from all of u specially JIM n ALEX.. i tried and its working nicely. thnx a lot.

 public static string CleanUpXHTML(string xhtml)
            {
                int pOpen = 0, pClose = 0, pSlash = 0, pNext = 0, length = 0;
                pOpen = xhtml.IndexOf("<p", 0);
                pClose = xhtml.IndexOf(">", pOpen);
                pSlash = xhtml.IndexOf("</p>", pClose);
                pNext = xhtml.IndexOf("<p", pClose);

                while (pSlash > -1)
                {


                    if (pSlash < pNext)
                    {
                        if (pSlash < pNext)
                        {
                            pOpen = pNext;
                            pClose = xhtml.IndexOf(">", pOpen);
                            pSlash = xhtml.IndexOf("</p>", pClose);
                            pNext = xhtml.IndexOf("<p", pClose);
                        }
                    }
                    else
                    {
                        length = pClose - pOpen + 1;
                        if (pNext < 0 && pSlash > 0)
                        {
                            break;
                        }


                        xhtml = xhtml.Remove(pOpen, length);

                        pOpen = pNext - length;
                        pClose = xhtml.IndexOf(">", pOpen);
                        pSlash = xhtml.IndexOf("</p>", pClose);
                        pNext = xhtml.IndexOf("<p", pClose);


                    }

                    if (pSlash < 0)
                    {
                        int lastp = 0, lastclosep = 0, lastnextp = 0, length3 = 0, TpSlash =0 ;

                        lastp = xhtml.IndexOf("<p",pOpen-1);

                        lastclosep = xhtml.IndexOf(">", lastp);
                        lastnextp = xhtml.IndexOf("<p", lastclosep);


                        while (lastp >0)
                        {
                            length3 = lastclosep - lastp + 1;
                            xhtml = xhtml.Remove(lastp, length3);
                            if (lastnextp < 0)
                            {
                                break;
                            }
                            lastp = lastnextp-length3;
                            lastclosep = xhtml.IndexOf(">", lastp);
                            lastnextp = xhtml.IndexOf("<p", lastclosep);

                        }

                        break;
                    }

                }

                return xhtml;

            }

Answer 2

http://htmlagility Pack.codeplex.com/“rel=“nofollow noreferer” Html Agility Pack:

It is a .NET code library that allows you to parse "out of the web" HTML files. The parser is very tolerant with "real world" malformed HTML. The object model is very similar to what proposes System.Xml, but for HTML.

仅将该文件装入OMS,就对检索和过滤这些内容进行检索,几乎与你进行有效的XML操纵一样。

Answer 3

申斥: 请注意,我并不主张试图将任意的超文本同定期表达或简单地加以补充。以下解决办法是这个具体问题,其宗旨似乎仅限于用简单方法进行区分。总的来说,我同意以下共识: 用超文本处理。

这就是说的。

鉴于nes>> tags aren t granted, assuming that there aren t any HTML comments granted, 相对而言,在发现和消除所有 <> 标签没有相应的 <>

string inputText = GetHtmlText();
int scanPos = 0;
int startTag = inputText.IndexOf("<p>", scanPos);
while (startTag != -1)
{
    scanPos += 4;
    // Now look for a closing tag or another open tag
    int closeTag = inputText.IndexOf("</p">, scanPos);
    int nextStartTag = inputText.IndexOf("<p>", scanPos);
    if (closeTag == -1 || nextStartTag < closeTag)
    {
        // Error at position startTag.  No closing tag.
    }
    else
    {
        // You have a full paragraph between startTag and (closeTag+5).
    }
    startTag = nextStartTag;
}

该守则假定,除实际段落开放和封闭标记外,案文中不可能有插图:<代码>。如果你能够作出这一保证,那么与上文(或非常相似的)相比,你的工作应当做得很好。

ADDED:

处理诸如<代码><p等物品=“类别名称”>等,获得的保证很少。如果您能保证在开放式<代码>和>之间打上任何“<代码>>>; 和封闭式<代码>>”;,则您可修改上述代码,以查询<代码>和t;p。以及<代码>,如发现封闭式<代码>>>,则。它只字不提,但并非特别困难。

尽管如此,由于我前面已经指出的警告,我不建议采用这种做法来禁止武断的超文本处理:它赢得评论,并且对一般的超文本格式作出可能无效的假设。它还赢得了诸如<代码>和等处理材料,这两条都完全有效(我在野生动物中碰到)。

Answer 4

首先,请见。如果这 t使你无法使用固定的超文本表示(而且由于我理解它是一个非常具体的案件,可能无法使用完全的多功能平线器,尽管这是绝对的最佳建议方式),我就一个类似的问题发出答案:here<;你可以方便地将它适用于你的情况,但请理解,如果你决定使用,那么它不会建议和许多事情会错(如上所述)。

如果我指出,你似乎过于复杂,或者你重新理解或简化了它,则发表评论,我要作更多的澄清。

友情链接