Question

是否有任何关于 c/cpp lib 的建议可以(尽可能)轻易地(尽可能)用于分析/循环/操纵 HTML 流/文件,但假定有些可能是错误的,即标签没有关闭等等。

"http://www.crummy.com/ software/ Beautiful Soup/" rel=" noreferrer" > Beautiful Soup

Answer 1

"http://www.xmlsoft.org/html/libxml-HTMLparser.html" rel=“不跟随 nofolrererer">HTMLparser ,来自Libxml 很容易使用(下面是简单的教程),即使对错误的 HTML也非常有效。

校对:Portnoy

Parsing (X)HTML in C is often seen as a difficult task. It s true that C isn t the easiest language to use to develop a parser. Fortunately, libxml2 s HTMLParser module come to the rescue. So, as promised, here s a small tutorial explaining how to use libxml2 s HTMLParser to parse (X)HTML.

首先,您需要创建剖析器上下文。您要这样做有许多功能, 取决于您要如何向剖析器输入数据。我将使用 htmlCreatePushParsterCtxt () , 因为它与内存缓冲作用。
htmlParserCtxtPtr parser = htmlCreatePushParserCtxt(NULL, NULL, NULL, 0, NULL, 0);
然后,你可以设置许多选项来选择采集器的上下文。
htmlCtxtUseOptions(parser, HTML_PARSE_NOBLANKS | HTML_PARSE_NOERROR | HTML_PARSE_NOWARNING | HTML_PARSE_NONET);
我们现在准备分析一份(X)HTML文件。
// char * data : buffer containing part of the web page
// int len : number of bytes in data
// Last argument is 0 if the web page isn t complete, and 1 for the final call.
htmlParseChunk(parser, data, len, 0);
一旦您推动了全部数据, 您可以用 < code> NULLL 缓冲区和 < code>1 来重新调用该函数作为最后一个参数。这将确保采集器已经处理过所有内容。

最后, 如何获得您解析的数据? 这比看起来容易。您只需走过 XML 树所创建的。
void walkTree(xmlNode * a_node)
{ 
    xmlNode *cur_node = NULL;
    xmlAttr *cur_attr = NULL;
    for (cur_node = a_node; cur_node; cur_node = cur_node->next)
    {
        // do something with that node information, like... printing the tag s name and attributes
        printf("Got tag : %s
", cur_node->name)
        for (cur_attr = cur_node->properties; cur_attr; cur_attr = cur_attr->next)
        {
            printf("  ->; with attribute : %s
", cur_attr->name);
        }
        walkTree(cur_node->children);
    }
}
walkTree(xmlDocGetRootElement(parser->myDoc));
仅此而已! 是否足够简单? 从那里, 您可以做任何种类的事情, 比如查找所有引用的图像( 查看 < code> img 标签), 并取回它们, 或者任何你能想到的事情。

而且,你应该知道,你可以随时走XML树,即使你还没有解析整个(X)HTML文件。

如果要在 C 中分析 (X) HTML, 您应该使用 libxml2 s HTMLParser 。这将节省您很多时间。

Answer 2

您可以使用 Google "https://github.com/google/gumbo-parser" rel="不跟随 nofollow norefererr" >gumbo-parser

Gumbo 是实施HTML5 解析算法,作为纯 C99 库实施,没有外部依赖性。它旨在作为其他工具和图书馆的构件,例如 Iinters, 验证器, 临时语言, 以及再设定和分析工具。

#include "gumbo.h"

int main() {
  GumboOutput* output = gumbo_parse("<h1>Hello, World!</h1>");
  // Do stuff with output->root
  gumbo_destroy_output(&kGumboDefaultOptions, output);
}

s 也为该图书馆的 < a href=> "https://github.com/lazytiger/gumbo-query" rel= "nofollow noreferrer">gumbo-query 装订了C++

A++ C+ 库, 为 Google s Gumbo- Parser 提供 JQuery 类选择器。

#include <iostream>
#include <string>
#include "Document.h"
#include "Node.h"

int main(int argc, char * argv[])
{
  std::string page("<h1><a>some link</a></h1>");
  CDocument doc;
  doc.parse(page.c_str());

  CSelection c = doc.find("h1 a");
  std::cout << c.nodeAt(0).text() << std::endl; // some link
  return 0;
}

Answer 3

我仅使用 "http://curl.hax.se/libcurl/cplus/" rel="nofollow" >libCurl C/a> 来形容这类事情,但发现它相当好和有用。不知道它会如何应对被打破的 HTML 。

友情链接