English 中文(简体)
C/CPP 版本的《美丽苏普》, 特别是处理错误的 HTML
原标题:C/CPP version of BeautifulSoup especially at handling malformed HTML

是否有任何关于 c/cpp lib 的建议可以(尽可能)轻易地(尽可能)用于分析/循环/操纵 HTML 流/文件,但假定有些可能是错误的,即标签没有关闭等等。

"http://www.crummy.com/ software/ Beautiful Soup/" rel=" noreferrer" > Beautiful Soup

最佳回答

"http://www.xmlsoft.org/html/libxml-HTMLparser.html" rel=“不跟随 nofolrererer">HTMLparser ,来自Libxml 很容易使用(下面是简单的教程),即使对错误的 HTML也非常有效。

校对:Portnoy

Parsing (X)HTML in C is often seen as a difficult task. It s true that C isn t the easiest language to use to develop a parser. Fortunately, libxml2 s HTMLParser module come to the rescue. So, as promised, here s a small tutorial explaining how to use libxml2 s HTMLParser to parse (X)HTML.

首先,您需要创建剖析器上下文。 您要这样做有许多功能, 取决于您要如何向剖析器输入数据。 我将使用 htmlCreatePushParsterCtxt () , 因为它与内存缓冲作用 。

htmlParserCtxtPtr parser = htmlCreatePushParserCtxt(NULL, NULL, NULL, 0, NULL, 0);

然后,你可以设置许多选项 来选择采集器的上下文。

htmlCtxtUseOptions(parser, HTML_PARSE_NOBLANKS | HTML_PARSE_NOERROR | HTML_PARSE_NOWARNING | HTML_PARSE_NONET);

我们现在准备分析一份(X)HTML文件。

// char * data : buffer containing part of the web page
// int len : number of bytes in data
// Last argument is 0 if the web page isn t complete, and 1 for the final call.
htmlParseChunk(parser, data, len, 0);

一旦您推动了全部数据, 您可以用 < code> NULLL 缓冲区和 < code>1 来重新调用该函数作为最后一个参数。 这将确保采集器已经处理过所有内容 。

最后, 如何获得您解析的数据? 这比看起来容易。 您只需走过 XML 树所创建的 。

void walkTree(xmlNode * a_node)
{ 
    xmlNode *cur_node = NULL;
    xmlAttr *cur_attr = NULL;
    for (cur_node = a_node; cur_node; cur_node = cur_node->next)
    {
        // do something with that node information, like... printing the tag s name and attributes
        printf("Got tag : %s
", cur_node->name)
        for (cur_attr = cur_node->properties; cur_attr; cur_attr = cur_attr->next)
        {
            printf("  ->; with attribute : %s
", cur_attr->name);
        }
        walkTree(cur_node->children);
    }
}
walkTree(xmlDocGetRootElement(parser->myDoc));

仅此而已! 是否足够简单? 从那里, 您可以做任何种类的事情, 比如查找所有引用的图像( 查看 < code> img 标签), 并取回它们, 或者任何你能想到的事情 。

而且,你应该知道,你可以随时走XML树,即使你还没有解析整个(X)HTML文件。

如果要在 C 中分析 (X) HTML, 您应该使用 libxml2 s HTMLParser 。 这将节省您很多时间 。

问题回答

您可以使用 Google "https://github.com/google/gumbo-parser" rel="不跟随 nofollow norefererr" >gumbo-parser

Gumbo 是实施HTML5 解析算法,作为纯 C99 库实施,没有外部依赖性。 它旨在作为其他工具和图书馆的构件,例如 Iinters, 验证器, 临时语言, 以及再设定和分析工具。

#include "gumbo.h"

int main() {
  GumboOutput* output = gumbo_parse("<h1>Hello, World!</h1>");
  // Do stuff with output->root
  gumbo_destroy_output(&kGumboDefaultOptions, output);
}

s 也为该图书馆的 < a href=> "https://github.com/lazytiger/gumbo-query" rel= "nofollow noreferrer">gumbo-query 装订了C++

A++ C+ 库, 为 Google s Gumbo- Parser 提供 JQuery 类选择器 。

#include <iostream>
#include <string>
#include "Document.h"
#include "Node.h"

int main(int argc, char * argv[])
{
  std::string page("<h1><a>some link</a></h1>");
  CDocument doc;
  doc.parse(page.c_str());

  CSelection c = doc.find("h1 a");
  std::cout << c.nodeAt(0).text() << std::endl; // some link
  return 0;
}




相关问题
Undefined reference

I m getting this linker error. I know a way around it, but it s bugging me because another part of the project s linking fine and it s designed almost identically. First, I have namespace LCD. Then I ...

C++ Equivalent of Tidy

Is there an equivalent to tidy for HTML code for C++? I have searched on the internet, but I find nothing but C++ wrappers for tidy, etc... I think the keyword tidy is what has me hung up. I am ...

Template Classes in C++ ... a required skill set?

I m new to C++ and am wondering how much time I should invest in learning how to implement template classes. Are they widely used in industry, or is this something I should move through quickly?

Print possible strings created from a Number

Given a 10 digit Telephone Number, we have to print all possible strings created from that. The mapping of the numbers is the one as exactly on a phone s keypad. i.e. for 1,0-> No Letter for 2->...

typedef ing STL wstring

Why is it when i do the following i get errors when relating to with wchar_t? namespace Foo { typedef std::wstring String; } Now i declare all my strings as Foo::String through out the program, ...

C# Marshal / Pinvoke CBitmap?

I cannot figure out how to marshal a C++ CBitmap to a C# Bitmap or Image class. My import looks like this: [DllImport(@"test.dll", CharSet = CharSet.Unicode)] public static extern IntPtr ...

Window iconification status via Xlib

Is it possible to check with the means of pure X11/Xlib only whether the given window is iconified/minimized, and, if it is, how?

热门标签