是否有任何关于 c/cpp lib 的建议可以(尽可能)轻易地(尽可能)用于分析/循环/操纵 HTML 流/文件,但假定有些可能是错误的,即标签没有关闭等等。
"http://www.crummy.com/ software/ Beautiful Soup/" rel=" noreferrer" > Beautiful Soup
是否有任何关于 c/cpp lib 的建议可以(尽可能)轻易地(尽可能)用于分析/循环/操纵 HTML 流/文件,但假定有些可能是错误的,即标签没有关闭等等。
"http://www.crummy.com/ software/ Beautiful Soup/" rel=" noreferrer" > Beautiful Soup
"http://www.xmlsoft.org/html/libxml-HTMLparser.html" rel=“不跟随 nofolrererer">HTMLparser ,来自Libxml 很容易使用(下面是简单的教程),即使对错误的 HTML也非常有效。
校对:Portnoy
Parsing (X)HTML in C is often seen as a difficult task. It s true that C isn t the easiest language to use to develop a parser. Fortunately, libxml2 s HTMLParser module come to the rescue. So, as promised, here s a small tutorial explaining how to use libxml2 s HTMLParser to parse (X)HTML.
首先,您需要创建剖析器上下文。 您要这样做有许多功能, 取决于您要如何向剖析器输入数据。 我将使用
htmlCreatePushParsterCtxt ()
, 因为它与内存缓冲作用 。htmlParserCtxtPtr parser = htmlCreatePushParserCtxt(NULL, NULL, NULL, 0, NULL, 0);
然后,你可以设置许多选项 来选择采集器的上下文。
htmlCtxtUseOptions(parser, HTML_PARSE_NOBLANKS | HTML_PARSE_NOERROR | HTML_PARSE_NOWARNING | HTML_PARSE_NONET);
我们现在准备分析一份(X)HTML文件。
// char * data : buffer containing part of the web page // int len : number of bytes in data // Last argument is 0 if the web page isn t complete, and 1 for the final call. htmlParseChunk(parser, data, len, 0);
一旦您推动了全部数据, 您可以用 < code> NULLL 缓冲区和 < code>1 来重新调用该函数作为最后一个参数。 这将确保采集器已经处理过所有内容 。
最后, 如何获得您解析的数据? 这比看起来容易。 您只需走过 XML 树所创建的 。
void walkTree(xmlNode * a_node) { xmlNode *cur_node = NULL; xmlAttr *cur_attr = NULL; for (cur_node = a_node; cur_node; cur_node = cur_node->next) { // do something with that node information, like... printing the tag s name and attributes printf("Got tag : %s ", cur_node->name) for (cur_attr = cur_node->properties; cur_attr; cur_attr = cur_attr->next) { printf(" ->; with attribute : %s ", cur_attr->name); } walkTree(cur_node->children); } } walkTree(xmlDocGetRootElement(parser->myDoc));
仅此而已! 是否足够简单? 从那里, 您可以做任何种类的事情, 比如查找所有引用的图像( 查看 < code> img code > 标签), 并取回它们, 或者任何你能想到的事情 。
而且,你应该知道,你可以随时走XML树,即使你还没有解析整个(X)HTML文件。
如果要在 C 中分析 (X) HTML, 您应该使用 libxml2 s
HTMLParser
。 这将节省您很多时间 。
您可以使用 Google "https://github.com/google/gumbo-parser" rel="不跟随 nofollow norefererr" >gumbo-parser
Gumbo 是实施HTML5 解析算法,作为纯 C99 库实施,没有外部依赖性。 它旨在作为其他工具和图书馆的构件,例如 Iinters, 验证器, 临时语言, 以及再设定和分析工具。
#include "gumbo.h"
int main() {
GumboOutput* output = gumbo_parse("<h1>Hello, World!</h1>");
// Do stuff with output->root
gumbo_destroy_output(&kGumboDefaultOptions, output);
}
s 也为该图书馆的 < a href=> "https://github.com/lazytiger/gumbo-query" rel= "nofollow noreferrer">gumbo-query 装订了C++
A++ C+ 库, 为 Google s Gumbo- Parser 提供 JQuery 类选择器 。
#include <iostream>
#include <string>
#include "Document.h"
#include "Node.h"
int main(int argc, char * argv[])
{
std::string page("<h1><a>some link</a></h1>");
CDocument doc;
doc.parse(page.c_str());
CSelection c = doc.find("h1 a");
std::cout << c.nodeAt(0).text() << std::endl; // some link
return 0;
}
I m getting this linker error. I know a way around it, but it s bugging me because another part of the project s linking fine and it s designed almost identically. First, I have namespace LCD. Then I ...
I have been searching for sample code creating iterator for my own container, but I haven t really found a good example. I know this been asked before (Creating my own Iterators) but didn t see any ...
Is there an equivalent to tidy for HTML code for C++? I have searched on the internet, but I find nothing but C++ wrappers for tidy, etc... I think the keyword tidy is what has me hung up. I am ...
I m new to C++ and am wondering how much time I should invest in learning how to implement template classes. Are they widely used in industry, or is this something I should move through quickly?
Given a 10 digit Telephone Number, we have to print all possible strings created from that. The mapping of the numbers is the one as exactly on a phone s keypad. i.e. for 1,0-> No Letter for 2->...
Why is it when i do the following i get errors when relating to with wchar_t? namespace Foo { typedef std::wstring String; } Now i declare all my strings as Foo::String through out the program, ...
I cannot figure out how to marshal a C++ CBitmap to a C# Bitmap or Image class. My import looks like this: [DllImport(@"test.dll", CharSet = CharSet.Unicode)] public static extern IntPtr ...
Is it possible to check with the means of pure X11/Xlib only whether the given window is iconified/minimized, and, if it is, how?