English 中文(简体)
执行 Web 数据提取
原标题:Perform Web Data Extraction

我安装了 HTMLAGlityPack , 但我无法理解在捕捉文档表格时如何提取第一个 td 元素包含今天日期的行, 格式为 < code> dd- mm-yy 。

有人能用代码片指着我正确的方向吗?

目前,我已:

HtmlDocument doc = new HtmlDocument();
doc.Load("http://lbma.org.uk/pages/printerFriendly.cfm?thisURL=index.cfm&title=gold_fixings&page_id=53&show=2012&type=daily");
foreach(HtmlNode tr in doc.DocumentNode.SelectNodes("tr"))
{
            
}
问题回答

有趣的是,那页的Html形状非常错误,所以我可以看到你的问题。不过,我还是用10英尺的杆子碰XPath。林克让生活变得简单多了。

HtmlWeb web = new HtmlWeb();
HtmlDocument doc = web.Load("http://lbma.org.uk/pages/printerFriendly.cfm?thisURL=index.cfm&title=gold_fixings&page_id=53&show=2012&type=daily");

HtmlNode todaysRow = doc.DocumentNode.Descendants("tr").Where(n => n.InnerText.StartsWith(string.Format("{0:dd-MMM-yy}", DateTime.Today), StringComparison.InvariantCultureIgnoreCase)).FirstOrDefault();
if (todaysRow != null)
{
    List<HtmlNode> cells = todaysRow.Descendants("td").ToList();
    decimal usd = decimal.Parse(cells[1].FirstChild.InnerText);
    decimal gbp = decimal.Parse(cells[2].FirstChild.InnerText);
    // ... etc 
} 

您需要读取 XPath 。 我仍然在学习自己, 所以可能有一个比这个更好的路径声明, 但是您需要做一些事情, 比如:

foreach(HtmlNode tr in doc.DocumentNode.SelectNodes("tr[td[1] =  03-Jan-12 ]"))
{

}    

试一下这个:

Dictionary<string, string> values = new Dictionary<string, string>();
string key, date;
HtmlDocument doc = Load(html);
HtmlNode node = doc.DocumentNode.SelectSingleNode(".//table[@class= pricing detail ]");
//this will pull out only the dates, and store them in variable  date 
foreach(HtmlNode child in node.SelectNodes(".//tr[@class= left ]")
{
    date = child.GetInnerText;
}
//this will pull out the dates and the prices, and put them into a mapped data structure for easy (and quick!) referencing
foreach(HtmlNode child in node.SelectNodes(".//tr")
{
    if(child.Attributes.contains("class"))
    {
        key = child.GetInnerText;
    }
    else
    {
        values.Add(key, child.GetInnerText);
    }
}

然后它只是将文字放入数组或字典中的字符串的问题。

Explanation: Basically, the foreach() bit of code will look for children only in your table matching the attribute <tr>. This then iterates through the collection of nodes, and does a check to see if the node is the date (i.e., if the node matches the attribute <table class="pricing detail">. If so, the value of this node (the GetInnerText bit), is used as a dictionary key (i.e., the date) If the comparison is false, then the code adds the subsequent child node values to the dictionary, mapped to the date key, until the date key changes.

将字典中的值移到输出时, 我肯定您可以方便地做到这一点

就日期格式而言,见





相关问题
Anyone feel like passing it forward?

I m the only developer in my company, and am getting along well as an autodidact, but I know I m missing out on the education one gets from working with and having code reviewed by more senior devs. ...

NSArray s, Primitive types and Boxing Oh My!

I m pretty new to the Objective-C world and I have a long history with .net/C# so naturally I m inclined to use my C# wits. Now here s the question: I feel really inclined to create some type of ...

C# Marshal / Pinvoke CBitmap?

I cannot figure out how to marshal a C++ CBitmap to a C# Bitmap or Image class. My import looks like this: [DllImport(@"test.dll", CharSet = CharSet.Unicode)] public static extern IntPtr ...

How to Use Ghostscript DLL to convert PDF to PDF/A

How to user GhostScript DLL to convert PDF to PDF/A. I know I kind of have to call the exported function of gsdll32.dll whose name is gsapi_init_with_args, but how do i pass the right arguments? BTW, ...

Linqy no matchy

Maybe it s something I m doing wrong. I m just learning Linq because I m bored. And so far so good. I made a little program and it basically just outputs all matches (foreach) into a label control. ...