English 中文(简体)
Logic for Implementing a Dynamic Web Scraper in C#
原标题:

I am looking to develop a Web scraper in C# window forms. What I am trying to accomplish is as follows:

  1. Get the URL from the user.
  2. Load the Web page in the IE UI control(embedded browser) in WINForms.
  3. Allow the User to select a text (contiguous , small(not exceeding 50 chars)). from the loaded web page.
  4. When the User wishes to persist the location (the HTML DOM location) it has to be persisted into the DB, so that the user may use that location to fetch the data in that location during his subsequent visits.

Assume that the loaded website is a pricelisting site and the quoted rate keeps on changing, the idea is to persist the DOM hierarchy so that I can traverse it next time.

I would be able to do this if all the HTML elements had their id attributes. In the case where the id is null , i am not able to accomplish this .

Could someone suggest a valid idea on this (a bare minimum code snippet if possible).?

It would be helpful , even if you can share some online resources.

thanks,

vijay

最佳回答

One approach is to build a stack of tags/styles/id down to the element which you want to select.

From the element you want, traverse up to the nearest id element. This way you will get rid of most of the top header etc. Then build a sequence to look for.

Example:

<html>
  <body>
    <!-- lots of html -->
    <div id="main">
       <div>
          <span>
             <div class="pricearea">
                <table> <!-- with price data -->

For the exmaple you would store in your db a sequence of: [id=main],div,span,div,table or perhaps div[class=pricearea],table.

Using styles/classes might also be used to create your path. It s your choice to look for either a tag, an attribute of a tag or a combination. You want it as accurate as possible with as few elements as possible to make it robust.

If the layout seldom changes, this would let you navigate to the same location each time.

I would also suggest you perhaps use HTML Agility Pack or something similar for the DOM parsing, as the IE control is slow.

Screen scraping is fun, but it s difficult to get it 100% for all pages. Good luck!

问题回答

After a bit of googling , i encountered a fairly simple solution . Below attached is the sample snippet.

if (webBrowser.Document != null)
        {
            IHTMLDocument2 HtmlDoc = (IHTMLDocument2)webBrowser.Document.DomDocument;// loads the HTML DOM
            IHTMLSelectionObject selection = HtmlDoc.selection;// Fetches the currently selected HTML Element.
            IHTMLTxtRange range = (IHTMLTxtRange)selection.createRange();
            IHTMLElement parentElement = range.parentElement();// Identifies the parent element
            targetSourceIndex = parentElement.sourceIndex;               
            //dataLocation = range.parentElement().id;                
            MessageBox.Show(range.text);//range.parentElement().sourceIndex
        }

I used a Embedded Web Browser in a Winforms applications, which loads the HTML DOM of the current web page.

The IHTMLElement instance exposes a property named SourceIndex which allocates a unique id to each of the html elements.

One can store this SourceIndex to the DB and Query for the content at that location. using the following code.

if (webBrowser.Document != null)
            {
                IHTMLDocument2 HtmlDoc = (IHTMLDocument2)webBrowser.Document.DomDocument;
                IHTMLElement targetElement = null;
                foreach (IHTMLElement domElement in HtmlDoc.all)
                {
                    if (domElement.sourceIndex == int.Parse(node.InnerText))// fetching the persisted data from the XML file.
                    {
                        targetElement = domElement;
                        break;
                    }
                }

                MessageBox.Show(targetElement.innerText); //range.parentElement().sourceIndex
            }




相关问题
Anyone feel like passing it forward?

I m the only developer in my company, and am getting along well as an autodidact, but I know I m missing out on the education one gets from working with and having code reviewed by more senior devs. ...

NSArray s, Primitive types and Boxing Oh My!

I m pretty new to the Objective-C world and I have a long history with .net/C# so naturally I m inclined to use my C# wits. Now here s the question: I feel really inclined to create some type of ...

C# Marshal / Pinvoke CBitmap?

I cannot figure out how to marshal a C++ CBitmap to a C# Bitmap or Image class. My import looks like this: [DllImport(@"test.dll", CharSet = CharSet.Unicode)] public static extern IntPtr ...

How to Use Ghostscript DLL to convert PDF to PDF/A

How to user GhostScript DLL to convert PDF to PDF/A. I know I kind of have to call the exported function of gsdll32.dll whose name is gsapi_init_with_args, but how do i pass the right arguments? BTW, ...

Linqy no matchy

Maybe it s something I m doing wrong. I m just learning Linq because I m bored. And so far so good. I made a little program and it basically just outputs all matches (foreach) into a label control. ...

热门标签