English 中文(简体)
网络拖网时间
原标题:Web crawler time out

I am working on a simple web crawler to get a URL, Crawl first level links on the site and extract mails from all pages using RegEx...

我知道这种说法是假的,它只是开端,但我总是在发言2分钟后“及时”。

 private void button1_Click(object sender, System.EventArgs e)
    {

        string url = textBox1.Text;
        HttpWebRequest request = (HttpWebRequest)WebRequest.Create(url);
        HttpWebResponse response = (HttpWebResponse)request.GetResponse();
        StreamReader sr = new StreamReader(response.GetResponseStream());
        string code = sr.ReadToEnd();
        string re = "href="(.*?)"";
        MatchCollection href = Regex.Matches(code, @re, RegexOptions.Singleline);
        foreach (Match h in href)
        {

            string link = h.Groups[1].Value;
            if (!link.Contains("http://"))
            {
                HttpWebRequest request2 = (HttpWebRequest)WebRequest.Create(url + link);
                HttpWebResponse response2 = (HttpWebResponse)request2.GetResponse();
                StreamReader sr2 = new StreamReader(response.GetResponseStream());
                string innerlink = sr.ReadToEnd();


                MatchCollection m2 = Regex.Matches(code, @"([w-]+(.[w-]+)*@([a-z0-9-]+(.[a-z0-9-]+)*?.[a-z]{2,6}|(d{1,3}.){3}d{1,3})(:d{4})?)", RegexOptions.Singleline);


                foreach (Match m in m2)
                {
                    string email = m.Groups[1].Value;

                    if (!listBox1.Items.Contains(email))
                    {
                        listBox1.Items.Add(email);
                    }
                }
            }
        }

         sr.Close();
        }
最佳回答

Don t parse Html using Regex. 为此使用Html Agility包。

www.un.org/Depts/DGACM/index_spanish.htm Html Agility Pack (HAP)?

This is an agile HTML parser that builds a read/write DOM and supports plain XPATH or XSLT (you actually don t HAVE to understand XPATH nor XSLT to use it, don t worry...). It is a .NET code library that allows you to parse "out of the web" HTML files. The parser is very tolerant with "real world" malformed HTML. The object model is very similar to what proposes System.Xml, but for HTML documents (or streams).

More Information

问题回答

Oded的评论是正确的,我们需要知道你需要什么具体帮助;然而,我至少可以点到你:。 因为它将解决你的大部分网络垃圾。

Good Luck!

问题





相关问题
Anyone feel like passing it forward?

I m the only developer in my company, and am getting along well as an autodidact, but I know I m missing out on the education one gets from working with and having code reviewed by more senior devs. ...

NSArray s, Primitive types and Boxing Oh My!

I m pretty new to the Objective-C world and I have a long history with .net/C# so naturally I m inclined to use my C# wits. Now here s the question: I feel really inclined to create some type of ...

C# Marshal / Pinvoke CBitmap?

I cannot figure out how to marshal a C++ CBitmap to a C# Bitmap or Image class. My import looks like this: [DllImport(@"test.dll", CharSet = CharSet.Unicode)] public static extern IntPtr ...

How to Use Ghostscript DLL to convert PDF to PDF/A

How to user GhostScript DLL to convert PDF to PDF/A. I know I kind of have to call the exported function of gsdll32.dll whose name is gsapi_init_with_args, but how do i pass the right arguments? BTW, ...

Linqy no matchy

Maybe it s something I m doing wrong. I m just learning Linq because I m bored. And so far so good. I made a little program and it basically just outputs all matches (foreach) into a label control. ...

热门标签