Question

I need to download a lot of pages through proxies. What is best practice for building a multi-threaded web crawler?

是平行的。 Foreach已经足够好,还是更适合重的CPU任务?

你对遵守代码有什么看法?

var multyProxy = new MultyProxy();

   multyProxy.LoadProxyList();


   Task[] taskArray = new Task[1000];

        for(int i = 0; i < taskArray.Length; i++)
        {
            taskArray[i] = new Task( (obj) =>
                {                                                             
                       multyProxy.GetPage((string)obj);
                },

            (object)"http://google.com"
            );
            taskArray[i].Start();
        }


   Task.WaitAll(taskArray);

它非常有效,很慢,我不知道为什么

这个代码也运行不良。

 System.Threading.Tasks.Parallel.For(0,1000, new System.Threading.Tasks.ParallelOptions(){MaxDegreeOfParallelism=30},loop =>
            {
                 multyProxy.GetPage("http://google.com");
            }
            );

我想我做错了什么

当我开始我的脚本的时候它只使用2%-4%的网络

Answer 1

您基本上正在使用 CPU 约束 IO 约束任务的 CPU 线条- 即。即使您正在平行操作, 它们基本上仍然在使用一种线索线, 主要是用于 CPU 约束操作。

基本上,您需要使用一个同步模式下载数据以将其修改为 IO 补全端口 - 如果您再使用 Web 请求, 然后是 BeinGetResponse () 和 EndGetResponse () 方法

我建议研究反应扩展来做到这一点,例如:

IEnumerable<string> urls = ... get your urls here...;
var results = from url in urls.ToObservable()
             let req = WebRequest.Create(url)
             from rsp in Observable.FromAsyncPattern<WebResponse>(
                  req.BeginGetResponse, req.EndGetResponse)()
             select ExtractResponse(rsp);

Exppppponse 可能只是使用串流Reader. ReadToond 来获取字符串结果, 如果这是您在

您也可以查看使用. retry 运算符, 这样您就可以很容易地重试几次, 如果您遇到连接问题等...

Answer 2

在您主方法的起始处添加此选项 :

System.Net.ServicePointManager.DefaultConnectionLimit = 100;

所以,你将不限于少量同时连接。

Answer 3

当您使用许多连接( 添加到 app. config 或 Web. config ) 时, 这会帮助您 :

<?xml version="1.0" encoding="utf-8" ?>
<configuration>
  <system.net>
    <connectionManagement>
      <add address="*" maxconnection="50"/>
    </connectionManagement>
  </system.net>
</configuration>

设置您的同时连接数, 而不是 50

在http://msdn.microsoft.com/en-us/library/fb6y0fyc.aspx 上更多关于它的内容。

友情链接