English 中文(简体)
.net 4. 0 中平行网络爬行器的最佳实用方法
原标题:Best practics for parallelize web crawler in .net 4.0

I need to download a lot of pages through proxies. What is best practice for building a multi-threaded web crawler?

是平行的。 Foreach已经足够好,还是更适合 重的CPU任务?

你对遵守代码有什么看法?

var multyProxy = new MultyProxy();

   multyProxy.LoadProxyList();


   Task[] taskArray = new Task[1000];

        for(int i = 0; i < taskArray.Length; i++)
        {
            taskArray[i] = new Task( (obj) =>
                {                                                             
                       multyProxy.GetPage((string)obj);
                },

            (object)"http://google.com"
            );
            taskArray[i].Start();
        }


   Task.WaitAll(taskArray);

它非常有效,很慢,我不知道为什么

这个代码也运行不良。

 System.Threading.Tasks.Parallel.For(0,1000, new System.Threading.Tasks.ParallelOptions(){MaxDegreeOfParallelism=30},loop =>
            {
                 multyProxy.GetPage("http://google.com");
            }
            );

我想我做错了什么

当我开始我的脚本的时候 它只使用2%-4%的网络

最佳回答

您基本上正在使用 CPU 约束 IO 约束任务 的 CPU 线条- 即 。 即使您正在平行操作, 它们基本上仍然在使用一种线索线, 主要是用于 CPU 约束 操作 。

基本上,您需要使用一个同步模式下载数据以将其修改为 IO 补全端口 - 如果您再使用 Web 请求, 然后是 BeinGetResponse () 和 EndGetResponse () 方法

我建议研究反应扩展 来做到这一点,例如:

IEnumerable<string> urls = ... get your urls here...;
var results = from url in urls.ToObservable()
             let req = WebRequest.Create(url)
             from rsp in Observable.FromAsyncPattern<WebResponse>(
                  req.BeginGetResponse, req.EndGetResponse)()
             select ExtractResponse(rsp);

Exppppponse 可能只是使用串流Reader. ReadToond 来获取字符串结果, 如果这是您在

您也可以查看使用. retry 运算符, 这样您就可以很容易地重试几次, 如果您遇到连接问题等...

问题回答

在您主方法的起始处添加此选项 :

System.Net.ServicePointManager.DefaultConnectionLimit = 100;

所以,你将不限于 少量同时连接。

当您使用许多连接( 添加到 app. config 或 Web. config ) 时, 这会帮助您 :

<?xml version="1.0" encoding="utf-8" ?>
<configuration>
  <system.net>
    <connectionManagement>
      <add address="*" maxconnection="50"/>
    </connectionManagement>
  </system.net>
</configuration>

设置您的同时连接数, 而不是 50

http://msdn.microsoft.com/en-us/library/fb6y0fyc.aspx 上更多关于它的内容。





相关问题
Anyone feel like passing it forward?

I m the only developer in my company, and am getting along well as an autodidact, but I know I m missing out on the education one gets from working with and having code reviewed by more senior devs. ...

NSArray s, Primitive types and Boxing Oh My!

I m pretty new to the Objective-C world and I have a long history with .net/C# so naturally I m inclined to use my C# wits. Now here s the question: I feel really inclined to create some type of ...

C# Marshal / Pinvoke CBitmap?

I cannot figure out how to marshal a C++ CBitmap to a C# Bitmap or Image class. My import looks like this: [DllImport(@"test.dll", CharSet = CharSet.Unicode)] public static extern IntPtr ...

How to Use Ghostscript DLL to convert PDF to PDF/A

How to user GhostScript DLL to convert PDF to PDF/A. I know I kind of have to call the exported function of gsdll32.dll whose name is gsapi_init_with_args, but how do i pass the right arguments? BTW, ...

Linqy no matchy

Maybe it s something I m doing wrong. I m just learning Linq because I m bored. And so far so good. I made a little program and it basically just outputs all matches (foreach) into a label control. ...

热门标签