English 中文(简体)
Programming languages comparison for web data mining task
原标题:

I need some help comparing different programming languages, such as: C++, Java, Python, Ruby and PHP, for a task which is related for web data mining (developing web crawler, string manipulations and etc.). I have a bit experience with PHP, and I think advantages that it has for this particular task are simple syntax, in-depth string parsing capabilities, networking functions, and portability, but don t know much about other languages and their advantages and disadvantages related for this particular task.

问题回答

The specific language will not matter nearly as much as your familiarity. These days, all high-level languages will come with the basics. Unless you need it to be super-fast (you re probably going to be limited by download speed, not the speed that you parse the HTML) or have other constraints not listed, the language won t matter that much.

Just make sure that you use the libraries. In particular an HTML parsing library that is good with invalid markup (not an XML parser) and regular expressions where appropriate.

As a previous post implies - being familiar makes a big difference. I would also say look at what the language was originally designed to do - it gives a good idea of what its best at.

PHP - designed for server side scripting, not really ideal for this use.

Perl - Designed to pull text apart (good start) and excellent libraries - look at LWP and the modules under HTML such as HTML::Treebuilder - a good choice. Unrivalled selection of modules to plugin.

Python - A good choice, look at beautifulsoup and urllib

Ruby - also a good choice, look at hpricot a lot less mature than Perl or Python in terms of modules available.

I have written quite a bit of web spider/data mining software and have always used Perl. If I was starting from scratch today I might choose python.

Google s first crawler was written in Python 1.5

I m no expert on other languages, but I would go with python and html5lib or Beautifulsoup.





相关问题
Logic for Implementing a Dynamic Web Scraper in C#

I am looking to develop a Web scraper in C# window forms. What I am trying to accomplish is as follows: Get the URL from the user. Load the Web page in the IE UI control(embedded browser) in ...

Capture ASP output for monitoring

How do I Capture ASP.NET output and then store it as temp memory so that I can use them in an application to do comparison. example. there s this site which has ASP output. Sorry I do not have ...

Error in using Python/mechanize select_form()?

I am trying to scrap some data from a website. The scripts I am trying to write, should get the content of the page: http://www.atpworldtour.com/Rankings/Singles.aspx Should simulate the user going ...

Retrieving dynamic text from a website in vb.net (VS2008)

I want to be able to retrieve dynamic data from a web page (share prices). I started out by retrieving the html code before I realised that as it is live data, the html code will be of little use. ...

Programming languages comparison for web data mining task

I need some help comparing different programming languages, such as: C++, Java, Python, Ruby and PHP, for a task which is related for web data mining (developing web crawler, string manipulations and ...

热门标签