English 中文(简体)
Building an index of URLs , what features to include?
原标题:

I am working towards building an index of URLs. The objective is to build and store a data structure which will have key as a domain URL (eg. www.nytimes.com) and the value will be a set of features associated with that URL. I am looking for your suggestions for this set of features. For example I would like to store www.nytimes.com as following:

[www.nytimes.com: [lang:en, alexa_rank:96, content_type:news, spam_probability: 0.0001, etc..]

Why I am building this? Well the ultimate goal is to do some interesting things with this index, for example I may do clustering on this index and find interesting groups etc. I have with me a whole lot of text which was generated by whole lot URLs over a period of whole lot of time :) So data is not a problem.

Any kind of suggestions are very welcome.

问题回答

Make it work first with what you ve already suggested. Then start adding features suggested by everybody else.

ideas are worth nothing unless executed.

-- http://www.codinghorror.com/blog/2010/01/cultivate-teams-not-ideas.html

I would maybe start here: Google white papers on IR

Then also search for white papers on IR on Google maybe?

Also a few things to add to your index:

  1. Subdomains associated with the domain
  2. IP addresses associated with the domain
  3. Average page speed
  4. Links pointing at the domain in Yahoo - e.g link:nytimes.com or search on yahoo
  5. Number of pages on the domain - site:nytimes.com on Google
  6. traffic nos on compete.com or google trends
  7. whois info e.g. age of domain, length of time registered for etc.

Some other places to research - http://www.majesticseo.com/, http://www.opensearch.org/Home and http://www.seomoz.org they all have their own indexes

I m sure theres plenty more but hopefully the IR stuff will get the cogs whirring :)





相关问题
How to identify ideas and concepts in a given text

I m working on a project at the moment where it would be really useful to be able to detect when a certain topic/idea is mentioned in a body of text. For instance, if the text contained: Maybe if ...

Text mining on large database (data mining)

I have a large database of resumes (CV), and a certain table skills grouping all users skills. inside that table there s a field skill_text that describes the skill in full text. I m looking for an ...

Building an index of URLs , what features to include?

I am working towards building an index of URLs. The objective is to build and store a data structure which will have key as a domain URL (eg. www.nytimes.com) and the value will be a set of features ...

extracting useful data from arbitary html pages?

is there a library for ruby or php that is able to parse html pages and extract unique data by comparing it with other similar pages....should use some sort of text mining to identify which texts are ...

term clustering library?

Does anybody know an open-sourcefree library that does term clustering? Thanks, yaniv

Perl within Python?

There is a Perl library I would like to access from within Python. How can I use it? FYI, the software is NCleaner. I would like to use it from within Python to transform an HTML string into text. (...

Find HEX patterns and number of occurrences

I d like to find patterns and sort them by number of occurrences on an HEX file I have. I am not looking for some specific pattern, just to make some statistics of the occurrences happening there and ...

热门标签