English 中文(简体)
extracting useful data from arbitary html pages?
原标题:

is there a library for ruby or php that is able to parse html pages and extract unique data by comparing it with other similar pages....should use some sort of text mining to identify which texts are more likely noise and repetivie, while other texts are more unique and useful...

问题回答

I m a PHP guy, no idea about Ruby but I think that what you want is trivial to archive:

  • Use something like Simple HTML DOM to parse the pages.
  • For each page compare all the DOM elements.
  • Get the path of all elements that have different content, those will be your signal elements.




相关问题
Ruby parser in Java

The project I m doing is written in Java and parsers source code files. (Java src up to now). Now I d like to enable parsing Ruby code as well. Therefore I am looking for a parser in Java that parses ...

rails collection_select vs. select

collection_select and select Rails helpers: Which one should I use? I can t see a difference in both ways. Both helpers take a collection and generates options tags inside a select tag. Is there a ...

RubyCAS-Client question: Rails

I ve installed RubyCAS-Client version 2.1.0 as a plugin within a rails app. It s working, but I d like to remove the ?ticket= in the url. Is this possible?

Ordering a hash to xml: Rails

I m building an xml document from a hash. The xml attributes need to be in order. How can this be accomplished? hash.to_xml

multiple ruby extension modules under one directory

Can sources for discrete ruby extension modules live in the same directory, controlled by the same extconf.rb script? Background: I ve a project with two extension modules, foo.so and bar.so which ...

Text Editor for Ruby-on-Rails

guys which text editor is good for Rubyonrails? i m using Windows and i was using E-Texteditor but its not free n its expired now can anyone plese tell me any free texteditor? n which one is best an ...

热门标签