English 中文(简体)
Discovering synonyms from set of documents using LSA transform in Ruby
原标题:
  • 时间:2011-03-29 02:17:48
  •  标签:
  • ruby
  • lsa

After applying the LSA transform to a document array, how can this be used to generate synonyms? For instance, I have the following sample documents:

D1 = Mobilization
D2 = Reflective Pavement
D3 = Maintenance of Traffic
D4 = Special Detour
D5 = Commercial Materials for Driveway

            D1    D2    D3    D4    D5    
commerci[ +0.00 +0.00 +0.00 +0.00 +1.00 ]  
  materi[ +0.00 +0.00 +0.00 +0.00 +1.00 ]  
drivewai[ +0.00 +0.00 +0.00 +0.00 +1.00 ]  
 special[ +0.00 +0.00 +0.00 +1.00 +0.00 ]  
  detour[ +0.00 +0.00 +0.00 +1.00 +0.00 ]  
 mainten[ +0.00 +0.00 +1.00 +0.00 +0.00 ]  
 traffic[ +0.00 +0.00 +1.00 +0.00 +0.00 ]  
 reflect[ +0.00 +1.00 +0.00 +0.00 +0.00 ]  
pavement[ +0.00 +1.00 +0.00 +0.00 +0.00 ]  
  mobil [ +1.00 +0.00 +0.00 +0.00 +0.00 ]  

Applying TFIDF transform

            D1    D2    D3    D4    D5  
commerci[ +0.00 +0.00 +0.00 +0.00 +0.54 ]  
  materi[ +0.00 +0.00 +0.00 +0.00 +0.54 ]  
drivewai[ +0.00 +0.00 +0.00 +0.00 +0.54 ]  
 special[ +0.00 +0.00 +0.00 +0.80 +0.00 ]  
  detour[ +0.00 +0.00 +0.00 +0.80 +0.00 ]  
 mainten[ +0.00 +0.00 +0.80 +0.00 +0.00 ]  
 traffic[ +0.00 +0.00 +0.80 +0.00 +0.00 ]  
 reflect[ +0.00 +0.80 +0.00 +0.00 +0.00 ]  
pavement[ +0.00 +0.80 +0.00 +0.00 +0.00 ]  
  mobil [ +1.61 +0.00 +0.00 +0.00 +0.00 ]  

Applying LSA transform

            D1    D2    D3    D4    D5  
commerci[ +0.00 +0.00 +0.00 +0.00 +0.00 ]  
  materi[ +0.00 +0.00 +0.00 +0.00 +0.00 ]  
drivewai[ +0.00 +0.00 +0.00 +0.00 +0.00 ]  
 special[ +0.00 +0.00 +0.00 +0.80 +0.00 ]  
  detour[ +0.00 +0.00 +0.00 +0.80 +0.00 ]  
 mainten[ +0.00 +0.00 +0.80 +0.00 +0.00 ]  
 traffic[ +0.00 +0.00 +0.80 +0.00 +0.00 ]  
 reflect[ +0.00 +0.80 +0.00 +0.00 +0.00 ]  
pavement[ +0.00 +0.80 +0.00 +0.00 +0.00 ]  
  mobil [ +1.61 +0.00 +0.00 +0.00 +0.00 ]  
问题回答

Firstly, this example won t work. The principle behind it is that the more frequently words occur in similar contexts, the more related they are in meaning. Therefore there needs to be some overlap between the input documents. Paragraph length documents are ideal (since they have a reasonable number of words and there tends to be a single topic per paragraph).

To understand how LSA is useful for synonym recognition, you need to first understand how a vector space representation (the first matrix you ve got there) of words occurrences is useful for synonym recognition in the first place. This is because you can calculate the distance between two items in this high dimensionality vector space as a measure of their similarity (given that it is a measure of how often they occur together). The Magic of LSA is that it reshuffles the dimensions of the vector space, so that items that don t occur together but occur in similar contexts are brought together by a collapsing of similar dimensions into each other.

The idea of the TFIDF weighting function is to highlight the differences between documents, by giving higher weightings to words that appear more in a smaller subset of of the corpus, and lower weightings to words that are used everywhere. A more thorough explanation.

The "LSA" transformation is actually a singular-value decomposition (SVD) – conventionally Latent Semantic Analysis or Latent Semantic Indexing refers to the combination TFIDF with SVD – and it serves to reduce the dimensionally of the vector space, or in other words, it reduces the number of columns into a smaller, more concise description (as described above).

So to get the the nub of your question: you can tell how similar to words are by applying a distance function to the two corresponding vectors (rows). There are several distance functions to choose from by the most commonly used is the cosine distance (which measures the angle between the two vectors).

Hope this makes things clearer.





相关问题
Ruby parser in Java

The project I m doing is written in Java and parsers source code files. (Java src up to now). Now I d like to enable parsing Ruby code as well. Therefore I am looking for a parser in Java that parses ...

rails collection_select vs. select

collection_select and select Rails helpers: Which one should I use? I can t see a difference in both ways. Both helpers take a collection and generates options tags inside a select tag. Is there a ...

RubyCAS-Client question: Rails

I ve installed RubyCAS-Client version 2.1.0 as a plugin within a rails app. It s working, but I d like to remove the ?ticket= in the url. Is this possible?

Ordering a hash to xml: Rails

I m building an xml document from a hash. The xml attributes need to be in order. How can this be accomplished? hash.to_xml

multiple ruby extension modules under one directory

Can sources for discrete ruby extension modules live in the same directory, controlled by the same extconf.rb script? Background: I ve a project with two extension modules, foo.so and bar.so which ...

Text Editor for Ruby-on-Rails

guys which text editor is good for Rubyonrails? i m using Windows and i was using E-Texteditor but its not free n its expired now can anyone plese tell me any free texteditor? n which one is best an ...

热门标签