English 中文(简体)
How to find common phrases in a large body of text
原标题:

I m working on a project at the moment where I need to pick out the most common phrases in a huge body of text. For example say we have three sentences like the following:

  • The dog jumped over the woman.
  • The dog jumped into the car.
  • The dog jumped up the stairs.

From the above example I would want to extract "the dog jumped" as it is the most common phrase in the text. At first I thought, "oh lets use a directed graph [with repeated nodes]":

directed graph http://img.skitch.com/20091218-81ii2femnfgfipd9jtdg32m74f.png

EDIT: Apologies, I made a mistake while making this diagram "over", "into" and "up" should all link back to "the".

I was going to maintain a count of how many times a word occurred in each node object ("the" would be 6; "dog" and "jumped", 3; etc.) but despite many other problems the main one came up when we add a few more examples like (please ignore the bad grammar :-)):

  • Dog jumped up and down.
  • Dog jumped like no dog had ever jumped before.
  • Dog jumped happily.

We now have a problem since "dog" would start a new root node (at the same level as "the") and we would not identify "dog jumped" as now being the most common phrase. So now I am thinking maybe I could use an undirected graph to map the relationships between all the words and eventually pick out the common phrases but I m not sure how this is going to work either, as you lose the important relationship of order between the words.

So does anyone have any general ideas on how to identify common phrases in a large body of text and what data structure I would use.

Thanks, Ben

最佳回答

Check out this related question: What techniques/tools are there for discovering common phrases in chunks of text? Also related to the longest common substring problem.

I ve posted this before, but I use R for all of my data-mining tasks and it s well suited to this kind of analysis. In particular, look at the tm package. Here are some relevant links:

More generally, there are a large number of text mining packages on the Natural Language Processing view on CRAN.

问题回答

暂无回答




相关问题
The Fastest DataStructure to Filter with in C#

Currently we are filtering and sorting data with a datatable. /// <summary> /// Filters the data table and returns a new data table with only the filtered rows. /// </summary>...

Efficient queue in Haskell

How can I efficiently implement a list data structure where I can have 2 views to the head and end of the list, that always point to a head a tail of a list without expensive calls to reverse. i.e: ...

Java large datastructure for storing a matrix

I need to store a 2d matrix containing zip codes and the distance in km between each one of them. My client has an application that calculates the distances which are then stored in an Excel file. ...

Holding onto items after a postback

I have an ASP.NET web application and I want to be able to take items from a master list and store them temporarliy into one of four other lists. The other lists need to survive post backs so that ...

negative number in the stack

I am a new student in the compilers world ^_^ and I want to know is legal represent negative number in the stack. For example: infix: 1-5=-4 postfix: 15- The statements are: push(1) push(5) x=...

What type of struct/container would you use in this instance?

I am trying to figure out what type of structure or container I should use for a quick project. I need to have an unknown number of sets that will be entered from the GUI (each one will have a name, ...

热门标签