English 中文(简体)
Text mining on large database (data mining)
原标题:

I have a large database of resumes (CV), and a certain table skills grouping all users skills.

inside that table there s a field skill_text that describes the skill in full text.

I m looking for an algorithm/software/method to extract significant terms/phrases from that table in order to build a new table with standarized skills..

Here are some examples skills extracted from the DB :

  • Sectoral and competitive analysis
  • Business Development (incl. in international settings)
  • Specific structure and road design software - Microstation, Macao, AutoCAD (basic knowledge)
  • Creative work (Photoshop, In-Design, Illustrator)
  • checking and reporting back on campaign progress
  • organising and attending events and exhibitions
  • Development : Aptana Studio, PHP, HTML, CSS, JavaScript, SQL, AJAX
  • Discipline: One to one marketing, E-marketing (SEO & SEA, display, emailing, affiliate program) Mix marketing, Viral Marketing, Social network marketing.

The output shoud be something like :

  • Sectoral and competitive analysis
  • Business Development
  • Specific structure and road design software -
  • Macao
  • AutoCAD
  • Photoshop
  • In-Design
  • Illustrator
  • organising events
  • Development
  • Aptana Studio
  • PHP
  • HTML
  • CSS
  • JavaScript
  • SQL
  • AJAX
  • Mix marketing
  • Viral Marketing
  • Social network marketing
  • emailing
  • SEO
  • One to one marketing

As you see only skills remains no other representation text.

I know this is possible using text mining technics but how to do it ? the database is realy large.. it s a good thing because we can calculate text frequency and decide if it s a real skill or just meaningless text... The big problem is .. how to determin that "blablabla" is a skill ?

Edit : please don t tell me to use standard things like a text tokinzer, or regex .. because users input skills in a very arbitrary way !!

thanks

最佳回答

If I was doing this programmatically I would:

Extract all punctuation delimited data (or perhaps just brackets and commas) into a new table (with no primary key, just skill) so Creative work (Photoshop, In-Design, Illustrator) becomes

 Skill            
 -------------
 Creative work    
 Photoshop        
 In-Design        
 Illustrator      

Then, after you ve proceed all CVs, query for the most common skills (this is MySQL)

SELECT skill, COUNT(1) cnt FROM newTable GROUP BY skill ORDER BY cnt DESC;

Which may look like this contrived example

 Skill            Cnt
 ---------------------
 Photoshop        3293
 Illustrator      2134
 Creative work     932
 In-Design         123

Then you decide, from the top X skills, which you want to capture, which must map to other skills (Indesign and In-design should map to the same skill, for example) and which to discard, then script the process using a data map.

Use the data map to write a new word frequency table (this time skill_id, skill, frequency) and the second time when parsing the data also write to a lookup table (cv_id,skill_id). Your data will then be in a state where each CV is mapped to a number of skills, and each skill to a number of CVs. You can query for the most popular skills, CVs matching certain criteria etc.

问题回答

Many databases will do this for you via their full-text search functionality. I know that PostgreSQL s full-text search would be able to do this easily with the aid of a custom dictionary.

Alternatively, you can use PHP s strtok or equivalent to index your text. Once indexed you can compare to dictionary, or simply use occurrences to create a sheet for yourself. Word clouds are made in a similar fashion.

Doing this well requires knowledge; otherwise what s to tell "organising events" is a skill while "creative work" isn t? But a stupid program can take a first cut at it by analyzing statistics of collocations: see the answers to How to extract common / significant phrases from a series of text entries and Algorithms to detect phrases and keywords from text.





相关问题
what is wrong with this mysql code

$db_user="root"; $db_host="localhost"; $db_password="root"; $db_name = "fayer"; $conn = mysqli_connect($db_host,$db_user,$db_password,$db_name) or die ("couldn t connect to server"); // perform query ...

Users asking for denormalized database

I am in the early stages of developing a database-driven system and the largest part of the system revolves around an inheritance type of relationship. There is a parent entity with about 10 columns ...

Easiest way to deal with sample data in Java web apps?

I m writing a Java web app in my free time to learn more about development. I m using the Stripes framework and eventually intend to use hibernate and MySQL For the moment, whilst creating the pages ...

join across databases with nhibernate

I am trying to join two tables that reside in two different databases. Every time, I try to join I get the following error: An association from the table xxx refers to an unmapped class. If the ...

How can I know if such value exists in database? (ADO.NET)

For example, I have a table, and there is a column named Tags . I want to know if value programming exists in this column. How can I do this in ADO.NET? I did this: OleDbCommand cmd = new ...

Convert date to string upon saving a doctrine record

I m trying to migrate one of my PHP projects to Doctrine. I ve never used it before so there are a few things I don t understand. In my current code, I have a class similar to this: class ...

热门标签