I have a large database of resumes (CV), and a certain table skills grouping all users skills.
inside that table there s a field skill_text that describes the skill in full text.
I m looking for an algorithm/software/method to extract significant terms/phrases from that table in order to build a new table with standarized skills..
Here are some examples skills extracted from the DB :
- Sectoral and competitive analysis
- Business Development (incl. in international settings)
- Specific structure and road design software - Microstation, Macao, AutoCAD (basic knowledge)
- Creative work (Photoshop, In-Design, Illustrator)
- checking and reporting back on campaign progress
- organising and attending events and exhibitions
- Development : Aptana Studio, PHP, HTML, CSS, JavaScript, SQL, AJAX
- Discipline: One to one marketing, E-marketing (SEO & SEA, display, emailing, affiliate program) Mix marketing, Viral Marketing, Social network marketing.
The output shoud be something like :
- Sectoral and competitive analysis
- Business Development
- Specific structure and road design software -
- Macao
- AutoCAD
- Photoshop
- In-Design
- Illustrator
- organising events
- Development
- Aptana Studio
- PHP
- HTML
- CSS
- JavaScript
- SQL
- AJAX
- Mix marketing
- Viral Marketing
- Social network marketing
- emailing
- SEO
- One to one marketing
As you see only skills remains no other representation text.
I know this is possible using text mining technics but how to do it ? the database is realy large.. it s a good thing because we can calculate text frequency and decide if it s a real skill or just meaningless text... The big problem is .. how to determin that "blablabla" is a skill ?
Edit : please don t tell me to use standard things like a text tokinzer, or regex .. because users input skills in a very arbitrary way !!
thanks