English 中文(简体)
Grouping to extract common values in semi-structured data
原标题:

I ve got a somewhat ugly field in a database which holds the names of locations. For instance, Madison Square Gardens which has also been entered as "The Madison Square Gardens", etc. etc.

I m trying to extract the data so that I can get an accurate list of all the locations. In order to accomplish this, what I ve done is created a sql query where I join the events for each location, and then group by the location name and only use location groups having more than 10 entries (that filters out the somewhat non-reliable entries), but I still end up with Some very different spellings and entries, resulting in duplicate properties/locations.

My SQL query looks like this

"SELECT location, COUNT(*) FROM locations 
JOIN event ON locations.lid=events.lid
WHERE `long`
BETWEEN - 74.419382608696
AND - 73.549817391304
AND lat
BETWEEN 40.314017391304
AND 41.183582608696
GROUP BY location 
HAVING COUNT(*)>10

Running this query provides 3 different entries "Madison Square Garden", "Madison Square Gardens", "The Madison Square Garden". Of course, this is only for the Madison Square Garden entry. Most entries have multiple slightly different spellings.

I restrict my searches by lat/long so I don t get locations with the same name in different cities grouped together.

Is there a way with Regular expressions or something in the GROUP clause to have these grouped consistently? Even just removing the trailing s , and the before the grouping would probably be a big benefit.

I was going to take each result and then do a regular expression match against all the locations in within the lat/long range?

Fortunately I have enough linked events to locations, that I am somewhat able to recognize the major locations.

Any other suggestions for extracting locations from semi-structured data? The data is scrapped from a variety of sources, so I don t have control over the input.

问题回答

Here are some suggestions for you.

Create a normalized venue-name column in your data base: (1) Run each name through some simple transformations ... Turn "The Madison Square Garden" and "The Washington Monument" into "Madison Square Garden" and "Washington Monument" Turn plural nouns into singular the easy way ... strip "es", then "s" from each word in your name. Downcase everything. Eliminate any remaining short words "a" "it" "the" "and" "&" you get the idea. Sort your words into alphabetical order, getting you "garden madison square" Store that resulting string into a new column in your table. Match on it, while still displaying your original string.

(2) Create a lookup table with variant spellings of venues. This works well for venue names like "Boston Garden" / "Fleet Center" / "TD Banknorth Garden" / "North Station" and junk like that. Same place, different spelling. ("Penn Station" for your example).

(3) You could use the Yahoo or Google Maps geocoding services, which will take incomplete names and addresses and standardize them.

Soundex is going to get you quite a few false positive matches. It s designed as a fallback and requires human disambiguation.

If your issue is treating "similar" strings the same, you may want to check out the SOUNDEX algorithm. I m not sure if it will work for all of your different scenarios, but it s a start.

It s discussed in this thread: How do I do a fuzzy match of company names in MYSQL with PHP for auto-complete?





相关问题
SQL SubQuery getting particular column

I noticed that there were some threads with similar questions, and I did look through them but did not really get a convincing answer. Here s my question: The subquery below returns a Table with 3 ...

please can anyone check this while loop and if condition

<?php $con=mysql_connect("localhost","mts","mts"); if(!con) { die( unable to connect . mysql_error()); } mysql_select_db("mts",$con); /* date_default_timezone_set ("Asia/Calcutta"); $date = ...

php return a specific row from query

Is it possible in php to return a specific row of data from a mysql query? None of the fetch statements that I ve found return a 2 dimensional array to access specific rows. I want to be able to ...

Character Encodings in PHP and MySQL

Our website was developed with a meta tag set to... <meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1" /> This works fine for M-dashes and special quotes, etc. However, I ...

Pagination Strategies for Complex (slow) Datasets

What are some of the strategies being used for pagination of data sets that involve complex queries? count(*) takes ~1.5 sec so we don t want to hit the DB for every page view. Currently there are ~...

Averaging a total in mySQL

My table looks like person_id | car_id | miles ------------------------------ 1 | 1 | 100 1 | 2 | 200 2 | 3 | 1000 2 | 4 | 500 I need to ...

热门标签