I ve got a somewhat ugly field in a database which holds the names of locations. For instance, Madison Square Gardens which has also been entered as "The Madison Square Gardens", etc. etc.
I m trying to extract the data so that I can get an accurate list of all the locations. In order to accomplish this, what I ve done is created a sql query where I join the events for each location, and then group by the location name and only use location groups having more than 10 entries (that filters out the somewhat non-reliable entries), but I still end up with Some very different spellings and entries, resulting in duplicate properties/locations.
My SQL query looks like this
"SELECT location, COUNT(*) FROM locations JOIN event ON locations.lid=events.lid WHERE `long` BETWEEN - 74.419382608696 AND - 73.549817391304 AND lat BETWEEN 40.314017391304 AND 41.183582608696 GROUP BY location HAVING COUNT(*)>10
Running this query provides 3 different entries "Madison Square Garden", "Madison Square Gardens", "The Madison Square Garden". Of course, this is only for the Madison Square Garden entry. Most entries have multiple slightly different spellings.
I restrict my searches by lat/long so I don t get locations with the same name in different cities grouped together.
Is there a way with Regular expressions or something in the GROUP clause to have these grouped consistently? Even just removing the trailing s , and the before the grouping would probably be a big benefit.
I was going to take each result and then do a regular expression match against all the locations in within the lat/long range?
Fortunately I have enough linked events to locations, that I am somewhat able to recognize the major locations.
Any other suggestions for extracting locations from semi-structured data? The data is scrapped from a variety of sources, so I don t have control over the input.