I am working on an application where I would like to retrieve a list of the day s top news stories from some source (such as the BBC) and parse these for keywords that I can use against my own tag data. There are obviously lots of webservices and APIs out there - but what would you suggest as good routes to take.
One thing I was considering is periodically downloading the RSS feed of BBC News and parsing the content using the Yahoo term extractor. This seems like a good solution to me, but the term extractor is for non-commercial use only and my application is commercial.
YQL looks promising but I m not sure how easy it will be to condense the data down to keywords.
All suggestions welcome, both for the news source and the keyword/tag extraction, and for both commercial and non-commercial uses.
Update:
Building on the suggestion of an answer, here s the YQL for grabbing the keywords from the top UK news stores on the BBC:
select content
from search.termextract
where context in (
select title
from rss
where url= http://newsrss.bbc.co.uk/rss/newsonline_uk_edition/front_page/rss.xml
)
which returns something like:
<?xml version="1.0" encoding="UTF-8"?>
<query xmlns:yahoo="http://www.yahooapis.com/v1/base.rng" yahoo:count="46" yahoo:created="2009-11-13T11:49:05Z" yahoo:lang="en-US" yahoo:updated="2009-11-13T11:49:05Z" yahoo:uri="http://query.yahooapis.com/v1/yql?q=select+content+from+search.termextract+where+context+in+%28select+title+from+rss+where+url%3D%27http%3A%2F%2Fnewsrss.bbc.co.uk%2Frss%2Fnewsonline_uk_edition%2Ffront_page%2Frss.xml%27+%29">
<results>
<Result xmlns="urn:yahoo:cate">new york</Result>
<Result xmlns="urn:yahoo:cate">bolt gun</Result>
<Result xmlns="urn:yahoo:cate">stalker</Result>
<Result xmlns="urn:yahoo:cate">russia</Result>
<Result xmlns="urn:yahoo:cate">moon</Result>
<Result xmlns="urn:yahoo:cate">hijack</Result>
<Result xmlns="urn:yahoo:cate">yacht</Result>
<Result xmlns="urn:yahoo:cate">balloon</Result>
<Result xmlns="urn:yahoo:cate">parents</Result>
<Result xmlns="urn:yahoo:cate">bruce forsyth</Result>
<Result xmlns="urn:yahoo:cate">flu</Result>
Ultimately though, I don t think I can use this within a commercial app though due to the restrictions on the term extraction service.