This is a first pass at providing a dataset for people to play with. The idea was to take 2.5 years of public bbc data, combine it with Lonclass (http://en.wikipedia.org/wiki/Lonclass) and output as RDF and other formats. For various reasons I've simplified the dataset: * from July 2007-December 2009 inclusive * channels bbcone, bbctwo, bbcthree, bbcfour only (no radio, childrens' channels or news24) * no repeat broadcasts (programmes not versions) * only items with at least one Lonclass categpory * about 10% of programmes are lost at the pid-matching phase * it's not guaranteed to be correct - the matching proceedure occassionally introduces inaccuracies * not all pids have series It's not in RDF yet either! LICENSE ======= These are derived from parts of the BBC catalogue, and also use /programmes data. BBC programmes data is licenced under the BBC backstage license (http://backstage.bbc.co.uk/archives/2005/05/api_licence.html). This data is licensed in a similar way, i.e. for non-commercial use. Let us know if you have any queries about that. These are the files: BBC DATASETS ============ 1. /programmes pids mapped to lonclass terms with lonclass term names --------------------------------------------------------------------- pid_term_name_all.txt.gz [select distinct pid,pid_term.term,term_name from pid_term,concepts where pid_term.term=concepts.term into outfile 'pid_term_name_all.txt';] [pid] [term] [term name] [varchar(17)] [varchar(100)] [varchar(100)] b00744q5 725.716 PUBLIC HOUSES b00745bp 343.91 CRIMINALS (CRIMINOLOGY) b00745bp (412) EIRE (EUROPE) 2. /programmes pids matched to series crids ------------------------------------------- pids_series_all.txt.gz [select distinct pid,crid,description,dt_scheduled,dt_actual,channel,core_title,series_title from pids into outfile 'pids_detail_all.txt';] [pid] [series crid] [varchar(17)] [varchar(72)] b00744nc crid://fp.bbc.co.uk/KKXUYW b00744q5 crid://fp.bbc.co.uk/KKXUYW 3. /programmes pids detail -------------------------- pids_detail_all.txt.gz [select distinct pid, series_crid from series into outfile 'pids_series_all.txt';] [pid] [series crid] [description] [varchar(17)] [varchar(72)] [text] b007481d crid://fp.bbc.co.uk/4D9FYN When Gaz finds Donna's toothbrush in his flat, he realises he [scheduled datetime] [actual datetime] [channel] [core title] [datetime] [datetime] [varchar(20)] [char(144)] has to stop her 'merging'. 2007-08-14 23:10:00 2007-08-15 00:12:00 bbcthree BONE WITH THE [series title] [char(72)] WIND TWO PINTS OF LAGER AND A PACKET OF CRISPS TANIMOTO DATASET ================ 4. /programmes, series_crid, connecting terms --------------------------------------------- pid_series_term_all.txt.gz [select pid_term.pid, series_crid, GROUP_CONCAT(distinct pid_term.term SEPARATOR '\t') as `terms` from pid_term, series where pid_term.pid=series.pid group by pid_term.pid into outfile 'pid_series_term_all.txt';] [pid] [series crid] [tab-separated list of terms] [varchar(17)] [varchar(72)] [text] b00rhfdn crid://fp.bbc.co.uk/KCIIXH 159.953 .009.04 159.953.009.04 625.2:910.21 301.153.5:159.921.118.002.694 616.899 371.912MAKATON 656.211 159.921.118.002.694:301.153.5 301.153.5 372.4 8.081 910.21 625.2 910.21:625.2 625.23 159.921.118 .002.694 b00rhff6 crid://fp.bbc.co.uk/KCIIXH (411BELFAST) 381.14 083.3:64.033 64.033 083.3 371.912MAKATON ADD-ON DATASETS =============== These have been created by matching aspects of the previous datasets with external vocbularies. 5. pid_dbpedia_lonclass.txt.gz ------------------------------ *DISCLAIMER* these are matches made automatically by a piece of code (https://github.com/notube/Code_snippets/tree/master/lonclass_to_dbpedia) and have not been checked for accuracy at this stage. More information is here: http://notube.tv/2011/02/15/linking-wikipedia-and-bbc-programmes/ The format is tab-separated, one per line: [term] [wikipedia url segment] [term text matched] [exact / plural] [varchar(100)] [text] [text] [varchar(6)] (=1.569.5) Jordanians jordanians exact 611.781.002.112 Hairstyle hairstyles plural 623.442.47 Machine_Gun machine guns plural it consists of lonclass terms matched to dbpedia using a basic text match. The equivalent Wikipedia / DBPedia / DBPedialite.org URL can be found by replacing part of the url, e.g. http://en.wikipedia.org/wiki/Jordanians http://dbpedia.org/data/Machine_gun.n3 (view the head of http://dbpedia.org/page/Machine_gun for more), http://dbpedialite.org/titles/Air_forces. 6. dbpedialite_titles.txt.gz ---------------------------- *DISCLAIMER* these are matches made automatically by a piece of code (https://github.com/notube/Code_snippets/tree/master/titles_to_dbpedialite) and have not been checked for accuracy at this stage. It uses http://dbpedialite.org for the lookups. The format is tab-separated, one per line: [pid] [wikipedia url segment] [varchar(17)] [text] b00jlz3y Holy_Sepulchre_Cemetery_%28Cheltenham_Township%2C_Pennsylvania%29 b00hcgx3 Working_Lunch b00fh69t Watchdog b0079y0f Alfred_Wainwright b00fzbmc Bargain_Hunt The equivalent Wikipedia / DBPedia / DBPedialite.org URL can be found by replacing part of the url, e.g. http://en.wikipedia.org/wiki/History_of_the_Conservative_Party http://dbpedia.org/data/Machine_gun.n3 (view the head of http://dbpedia.org/page/Machine_gun for more), http://dbpedialite.org/titles/Air_forces.