Use Cases and Data Sets
Data Sets
It is essential to consult many different information sources within patent
retirieval however in common cases relevant information is siloed into many
heterogeneous and distributed information sources. It is crucially important to
retireive these data from such sources therefore we utilise the following data streams
within the bioRegnet framework.
TREC 2006-2007 Genomics Dataset: In full this collection contains
162,259 documents from the 49 scientific journals. The files are available on
the protected portion of the TREC Genomics track Web site in WinZip format, however
for use in bioRegnet we have processed the text prior to building indexes.
There are 59 .zip files, one for each journal, with the exception of the large
Journal of Biological Chemistry that has its data in 10 files (one for each year).
The 59 .zip files total about 3 GB in size. The full collection is about 12.3 GB
when uncompressed.