the ultimate federated search test collection

Search results

The main folder search_data is split up into the folders fedweb13 and fedweb14. Each of these has the following subfolders (containing a compressed file per search engine), for the samples, topics and additional topics (xtopics):

  • FWxx-...-searchpages : the HTML search pages retrieved from the search engines.
  • FWxx-...-search : the extracted search snippets from the search pages available in XML (see below). The sample snippets are provided in a single XML file per search engine (e.g. FW13-sample-search/e023/e023.xml), the topic snippets are provided in separate files per topic (e.g. FW13-topics-search/e127/7029.xml).
  • FWxx-...-docs : the crawled documents (e.g. FW13-sample-docs/e121/6752_10.html where 6752 is the query id and 10 is the rank of the document). Also thumbnails provided by the search engine are sometimes available (e.g. FW13-sample-docs/e121/6412_10_thumb.jpg). The extension of the file indicates its format (e.g. html, doc or pdf).
  • FWxx-topics-screenshots (only available for the official topics): contains page screenshots (e.g. FW13-topics-screenshots/e109/7439_01_screenshot.jpg for the screenshot of the result for queryid 7439 from engine e109 at rank 1).

Size

The total size of the search results is 784 GB (uncompressed), and 290 GB (compressed). Further statistics are given in the README.txt file.

Snippet format

Below is an example of a search snippet. For the sample search pages (e.g. FW13-sample-search/e001/e001.xml), the <search_results> elements are listed under a <samples> root element, as follows

<search_results>
 <query id="5000">the</query>
 <engine status="OK" timestamp="2013-04-04 11:23:15"
  name="arXiv.org" id="FW13-e001"/>
 <snippets>
  <snippet id="FW13-e001-5000-01">
   <link cache="FW13-sample-docs/e001/5000_01.html">
    http://arxiv.org/abs/adap-org/9912005</link>
   <title>...</title>
   <description>...</description>
   <thumb cache= "FW13-sample-docs/e001/5000_01_thumb.jpg">thumbnail_url
   </thumb>
  </snippet>
  <snippet>...</snippet>
    ...
 </snippets>
</search_results>

The cache attributes refer to the files in the dataset; if the attribute is not set, we failed in retrieving the corresponding document. The engine status attribute indicates whether we succeeded in retrieving a search result page from the server. If the status attribute is "OK" and no snippet tags are present this implies that no search results were found.