the ultimate federated search test collection

General information

The FedWeb Greatest Hits collection can be used for research in Federated Web Search. It consists of a 2013 and 2014 part, both used in the official TREC FedWeb tracks, as well as a lot of extra data (search results for extra test topics, judgments, evaluation scripts...).

The collection contains search results and corresponding web pages crawled in April and May 2013/2014 from 157/149 existing web search engines. The engine with engineID e200 (BigWeb) was artificially created: its query results were randomly obtained from 5 large general web search engines that were not explicitly included in the collection.

Both the 2013 and 2014 parts have:

  • a set of samples: 2000/4000 (for resp. 2013, 2014) random single word queries issued to each of the search engines. The first 1000/2000 queries are the same accross all engines, the second 1000/2000 are engine-specific. These samples can be used to build engine resource descriptions.
  • a set of topics: 200/275 information needs and corresponding keyword queries. 50/60 topics have been fully judged. All 200 official 2013 topics also belong to the official 2014 topics (along with 75 others). For the judged pages, jpeg screenshots (max. height of 3000 pixels) are available.
  • a set of extra topics: 306/231 information needs and keyword queries which were not used in the official FedWeb track. The total set of 200 official + 306 extra topics for 2013 and the 275 official and 231 extra topics for 2014, are the same pool of 506 topics (with descriptions and narratives in FW-topics.xml).

For each of these sets the following data is available:

  • the search result page in html
  • up to 10 snippets (title, description, url, icon) extracted from the search result page in xml
  • for each snippet the downloaded document (in various formats such as html,pdf, jpg etc.)

More information about the TREC Federated Web Search track and gathering of the data can be found in the 2013 overview paper, and the 2014 overview paper .

Organization of the data

The collection consists of two main folders search_data and meta_data.

The search_data folder is split up into a fedweb13 and fedweb14 folder. Each of these folders contains compressed search results per search engine, for samples, topics, and additional topics (called 'xtopics'). More detailed information is given here.

The meta_data folder contains 4 subfolders:

  • engines : information on FW13 and FW14 included search engines
  • judgments : raw relevance judgments for FW13 and FW14 evaluationtopics
  • topics : topics with descriptions in xml, and lists of topic ID'sfor different official/evaluation parts of the collection.
  • TREC_evaluation : evaluation scripts, example runs, qrels files,and various other files, used for the TREC evaluation