the ultimate federated search test collection

Scroll down to:

TREC evaluation
Vertical Selection evaluation
Resource Selection evaluation
Results Merging evaluation

TREC evaluation

The collection folder meta_data/TREC_EVALUATION contains the following subfolders:

  • eval_scripts
    • FW-eval-VS.pl: perl script for vertical selection (VS) evaluation
    • FW-eval-RS.py: python script for resource selection (RS) evaluation
    • FW-eval-RM.py: python script for results merging (RM) evaluation
  • example_runs
    • FW13-X-example-run (X = RS, RM)
    • FW13-baseline-rs.txt: baseline used for FedWeb13 resource selection evaluation
    • FW14-X-example-run (X = VS, RS, RM)
  • qrels_files
    • FW13-QRELS-X.txt (X = RS, RM)
    • FW14-QRELS-X.txt (X = VS, RS, RM)
  • allsorts
    • FW13-annotation-times.txt: FedWeb13 annotation times for pages and snippets
    • FWxx-duplicates-checked.txt: verified duplicates
    • FW14-intent-probs.txt: intent probabilities per topic ID, for all verticals.

Notes :

The file FW13-annotation-times.txt: is not used for the TREC evaluation, but may be interesting to researchers. The annotation times are based on the FedWeb13 annotation logs, by using the time difference between subsequent annotations as the annotation time estimate. For outliers (i.e., breaks during the annotation process), we used the average time estimate. This was done separately for snippets and pages, and within the pages, a distinction was made between video and non-video material.

The files FWxx-duplicates-checked.txt (xx=13,14) have the format

"score  ID1  ID2  ID3 ..."

in which:
ID1, ID2, ID3,... : page ID's on a single line form a set of duplicate results.
score = 0: same URL score = 1: MD5 hash of the retrieved document is the same and the document is not empty
score = 2: the urls are both long enough (num words), have similar size(max. 1 percent difference), come from different engines, and have a similar simhash. All sets with score = 2 were verified manually.

The file FW14-engines.txt contains lines with the following tab-separated information

engine  ID  name  URL  Vertical  VerticalID

Evaluation of Vertical Selection (FedWeb14)

  • Evaluation script: TREC_evaluation/eval_scripts/FW-eval-VS.pl
  • Example run file: TREC_evaluation/example_runs/FW14-VS-example-run
  • Qrels file: TREC_evaluation/qrels_files/FW14-QRELS-VS.txt

To run the perl script FW-eval-VS.pl, the following input arguments are required:

- RUN: location to runfile to be evaluated
- QRELS: location to qrels file (FW14-QRELS-VS.txt)

The output is given per topic: precision (P), recall (R), and F1-measure of the selected vertical(s).

For details on how we obtained the vertical relevance, please refer to the (GMR + II) approach described in the following paper:

K. Zhou, T. Demeester, D. Nguyen, D. Hiemstra, and D. Trieschnigg. Aligning Vertical Collection Relevance with User Intent. In ACM International Conference on Information and Knowledge Management (CIKM 2014), Shanghai, China, 2014.
(pdf, acm)

Evaluation of Resource Selection (FedWeb13, FedWeb14)

  • Evaluation script: TREC_evaluation/eval_scripts/FW-eval-RS.py
  • external software needed: trec_eval (tested with trec_eval.9.0)
  • Example run files:
    FW13: TREC_evaluation/example_runs/FW13-RS-example-run
    FW14: TREC_evaluation/example_runs/FW14-RS-example-run
  • Qrels files:
    FW13: TREC_evaluation/qrels_files/FW13-QRELS-RS.txt
    FW14: TREC_evaluation/qrels_files/FW14-QRELS-RS.txt

To run the python script FW-eval-RS.py, the following input arguments are required (in this order)

- RUN: location to runfile to be evaluated
- QRELS: location to appropriate qrels file (FW13-QRELS-RS.txt or FW14-QRELS-RS.txt)
- TRECEVAL: location of folder with trec_eval executable (e.g., /usr/local/lib/trec_eval.9.0)

Important note:

We only provided the tools used for the FedWeb14 evaluation, to calculate nDCG@k and nP@k (normalized precision at k).

The evaluation script can be run both for 2013 and 2014 runs and qrels files. However, the script does not allow to exactly reproduce the FedWeb 2013 official results. The reason is that there are two major differences w.r.t. the FedWeb 2013 evaluation:

  1. For nDCG we used trec_eval (in the form as used in FW-eval-RS.py), whereas for FedWeb13 we used nDCG as defined by the ICML 2005 paper of Burges et al., for which the TREC Web Track's gdeval.pl script was used (to be found here).
  2. For the normalized precision nP@k, for FedWeb14 we explicitly ordered (query,engine) pairs with the same scores (both the predicted and the ideal case) according to the engine id (which leads to larger nP@k values as opposed to FedWeb13, where we used a random order).

Important note:

Also note the difference in reference scores (graded precision values per search engine) in the QRELS files. For 2013, the scores out of 100 are given, i.e. round(graded_precision*100), for 2014 out of 1000.

Both cases are based on different weights for the relevance levels (see the respective TREC FedWeb overview papers).

In case you would like to perform uniform tests over both years, with an arbitrary set of relevance weights, the original jugments files can be used (FW1X-single-page-judgments.txt) to create new qrels files from.

Evaluation of Results Merging (FedWeb13, FedWeb14)

  • Evaluation script: TREC_evaluation/eval_scripts/FW-eval-RM.py
  • external software needed: trec_eval (tested with trec_eval.9.0)
  • Example run files:
    FW13: TREC_evaluation/example_runs/FW13-RM-example-run
    FW14: TREC_evaluation/example_runs/FW14-RM-example-run
  • Qrels files:
    FW13: TREC_evaluation/qrels_files/FW13-QRELS-RM.txt
    FW14: TREC_evaluation/qrels_files/FW14-QRELS-RM.txt
  • Other files needed for the evaluation:
    TREC_evaluation/allsorts/FWxx-duplicates-checked.txt (xx = 13, 14)
    TREC_evaluation/allsorts/FW14-intent-probs.txt
    engines/FW14-engines.txt

To run the python script FW-eval-RM.py, the following input arguments are required (in this order)

- RUN: location to runfile to be evaluated
- QRELS: location to appropriate qrels file (FW13-QRELS-RM.txt orFW14-QRELS-RM.txt)
- TRECEVAL: location of folder with the trec_eval executable (e.g., /usr/local/lib/trec_eval.9.0)
- DUPLICATES: location of file with checked duplicates (FW13-duplicates-checked.txt or FW14-duplicates-checked.txt)
- TMP: location of folder with write access, for temporary files (to store run- and/or vertical-specific filtered QRELS file as input for trec_eval)
- RESOURCES: location of resource desciption file (FW14-engines.txt)
- INTENTS: location of file with intent probabilities for the 50 FedWeb14 evaluation topics (FW14-intent-probs.txt)
The last 2 arguments can be left out, in which case only nDCG@k will be calculated. If they are present, the intent-aware nDCG-IA@k will be calculated as well.

Important notes:

  • The flag "set_duplicates_nonrelevant = True" on line 30 in the FW-eval-RM.py script, means that subsequent duplicates are set to non-relevant. If you want duplicates to be assigned their actual relevance weights, it needs to be set to False.
  • The evaluation script was composed of the FedWeb14 evaluation scripts, with a few simplifications:
    • no explicit checking whether the run contains only input from at most 20 resources (which was already the case for all submitted runs)
    • no explicit checking for the link with an actual resource selection file (which was implemented only for FedWeb14, and already the case for the submitted runs)
  • The FedWeb 2013 and 2014 evaluation results can be reconstructed with these files. However, we discovered a bug in the 2013 duplicate removal script, such that for 2013, only the results without explicit duplicate removal can be reconstructed.