--- WC_TREC_DS ---

http://www.ccc.ipt.pt/~ricardo/datasets/WC_TREC_DS.html

http://www.ccc.ipt.pt/~ricardo/datasets/WC_TREC_DS.zip (for downloading data)


DATASET REFERENCE

This dataset may be used for any research purposes upon referring the following reference:

----

 

SUMMARY

The WC_TREC_DS is a dataset designed to evaluate the relationship between queries and dates (q, di).

It consists of 25 implicit time sensitive queries selected from the TREC-ts-{2013, 2014} (Guo et al., 2013) collections, 489 web documents with years obtained by querying the Bing search engine for each of the 25 queries through Bing Search API, and 443 distinct pairs, where is the query and the candidate year.

Three participants were recruited to evaluate the relevance of the 443 (q, di) pairs using two levels of relevance:

(0) for non-relevant (the candidate date is not relevant for the query or is incorrect);

(1) for relevant (the candidate date is relevant for the query).

 

The assessments were performed on November 2016 and did not involve any payment. Each worker evaluated 443 (q, di) pairs resulting in 1329 (q, di) total assessments, lasting three hour son average to complete their task.

To get familiar with the topic, workers were given a very short description of the query. The decision of whether a candidate date is or not relevant should take into account not only this short information, but also the web texts containing the candidate date. Thus, annotators are asked to not only determine the relevance of the obvious date, but also those candidate dates which despite being less evident may still be related to the query.

The ground-truth collection comes as a result of a majority voting approach. As such, each candidate date is considered to be relevant if it gets more relevant votes from the workers than non-relevant ones. The resulting ground-truth consists of 443 candidate dates, of which 194 were deemed relevant to the query and 249 non-relevant.

 

The WC_TREC_DS dataset is an Excel file consisting of two spreadsheets described below:

GroundTruth: Table with the worker's relevance decision for the set of 25 queries

Column A has the query name.

Column B has the date.

Column C has the worker's relevance decision (2-level scale) obtained by majority voting

          Worker's Assessement Summary: Table that gathers the relevance decision of the 3 worker's for set of 25 queries

Column A has the query name.

Column B has the date.

                     Column C has the summary of the worker's relevance decision when the date is considerd to be relevant (1)

Column D has the summary of the worker's relevance decision when the date is considerd to be non-relevant (0)

Column F - H has the worker's 1 - 3 relevance decision

 

Other datafiles (folders):

               Texts: It gathers all the 489 web documents with years obtained by querying the Bing search engine for each of the 25 queries through Bing Search API.

               JSONFiles: It gathers the relevance for each of the 25 queries, together with the title, text and URL where they appear.

 

DOWNLOAD

http://www.ccc.ipt.pt/~ricardo/datasets/WC_TREC_DS.zip

 

MORE INFO

If you have any further questions, please contact Ricardo Campos (ricardo.campos@ipt.pt).