--- WC_DS ---

http://www.ccc.ipt.pt/~ricardo/datasets/WC_DS.html

http://www.ccc.ipt.pt/~ricardo/datasets/WC_DS.zip (for downloading data)


DATASET REFERENCE

This dataset may be used for any research purposes upon referring the following reference:

Campos, R., Dias, G., Jorge, A. and Nunes, C. (2012). GTE: A Distributional Second-Order Co-Occurrence Approach to Improve the Identification of Top Relevant Dates in Web Snippets. In Proceedings of the 21st ACM International Conference on Information and Knowledge Management (CIKM 2012) Maui, Hawaii, October 29 - November 02, ISBN 978-1-4503-1156-4, pp 2035 - 2039. ACM Press

 

SUMMARY

The WC_DS is a dataset designed for evaluating the relatedness between (queries, dates) and (snippets, dates).

It consists of 42 text queries selected from the 27 categories of Google Insights for Search 2010 and 2011 Webpage trends (closed on September 27, 2012), after removing duplicates, atemporal queries and queries with multiple meanings.

Each query was issued in Bing search engine on December 2011, collecting the top best 50 relevant web results, using for this purpose the Bing Web search API, parameterized with the en-US market language parameter. Of the 2100 web snippets retrieved, only those annotated with at least one candidate year term were selected.

The final set consists of 582 distinct web snippets having candidate year dates.

The ground truth was then obtained over this dataset by conducting two relevance human judgments:

  (1) Relatedness between each of the snippets and its respective dates (snippet, date);

  (2) Relatedness between each of the queries and its respective dates (query, date);

 

(1) Relatedness between each of the snippets and its respective dates (snippet, date);

The former judgment was performed on top of 656 distinct (s, d) pairs, where s is the set of 582 web snippets having dates and d is each one of the dates appearing in the snippet.

Each (s, d) pair was assigned a relevance label on a 2-level scale:

An example of this task is given bellow:

Title: 2011 Haiti Earthquake Anniversary

Snippet: As of 2010 (see 1500 photos), the following major earthquakes have been recorded in Haiti. The 1st one occurred in 1564. 2010 has been a tragic date, however in 2012 Haiti will organize the Carnival…

While there are a few year candidates, only “1564” and “2010” are relevant to the query. “2012” is not query-related, “1500” is not even a date and “2011” may be considered not very relevant. As the task did not show to be prone to different judgments, we did not apply a multi-annotator scheme.

The final list of judgments consists of 119 (s, d) pairs labeled with score 0, and 537 (s, d) with score 1.

 

(2) Relatedness between each of the queries and its respective dates (query, date);

The second human judgment, consists of 235 distinct (q,d) pairs, where q is the query and d the set of distinct candidate dates, extracted from the set of 582 Web snippets s.  Relevance labels were assigned based on the number of corresponding relevant and irrelevant (s,d) pairs.

An example of this task, is given for the pair (avatar movie, 2009), where “avatar movie” is the query and “2009” is a candidate date. In this example we assume that “2009” was found within seven Web snippets and that six out of seven (s,d) pairs where classified by the human evaluator as relevant. As such, given that the number of (s,d) pairs classified as relevant is higher than the number of (s,d) pairs classified as irrelevant, (avatar movie, 2009) would be classified as a relevant association.

The final list of judgments consists of 86 (q,d) pairs labeled with score 0, and 149 (q,d) with score 1.

 

The WC_DS dataset is an Excel file consisting of forty-three spreadsheets described below:

 

OTHER REFERENCES

More details on this dataset can be found in the following paper:

Campos, R., Dias, G., Jorge, A. and Nunes, C. (2012). Enriching Temporal Query Understanding through Date Identification: How to Tag Implicit Temporal Queries? In Proceedings of the 2nd International Temporal Web Analytics Workshop (TWAW 2012) associated to 21th International World Wide Web Conference (WWW2012) Lyon, France, 17 April. ISBN 978-1-4503-1188-5, pp 41 – 48. ACM - DL

 

DOWNLOAD

http://www.ccc.ipt.pt/~ricardo/datasets/WC_DS.zip

 

MORE INFO

If you have any further questions, please contact Ricardo Campos (ricardo.campos@ipt.pt).