--- GISQC_DS ---

http://www.ccc.ipt.pt/~ricardo/datasets/GISQC_DS.html

http://www.ccc.ipt.pt/~ricardo/datasets/GISQC_DS.zip (for downloading data)


DATASET REFERENCE

This dataset may be used for any research purposes upon referring the following reference:

Campos, R., Dias, G. & Jorge, A. (2011). What is the Temporal Value of Web Snippets? In Proceedings of the 1st International Temporal Web Analytics Workshop (TWAW2011) associated to the 20th International World Wide Web Conference (WWW2011), pp 9 – 16, Hyderabad, India, 28th March, ISSN 1613 - 0073.

 

SUMMARY

The GISQC_DS is a dataset designed for evaluating the type of queries in terms of concept and temporal ambiguity. It is also used to analyse the set of temporal features retrieved for a set of queries.

We rely on a dataset consisting of 465 queries manually extracted from Google Insights for Search (closed on September 27, 2012) which registers the hottest and rising searches performed worldwide in a given period of time. We collected queries belonging to the period of Jan 2010 – Oct 2010.

Each query was issued on December 2010 on our meta-search engine VipAccess parameterized to run over Yahoo and Bing search engines. We are particularly interested in studying the existence of temporal information in web documents, specifically within web snippets. Thus we decided to extract dates, particularly year dates of the period [1000 - 2090] within each of the retrieved snippet, title and url.

Each query is classified under one of 29 pre-defined categories and they are also classified in terms of concept ambiguity. To this purpose we rely on the work of (Song, Luo, Nie, Yu, & Hon, 2009) who proposes three possible classes, Ambiguous (queries having more than one meaning, e.g., scorpions), Broad (queries covering a variety of sub-topics, e.g., quotes, which may either refer to love quotes, historical quotes, etc...) and Clear queries (queries with a specific meaning covering a narrow topic, e.g., Bank of America).

Moreover, we have also classified queries in terms of temporal ambiguity, on the assumption that only clear concept queries can be classified. To this purpose we rely on the work of (Jones & Diaz, 2007) who proposes three possible classes, namely ATemporal (queries not sensitive to time, e.g., scorpions animal), Temporal Ambiguous (queries with multiple instances over time, e.g., oil spill) and Temporal Unambiguous (queries that take place in a very concrete time period, e.g., bp oil spill). Each query is temporally classified based on a combination of three basic measures: TSnippets(q), TTitle(q) and TUrl(q), where q is the query. Particularizing, TSnippets is the computed ratio between the number of snippets returned with dates, divided by the total number of snippets returned. TTitle and TUrl are computed similarly.

The GISQC_DS dataset is an Excel file consisting of four spreadsheets described below:

 

OTHER REFERENCES

More details on this dataset can be found in the following papers:

Campos, R., Dias, G. & Jorge, A. (2011). What is the Temporal Value of Web Snippets? In Proceedings of the 1st International Temporal Web Analytics Workshop (TWAW2011) associated to the 20th International World Wide Web Conference (WWW2011), pp 9 – 16, Hyderabad, India, 28th March, ISSN 1613 - 0073.

Campos, R., Jorge, A. and Dias, G. (2011). Using Web Snippets and Query-logs to Measure Implicit Temporal Intents in Queries. In Proceedings of the Query Representation and Understanding Workshop (QRU 2011) associated to 34th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR2011) Beijing, China, 28 July, pp 13 - 16.

 

DOWNLOAD

http://www.ccc.ipt.pt/~ricardo/datasets/GISQC_DS.zip

 

MORE INFO

If you have any further questions, please contact Ricardo Campos (ricardo.campos@ipt.pt).