--- GISFD_DS ---

http://www.ccc.ipt.pt/~ricardo/datasets/GISFD_DS.html

http://www.ccc.ipt.pt/~ricardo/datasets/GISFD_DS.zip (for downloading data)


DATASET REFERENCE

This dataset may be used for any research purposes upon referring the following reference:

Campos, R., Dias, G. & Jorge, A. (2011). An Exploratory Study on the impact of Temporal Features on the Classification and Clustering of Future-Related Web Documents. In L. Antunes and H.S. Pinto (Eds.), Lecture Notes in Artificial Intelligence - Progress in Artificial Intelligence, - 15th Portuguese Conference on Artificial Intelligence (EPIA2011) associated to APPIA: Portuguese Association for Artificial Intelligence Lisbon, Portugal, 10 - 13 October. (Vol. 7026-2011, pp. 581 - 596). ISBN: 978-3-642-24768-2. DBLP. Springer. Thomson ISI Web of Knowledge. ACM Press.

 

SUMMARY

The GISFD_DS is a dataset designed for evaluating the relatedness between (texts, future dates).

In order to extract both texts and future dates, we rely on a dataset consisting of 450 queries manually extracted from Google Insights for Search (closed on September 27, 2012) which registers the hottest and rising searches performed worldwide in a given period of time. We collected queries belonging to the period of Jan 2010 – Oct 2010.

Each query was issued on December 2010 on our meta-search engine VipAccess parameterized to run over Yahoo and Bing search engines (defined to retrieve 100 results per query). We are particularly interested in studying the existence of temporal information in web documents, specifically within web snippets. Thus we decided to extract dates, particularly year dates of the period [1000 - 2090] within each of the retrieved snippet, title and url.

Of the total set of 62.842 retrieved web snippets, we kept only those texts containing future dates. As such our final collections consists of (508 web snippets; 419 titles; 195 URLs) which were manually classified into three future temporal classes:  {informative, scheduled or rumors}. Each one of these texts was also tagged as belonging to a near or far future depending on the dates found.

 

The GISFD_DS dataset is an Excel file consisting of three spreadsheets described below:

 

OTHER REFERENCES

More details on this dataset can be found in the following papers:

Dias, G, Campos, R. & Jorge, A. (2011). Future Retrieval: What Does the Future Talk About? Proceedings of the Enriching Information Retrieval Workshop (ENIR2011) associated to the 34th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR2011), Beijing, China, July 28.

Campos, R., Dias, G. & Jorge, A. (2011). An Exploratory Study on the impact of Temporal Features on the Classification and Clustering of Future-Related Web Documents. In L. Antunes and H.S. Pinto (Eds.), Lecture Notes in Artificial Intelligence - Progress in Artificial Intelligence, - 15th Portuguese Conference on Artificial Intelligence (EPIA2011) associated to APPIA: Portuguese Association for Artificial Intelligence Lisbon, Portugal, 10 - 13 October. (Vol. 7026-2011, pp. 581 - 596). ISBN: 978-3-642-24768-2. DBLP. Springer. Thomson ISI Web of Knowledge. ACM Press.

 

DOWNLOAD

http://www.ccc.ipt.pt/~ricardo/datasets/GISFD_DS.zip

 

MORE INFO

If you have any further questions, please contact Ricardo Campos (ricardo.campos@ipt.pt).