--- Analysis of Future Dates ---

http://www.ccc.ipt.pt/~ricardo/experiments/AnalysisOfFutureDates.html

http://www.ccc.ipt.pt/~ricardo/experiments/AnalysisOfFutureDates.zip (for downloading data)


DATASET REFERENCE

This dataset may be used for any research purposes upon referring the following reference:

Campos, R., Dias, G. & Jorge, A. (2011). An Exploratory Study on the impact of Temporal Features on the Classification and Clustering of Future-Related Web Documents. In L. Antunes and H.S. Pinto (Eds.), Lecture Notes in Artificial Intelligence - Progress in Artificial Intelligence, - 15th Portuguese Conference on Artificial Intelligence (EPIA2011) associated to APPIA: Portuguese Association for Artificial Intelligence Lisbon, Portugal, 10 - 13 October. (Vol. 7026-2011, pp. 581 - 596). ISBN: 978-3-642-24768-2. DBLP. Springer. Thomson ISI Web of Knowledge. ACM Press.

 

SUMMARY

In this research we are particularly interested in mining web resources seeking for future temporal references related to implicit user queries.

We rely on a dataset consisting of 450 queries manually extracted from Google Insights for Search, which registers the hottest and rising searches performed worldwide in a given period of time. We collected queries belonging to the period of Jan 2010 – Oct 2010.

Each query was issued on December 2010 on our meta-search engine VipAccess parameterized to run over Yahoo and Bing search engines (defined to retrieve 100 results per query). We are particularly interested in studying the existence of temporal information in web documents, specifically within web snippets. Thus we decided to extract dates, particularly year dates of the period [1000 - 2090] within each of the retrieved snippet, title and url.

Of the total set of 62.842 retrieved web snippets, we kept only those texts containing future dates. As such our final collections (GISFD_DS) consists of (508 web snippets; 419 titles; 195 URLs) which were manually classified into three future temporal classes:  informative, scheduled or rumors. Each one of these texts was also tagged as belonging to a near or far future depending on the dates found.

In detail, we studied the distribution of future dates in order to understand how these temporal features impact classification and clustering of future-related content. Experiments were performed on Weka software using Naive Bayes, Multinomial Naive Bayes, K-NN, Weighted K-NN and Multi-Class SVM for the classification task and K-Means for the clustering task.

The GISFD_Experiment is an Excel file consisting of thirteen spreadsheets constructed upon the GISFD_DS dataset. Each one is described below:

 

OTHER REFERENCES

More details on this dataset can be found in the following papers:

Dias, G, Campos, R. & Jorge, A. (2011). Future Retrieval: What Does the Future Talk About? Proceedings of the Enriching Information Retrieval Workshop (ENIR2011) associated to the 34th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR2011), Beijing, China, July 28.

Campos, R., Dias, G. & Jorge, A. (2011). An Exploratory Study on the impact of Temporal Features on the Classification and Clustering of Future-Related Web Documents. In L. Antunes and H.S. Pinto (Eds.), Lecture Notes in Artificial Intelligence - Progress in Artificial Intelligence, - 15th Portuguese Conference on Artificial Intelligence (EPIA2011) associated to APPIA: Portuguese Association for Artificial Intelligence Lisbon, Portugal, 10 - 13 October. (Vol. 7026-2011, pp. 581 - 596). ISBN: 978-3-642-24768-2. DBLP. Springer. Thomson ISI Web of Knowledge. ACM Press.

 

DOWNLOAD

http://www.ccc.ipt.pt/~ricardo/experiments/AnalysisOfFutureDates.zip

 

MORE INFO

If you have any further questions, please contact Ricardo Campos (ricardo.campos@ipt.pt).