--- AOL_DS ---

http://www.ccc.ipt.pt/~ricardo/experiments/AOL_DS.html

http://www.ccc.ipt.pt/~ricardo/experiments/AOL_DS.zip (for downloading data)


DATASET REFERENCE

This dataset may be used for any research purposes upon referring the following reference:

Campos, Ricardo (2011). Analysis of Temporal Data in Explicit Temporal Queries. AOL dataset (AOL_DS). http://www.ccc.ipt.pt/~ricardo/experiments/AOL_DS.html

 

SUMMARY

In this research we are particularly interested in studying the use of dates in the formulation of queries.

To this end we use one dataset consisting of 21.011.240 queries collected from 650.000 users over three months (01 March, 2006 - 31 May, 2006) of activity within the AOL search engine.

Of the initial collection of 21.011.240 queries, we were left with 10,154,742 queries after removing doubled entries. Over this collection, we executed a rule based model so as to detect only those queries with year dates, particularly those belonging to the period of [1000 - 2090].

We ended up with 143.590 explicit temporal queries,  representing a value of 1.41%, very close to the 1.5% referred by (Nunes, Ribeiro, & David, 2008). Since our purpose is to classify the queries under 29 pre-defined categories, we selected a representative sample, in order to make it feasible. We considered, based on (Barbetta, Reis, & Bornia, 2004), a sample of n=601 queries with a maximum tolerated average sampling error of E=4% for a confidence interval of 95%

We can note that of the 601 queries, 87 (i.e., 14,14% of the sample) were wrongly marked as dates by our rule based model.  Mostly this happens in the category of Computer & Electronics, URLs , Travel & Maps (e.g., streets with numbers) and Finance & Insurance (number of forms).

We can also observe that when it comes to explicit temporal queries users are mostly interested in the categories of Automotive, Entertainment, Sports, Business & Economics and News & Events.

The AOL_DS dataset is an Excel file consisting of six spreadsheets described below:

 

OTHER REFERENCES

More details on this dataset can be found in the following paper:

Campos, R., Dias, G. & Jorge, A. (2011). What is the Temporal Value of Web Snippets? In Proceedings of the 1st International Temporal Web Analytics Workshop (TWAW2011) associated to the 20th International World Wide Web Conference (WWW2011), pp 9 – 16, Hyderabad, India, 28th March, ISSN 1613 - 0073.

 

DOWNLOAD

http://www.ccc.ipt.pt/~ricardo/experiments/AOL_DS.zip

 

MORE INFO

If you have any further questions, please contact Ricardo Campos (ricardo.campos@ipt.pt).