--- GTE ---

http://www.ccc.ipt.pt/~ricardo/experiments/GTE.html

http://www.ccc.ipt.pt/~ricardo/experiments/GTE.zip (for downloading data)


DATASET REFERENCE

This dataset may be used for any research purposes upon referring the following reference:

Campos, R., Dias, G., Jorge, A. and Nunes, C. (2012). GTE: A Distributional Second-Order Co-Occurrence Approach to Improve the Identification of Top Relevant Dates in Web Snippets. In Proceedings of the 21st ACM International Conference on Information and Knowledge Management (CIKM 2012), Maui, Hawaii, October 29 - November 02, ISBN 978-1-4503-1156-4, pp 2035 - 2039. ACM Press.

 

SUMMARY

In this experiment we compare a set of different association measures in order to assess the similarity existing between a query and a date. We rely on the WC_DS and on the WC_TREC_DS.

WC_DS consists of 42 text queries and 235 distinct (q,d) pairs, where q is the query and d the date, while WC_TREC_DS consists of 25 queries and 443 distinct (q,d) pairs. Each (q,d) pair was manually annotated on a 2-level scale:

More details on the dataset can be found on WC_DS and WC_TREC_DS web pages.

In the following we detail the experiments over the WC_DS and WC_TREC_DS datasets.

 

Experiments over the WC_DS dataset:

In this experiment, we aim to compare GTE  against a set of association measures on top of a collection made of web snippets. The results can be found on three different files: GTE_Max_Min_WC_DS; GTE_Mean_WC_DS; GTE_Median_WC_DS. Best results occur in the GTE_Median_WC_DS, so the following description we will rest on this datasheet.

Although the computation of GTE is direct for the first order metrics (SCP, PMI, DICE, etc), it requires certain configurations for the InfoSimba second order-association measure. Namely:

(1) the first order association measure to use with InfoSimba

(2) the five possible context vector representations for the (q,d)  pair

Moreover it is also important to define the selection criterion from which to choose the set of words and/or dates that should be part of the  contextual vector representation. For this purpose, two inter-related factors should be considered:

(3) the size of the contextual vector, denoted N

(4) the threshold, T, which decides whether we should consider as input for the contextual vector, all the terms or just those having a similarity value higher than T.

Based on this, we performed a set of experiments with different sizes of  N and different threshold values T in order to find the optimal combination of  N and T. For this purpose, we limited the parameters within the ranges of  5 <= N <= ∞ and 0 <= T <= 0.9 and combined them as follows: {T0.0N5, T0.0N10, T0.0N20, T0.0N+∞,..., T0.9N5, T0.9N10, T0.9N20, T0.9N+∞} giving rise to forty-four spreadsheets arranged as follows:

More details on each of this five columns can be found on the WC_DS dataset.

From column E onwards,  we can found each one of the different association measures (and corresponding values). Columns identified as green means that the values relate to a measure proposed by us, whereas columns identified as red means baseline measures.

 

The best result occurrs in the spreadsheet  named "_Thresold_005N100" in the column AQ for the "IS_(WD;WD)_DICE_Median" , as the 0.80 point biserial correlation coefficient  obtained shows the highest agreement with the human annotators, compared to any of the other association measures. IS_(WD;WD)_DICE_Median  is known as the BGTE (Best GenTempEval).

A summary of all the results can be found in SummaryThresholds.

The GenTempEval_Median Excel_WC_DS file consists of three aditional spreadsheets described below:

 

Experiments over the WC_TREC_DS dataset:

In this experiment, we aim to compare GTE  against a set of association measures on top of a collection made of web documents. The results can be found on GTE_Median_WC_TREC_DS.

The GTE_Median_WC_TREC_DS file consists of the following datasheets:

 

OTHER REFERENCES

More details on this dataset can be found in the following paper:

Campos, R., Dias, G., Jorge, A. and Nunes, C. (2012). Enriching Temporal Query Understanding through Date Identification: How to Tag Implicit Temporal Queries? In Proceedings of the 2nd International Temporal Web Analytics Workshop (TWAW 2012) associated to 21th International World Wide Web Conference (WWW2012) Lyon, France, 17 April. ISBN 978-1-4503-1188-5, pp 41 – 48. ACM Press.

Campos, R., Dias, G., Jorge, A. and Nunes, C. (2012). GTE: A Distributional Second-Order Co-Occurrence Approach to Improve the Identification of Top Relevant Dates in Web Snippets. In Proceedings of the 21st ACM International Conference on Information and Knowledge Management (CIKM 2012), Maui, Hawaii, October 29 - November 02, ISBN 978-1-4503-1156-4, pp 2035 - 2039. ACM Press.

 

DOWNLOAD

http://www.ccc.ipt.pt/~ricardo/experiments/GTE.zip

 

MORE INFO

If you have any further questions, please contact Ricardo Campos (ricardo.campos@ipt.pt).