Recommending citations with translation model

Recommending Citations with Translation Model

Yang Lu, Jing He, Dongdong Shan, Hongfei Yan

Department of Computer Science and Technology, Peking University, China

{luyang,hj,sdd,yhf}@net.pku.

ABSTRACT

Citation Recommendation is useful for an author to find out the papers or books that can support the materials she is writing about. It is a challengeable problem since the vocabulary used in the content of papers and in the citation contexts are usually quite different. To address this problem, we propose to use translation model, which can bridge the gap between two heterogeneous languages. We conduct an experiment and find the translation model can provide much better candidates of citations than the state-of-the-art methods.

Categories and Subject Descriptors

H.3.3 [Information Systems]: Information Search and Retrieval

General Terms

Algorithms, Experimentation

Keywords

Citation Recommendation, Translation Model

1. INTRODUCTION

The citation recommendation system is very useful for an author who is writing a paper or a book. It is a general case that the author forgets the specific paper containing some material when she wrote about it. The material may be a method, some statistic data or the conclusion from other researchers. If the author misses the detailed citation for these materials, the reviewers or editors may ask the author to add. Figure 1 shows a wikipedia example, in which the editor labels "citation needed" at the location where a citation is needed. (Such location is usually called citation placeholder ). For the authors or the editors, it would be easier if the citation recommendation system can provide some candidate papers containing the required information about the used material.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. CIKM'11, October 24?28, 2011, Glasgow, Scotland, UK. Copyright 2011 ACM 978-1-4503-0717-8/11/10 ...$10.00.

Figure 1: A Citation Needed Example on Wikipedia

A quick solution is to take the context of the citation placeholder as a query, and search the papers' content with some state-of-the-art IR methods. Unfortunately, this kind of method cannot work very well. The main reason is that the vocabulary used in the citation context is not same as that used in the paper's content. For example, the abstract usually only contains very brief information about a paper, but the full text may contain too much noise. In some other cases, the jargon may be different for different research fields or/and for different periods.

Instead of the paper's own content, some researchers leverage the paper's cited context to represent the paper [3]. The assumption of this method is that the cited contexts of one paper may be similar to each other. However, one difficulty for this method is that the cited context of one paper is usually sparse due to the long tail phenomenon. For some recent paper, they may even have no citation yet. Therefore, it is very challengeable to find out the papers with very few citations.

In this paper, we address the citation recommendation problem with the translation model. The translation model is originally used in translation text in one language to another, and it is also applied in retrieving the document whose content is heterogeneous to the queries (e.g., cross-language retrieval). In the citation recommendation problem, the citation contexts and the content of papers are actually quite different in nature, and the translation model can bridge the gap between them. We conduct an experiment on a paper collection, and find the translation model can perform better than the state-of-art method.

The contribution of this paper is as follows:

1. We propose translation model for the citation recommendation problem

2. We deploys an experiment to verify that the translation model works better than two baseline methods

3. We compare the performance of translation models on

2017

different fields of the paper content, finding the abstract based model can perform better

2. RELATED WORK

In this section, we will introduce the related in citation recommendation and translation model.

2.1 Citation Recommendation

Citation recommendation is a research problem of recommending citation candidate for a manuscript or a citation placeholder in a manuscript. Shaparenko et al. [8] proposed to employ language model to compare the citation context and the source paper content, but it usually leads to low recall due to the mismatching vocabulary used. He et al. [3] employed other citation contexts to represent one paper. In addition, they also consider other citation contexts in the same manuscript of the query citation placeholder. Some researchers [10, 6] proposed to use topic models to predict whether there should be a link (citation relation) between two documents.

Besides the text information, some researchers also make use of the citation information for this task. They generally require the manuscript has already contained some other citations. McNee et al. [5] employed the collaborative filtering technique to make use of the existing citation information. They assumed the documents that are usually co-cited with the existing citations are likely cited by the current manuscript. Zhou et al. [13] proposed a semisupervised learning method to spread the citation relation in multiple graphs (e.g., paper-paper citation graph, paper-author writing graph, etc.). Some researchers [9, 11] combined the text information and partial citation information to recommend citations. In this paper, we focus on modeling the text information for citation recommendation, and our method can potentially be combined with the existing citation based methods.

2.2 Translation Model

Translation Model was introduced to be used for information retrieval by Berger et al. [1]. The main idea is that it can translate the words in the documents to the query terms, so it can bridge the vocabulary gap between the query and the document. Because of its translation manner, it can be naturally applied in the cross-language retrieval [7, 4] and other applications where queries and documents use different vocabularies. Xue et al. [12] employed it to find the relevant QA pair for the question in natural language. Gao et al. [2] used it to bridge the vocabulary gap between the Web search query and the page title. There is also vocabulary gap between the paper content and the citation context, so it is appropriate to use translation model to address the citation recommendation problem.

3. TRANSLATION MODEL FOR CITATION RECOMMENDATION

In this section, we first introduce the method for estimating the translation model, and then apply the translation model for ranking the papers.

3.1 Translation Model Estimation

Translation model defines the probability of translating one word in one language into a word in another language. For the citation recommendation, we assume the languages

used in the citation contexts and in the papers' content (doc-

ument) are different, so we need to bridge these two lan-

guages by translating one word in the document (wd) to one word in the citation context (wc).

To estimate the translation model for the citation recom-

mendation problem, we need a training data including a set

of citation context and document pair T = {(c, d)}, in which

the citation context c references the document d. The trans-

lation model can be estimated by maximizing the likelihood

of citations contexts with their corresponding documents:

t = arg max =

P (c|d, t)

(1)

t

(c,d)T

The translation model with this formula can be estimated by the EM algorithm. In the practice, the translation model is usually heuristically approximated by a simpler version, which can compute much more efficiently [2]:

P (wc|wd)

=

count(wc, wd) count(wd)

(2)

where count(wc, wd) is the count of (c, d) pairs in the training data T , in which document d contains the word wd, and the context c contains word wc, and count(wd) is the count of (c, d) pairs in which the document d contains word wd.

The translation model usually cannot work very well due to the small self translation probability P (w|w). In the citation recommendation problem, the small self translation probability may lead to underestimate the score of the document containing the words in the citation context. Xue et al. [12] proposed a method to boost the self translation probability and showed improvement in retrieval performance:

Pself (wc|wd) = ? 1(wc = wd) + (1 - ) ? P (wc|wd) (3)

where 1(wc = wd) is a signal function which outputs 1 when the word wc is same as the word wd, and P (wc|wd) is the pure translation probability calculated by Eq 2. We can find the model without self translation boosting is a special case of Eq. 3 when = 0.

One concern of using translation model is that it would be very expensive, both in word-to-word translation probability matrix storage and the online retrieval processing. One heuristic method is to store only top K translated words for each word, so that the storage and processing complexity would be drastically reduced. We would test the sensitiveness of parameter K in our experiment.

3.2 Documents Ranking

In this section, we will utilize the translation model de-

rived in the previous section for ranking the documents.

Denote the citation context containing m words as c =

{c1, . . . , cm}, and the source content containing n words as

d = {d1, . . . , dn}. Taking the citation context as the query,

one document d can be scored by the query likelihood model:

P (c|d) = P (ci|d)

(4)

ci c

where P (ci|d) is the likelihood of word ci of the document language model of d.

In the practice, the query likelihood is generally calculated by smoothing the maximum likelihood estimation of the document with the collection language model.

P (ci|d) = ? Pml(ci|C) + (1 - ) ? Pml(ci|d) (5)

2018

where Pml(w|C) and Pml(w|d) the maximum likelihood estimation of a word in the collection and a document d respectively.

Since the vocabulary used in the citation contexts and the documents are usually different, it may not perform very well by deriving the word ci in the context from the document d's language model directly, and the translation model can bridge the gap as follows:

P (ci|d) = ?Pml(ci|C)+(1-)? P (ci|dj )Pml(dj |d) (6)

dj d

where P (ci|dj) is the translation probability from word dj to word ci calculated by Eq. 3. This model takes into account both document generative probability and the translation probability.

4. EXPERIMENTS AND RESULTS

In this section, we describe our experiment setup and report the results.

4.1 Experiment Setup

Dataset

The dataset is a collection of 5,183 papers from 1988 to 2010, mainly in the information retrieval and text mining direction. For each paper, we extract all citations placeholders and the corresponding citation contexts (three surrouding sentences). In the collection, we find that 1,499 papers have been cited by the other papers, and 6,166 citation contexts reference the papers in the collection. In the experiment, all papers in the collection composed the document collection, and a randomly selected set of 200 citation contexts are used as queries.

Evaluation Metric

For each citation placeholder, we search the papers that may

be referenced at this citation placeholder. Each retrieval

model would return a ranked list of papers. Since there may

be one or more references for one citation context, we use

Mean Average Precision (MAP) as the evaluation metric:

MAP(d1, . . . , dN )

=

i

R(di )

i

ji

R(dj )

i R(di)

(7)

where R(di) is a binary function indicating whether document di is relevant or not . For our problem, the papers really cited at the citation placeholder are judged as the relevant documents.

Compared Methods

We use two baselines: query likelihood language model and the context-aware relevance model [3]. For both the query likelihood language model and our translation model, it can consider the abstract or the full text as the document content, so we have two versions for these two models. For the translation model, we use the self translation boosting method described in section 3.1, so we still need to test the effect of the self translation boosting in our experiment. The compared methods are described in Table 1.

4.2 Results

Table 2 shows the best result for each compared method on our dataset. The parameter tuning results would be presented in Section 4.3. The first row shows the methods,

Table 1: Citation Recommendation Methods Name Description

LMa LMf CRM TMa TMf TMsa TMsf

query likelihood language model on abstract query likelihood language model on full text context-aware relevance model by He et al.[3] translation model on abstract translation model on full text translation model with self boosting on abstract translation model with self boosting on full text

Table 2: Performance of the Citation Recommending Methods

LMa LMf CRM TMa TMf TMsa TMsf

0.122 0.211 0.238 0.519 0.494 0.571 0.535

and the second row shows the average MAP scores across the test queries. We find the difference between each pair of methods is significant with p-value 0.05 by paired t-test.

There are some interesting findings from the results: First, the translation models perform much better than the language model and the context-aware relevance model. It indicates the effectiveness of using translation models for the citation recommendation problem.

Second, the translation models on both abstract and full text are better than the corresponding query likelihood language models. It reminds that the vocabulary between the citation context and the source content are different, so the such kind of "translation" is needed.

Third, for the translation model (with or without self translation boosting), it performs better on the abstract than on the full text. On the contrary, the language model performs better on the full text than on the abstract. The abstract may miss some information of a paper, and the citation context may talk about the missing information. For the language model, it cannot generate the missing information from the abstract, but for the translation model, it is still possible to generate such information from translation. The full text contains all information that can help to generate the citation context, so it benefits the language model, but it may harm the translation model due to too much noise introduced to the translation matrix.

Finally, we validate that the context-aware relevance model outperforms the language model, but it is still not as good at the translation model.

4.3 Parameter Tuning

In this section, we report the results of parameter tuning. There are three parameters used in our translation model: (0, 1] controls the mixture weight of the global collection model, [0, 1] controls how much the self translation boosting is, and the parameter K controls the number of translated words for each word. We iterate all the ranges for these three parameters (for K, we iterate between (0, 1000]), and we report the results for each parameter tuning when the other parameters are set as the optimal values.

In the experiment, we find the performance would increase when value becomes smaller, and it keeps stable when the value is quite small, so we just set to be a very small value (10-5 in our experiment) to prevent the zero probability in some cases. It indicates that the collection model smooth is almost useless in this scenario. One possible explanation is

2019

that the word probability has been implicitly smoothed by the translation model.

We tune the value in the range from 0 to 1, finding the best performance can be acquired when the value is between 0 and 0.2. Figure 2 presents the results for in the range between 0 and 0.2. From the results, we can find the it needs larger self translation boosting for the language on abstract. The words in the abstract are generally more important than those in the full text, so abstract words should be more translated to themselves. On the contrary, the full text words contain more noise, so they are better represented in their translated forms.

Figure 2: Parameter Tuning on Self Translation Probability Boosting

Figure 3 presents the results for different translated word numbers. We note that the translated word number K can also affect the efficiency of the model, the model with larger K value will process more slowly. The optimal K value is 400 for the full text translation model and 800 for the abstract translation model. The abstract of a paper is usually short and brief, thus it needs to translate to more words that possibly appear in the citation context. From the result, we can find the performance is quite robust to the K value selection.

Figure 3: Parameter Tuning on Translated Word Number

5. CONCLUSION AND FUTURE WORK

We propose to use the translation retrieval model for the citation recommendation problem. The citation recommendation problem is challengeable due to the vocabulary gap

between the citation context and the paper's content, and translation model can well bridge this gap. In the future, we can extend to consider more features such as other citations in the paper and the paper's global content for better citation recommendation.

Acknowledgement

This work has been partially supported by HGJ 2010 Grant 2011ZX01042-001-001 and NSFC with Grant No.61073082, 60933004.

6. REFERENCES

[1] Berger, A., and Lafferty, J. Information retrieval as statistical translation. In Proceedings of SIGIR '99 (1999), pp. 222?229.

[2] Gao, J., He, X., and Nie, J.-Y. Clickthrough-based translation models for web search: from word models to phrase models. In Proceedings of CIKM '10 (2010), pp. 1139?1148.

[3] He, Q., Pei, J., Kifer, D., Mitra, P., and Giles, L. Context-aware citation recommendation. In Proceedings of WWW'10 (2010), pp. 421?430.

[4] Lavrenko, V., Choquette, M., and Croft, W. B. Cross-lingual relevance models. In Proceedings of SIGIR '02 (2002), pp. 175?182.

[5] McNee, S. M., Albert, I., Cosley, D., Gopalkrishnan, P., Lam, S. K., Rashid, A. M., Konstan, J. A., and Riedl, J. On the recommending of citations for research papers. In Proceedings of CSCW '02 (2002), pp. 116?125.

[6] Nallapati, R. M., Ahmed, A., Xing, E. P., and Cohen, W. W. Joint latent topic models for text and citations. In Proceeding of KDD '08 (2008), pp. 542?550.

[7] Nie, J.-Y., Simard, M., Isabelle, P., and Durand, R. Cross-language information retrieval based on parallel texts and automatic mining of parallel texts from the web. In Proceedings SIGIR'99 (1999), pp. 74?81.

[8] Shaparenko, B., and Joachims, T. Identifying the original contribution of a document via language modeling. In Proceedings of SIGIR '09 (2009), pp. 696?697.

[9] Strohman, T., and Croft, W. B. Efficient document retrieval in main memory. In Proceedings of SIGIR '07 (2007), pp. 175?182.

[10] Tang, J., and Zhang, J. A discriminative approach to topic-based citation recommendation. In Proceedings of PAKDD '09 (2009), pp. 572?579.

[11] Torres, R., McNee, S. M., Abel, M., Konstan, J. A., and Riedl, J. Enhancing digital libraries with techlens+. In Proceedings of JCDL '04 (2004), pp. 228?236.

[12] Xue, X., Jeon, J., and Croft, W. B. Retrieval models for question and answer archives. In Proceeding of SIGIR '08 (2008), pp. 475?482.

[13] Zhou, D., Zhu, S., Yu, K., Song, X., Tseng, B. L., Zha, H., and Giles, C. L. Learning multiple graphs for document recommendations. In Proceeding of WWW '08 (2008), pp. 141?150.

2020

................
................

In order to avoid copyright disputes, this page is only a partial summary.

Google Online Preview   Download