Discovering Attribute and Entity Synonyms for Knowledge ...

[Pages:12]Discovering Attribute and Entity Synonyms for

Knowledge Integration and Semantic Web Search

Hamid Mousavi 1, Shi Gao 2, Carlo Zaniolo 3

Technical Report #130013 1,2,3Computer Science Department, UCLA

Los Angeles, USA 1hmousavi@cs.ucla.edu

2gaoshi@cs.ucla.edu 3zaniolo@cs.ucla.edu

Abstract-- There is a growing interest in supporting semantic search on knowledge bases such as DBpedia, YaGo, FreeBase, and other similar systems, which play a key role in many semantic web applications. Although the standard RDF format subject, attribute, value is often used by these systems, the sharing of their knowledge is hampered by the fact that various synonyms are frequently used to denote the same entity or attribute-- actually, even an individual system may use alternative synonyms in different contexts, and polynyms also represent a frequent problem. Recognizing such synonyms and polynyms is critical for improving the precision and recall of semantic search. Most of previous efforts in this area have focused on entity synonym recognition, whereas attribute synonyms were neglected, and so was the use of context to select the appropriate synonym. For instance, the attribute `birthdate' can be a synonym for `born' when it is used with a value of type `date'; but if `born' comes with values which indicate places, then `birthplace' should be considered as its synonym. Thus, the context is critical to find more specific and accurate synonyms.

In this paper, we propose new techniques to generate contextaware synonyms for the entities and attributes that we are using to reconcile knowledge extracted from various sources. To this end, we propose the Context-aware Synonym Suggestion System (CS3) which learns synonyms from text by using our NLP-based text mining framework, called SemScape, and also from existing evidence in the current knowledge bases. Using CS3 and our previously proposed knowledge extraction system IBminer, we integrate some of the publicly available knowledge bases into one of the superior quality and coverage, called IKBstore.

I. INTRODUCTION

The importance of knowledge bases in semantic-web applications has motivated the endeavors of several important projects that have created the public-domain knowledge bases shown in Table I. The project described in this paper seeks to integrate and extend these knowledge bases into a more complete and consistent repository named Integrated Knowledge Base Store (IKBstore). IKBstore will provide much better support for advanced web applications, and in particular for user-friendly search systems that support Faceted Search [5] and By-Example Structured Queries [6]. Our approach in achieving this ambitious goal involves the four main tasks of:

A Integrating existing knowledge bases by converting them into a common internal representation and store them in IKBstore.

B Completing the integrated knowledge base by extracting more facts from free text.

C Generating a large corpus of context-aware synonyms that can be used to resolve inconsistencies in IKBstore and to improve the robustness of query answering systems,

D Resolving incompleteness in IKBstore by using the synonyms generated in Task C.

At the time of this writing, task A and B are completed, and other tasks are well on their ways. The tools developed to perform tasks B, C and D, and the initial results they produced will be demonstrated at the VLDB conference on Riva del Garda [20]. The rest of this section and the next section introduce the various intertwined aspects of these tasks, while the remaining sections provide an in-depth coverage of the techniques used for synonyms generation and the very promising experimental results they produced.

Task A was greatly simplified by the fact that many projects, including DBpedia [7] and YaGo [18], represent the information derived from the structured summaries of Wikipedia (a.k.a. InfoBoxes) by RDF triples of the form subject, attribute, value, which specifies the value for an attribute (property) of a subject. This common representation facilitates the use of these knowledge bases by a roster of semantic-web applications, including queries expressed in SPARQL, and user-friendly search interfaces [5], [6]. However, the coverage and consistency provided by each individual system remain limited. To overcome these prob-

Name

ConceptNet [27] DBpedia [7] FreeBase [8] Geonames [2]

MusicBrainz [3] NELL [10] OpenCyc [4] YaGo2 [18]

Size (MB)

3075 43895 85035 2270 17665 1369 240 19859

Number of Entities (106)

0.30 3.77 ?25 8.3 18.3 4.34 0.24 2.64

Number of Triples (106)

1.6 400 585 90 ?131 50 2.1 124

TABLE I SOME OF THE PUBLICLY AVAILABLE KNOWLEDGE BASES

lems, this project is merging, completing, and integrating these knowledge bases at the semantic level.

Task B is mainly to complete the initial knowledge base using our knowledge extraction system called IBminer [22]. IBminer employs an NLP-based text mining framework, called SemScape, to extract initial triples from the text. Then using a large body of categorical information and learning from matches between the initial triples and existing InfoBox items in the current knowledge base, IBminer translates the initial triples into more standard InfoBox triples.

The integrated knowledge base so obtained will represent a big step forward, since it will (i) improve coverage, quality and consistency of the knowledge available to semantic web applications and (ii) provide a common ground for different contributors to improve the knowledge bases in a more standard and effective way. However, a serious obstacle in achieving such desirable goal is that different systems do not adhere to a standard terminology to represent their knowledge, and instead use plethora of synonyms and polynyms.

Thus, we need to resolve synonyms and polynyms for the entity names as well as the attribute names. For example, by knowing `Johann Sebastian Bach' and `J.S. Bach' are synonyms, the knowledge base can merge their triples and associate them with one single name. As for the polynyms, the problem is even more complex. Most of the time based on the context (or popularity), one should decide the correct polynym of a vague term such as `JSB' which may refer to "Johann Sebastian Bach", `Japanese School of Beijing', etc. Several efforts to find entity synonyms have been reported in recent years [11], [12], [13], [15], [25]. However, the synonym problem for attribute names has received much less attention, although they can play a critical role in query answering. For instance, the attribute `birthdate' can be represented with terms such as `date of birth', `wasbornindate', `born', and `DoB' in different knowledge bases, or even in the same one when used in different contexts. Unless these synonyms are known, a search for musicians born, say, in 1685 is likely to produce a dismal recall.

To address the aforementioned issues, we proposed our Context-aware Synonym Suggestion System (CS3 for short). CS3 performs mainly tasks C and D by first extracting context-aware attribute and entity synonyms, and then using them to improve the consistency of IKBstore. CS3 learns attribute synonyms by matching morphological information in free text to the existing structured information. Similar to IBminer, CS3 takes advantage of a large body of categorical information available in Wikipedia, which serves as the contextual information. Then, CS3 improves the attribute synonyms so discovered, by using triples with matching subjects and values but different attribute names. After unifying the attribute names in different knowledge bases, CS3 finds subjects with similar attributes and values as well as similar categorical information to suggest more entity synonyms. Through this process, CS3 uses several heuristics and takes advantage of currently existing interlinks such as DBpedia's alias, redirect, externalLink, or sameAs links as well as the

UCLA Computer Science Department Technical Report #130013

interlinks provided by other knowledge bases. In this paper, we describe the following contributions: , The Context-aware Synonym Suggestion System (CS3) which generates synonyms for both entities and attributes in existing knowledge bases. CS3 intuitively uses free text and existing structured data to learn patterns for suggesting attribute synonyms. It also uses several heuristics to improve existing entity synonyms. , Novel techniques are introduced to integrate several pubic knowledge bases and convert them into a general knowledge base. To this end, we initially collect the knowledge bases and integrate them by exploiting the subject interlinks they provide. Then, IBminer is used to find more structured information from free text to extend the coverage of the initial knowledge base. At this point, we use CS3 to resolve attribute synonyms on the integrated knowledge base, and to suggest more entity synonyms based on their context similarity and other evidence. This improves performance of semantic search over our knowledge base, since more standard and specific terms are used for both entities and attributes. , We implemented our system and performed preliminary experiments on public knowledge bases, namely DBpedia and YaGo, and text from Wikipedia pages. The initial results so obtained are very promising and show that CS3 improves the quality and coverage of the existing knowledge bases by applying synonyms in knowledge integration. The evaluation results also indicate that IKBstore can reach up to 97% accuracy. The rest of the paper is organized as follows: In next

section, we explain the high-level tasks in IKBstore. Then, in Section III, we briefly discuss the techniques to generate structured information from text in IBminer. In Section IV, we propose CS3 to learn context-aware synonyms. In Section V, we discuss how these subsystems are used to integrate our knowledge bases. The preliminary results of our approach are presented in Section VI. We discuss some related work in Section VII and conclude our paper in Section VIII.

II. THE BIG PICTURE

As already mentioned, the goal of IKBstore is to integrate the public knowledge bases and create a more consistent and complete knowledge base. IKBstore performs four tasks to achieve this goal. Here we elaborate these four tasks in more details:

Task A: Collecting publicly available knowledge bases, unifying knowledge representation format, and integrating knowledge bases using existing interlinks and structured information. Creating the initial knowledge base is actually a straightforward task (Subsection V-B), since many of the existing knowledge bases are representing their knowledge in RDF format. Moreover, they usually provide information to interlink a considerable portion of their subjects to those in DBpedia. Thus, we use such information to create the initial integrated knowledge base. For easing our discussion, we refer to initial integrated knowledge base as initial knowledge base. Although

2

this naive integration may improve the coverage of the initial knowledge base, it still needs considerable improvement in consistency and quality.

Task B: Completing the initial knowledge base using accompanying text. In order to do so, we employ the IBminer system [22] to generate structured data from the free text available at Wikipedia or similar resources. IBminer first generated semantic links between entity names in the text using recently proposed text mining SemScape. Then, IBminer learns common patterns called Potential Matches (PMs) by matching the current triples in the initial knowledge base to the semantic links derived from free text. It then employs the PMs to extract more InfoBox triples from text. These newly found triples are then merged with the initial knowledge base to improve its coverage. Section III provides more information about this process.

Task C: Generating a large corpus of context-aware synonyms. Since IBminer learns by matching structured data to the morphological structure in the text, it may find more than one acceptable matching attribute name for a given link name from the text. This in fact implies possible attribute synonyms, and it is the main intuition that CS3 uses to learn attribute synonyms. Based on PM, CS3 creates the Potential Attribute Synonyms (PAS) which is in nature similar to PM. However instead of mapping link names into attribute names, PAS provides mapping between different attribute names based on the categorical information of the subject and the value. Similar to the case of IBminer, the categorical information serves as the contextual information, and improves the final results of the generated attribute synonyms As described in Section IV-A, CS3 improves PAS by learning from the triples with matching subjects and values but different attribute names in the current knowledge base. CS3 also recognizes context-aware entity synonyms by considering the categorical information and InfoBoxes of the entities. (Subsection IV-B).

Task D: Realigning attribute and entity names to construct the final IKBstore. This is indeed the most important step on preparing the knowledge bases for structured queries. Here, we first use P AS to resolve attribute synonyms in the current knowledge base. Then, we use the entity synonyms suggested by CS3 to integrate entity synonyms and their InfoBoxes. This step is covered in Section V.

Applications: IKBstore can benefit a wide variety of applications, since it covers a large number of structured summaries represented with a standard terminology. Knowledge extraction and population systems such as IBminer [22] and OntoMiner [23], knowledge browsing tools such as DBpedia Live [1] and InfoBox Knowledge-Base Browser (IBKB)[20], and semantic web search such as Faceted Search [5] and ByExample Structured queries [6] are three prominent examples of such applications. In particular for semantic web search, IKBstore improves the coverage and accuracy of structured queries due to superior quality and coverage with respect to existing knowledge bases. Moreover, IKBstore can serve as

UCLA Computer Science Department Technical Report #130013

a common ground for different contributors to improve the knowledge bases in a more standard and effective way. Using multiple knowledge bases in IKBstore can also be a good mean for verifying the correctness of the current structured summaries as well as those generated from the text.

III. FROM TEXT TO STRUCTURED DATA

To perform the nontrivial task of generating structured data from text, we use our IBminer system [22]. Although, IBminer's process is quite complex, we can divide it into three high-level steps which are elaborated in this section. The first step is to parse the sentences in text and convert them to a more machine friendly structure called TextGraphs which contain grammatical and semantic links between entities mentioned in the text. As discussed in subsection III-A, this step is performed by the NLP-based text mining framework SemScape [21], [22]. The second step is to learn a structure called Potential Match (PM). As explained in Subsection III-B, PM contains context-aware potential matches between semantic links in the TextGraphs and exiting InfoBox items. As the third step, PM is used to suggest the final structured summaries (InfoBoxes) from the semantic links in the TextGraphs. This phase is described in Subsection III-C.

A. From Text to TextGraphs

To generate TextGraphs from text, we employ the SemScape system which uses morphological information in the text to capture the categorical, semantic, and grammatical relations between words and terms in the text. To understand the general idea, consider the following sentence:

Motivating Sentence: "Johann Sebastian Bach (31 March 1685 - 28 July 1750) was a German composer, organist, harpsichordist, violist, and violinist of the Baroque Period."

There are several entity names in this sentence (e.g. `Johann Sebastian Bach', `31 March 1685', and `German composer'). The first step in SemScape is to recognize these entity names. Thus, SemScape parses the sentence with Stanford parser [19] which is a probabilistic parser. Using around 150 treebased patterns (rules), SemScape finds such entity names and annotates nodes in the parse trees with possible entity names they contain. These annotations are called MainParts (MPs), and the annotated parser tree is referred to as an MP Tree. With the nodes annotated with their MainParts, other rules do not need to know the underlying structure of the parse trees at each node. As a result, one can provide simpler, fewer, and more general rules to mine the annotated parse trees.

Next, SemScape uses another set of tree-based patterns to find grammatical connections between words and entity names in the parse trees and combine them into the TextGraph. One such TextGraph for our motivating sentence is shown in Figure 1. Currently SemScape contains more than 290 manually created rules to perform this step. Each link in the TextGraph is also assigned a confidence value indicating SemScape's confidence on the correctness of the link. We should point out

3

UCLA Computer Science Department Technical Report #130013

Fig. 2. Part a) shows the graph pattern for Rule1, and part b) depicts one of the possible matches for this pattern.

Fig. 1. Part of the TextGraph for our motivating sentence.

that TextGraphs support nodes consisting of multiple nodes (and links) through its hyper-links which mainly differentiates TextGraphs from similar structures such as dependency trees. For more details on the TextGraph generation phase, readers are referred to [22].

Although useful for some applications, the grammatical connections at this stage of the TextGraphs are not enough for IBminer to generate structured data. The reason is that, IBminer needs connections between entity names, which we refer to them as Semantic links. Semantic links simply specify any relation between two entities1. With this definition, most of the grammatical links in the TextGraphs are not semantic links. To generate more semantic links, IBminer uses a set of manually created graph-based patterns (rules) over the TextGraphs. One such rule is provided below:

--------------------? Rule 1. --------------------? SELECT ( ?1 ?3 ?2 ) WHERE {

?1 "subj of" ?3. ?2 "obj of" ?3. NOT("not" "prop of" ?3). NOT("no" "det of" ?1). NOT("no" "det of" ?2). } ------------------------------------------------? As depicted in the part a) of Figure 2, the pattern graph (WHERE clause of Rule 1) specifies two nodes with (variable) names ?1 and ?2 which are connected to a third node (?3) respectively with subj of and obj of links. One possible match for this pattern in the TextGraph of our running example is depicted in part b) of Figure 2. It is worthy mentioning that due to the structure of the TextGraphs, matching multi-word entities to the variable names in the patterns is an easy task for IBminer. This is actually a challenging issue in works such as [30] which are based on dependency parse trees. Using the SELECT clause Rule 1, the rule returns several triples for our running example such as: , `johann sebastian bach', `was', `composer', , `sebastian bach', `was', `composer', , `bach', `was', `composer',etc. The above triples are referred to as the initial triples. In IBminer, we have created 98 graph-based rules to capture

1SemScape treats values as entities

semantic links (initial triples) from TextGraphs and add them to the same TextGraph. Some of the semantic links generated for our motivating sentence are shown in Figure 3. For the sake of simplicity we do not depict grammatical links in this graph. We should restate that all the patterns discussed in this subsection are manually created and the readers should not confuse them with automatically generated patterns that we discuss in the next subsections.

B. Generating Potential Matches (PM)

To generate the final structured data (new InfoBox Triples), IBminer learns an intermediate data structure called Potential Matches (PM) from the mapping between initial triples (semantic links in TextGraphs) and current InfoBox triples. For instance consider the initial triple Bach, was, composer in Figure 3. Also assume, the current InfoBoxes include Bach, Occupation, composer. This implies a simple pattern saying that link name `was' may be translated to attribute name `Occupation' depending on the context of the subject (`Bach' in this case) and the value (`composer'). We refer to these matches as potential matches and the goal here is to automatically generate and aggregate all potential matches.

To understand why we need to consider the context of the subject and the value, this time consider the two initial triples Bach, was, composer and Bach, was, German in the TextGraph in Figure 3. Obviously the link name `was' should be interpreted differently in these two cases, since the former one is connecting a `person' to an `occupation', while the latter is between a `person' and a `nationality'. Now, consider two existing InfoBoxes Bach, occupation, composer and Bach, nationality, German which respectively match the formerly mentioned initial triples. These two matches imply a simple pattern saying link name `was' connecting `Bach' and `composer' should be interpreted as `occupation', while it should be interpreted as `nationality' if it is used between `Bach' and `German'. To generalize these implications, instead of using the subject and value names, we use the categories they belong to in our patterns. For instance, knowing that `Bach' is in category `Cat:Person' and `composer' and `German' are respectively in categories `Cat:Occupation in Music' and `Cat:Ueropean Nationality', we learn the following two patterns:

, Cat:Person, was, Cat:Occupation in Music: occupation , Cat:Person, was, Cat:European Nationality: nationality Here the pattern c1, l, c2: indicates that the link named l, connecting a subject in category c1 to an entity or value in

4

Fig. 3. Some Semantic links for our motivating sentence.

category c2, may be interpreted as the attribute name . Note that for each triple with a matching InfoBox triple, we create several patterns since the subject and the values usually belong to more than one (direct or indirect) categories.

More formally, let s, l, v be an initial triple in the TextGraph, which matches InfoBox triple s, , v in the initial knowledge base. Also, let s and v respectively belong to category sets Cs " tcs1, cs2, ...u and Cv " tcv1, cv2, ...u according to the categorical information in Wikipedia. Later in this subsection, we discuss a simple yet effective technique for selecting a small set of related categories for a given subject or value. For each cs P Cs and cv P Cv, IBminer creates the following tuple and add it to PM:

cs, l, cv :

Each tuple in PM is also associated with a confidence value c (initialized by the confidence of the TextGraph's semantic link) and an evidence frequency e (initialized by 1). More matches for the same categories and the same link will increase the confidence and evidence count of the above potential match. The above potential match basically means that for an initial triple s1, l, v1 in which s1 belongs to category cs and v1 belongs to cv, may be a match for l with confidence c and evidence e.

Later in Section IV, we show that how CS3 uses the P M structure to build up a Potential Attribute Synonym structure to generate high-quality synonyms.

Selecting Best Categories: A very important issue in generating potential matches is the quality and quantity of the categories for the subjects and values. The direct categories provided for most of the subjects are too specific and there are only a few subjects in each of them. As shown in Section VI, generating the potential matches over direct categories is not very helpful to generalize the matches for new subjects. On the other hand, exhaustively adding all the indirect (or ancestor) categories will result in too many inaccurate potential matches. For instance considering only four levels of categories in Wikipedia's taxonomy, the subject `Johann Sebastian Bach' belongs to 422 categories. In this list, there are some useful indirect categories such as `German Musicians' and `German Entertainers', as well as several categories which are either too

UCLA Computer Science Department Technical Report #130013

general or inaccurate (e.g. `People by Historical Ethnicity' and `Centuries in Germany'). Considering the same issue for the value part, hundreds of thousands of potential matches may be generated for a single subject. This issue not only wastes our resources, but also impacts the accuracy of the final results.

To address this issue, we use a flow-driven technique to rank all the categories to which subject s belongs, and then select best NC categories. The main intuition is to propagate flows or weights through different pathes from s to each category, the categories receiving more weights are considered to be more related to s. Now, L being the number of allowed ancestor levels, we create the categorical structure for s up to L levels. Starting with node s as the root of this structure and assigning weight 1.0 to it, we iteratively select the closest node to s, which has not been processed yet, propagate its weight to its parent categories, and mark it as processed. To propagate weights of node ci with k ancestors, we increase the current weights of each k ancestors with wi{k, where wi is the current weight of node ci. Although wi may change even after ci is processed, we will not re-process ci after any further updates on its weight. After propagating the weight to all the nodes, we select the top NC categories for generating potential matches and attribute synonyms.

C. Generating New Structured Summaries

To extract new structured summaries (InfoBox triples), IBminer uses PMs to translate the link names of the initial triples into the attribute names of InfoBoxes. Let t "s, l, v be the initial triple whose link (l) needs to be translated, and s and v are listed in category sets Cs " tcs1, cs2, ...u and Cv " tcv1, cv2, ...u respectively which are generated based on our category selection algorithm. The key idea to translate l is to take a consensus among all pair of categories in Cs and Cv and all possible attributes to decide which attribute name is a possible match.

To this end, for each cs P Cs and cv P Cv, IBminer finds all potential matches such as cs, l, cv: i. The resulting set of potential matches are then grouped by the attribute names, i's, and for each group we compute the average confidence value and the aggregate evidence frequency of the matches. IBminer uses two thresholds at this point to discard lowconfident (named c) or low-frequent (named e) potential matches, which are discussed in Section VI. Next, IBminer filters the remaining results by a very effective type-checking technique2. At this point, if there exists one or more matches in this list, we pick the one with the largest evidence value, say pmmax, as the only attribute map and report the new InfoBox triple s, pmmax.ai, v with confidence t.c ^ pmmax.c and evidence tn.e. Secondary possible matches are considered in the next section as attribute synonyms.

IV. CONTEXT-AWARE SYNONYMS

Synonyms are terms describing the same concept, which can be used interchangeably. According to this definition, no

2See [22] for details on the type-checking mechanism used by IBminer

5

matter what context is used, the synonym for a term is fixed (e.g. `birthdate' and `date of birth' are always synonyms). However, the meaning or semantic of a term usually depends on the context in which the term is used. The synonym also varies as the context changes. For instance, in an article describing IT companies, the synonym of the attribute name `wasCreatedOnDate' most probably is `founded date'. In this case, knowing that the attribute is used for the name of a company is a contextual information that helps us find an appropriate synonym for `wasCreatedOnDate'. However, if this attribute is used for something else, such as an invention, one can not use the same synonym for it.

Being aware of the context is even more useful for resolving polynymous phrases, which are in fact much more prevalent than exact synonyms in the knowledge bases. For example, consider the entity/subject name `Johann Sebastian Bach'. Due to its popularity, a general understanding is that the entity is describing the famous German classical musician. However, what if we know that for this specific entity the birthplace is in `Berlin'. This simple contextual information will lead us to the conclusion that the entity is refereing to the painter who was actually the grandson of the famous musician Johann Sebastian Bach. A very similar issue exists for the attribute synonyms. For instance considering attribute `born', `birthdate' can be a synonym for `born' when it is used with a value of type `date'; but if `born' is used with values which indicate places, then `birthplace' should be considered as its synonym.

CS3 constructs a structure called Potential Attribute Synonyms (PAS) to extract attribute synonym. In the generation of PAS, CS3 essentially counts the number of times each pair of attributes are used between the same subject and value and with the same corresponding semantic link in the TextGraphs. The context in this case is considered to be the categorical information for the subject and the value. These numbers are then used to compute the probability that any given two attributes are synonyms. Next subsection describes the process of generating PAS. Later in Subsection IV-B, we will discuss our approach to suggest entity synonyms and improve existing ones.

A. Generating Attribute Synonyms

Intuitively, if two attributes (say `birthdate' and `dateOfBirth') are synonyms in a specific context, they should be represented with the same (or very similar) semantic links in the TextGraphs (e.g. with semantic links such as `was born on', `born on', or `birthdate is'). In simpler words, we use text as the witness for our attribute synonyms. Moreover, the context, which is defined as the categories for the subjects (and for the values), should be very similar for synonymous attributes.

More formally, let attributes i and j be two matches for link l in initial triple s, l, v. Let Ni,j (" Nj,i) be the total number of times both i and j are the interpretation of the same link (in the initial triples) between category sets Cs and Cv. Also, let Nx be the total number of time x is used between Cs and Cv. Thus the probability that i (j) is a

UCLA Computer Science Department Technical Report #130013

synonym for j (i) can be computed by Ni,j{Nj (Ni,j{Ni). Obviously this is not always a symmetric relationship (e.g. `born' attribute is always a synonym for `birthdate', but not the other way around, since `born' may also refer to `birthplace' or `birthname' as well). In other words having Ni and Ni,j computed, we can resolve both synonyms and polynyms for any given context (Cs and Cv).

With the above intuition in mind, the goal in PAS is to compute Ni and Ni,j. Next we explain how CS3 constructs PAS in one-pass algorithm which is essential for scaling up our system. For each two records in PM such as cs, l, cv: i and cs, l, cv: j respectively with evidence frequency ei and ej (ei ej), we add the following two records to PAS:

cs, i, cv: j cs, j, cv: i

Both records are inserted with the same evidence frequency ei. Note that, if the records are already in the current PAS, we increase their evidence frequency by ei. At the very same time we also count the number of times each attribute is used between a pair of categories. This is necessary for estimating Ni and computing the final weights for the attribute synonyms. That is for the case above, we add the following two PAS records as well:

cs, i, cv: `' (with evidence ei) cs, j, cv: `' (with evidence ej)

Improving PAS with Matching InfoBox Items: Potential attribute synonyms can be also derived from different knowledge bases which contain the same piece of knowledge, but in different attribute names. For instance let J.S.Bach, birthdate, 1685 and J.S.Bach, wasBornOnDate, 1685 be two InfoBox triples indicating bach's birthdate. Since the subject and value part of the two triples matches, one may say birthdate and wasBornOnDate are synonyms. To add these types of synonyms to the PAS structure, we follow the exact same idea explained earlier in this section. That is, consider two triples such as s, i, v and s, j, v in which i and j may be a synonym. Also, let s and v respectively belong to category sets Cs " tcs1, cs2, ...u and Cv " tcv1, cv2, ...u. Thus, for all cs P Cs and cv P Cv we add the following triples to PAS:

cs, i, cv: j (with evidence 1) cs, j, cv: i (with evidence 1)

This intuitively means that from the context (category) of cs to cv, attributes i and j may be synonyms. Again more examples for these categories and attributes increase the evidence which in turn improve the quality of the final attribute synonyms. Much in the same way as learning from initial triples, we count the number of times that an attribute is used between any possible pair of categories (cs and cv) to estimate Ni.

Generating Final Attribute Synonyms: Once PAS structure is built, it is easy to compute attribute synonyms as described earlier. Assume we want to find best synonyms for attribute i in InfoBox Triple t=s, i, v. Using PAS, for

6

all possible j, all cs P Cs, and all cv P Cv, we aggregate the evidence frequency (e) of records such as cs, i, cv: j in PAS to compute Ni,j. Similarly, we compute Nj by aggregating the evidence frequency (e) of all records in form of cs, i, cv: `'. Finally, we only accept attribute j as the synonym of j, if Ni,j{Ni and Ni,j are respectively above predefined thresholds sc and se. We study the effect of such thresholds in Section VI.

B. Generating Entity Synonyms

There are several techniques to find entity synonyms. Approaches based on the string similarity matching [24], manually created synonym dictionaries [28], automatically generated synonyms from click log [14], [15], and synonyms generated by other data/text mining approaches [29], [23] are only a few examples of such techniques. Although performing very well on suggesting context-independent synonyms, they do not explicitly consider the contextual information for suggesting more appropriate synonyms and resolving polynyms.

Very similar to context-aware attribute synonyms in which the context of the subject and value used with an attribute plays a crucial role on the synonyms for that attribute, we can define context-aware entity synonyms. For each entity name, CS3 uses the categorical information of the entity as well as all the InfoBox triples of the entity as the contextual information for that entity. Thus to complete the exiting entity synonym suggestion techniques, for any suggested pair of synonymous entities, we compute entities context similarity to verify the correctness of the suggested synonym.

It is important to understand that this approach should be used as a complementary technique over the existing ones for two main reasons. First, context similarity of two entities does not always imply that they are synonyms specially when many pieces of knowledge are missing for most of entities in the current knowledge bases. Second, it is not feasible to compute the context similarity of all possible pairs of entities due to the large number of existing entities. In this work, we use the OntoMiner system [23] in addition to simple string matching techniques (e.g. Exact string matching, having common words, and edit distance) to suggest initial possible synonyms.

Let `Johann Sebastian Bach' and `J.S. Bach' be two synonyms that two different knowledge bases are using to denote the famous musician. A simple string matching would offer these two entity as being synonyms. Thus we compare their contextual information and realize that they have many common attributes with similar values for them (e.g. same values for attributes occupation, birthdate, birthplace, etc.). Also they both belong to many common categories (e.g. Cat:German musician, Cat:Composer, Cat:people, etc.). Thus we suggest them as entity synonyms with high confidence. However, consider `Johann Sebastian Bach (painter)' and `J.S. Bach' entities. Although the initial synonym suggestion technique may suggest them as synonyms, since their contextual information is quite different (e.i. they have different values for common attributes occupation, birthplace, birthdate, deathplace, etc.) our system does not accept them as being synonyms.

UCLA Computer Science Department Technical Report #130013

V. COMBINING KNOWLEDGE BASES

The IBminer and CS3 systems allow us to integrate the existing knowledge bases into one of superior quality and coverage. In this section, we elaborate the steps of IKBstore as introduced previously in Section II.

A. Data Gathering

We are currently in the process of integrating knowledge bases listed in Table I, which include some domain specific knowledge bases (e.g. MusicBrainz [3], Geonames [2], etc.), and some domain independent ones (e.g. DBpedia [7], YaGo2 [18], etc.). Although most knowledge bases such as DBPedia, YaGo, and FreeBase already provide there knowledge in RDF, some of them may use other representations. Thus, for all knowledge bases, we convert their knowledge into Subject, Attribute, Value triples and store them in IKBstore3. IKBstore is currently implemented over Apache Cassandra which is designed for handling very large amount of data. IKBstore recognizes three main types of information:

, InfoBox triples: These triples provide information on a known subject (subject) in the subject, attribute, value format. E.g. J.S. Bach, PlaceofBirth, Eisenach which indicates the birthplace of the subject J.S.Bach is Eisenach. We refer to these triples as InfoBox triples.

, Subject/Category triples: They provide the categories that a subject belongs to in the form of subject/link/category where, link represents a taxonomical relation. E.g. J.S.Bach, is in, Cat:German Composers which indicates the subject J.S.Bach belongs to the category Cat:German Composers.

, Category/Category triples: They represent taxonomical links between categories. E.g. Cat:German Composers, is in, Cat:German Musicians which indicates the category Cat:German Composers is a sub-category of Cat:German Musicians.

Currently, we have converted all the knowledge bases listed in Table I into above triple formats.

IKBstore also preserves the provenance of each piece of knowledge. In other words, for every fact in the integrated knowledge base, we can track its data source. For the facts derived from text, we record the article ID as the provenance. Provenance has several important applications such as restricted search on specific sources, tracking erroneous knowledge pieces to the emanating source, and better ranking techniques based on reliability of the knowledge in each source. In fact, we also annotate each fact with accuracy confidence and frequency values, based on the provenance of the fact. To do so, each knowledge base is assigned a confidence value. To compute the frequency and confidence of individual facts, we respectively count the number of knowledge bases

3For instance MusicBraneZ uses relational representation, for which we consider every column name, say , as an attribute connecting the main entity (s) of a row in the table to value (v) of that row for column , and create triple s, ,v.

7

including the fact and combine the confidence value of them as explained in [22].

B. Initial Knowledge Integration

The aim of this phase is to find the initial interlinks between subjects, attributes, and categories from the various knowledge sources to eliminate duplication, align attributes, and reduce inconsistency using only the existing interlinks. At the end of this phase, we have an initial knowledge base which is not quite ready for structured queries, but provides a better starting point for IBminer to generate more structured data and for CS3 to resolve attribute and entity synonyms.

, Interlinking Subjects: Fortunately, many subjects in different knowledge bases have the same name. Moreover, DBPedia is interlinked with many existing knowledge bases, such as YaGo2 and FreeBase, which can serve as a source of subject interlinks. For the knowledge bases which do not provide such interlinks (e.g. NELL), in addition to exact matching, we parse the structured part of knowledge base to derive candidate interlinks for existing entities, such as redirect and sameAs links in Wikipedia.

, Interlinking Attributes: As we mentioned previously, attributes interlinks are completely ignored in the current studies. In this phase, we only use exact matching for interlinking attributes.

, Interlinking Categories: In addition to exact matching, we compute the similarity of the categories in different knowledge bases based on their common instances. Consider two categories c1 and c2, and let Spcq be the set of subjects in category c. The similarity function for categories interlink is defined as Simpc1, c2q " |Spc1q X Spc2q|{|Spc1q Y Spc2q|. If the Simpc1, c2q is greater than a certain threshold, we consider c1 and c2 as aliases of each other, which simply means that if the instances of two categories are highly overlapping, they might be representing the same category.

After retrieving these interlinks, we merge similar entities, categories, and triples based on the retrieved interlinks. The provenance information is generated and stored along with the triples.

C. Final Knowledge Integration

Once the initial knowledge base is ready, we first employ IBminer to extract more structured data from accompanying text and then utilize CS3 to resolve synonymous information and create the final knowledge base. More specifically, we perform the following steps in order to complete and integrate the final knowledge base:

, Improving Knowledge base coverage: The web documents contain numerable facts which are ignored by existing knowledge bases. Thus, we first enrich the knowledge base by employing our knowledge extraction system IBminer. As described in Section III, IBminer learns PM from free text and the initial knowledge base to derive more triples which will greatly improve the

UCLA Computer Science Department Technical Report #130013

coverage of existing knowledge bases. These new triples are then added to IKBstore. For each generated triple, we also update the confidence and evidence frequency in IKBstore. That is if the triple is already in IKBstore, we only increase and update its confidence and evidence frequency. , Realigning attribute names: Next we employ CS3 to learn PAS and generate synonyms for attribute names and expand the initial knowledge base with more common and standard attribute names. , Matching entity synonyms: This step merges the entities base on the entity synonyms suggested by CS3. For the suggested synonym entities such as s1, s2, we aggregate their triples and use one common entity name, say s1. The other subject (s2) is considered as a possible alias for s1, which can be represented by RDF triple s1, alias, s2.

, Integrating categorical information: Since we have merged subjects based on entity synonyms, the similarity score of the categories may change and thus we need to rerun the category integration described in Subsection V-B.

Currently the knowledge base integration is partially completed for DBpedia and YaGo2 as described in Section VI.

VI. EXPERIMENTAL RESULTS

In this section, we test and evaluate different steps of creating IKBstore in terms of precision and recall. To this end, we create an initial knowledge base using subjects listed in Wikipedia for three specific domains (Musicians, Actors, and Institutes). For these subjects, we add their related structured data from DBpedia and YaGo2 to our initial knowledge bases. Then, to learn PM and PAS structures, we use the entire Wikipedia's long abstracts provided by DBpedia for the mentioned subjects. We should state that IBminer only uses the free text and thus can take advantage of any other source of textual information.

All the experiments are performed in a single machine with 16 cores of 2.27GHz and 16GB of main memory. This machine is running Ubuntu12. On average, SemScape spends 3.07 seconds on generating initial semantic links for each sentence on a singe CPU. That is, using 16 cores, we were able to generate initial semantic links for 5.2 sentences per second.

A. Initial Knowledge Base

To evaluate our system, we have created a simple knowledge base which consists of three data sets for domains Musicians, Actors, and Institutes (e.g. universities and colleges) from the subjects listed in Wikipedia. These data sets do not share any subject, and in total they cover around 7.9% of Wikipedia subjects. To build each data set, we start from a general category (e.g. Category:Musicians for Musicians data set) and find all the Wikipedia subjects in this category or any of its descendant categories up to four levels. Table II provides more details on each data set.

8

................
................

In order to avoid copyright disputes, this page is only a partial summary.

Google Online Preview   Download

To fulfill the demand for quickly locating and searching documents.

It is intelligent file search solution for home and business.

Literature Lottery

Related searches