Indexing semi-structured

Indexing semi-structured

XML databases

A study of techniques.

Section I based on

« Compact labeling schemes for ancestor queries », SODA ’01, Abiteboul, Kaplan, Milo

Section II based on

« Index Structures for Path Expressions », ICDT ’99, Milo, Suciu

Yann THIERRY-MIEG

y.mieg@free.fr

DEA-SIR

GR-BDA

March 27, 2001

Abstract :

Indexing databases to improve query response time is a universal technique, well known, and highly optimized. The XML standard for web documents raises new problems for database indexing. Namely XML documents do not necessarily conform to a traditional object-oriented or relational data model, they may contain heterogeneous collections, missing attributes, and/or be physically stored in a distributed way. Furthermore these semi-structured data models are mostly queried using ancestor queries, and the index must therefore include this structural information.

I present in this paper two techniques that try to answer to this problem.

The first is a compact labeling scheme for data trees that contains ancestor information in the labels, so that the it captures both the data and its’ structure, and is designed for large scale XML data warehousing. I have included a critical analysis of the results obtained, that shows that this approach does not in fact reduce the maximum total size of the index tables.

The second is an index structure called T-Index, that constructs a specific index for a class of queries. It is constructed to exploit the symmetry of the data model with respect to the language given by a template T, and can be used to optimize queries on heterogeneous data.

Introduction

I review here two articles that propose solutions to the problem of indexing XML semi-structured data. With the rise of XML as a standard for web exchange, efficient indexing schemes must be found to exploit the structural information that is the richness of this language.

We consider here data structured as a rooted tree, the arcs bearing labels that are the names of the attributes, the nodes being the data itself. An XML document constitutes just such a tree. The data is queried with ancestor or path expressions. These queries must be answerable without opening the actual objects from disk, or worse from the network. But index size is a critical factor for huge web indexing engines, as the index table must hold in RAM.

In section 2, I present a labeling scheme that preserves ancestor information with a label size close to [pic]. It is based on the rather straightforward “parenthesis scheme”, that produces labels of at most 2*log(n) bits in length, but by a clever partitioning of the tree, the authors show that label size can be reduced further. I demonstrate that though the maximal label size is indeed reduced to [pic], the maximum average label size is greater than that obtained with the basic parenthesis scheme, thus shedding some doubt on the usefulness of this algorithm.

In section 3, I present a different approach, aimed at optimizing response time for certain classes of queries, those conforming to a given template. The authors use standard reduction techniques on finite state machines representing a language, to construct a specific index for a type of request. This is achieved by constructing a bi-simulation of the structured tree representing the DB, and merging nodes in the same equivalency class with respect to a given language. The queries are then evaluated on this reduced symbolic graph, yielding performance increase.

Compact labeling schemes for ancestor queries.

2.1 Context

We consider in this section the following problem:

Let T be a rooted tree, let n be the number of its nodes. Given T, what are the most compact labels one can attach to the nodes, such that given the labels of u and v , we can determine in constant time if u is an ancestor of v.

Such a labeling scheme constructs labels of at most 2*log(f) where f is the number of leaves, arguably smaller than the total number of nodes. Answering the question “is u an ancestor of v” now amounts to checking the interval containment. Although this labeling problem has an obvious lower bound for the length of an index at log(n), to distinguish individual nodes, the goal here is to reduce, even by a constant factor, this 2log(f) bound for indexes containing structure information.

The technique presented here basically partitions the tree following an algorithm presented in the next section, then indexes the resulting set of trees with the parenthesis scheme, and finally reconstructs an index as the union of the smaller indexes in different subtrees.

2 The 3 phase pruning algorithm

To focus this discussion I will present the algorithm through an example that exhibits the pruning procedure. I forward the reader to the actual article for a more formal presentation, with the demonstrations of the statements I will not justify here.

We begin with a rooted binary tree T, such that any internal node has at least two descendants, and there exists a root node from which any node in this acyclic oriented graph can be reached.

Let n be the number of nodes in T, and x be a formal parameter that will condition the size of the different subtrees in the partition we are creating. I will develop further on the choice of this formal parameter in the section that deals with the recursive scheme, but for now x will be instantiated with 1/3.

PHASE 1: PRUNING

Let l(u) be the number of leaves in the subtree that the node u roots.

The algorithm is the following :

• For any node u such that l(u) > 2 nx

• For any v son of u

• If l(v) ( nx

• Cut the edge (u,v)

• Else if l(v) ( 2nx

• Cut all the (v,w) edges where w is a son of v

Let T1 be the resulting tree.

PHASE 2 : GROUPING

We first define a heavy node of T1 v, as being any node of T1 such that the sum of the weights of u’s children that were pruned during phase 1 exceeds nx. More formally

V is heavy iff [pic][pic]

We then partition these children of heavy nodes in groups Gi such that the number of leaves in any such group be ( 2 nx . Finally for any binary node in T1 that lost some of its children, but not enough to become heavy, we add a group Bi to represent the pruned children.

We then add a leaf representative of each of these groups to the tree T1, under the heavy node the group was pruned from, except if it would be added to a leaf of T1, introducing an unnecessary unary path. Let T2 be the resulting tree:

PHASE 3 : CONTRACTING

Since the parenthesis labeling scheme needs a binary tree to assign distinct labels to each node, and that the pruning phase may have left some unary paths in T2, we now contract these paths, in a way similar to the grouping of phase 2. For any maximal sequence u1 , u2 , …,uk such that any ui is the sole child of ui-1 in T2, and that uk has a sole child in T2 , we will partition the nodes along this path, so that no group Ci thus formed has had more than 2*nx total weight of children pruned in phase 1. We then represent the Ci by a single node, and attach a leaf Li to it, representing the pruned children.

Let T3 be the resulting tree.

ASSIGNING LABELS

We first label T3 and all the subtrees we have constructed with the parenthesis scheme.

We can now assign labels to nodes. The labels are composed of a type bit, the label of the representative of the node in T3, the index of the node for nodes belonging to a Ci Li pair , the label of the node in the subtree it belongs to, labeled with the parenthesis scheme.

• Type bit is 0 if u was not pruned in phase 1 ( i.e. [pic] ), 1 if it was.

• The representative of a node u is the node representative of the subtree u belongs to, or u itself if [pic]

• The index of a node u belonging to a Ci is its depth along the path that was contracted, with 1 being nearest to the root of the tree. For a node v belonging to a subtree Li , it is the index of closest parent of v in the contracted path Ci . For any other node the index is 0.

• The subtrees are also labeled using the basic parenthesis scheme.

More concisely :

Label = type/label T3/index Ci/(label Gi)

Label(u)=[0;LT3;0] if [pic]

=[0;LT3;Ind(Ci)] if [pic]

=[1;LT3;Ind(Ci);Lli] if [pic]

=[1;LT3;0;Lgi] if [pic]

Index size :

We must separate the cases where type bit is 0 from type is 1. Let us first remind a property of the trees constructed : card(T3) ( 3* n1-x

Let L be the index size of a node u.

Type of u is 0 :

L = 1 + 2*log(3 * n1-x) + log( 2* nx )

= (2-x)log(n) + O(1)

Type of u is 1 :

L = 1 + log(3 * n1-x) + log( 2* nx ) + 2*log(2*nx)

= (1 + 2x) log(n) + O(1)

To balance the size of these labels, we choose x = 1/3 .

ANSWERING ANCESTOR QUERIES

When answering ancestor queries of the type “is u v ’s ancestor?” we will consider the following cases :

• Case 1: u and v were pruned during phase 1 and have the same representative gi in T3 : compare the ranges of u and v in Gi

• Case 2: u and v were contracted as part of the same sub-path in phase 3 : compare the index of u and v

• Case 3: u and v have different representatives : compare the labels of their representatives in T3

• Case 3’: u and v have different representatives that form a ci , li pair, with rep(u) = ci : compare the index of u and v.

The different cases are captured here as a table :

THE RECURSIVE TECHNIQUE

We can generalize this technique to a recursive scheme : we will iterate the 3 phase prune and contract algorithm on the T3 tree obtained after an iteration. Let T3j be the tree obtained after iteration j. Between each iteration xi is modified (increased) to gradually prune more of the original tree. We extend the notion of representative, so that the representative at iteration j of u, r j (u), is the representative of rj-1(u) in T3j .

Let k be the number of iterations.

We introduce the notion of level :

The level of a node u is the number of the iteration it was pruned on, or 0 if it has not been pruned. More formally , it is the minimum j, such rj(u) is a leaf of T3j and is not u. We can see therefore that the type bit of the non-recursive scheme is exactly this level.

We generalize the labeling scheme, labels is now being composed of :

• A prefix level

• K+1 fixed length fields indexed from k down to 0 :

• The kth field contains the label of rk(u) in the final tree T3k obtained after the last iteration

• For j varying from k-1 down to level –1 : the j field contain two parts:

• The index of rj-1(u)

• The label of rj-1(u) in its subtree

• From level-2 down to 0 :

• The index of rj-1

The authors prove that by choosing xi at iteration i : [pic] we obtain maximum labels of size [pic] , thus improving on the maximum label size of [pic]in the basic parenthesis scheme by a constant factor.

2.3 A critical analysis of these results

Some things disturb me in this paper. The questions that spring to my mind after reading the paper carefully are:

• Why so complicated ? Best algorithms are usually simple ones as well, as is the parenthesis scheme.

• Ok, so the max index size is reduced, but what about the average index size? , which is after all the truly important thing, since we have to store the whole DB in the end.

• Furthermore, though it’s true that the parenthesis scheme labels have a max length of 2*log(n) for internal nodes, we can argue that we are working here under a very strong constraint : the tree must be at least binary, i.e. any internal node has at least two children. This implies that for any internal node there must be at least two leaves. Now this does miracles for our ratio of “long” internal labels with respect to shorter leaf labels. In fact it means that if f is the number of leaves and n is the total number of nodes : [pic] . So that in the worst case, when [pic], the total length needed to store the DB is reduced to :[pic][pic]

Thus the average label size with the parenthesis scheme is at most [pic] which is truly a very good figure, knowing that the limit without ancestry information is [pic]

NB: I consider here that leaf labels are in fact implemented as being half the size of internal node labels; this is easily accomplished by adding a type bit, the length of which will fall into the O(1) term.

• Looking more in detail, there is this complicated process to manage the ancestry within Cs Ls pairs. Introducing the index to be able to go around this problem is a good solution obviously, but it is that much more information to code. More it is necessary only in this particular case, which was introduced by the algorithm in the first place, but is an added field for every label…. I don’t know, it just doesn’t feel right…

• Lastly on the above example label size is up from 5/10 bits per node to 12/14. Though there is a constant factor that explains this, it is a bit disturbing.

Having my doubts as to the gain in terms of average label size, I tried to compute this average size for the prune and contract algorithm, but was not successful, because of the many parameters that influence the result. In particular the geometry of the tree plays an important role. Having failed in this direction, I can only resort to exhibiting an example geometry that would produce an average label size of more than [pic]with the Prune and Contract algorithm.

This example is by no means a worst case scenario, or the values computed here could be used to compute the maximum average label size of the P&C algorithm. It is just an example I devised to pinpoint the problem.

Let us consider the following three level tree : with a single root R, (n-1)/3 children of this root in U1, and each of these with two children in U2, therefore 2(n-1)/3 leaves in the tree. Ui is the set of nodes of depth i.

We can see that:

[pic] [pic] [pic]

The P&C algorithm will therefore prune all the nodes but R in phase 1, and reconstruct [pic][pic] groups to represent these pruned nodes.

Label lengths will therefore be :

• For U0 : 1+ 2*log(G) + log(2nx)

• For U1 : 1+ log(G) + log(2nx) + 2*log(2nx)

• For U2 : 1+ log(G) + log(2nx) + log(2nx)

And we have :

• 1 node of type U0

• (n-1)/3 nodes of type U1

• 2(n-1)/3 nodes of type U2

For the total length in bits to store the tree L, we sum the terms:

[pic]

We obtain by neglecting constant factors and O(log(n)) before O(n) :

[pic][pic]

For [pic] , [pic] giving us an average label size at [pic]that we must compare with the average size in the parenthesis scheme of [pic].

As we can see, the comparison does not favor the P&C scheme. Indexing a database with this scheme may increase the total space needed to store the index table, compared to using the basic parenthesis scheme, which has the further advantage of accelerating the computation of ancestor queries.

(I’m not quite sure of the validity of the next couple of paragraphs )

We can also invalidate the recursive scheme using the same geometry for our sample tree, with k+1 levels in depth. The recursive tree geometry is the following :

Let T1 be the tree I used for the single iteration problem.

Let Ti+1 be a tree constructed on Ti such that any leaf in Ti has two leaf children in Ti+1

The tree resulting of an iteration of P&C will be a tree that matches the Ti model, due to the symmetry it presents. As I have shown above, indexing such a tree with P&C will yield longer labels than the parenthesis scheme. Therefore the recursive scheme will also increase the average length of the labels.

(This is not very well formalized but my feeling is that this approach would yield the result I announce)

Index structures for path expressions : T-Index

3.1 Context

In this section I present a paper that suggests an ingenious index structure, tailored to accelerate the computation of requests that conform to a given template T, on semi-structured databases with an ill-known or irregular structure. Due to lack of space I will not present the details of this method, but just the general principle and one or two examples.

We consider a tree representing the database, where the arcs are labeled with the name of the attributes, and the nodes represent the actual data. Such a tree is an accurate description of an XML database. Such a tree can also be considered as defining a language , because it describes a finite state machine. The authors use this property of the data tree to apply known reduction techniques in data mining and model-checking, namely the algorithms to build a bi-simulation of this finite state machine with respect to a language defined by the template. This technique consists in building equivalence classes for words that are equivalent with respect to the language.

We seek to answer queries of the type

Select xi1, xi2 ,..xik

From P1 x1 P2 x2 …Pn xn

Where {xi1 .,. xik} is a subset of {x1 .,..xn }, and Pi are path expressions defined as

P::= ( | F| (P|P) | (P.P) | P*

Example :

Select x

From (*.Restaurant) x (Menu.*.Dinner.*.Lasagna) y

A template T defined with such path expressions can be instantiated by a class of queries.

Example :

Template T=(*.Restaurant)x1 P x2 Name x3 F x4 can be instantiated for example by

q1 = (*.Restaurant) x1 * x2 Name x3 McMickey x4

q2 = (*.Restaurant) x1 * x2 Name x3 _ x4

q3 = (*.Restaurant) x1 ((|_) x2 Name x3 McMickey x4

3.2 1-index

The first type of index we will build here as an example is the 1-Index, corresponding to the template T= P x . Such an index is a concise representation of all the paths in the DB from the root. It places in the same node of I(DB) the 1-index of DB any nodes of DB reachable with the same sequence of tags from the root.

An example :

BD I(BD)

As we can see answering q= Select x From t.a x on I(BD) will be faster on I(BD) because it is a smaller graph (only two paths to explore instead of 5).

3.3 2-index

A 2-index is constructed to answer queries conforming to T= * x P y . It thus defines a language between pair of nodes, as opposed to the 1-index that uses paths from the root. In a node of the 2-index I2 we will find 2-tuples of nodes of BD. If (u,v) and (u’,v’) are in the same node of I2(BD) , it means that the same path leads from u to v and u’ to v’.

We place in the root all the equivalence classes of the form (u,u) (empty path expression). Then we construct the tree by adding an arc labeled by tag between n1 and n2 if (u,v) is in n1 , (u,v’) is in n2 and there is an arc labeled by tag between v and v’ in DB.

We can note that evaluating q = Select x,y From * x t.a y on this graph will save us a costly examination of all the nodes in DB, since we just have to explore the t.a paths from the root of DB.

3.4 T-Index

This notion can be extended to T-index, an index based on a template of the form T= P1 x1 …. Pn xn . This is both a generalization and a specialization of the concepts of 1 and 2 index, in the sense that 1 and 2 index are particular T-index, but any path query q can be answered with a 1-index though q must conform with T for a T-index to be useful.

Examples :

Suppose a sample of queries shows

• restaurants are often selected by their name, we could use the template T=(*.Restaurant)x1 P x2 Name x3 F x4 to compute an index that would assist with these queries.

• Most queries select less than four contiguous fields, the template T= _ x _ y _ z _ t could be used.

Some queries that don’t “fit” the template might nevertheless be rewritten to take advantage of a particular index, though the paper doesn’t explore all possible rewriting rules. Furthermore though this scheme is space consuming, index may be stored in compressed form. The authors also suggest that exploration of the index tree can be further accelerated by hashing outgoing arcs by their label. Finally they show that if the database is k-short, such that no node is further than k arcs from the root, the size of the index is independent of the size of DB.

Using this type of index for heterogeneous data or when the schema is not well known in combination with the more standard b-tree or hashing techniques on the more regular parts of a DB seems particularly efficient.

Conclusion

To avoid drawing the obvious conclusion that with XML emerging as the standard for web data exchange, devising specific indexing schemes for semi-structured data is a necessity, etc…

I will just say that I really liked the second approach. It is very close in some respects to my current object of study, namely a reduction of Petri nets that exploits symmetries by creating a symbolic reachability graph, in which each node represents an equivalence class. Furthermore it doesn’t pretend to be the solution, just a good solution to certain problems that can be used jointly with the optimized algorithms database managers know and love, and the paper is thus doubly convincing of the validity of this approach.

The first article was interesting from an intellectual point of view, but all I can say is that I’m not convinced. I’m also surprised, because the reasoning in the paper is fine and very “thought out”, but the things I point out here on the average length of the index are hardly ground breaking, when you think about the problem, as the authors obviously have. The paper is so focused on the maximal length that this aspect is not mentioned , though the indexing scheme is suggested for use in data warehouse projects.

-----------------------

1..4

1..3

4

3

2

1

The best known labeling scheme, called parenthesis scheme, uses a depth first exploration of the tree to label the leaves with unique ordered identifiers, and the inner nodes N by a pair representing an interval [m..M] such that m (res. M) is the smallest (res. largest) leaf identifier in the subtree of which N is the root.

1..20

1..7

9..19

10..19

14..19

8..19

20

9

6

7

6..7

14..17

14

15

16

17

15..17

18..19

18

19

13

12

10..11

10

11

1

2

3

4

5

8

1..3

4..5

The tree T we will use throughout this example. It is labeled here using the parenthesis scheme, with f=20 leaves these scheme yields labels of at most 10 bits.

The tree has n=33 nodes, we will be using the value nx = n1/3 ~= 3,2.. as threshold value for our partition

The tree T1 with the l(u) used to compute it.

Note that a node v in T1 has the following properties:

• l(v) > nx

• if v is an internal node l(v) ( 2 nx

• if v has a child u that was pruned and l(u)> nx , v is a leaf in T1 and l(v) (2 nx

20

T1

11

2

12

10

7

2

2

2

3

4

6

3

B1

G3

G4

G1

T1

G2

1

1

2

4

1

2

7

2

2

3

4

6

3

Heavy nodes are represented in black here.

b1

g3

g1

T2

g2

c

a

4

b

b

2

3

c1

b1

4

1

T3

2

C1

1..4

1

5

2

4..5

G2: 2;

G1:1;

a

1

a

5

4..5

3..5

1..6

1..2

g3

d

U0=R

U2

U1

[pic]

1;1

l1

g2

3

4

2..4

1

5..6

5

2;2

6

L1: 3;

C1: 3..5;

G4: 5;

B1: 6;

4

2

1

1..3

g1

6

2

3

3

1..2

1

1

2

G3: 4;

1..2

a

t

t

t

a

t

1

I2(BD)

(1,7) (1,13)

(1,8) (1,10)

(1,12)

(1,11)

(1,9)

Root

d

c

a

b

a

(1,2) (1,3) (1,4)

(1,5) (1,6)

d

c

a

b

d

c

a

b

a

t

7 13

8 10 12

2 3 4 5 6

11

9

t

13

12

11

10

9

8

7

6

5

4

3

2

1

t

(2,7) (6,13)

(3,8) (4,10)

(5,12)

(1,1) (2,2)…(13,13)

(4,11)

(3,9)

................
................

In order to avoid copyright disputes, this page is only a partial summary.

To fulfill the demand for quickly locating and searching documents.

It is intelligent file search solution for home and business.

Literature Lottery

To fulfill the demand for quickly locating and searching documents.

Related download

Related searches