Jaql: Querying JSON data on Hadoop

[Pages:26]IBM Almaden Research Center

Jaql: Querying JSON data on Hadoop

Kevin Beyer

Research Staff Member IBM Almaden Research Center

In collaboration with Vuk Ercegovac, Ning Li, Jun Rao, Eugene Shekita

1

Jaql: Querying JSON data on Hadoop

? 2008 IBM Corporation

IBM Almaden Research Center

Outline

Overview of Hadoop JSON Jaql query language

2

Jaql: Querying JSON data on Hadoop

? 2008 IBM Corporation

IBM Almaden Research Center

The Hadoop Stack

Components:

Map-Reduce

Parallel batch processing

HBase

Simple distributed database

HDFS

Distributed file system

Horizontal features:

? Used at large scale (e.g., 10,000 cores at Yahoo) ? Elastic (w/out data re-org) ? Fault tolerant (getting there...) ? Easy to administer

Non-features:

? No data model or types in HDFS or HBase ? No indexing ? No query language

3

Jaql: Querying JSON data on Hadoop

? 2008 IBM Corporation

IBM Almaden Research Center

HDFS Overview

File:

Rack 1 Switch

HDFS

Switch

Rack N Switch

Server: 2-4 disks

Typically 64 MB blocks

Single file-system stored on direct-attached disks of commodity servers Replicate file blocks for failures Simplified file system interface? not Posix

? Designed for large, sequential reads

4

Jaql: Querying JSON data on Hadoop

? 2008 IBM Corporation

IBM Almaden Research Center

HBase Overview

column column

key

name

value

p127532 itemType: "car" p187842 itemType: "apartment"

make: "VW" rooms: 3

...

doors: 2 rent: 1200

Logical view of table

Physical view of table

No schema, no types

... ... location: "45E, 32N"

Sorted by key

Column values

HDFS

? Are versioned

? Stored vertically in HDFS:

5

Jaql: Querying JSON data on Hadoop

Redundancy through HDFS replication

? 2008 IBM Corporation

IBM Almaden Research Center

Map-Reduce Overview

Input

Vi

[ Km, Vm ]

M1

M2

M3

M4

shuffle

Km, [ Vm ] [ Vr ] Output R1 R2

Programmer focus:

? Map: Vi [ Km,Vm ] ? Reduce: Km, [ Vm ] Vr

System provides:

? Parallelism ? Fault tolerance ? Key partitioning (shuffle) ? Synchronization ? Map task reads local block

6

Jaql: Querying JSON data on Hadoop

? 2008 IBM Corporation

IBM Almaden Research Center

Example: Counting Words

(Vi)

(Document Line)

"The following example is simple and is the `Hello, World' for Map-Reduce"

(Km,Vm)

[(Word, Count),...]

the,

2

following, 1

example, 1

...

Mi

(Km, [Vm,...]) (Word, [Count, Count, ...])

(Vr) Word, Count

the, [2,1,13,7,7] following, [2,1]

the, 30 following, 3

R1

example, [1]

R2

example, 1

Aggregate locally when possible (combine step)

7

Jaql: Querying JSON data on Hadoop

? 2008 IBM Corporation

IBM Almaden Research Center

Outline

Overview of Hadoop JSON Jaql query language

8

Jaql: Querying JSON data on Hadoop

? 2008 IBM Corporation

................
................

In order to avoid copyright disputes, this page is only a partial summary.

Google Online Preview   Download