Jaql: Querying JSON data on Hadoop

IBM Almaden Research Center

Jaql: Querying JSON data on Hadoop

Kevin Beyer

Research Staff Member IBM Almaden Research Center

In collaboration with Vuk Ercegovac, Ning Li, Jun Rao, Eugene Shekita

1

Jaql: Querying JSON data on Hadoop

? 2008 IBM Corporation

IBM Almaden Research Center

Outline

Overview of Hadoop JSON Jaql query language

2

Jaql: Querying JSON data on Hadoop

? 2008 IBM Corporation

IBM Almaden Research Center

The Hadoop Stack

Components:

Map-Reduce

Parallel batch processing

HBase

Simple distributed database

HDFS

Distributed file system

Horizontal features:

? Used at large scale (e.g., 10,000 cores at Yahoo) ? Elastic (w/out data re-org) ? Fault tolerant (getting there...) ? Easy to administer

Non-features:

? No data model or types in HDFS or HBase ? No indexing ? No query language

3

Jaql: Querying JSON data on Hadoop

? 2008 IBM Corporation

IBM Almaden Research Center

HDFS Overview

File:

Rack 1 Switch

HDFS

Switch

Rack N Switch

Server: 2-4 disks

Typically 64 MB blocks

Single file-system stored on direct-attached disks of commodity servers Replicate file blocks for failures Simplified file system interface? not Posix

? Designed for large, sequential reads

4

Jaql: Querying JSON data on Hadoop

? 2008 IBM Corporation

IBM Almaden Research Center

HBase Overview

column column

key

name

value

p127532 itemType: "car" p187842 itemType: "apartment"

make: "VW" rooms: 3

...

doors: 2 rent: 1200

Logical view of table

Physical view of table

No schema, no types

... ... location: "45E, 32N"

Sorted by key

Column values

HDFS

? Are versioned

? Stored vertically in HDFS:

5

Jaql: Querying JSON data on Hadoop

Redundancy through HDFS replication

? 2008 IBM Corporation

................
................

In order to avoid copyright disputes, this page is only a partial summary.

Google Online Preview   Download