Working Within the Data Lake -east-1.amazonaws.com

[Pages:52]Working Within the Data Lake

With AWS Glue

Team or presenters name Date

? 2020, Amazon Web Services, Inc. or its Affiliates.

Table of contents

1. Optimizing for Cost and Performance 2. Cataloging Data Schemas with AWS Glue 3. Transforming Data with AWS Glue 4. AWS Glue ML Transform and Workflows

? 2020, Amazon Web Services, Inc. or its Affiliates.

Session's Focus ? Working In The Data Lake

Amazon Amazon Elasticsearch AWS

DynamoDB

Service

Glue

Catalog & Search

AWS Snowball

Amazon Kinesis Data

Firehose

AWS Direct Connect

AWS Database AWS Storage

Migration

Gateway

Service

AWS DataSync

AWS Transfer for SFTP

Amazon S3 Transfer Acceleration

Data Ingestion

? 2020, Amazon Web Services, Inc. or its Affiliates.

AWS AppSync

Amazon API Gateway

Amazon Cognito

Access & User Interfaces

Central Storage

Scalable, secure, costeffective

S3

Amazon Athena

Amazon EMR

AWS Glue

Amazon Redshift

Amazon DynamoDB

Manage & Secure

AWS KMS

AWS IAM

AWS

Amazon

CloudTrail CloudWatch

Amazon QuickSight

Amazon Kinesis

Amazon Elasticsearch

Service

Amazon Neptune

Analytics & Serving

Amazon RDS

Optimizing for Cost and Performance

? 2020, Amazon Web Services, Inc. or its Affiliates.

Optimizing for Cost and Performance

Partitioning

Compression

Pay for data your query needs, Pay for what you store, not to scan all of your data not for what you process

? 2020, Amazon Web Services, Inc. or its Affiliates.

Managed Services

Pay for what you use, not for what you run

Partitioning

datalake 20170515T1423-GB-01.tar.gz 20170515T1423-GB-02.tar.gz 20170515T1500-US-01.tar.gz 20170516T1500-US-01.tar.gz 20170516T1600-GB-01.tar.gz 20170516T1600-GB-02.tar.gz

select * from datalake where dt=20170515 and country=US

? 2020, Amazon Web Services, Inc. or its Affiliates.

datalake

dt=20170515

country=GB

20170515T1423-GB-01.tar.gz

20170515T1423-GB-02.tar.gz

country=US

20170515T1500-US-01.tar.gz

dt=20170516

country=GB

20170516T1600-GB-01.tar.gz

20170516T1600-GB-02.tar.gz

country=US

20170516T1500-US-01.tar.gz

Partitioning - Advantages

Run Time Data Scanned Cost Results

select count(*) from datalake where dt=`20170515'

Non-Partitioned Partitioned

9.71 sec 74.1 GB

2.16 sec 29.06 MB

select count(*) from datalake where dt >= `20170515' and dt < `20170516'

Non-Partitioned Partitioned

10.41 sec 74.1 GB

2.73 sec 871.39 MB

$0.36

$0.0001

77% faster, 99% cheaper

$0.36

$0.004

73% faster, 98% cheaper

? 2020, Amazon Web Services, Inc. or its Affiliates.

Compression

? Compressing your data can speed up your queries significantly

? Splittable formats enable parallel processing across nodes

Algorithm Splittable

Gzip (DEFLATE)

bzip2 LZO Snappy

No

Yes Yes Yes and No *

Compressio n Ratio High

Very High Low Low

Algorithm Speed Medium

Slow Fast Very Fast

Good For

Raw Storage

Very Large Files Slow Analytics

Slow & Fast Analytics

* Depends on if the source format is splittable and can output each record into a Snappy Block

? 2020, Amazon Web Services, Inc. or its Affiliates.

................
................

In order to avoid copyright disputes, this page is only a partial summary.

Google Online Preview   Download