Working Within the Data Lake -east-1.amazonaws.com
[Pages:52]Working Within the Data Lake
With AWS Glue
Team or presenters name Date
? 2020, Amazon Web Services, Inc. or its Affiliates.
Table of contents
1. Optimizing for Cost and Performance 2. Cataloging Data Schemas with AWS Glue 3. Transforming Data with AWS Glue 4. AWS Glue ML Transform and Workflows
? 2020, Amazon Web Services, Inc. or its Affiliates.
Session's Focus ? Working In The Data Lake
Amazon Amazon Elasticsearch AWS
DynamoDB
Service
Glue
Catalog & Search
AWS Snowball
Amazon Kinesis Data
Firehose
AWS Direct Connect
AWS Database AWS Storage
Migration
Gateway
Service
AWS DataSync
AWS Transfer for SFTP
Amazon S3 Transfer Acceleration
Data Ingestion
? 2020, Amazon Web Services, Inc. or its Affiliates.
AWS AppSync
Amazon API Gateway
Amazon Cognito
Access & User Interfaces
Central Storage
Scalable, secure, costeffective
S3
Amazon Athena
Amazon EMR
AWS Glue
Amazon Redshift
Amazon DynamoDB
Manage & Secure
AWS KMS
AWS IAM
AWS
Amazon
CloudTrail CloudWatch
Amazon QuickSight
Amazon Kinesis
Amazon Elasticsearch
Service
Amazon Neptune
Analytics & Serving
Amazon RDS
Optimizing for Cost and Performance
? 2020, Amazon Web Services, Inc. or its Affiliates.
Optimizing for Cost and Performance
Partitioning
Compression
Pay for data your query needs, Pay for what you store, not to scan all of your data not for what you process
? 2020, Amazon Web Services, Inc. or its Affiliates.
Managed Services
Pay for what you use, not for what you run
Partitioning
datalake 20170515T1423-GB-01.tar.gz 20170515T1423-GB-02.tar.gz 20170515T1500-US-01.tar.gz 20170516T1500-US-01.tar.gz 20170516T1600-GB-01.tar.gz 20170516T1600-GB-02.tar.gz
select * from datalake where dt=20170515 and country=US
? 2020, Amazon Web Services, Inc. or its Affiliates.
datalake
dt=20170515
country=GB
20170515T1423-GB-01.tar.gz
20170515T1423-GB-02.tar.gz
country=US
20170515T1500-US-01.tar.gz
dt=20170516
country=GB
20170516T1600-GB-01.tar.gz
20170516T1600-GB-02.tar.gz
country=US
20170516T1500-US-01.tar.gz
Partitioning - Advantages
Run Time Data Scanned Cost Results
select count(*) from datalake where dt=`20170515'
Non-Partitioned Partitioned
9.71 sec 74.1 GB
2.16 sec 29.06 MB
select count(*) from datalake where dt >= `20170515' and dt < `20170516'
Non-Partitioned Partitioned
10.41 sec 74.1 GB
2.73 sec 871.39 MB
$0.36
$0.0001
77% faster, 99% cheaper
$0.36
$0.004
73% faster, 98% cheaper
? 2020, Amazon Web Services, Inc. or its Affiliates.
Compression
? Compressing your data can speed up your queries significantly
? Splittable formats enable parallel processing across nodes
Algorithm Splittable
Gzip (DEFLATE)
bzip2 LZO Snappy
No
Yes Yes Yes and No *
Compressio n Ratio High
Very High Low Low
Algorithm Speed Medium
Slow Fast Very Fast
Good For
Raw Storage
Very Large Files Slow Analytics
Slow & Fast Analytics
* Depends on if the source format is splittable and can output each record into a Snappy Block
? 2020, Amazon Web Services, Inc. or its Affiliates.
................
................
In order to avoid copyright disputes, this page is only a partial summary.
To fulfill the demand for quickly locating and searching documents.
It is intelligent file search solution for home and business.
Related download
- loan risk analysis with databricks and xgboost
- oracle big data and spatialdata sheet
- c talyst support to spark with adding native sql
- sparql by example the cheat sheet
- working within the data lake
- interactive data analysis with r sparkr and mongodb a
- is a scalable and fault tolerant structured streaming uses
Related searches
- word within the word activities
- word within the word list
- word within the word lists
- word within the word stems
- word within the word pdf
- word within the word 1
- word within the word 11
- word within the word list pdf
- the word within the word
- word within the word ideas
- word within the word book
- working from home data entry