Beginning Apache Spark 2 - Programmer Books

[Pages:398]Beginning Apache Spark 2

With Resilient Distributed Datasets, Spark SQL, Structured Streaming and Spark Machine Learning library -- Hien Luu



Beginning Apache Spark 2

With Resilient Distributed Datasets, Spark SQL, Structured Streaming and Spark Machine

Learning library

Hien Luu



Beginning Apache Spark 2: With Resilient Distributed Datasets, Spark SQL, Structured Streaming and Spark Machine Learning library

Hien Luu SAN JOSE, California, USA

ISBN-13 (pbk): 978-1-4842-3578-2

ISBN-13 (electronic): 978-1-4842-3579-9

Library of Congress Control Number: 2018953881

Copyright ? 2018 by Hien Luu

This work is subject to copyright. All rights are reserved by the Publisher, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed.

Trademarked names, logos, and images may appear in this book. Rather than use a trademark symbol with every occurrence of a trademarked name, logo, or image we use the names, logos, and images only in an editorial fashion and to the benefit of the trademark owner, with no intention of infringement of the trademark.

The use in this publication of trade names, trademarks, service marks, and similar terms, even if they are not identified as such, is not to be taken as an expression of opinion as to whether or not they are subject to proprietary rights.

While the advice and information in this book are believed to be true and accurate at the date of publication, neither the authors nor the editors nor the publisher can accept any legal responsibility for any errors or omissions that may be made. The publisher makes no warranty, express or implied, with respect to the material contained herein.

Managing Director, Apress Media LLC: Welmoed Spahr Acquisitions Editor: Steve Anglin Development Editor: Matthew Moodie Coordinating Editor: Mark Powers

Cover designed by eStudioCalamar

Cover image designed by Freepik ()

Distributed to the book trade worldwide by Springer Science+Business Media New York, 233 Spring Street, 6th Floor, New York, NY 10013. Phone 1-800-SPRINGER, fax (201) 348-4505, e-mail orders-ny@ springer-, or visit . Apress Media, LLC is a California LLC and the sole member (owner) is Springer Science + Business Media Finance Inc (SSBM Finance Inc). SSBM Finance Inc is a Delaware corporation.

For information on translations, please e-mail editorial@; for reprint, paperback, or audio rights, please email bookpermissions@.

Apress titles may be purchased in bulk for academic, corporate, or promotional use. eBook versions and licenses are also available for most titles. For more information, reference our Print and eBook Bulk Sales web page at bulk-sales.

Any source code or other supplementary material referenced by the author in this book is available to readers on GitHub via the book's product page, located at 9781484235782. For more detailed information, please visit source-code.

Printed on acid-free paper



Table of Contents

About the Author ix About the Technical Reviewer xi

Chapter 1: Introduction to Apache Spark 1 Overview 1 History 2 Spark Core Concepts and Architecture 3 Spark Clusters and the Resource Management System 4 Spark Application 5 Spark Driver and Executor 5 Spark Unified Stack 6 Spark Core 7 Spark SQL 8 Spark Structured Streaming and Streaming 9 Spark MLlib 10 Spark Graphx 11 SparkR 11 Apache Spark Applications 12 Example Spark Application 12 Summary 13

Chapter 2: Working with Apache Spark 15 Downloading and Installing Spark 15 Downloading Spark 15 Installing Spark 16

iii



Table of Contents

Having Fun with the Spark Scala Shell 19 Useful Spark Scala Shell Commands and Tips 19 Basic Interactions with Scala and Spark 22

Introduction to Databricks 30 Creating a Cluster 33 Creating a Folder 36 Creating a Notebook 39

Setting Up the Spark Source Code 47 Summary 49

Chapter 3: Resilient Distributed Datasets 51 Introduction to RDDs 51 Immutable 52 Fault Tolerant 52 Parallel Data Structures 52 In-Memory Computing 53 Data Partitioning and Placement 53 Rich Set of Operations 54 RDD Operations 54 Creating RDDs 56 Transformations 57 Transformation Examples 58 Actions 68 Action Examples 69 Working with Key/Value Pair RDD 74 Creating Key/Value Pair RDD 75 Key/Value Pair RDD Transformations 76 Key/Value Pair RDD Actions 81 Understand Data Shuffling 83 Having Fun with RDD Persistence 83 Summary 85

iv



Table of Contents

Chapter 4: Spark SQL (Foundations) 87 Introduction to DataFrames 88 Creating DataFrames 89 Creating DataFrames from RDDs 89 Creating DataFrames from a Range of Numbers 92 Creating DataFrames from Data Sources 95 Working with Structured Operations 109 Introduction to Datasets 130 Creating Datasets 132 Working with Datasets 133 Using SQL in Spark SQL 135 Running SQL in Spark 136 Writing Data Out to Storage Systems 139 The Trio: DataFrames, Datasets, and SQL 142 DataFrame Persistence 143 Summary 144

Chapter 5: Spark SQL (Advanced) 147 Aggregations 147 Aggregation Functions 148 Aggregation with Grouping 156 Aggregation with Pivoting 161 Joins 163 Join Expressions and Join Types 164 Working with Joins 165 Dealing with Duplicate Column Names 173 Overview of a Join Implementation 175 Functions 178 Working with Built-in Functions 178 Working with User-Defined Functions 194

v

Table of Contents

Advanced Analytics Functions 196 Aggregation with Rollups and Cubes 196 Aggregation with Time Windows 200 Window Functions 203

Catalyst Optimizer 211 Logical Plan 212 Physical Plan 213 Catalyst in Action 213

Project Tungsten 215 Summary 216

Chapter 6: Spark Streaming 219 Stream Processing 220 Concepts 222 Stream Processing Engine Landscape 227 Spark Streaming Overview 230 Spark DStream 230 Spark Structured Streaming 232 Overview 233 Core Concepts 235 Structured Streaming Application 242 Streaming DataFrame Operations 248 Working with Data Sources 251 Working with Data Sinks 264 Deep Dive on Output Modes 275 Deep Dive on Triggers 280 Summary 285

Chapter 7: Spark Streaming (Advanced) 287 Event Time 287 Fixed Window Aggregation Over an Event Time 289 Sliding Window Aggregation Over an Event Time 291

vi

Table of Contents

Aggregation State 295 Watermarking: Limit State and Handle Late Data 296 Arbitrary Stateful Processing 300 Arbitrary Stateful Processing with Structured Streaming 300 Handling State Timeouts 303 Arbitrary State Processing in Action 304 Handling Duplicate Data 316 Fault Tolerance 319 Streaming Application Code Change 320 Spark Runtime Change 320 Streaming Query Metrics and Monitoring 320 Streaming Query Metrics 321 Monitoring Streaming Queries 324 Summary 325 Chapter 8: Machine Learning with Spark 327 Machine Learning Overview 329 Machine Learning Terminologies 330 Machine Learning Types 331 Machine Learning Process 335 Spark Machine Learning Library 338 Machine Learning Pipelines 338 Machine Learning Tasks in Action 367 Classification 367 Regression 370 Recommendation 374 Deep Learning Pipeline 381 Summary 383

Index 385

vii

................
................

In order to avoid copyright disputes, this page is only a partial summary.

Google Online Preview   Download