Practical Web Scraping for Data Science

[Pages:313]Practical Web Scraping for Data Science

Best Practices and Examples with Python -- Seppe vanden Broucke Bart Baesens



Practical Web Scraping for Data Science

Best Practices and Examples with Python

Seppe vanden Broucke Bart Baesens



Practical Web Scraping for Data Science

Seppe vanden Broucke Leuven, Belgium

Bart Baesens Leuven, Belgium

ISBN-13 (pbk): 978-1-4842-3581-2

ISBN-13 (electronic): 978-1-4842-3582-9

Library of Congress Control Number: 2018940455

Copyright ? 2018 by Seppe vanden Broucke and Bart Baesens

This work is subject to copyright. All rights are reserved by the Publisher, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed.

Trademarked names, logos, and images may appear in this book. Rather than use a trademark symbol with every occurrence of a trademarked name, logo, or image we use the names, logos, and images only in an editorial fashion and to the benefit of the trademark owner, with no intention of infringement of the trademark.

The use in this publication of trade names, trademarks, service marks, and similar terms, even if they are not identified as such, is not to be taken as an expression of opinion as to whether or not they are subject to proprietary rights.

While the advice and information in this book are believed to be true and accurate at the date of publication, neither the authors nor the editors nor the publisher can accept any legal responsibility for any errors or omissions that may be made. The publisher makes no warranty, express or implied, with respect to the material contained herein.

Managing Director, Apress Media LLC: Welmoed Spahr Acquisitions Editor: Todd Green Development Editor: James Markham Coordinating Editor: Jill Balzano

Cover designed by eStudioCalamar

Cover image designed by Freepik ()

Distributed to the book trade worldwide by Springer Science+Business Media New York, 233 Spring Street, 6th Floor, New York, NY 10013. Phone 1-800-SPRINGER, fax (201) 348-4505, e-mail orders-ny@, or visit . Apress Media, LLC is a California LLC and the sole member (owner) is Springer Science + Business Media Finance Inc (SSBM Finance Inc). SSBM Finance Inc is a Delaware corporation.

For information on translations, please e-mail rights@, or visit rights-permissions.

Apress titles may be purchased in bulk for academic, corporate, or promotional use. eBook versions and licenses are also available for most titles. For more information, reference our Print and eBook Bulk Sales web page at .

Any source code or other supplementary material referenced by the author in this book is available to readers on GitHub via the book's product page, located at 9781484235812. For more detailed information, please visit .

Printed on acid-free paper



Dedicated to our partners, kids and parents.



Table of Contents

About the Authors ix About the Technical Reviewer xi Introduction xiii

Part I: Web Scraping Basics 1 Chapter 1: Introduction 3

1.1What Is Web Scraping? 3 1.1.1W hy Web Scraping for Data Science? 4 1.1.2W ho Is Using Web Scraping? 5

1.2Getting Ready 8 1.2.1Setting Up 8 1.2.2A Quick Python Primer 9

Chapter 2: The Web Speaks HTTP 25 2.1The Magic of Networking 25 2.2T he HyperText Transfer Protocol: HTTP 28 2.3HTTP in Python: The Requests Library 34 2.4Query Strings: URLs with Parameters 39

Chapter 3: Stirring the HTML and CSS Soup 49 3.1Hypertext Markup Language: HTML 49 3.2Using Your Browser as a Development Tool 51 3.3Cascading Style Sheets: CSS 56 3.4T he Beautiful Soup Library 61 3.5More on Beautiful Soup 72

v



Table of Contents

Part II: Advanced Web Scraping 79 Chapter 4: Delving Deeper in HTTP 81

4.1W orking with Forms and POST Requests 81 4.2Other HTTP Request Methods 97 4.3More on Headers 100 4.4Dealing with Cookies 108 4.5Using Sessions with Requests 119 4.6Binary, JSON, and Other Forms of Content 121 Chapter 5: Dealing with JavaScript 127 5.1What Is JavaScript? 127 5.2Scraping JavaScript 128 5.3Scraping with Selenium 134 5.4More on Selenium 148 Chapter 6: From Web Scraping to Web Crawling 155 6.1What Is Web Crawling? 155 6.2W eb Crawling in Python 158 6.3Storing Results in a Database 161

Part III: Managerial Concerns and Best Practices 173 Chapter 7: Managerial and Legal Concerns 175

7.1T he Data Science Process 175 7.2W here Does Web Scraping Fit In? 179 7.3Legal Concerns 181 Chapter 8: Closing Topics 187 8.1Other Tools 187

8.1.1Alternative Python Libraries 187 8.1.2Scrapy 188 8.1.3Caching 188 8.1.4Proxy Servers 189

vi



Table of Contents 8.1.5Scraping in Other Programming Languages 190 8.1.6Command-Line Tools 191 8.1.7Graphical Scraping Tools 191 8.2Best Practices and Tips 193

Chapter 9: Examples 197 9.1Scraping Hacker News 199 9.2Using the Hacker News API 201 9.3Quotes to Scrape 202 9.4Books to Scrape 206 9.5Scraping GitHub Stars 209 9.6Scraping Mortgage Rates 214 9.7Scraping and Visualizing IMDB Ratings 220 9.8Scraping IATA Airline Information 222 9.9Scraping and Analyzing Web Forum Interactions 228 9.10Collecting and Clustering a Fashion Data Set 237 9.11Sentiment Analysis of Scraped Amazon Reviews 241 9.12Scraping and Analyzing News Articles 252 9.13Scraping and Analyzing a Wikipedia Graph 271 9.14Scraping and Visualizing a Board Members Graph 278 9.15Breaking CAPTCHA's Using Deep Learning 281

Index 299

vii



About the Authors

Seppe vanden Broucke is an Assistant Professor of Data and Process Science at the Faculty of Economics and Business, KU Leuven, Belgium. His research interests include business data mining and analytics, machine learning, process management, and process mining. His work has been published in well-known international journals and presented at top conferences. Seppe's teaching includes Advanced Analytics, Big Data, and Information Management courses. He also frequently teaches for industry and business audiences. Besides work, Seppe enjoys traveling, reading (Murakami to Bukowski to Asimov), listening to music (Booka Shade to Miles Davis to Claude Debussy), watching movies and series (less so these days due to a lack of time), gaming, and keeping up with the news.

Bart Baesens is a Professor of Big Data and Analytics at KU Leuven, Belgium, and a lecturer at the University of Southampton, United Kingdom. He has done extensive research on big data and analytics, credit risk modeling, fraud detection, and marketing analytics. Bart has written more than 200 scientific papers and several books. Besides enjoying time with his family, he is also a diehard Club Brugge soccer fan. Bart is a foodie and amateur cook. He loves drinking a good glass of wine (his favorites are white Viognier or red Cabernet Sauvignon) either in his wine cellar or when overlooking the authentic red English phone booth in his garden. Bart loves traveling and is fascinated by World War I and reads many books on the topic.

ix



................
................

In order to avoid copyright disputes, this page is only a partial summary.

Google Online Preview   Download