Applied Data Science

Applied Data Science

Ian Langmore

Daniel Krasner

2

Contents

I Programming Prerequisites

1

1 Unix

2

1.1 History and Culture . . . . . . . . . . . . . . . . . . . . . . . 2

1.2 The Shell . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

1.3 Streams . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

1.3.1 Standard streams . . . . . . . . . . . . . . . . . . . . . 6

1.3.2 Pipes . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

1.4 Text . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

1.5 Philosophy . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

1.5.1 In a nutshell . . . . . . . . . . . . . . . . . . . . . . . 10

1.5.2 More nuts and bolts . . . . . . . . . . . . . . . . . . . 10

1.6 End Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

2 Version Control with Git

13

2.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

2.2 What is Git . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

2.3 Setting Up . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

2.4 Online Materials . . . . . . . . . . . . . . . . . . . . . . . . . 14

2.5 Basic Git Concepts . . . . . . . . . . . . . . . . . . . . . . . . 15

2.6 Common Git Workflows . . . . . . . . . . . . . . . . . . . . . 15

2.6.1 Linear Move from Working to Remote . . . . . . . . . 16

2.6.2 Discarding changes in your working copy . . . . . . . 17

2.6.3 Erasing changes . . . . . . . . . . . . . . . . . . . . . 17

2.6.4 Remotes . . . . . . . . . . . . . . . . . . . . . . . . . . 17

2.6.5 Merge conflicts . . . . . . . . . . . . . . . . . . . . . . 18

3 Building a Data Cleaning Pipeline with Python

19

3.1 Simple Shell Scripts . . . . . . . . . . . . . . . . . . . . . . . 19

3.2 Template for a Python CLI Utility . . . . . . . . . . . . . . . 21

i

ii

CONTENTS

II The Classic Regression Models

23

4 Notation

24

4.1 Notation for Structured Data . . . . . . . . . . . . . . . . . . 24

5 Linear Regression

26

5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

5.2 Coefficient Estimation: Bayesian Formulation . . . . . . . . . 29

5.2.1 Generic setup . . . . . . . . . . . . . . . . . . . . . . . 29

5.2.2 Ideal Gaussian World . . . . . . . . . . . . . . . . . . 30

5.3 Coefficient Estimation: Optimization Formulation . . . . . . 33

5.3.1 The least squares problem and the singular value de-

composition . . . . . . . . . . . . . . . . . . . . . . . . 35

5.3.2 Overfitting examples . . . . . . . . . . . . . . . . . . . 39

5.3.3 L2 regularization . . . . . . . . . . . . . . . . . . . . . 43

5.3.4 Choosing the regularization parameter . . . . . . . . . 44

5.3.5 Numerical techniques . . . . . . . . . . . . . . . . . . 46

5.4 Variable Scaling and Transformations . . . . . . . . . . . . . 47

5.4.1 Simple variable scaling . . . . . . . . . . . . . . . . . . 48

5.4.2 Linear transformations of variables . . . . . . . . . . . 51

5.4.3 Nonlinear transformations and segmentation . . . . . 52

5.5 Error Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

5.6 End Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54

6 Logistic Regression

55

6.1 Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55

6.1.1 Presenter's viewpoint . . . . . . . . . . . . . . . . . . 55

6.1.2 Classical viewpoint . . . . . . . . . . . . . . . . . . . . 56

6.1.3 Data generating viewpoint . . . . . . . . . . . . . . . . 57

6.2 Determining the regression coefficient w . . . . . . . . . . . . 58

6.3 Multinomial logistic regression . . . . . . . . . . . . . . . . . 61

6.4 Logistic regression for classification . . . . . . . . . . . . . . . 62

6.5 L1 regularization . . . . . . . . . . . . . . . . . . . . . . . . . 64

6.6 Numerical solution . . . . . . . . . . . . . . . . . . . . . . . . 66

6.6.1 Gradient descent . . . . . . . . . . . . . . . . . . . . . 67

6.6.2 Newton's method . . . . . . . . . . . . . . . . . . . . . 68

6.6.3 Solving the L1 regularized problem . . . . . . . . . . . 70

6.6.4 Common numerical issues . . . . . . . . . . . . . . . . 70

6.7 Model evaluation . . . . . . . . . . . . . . . . . . . . . . . . . 72

6.8 End Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73

CONTENTS

iii

7 Models Behaving Well

74

7.1 End Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75

III Text Data

76

8 Processing Text

77

8.1 A Quick Introduction . . . . . . . . . . . . . . . . . . . . . . 77

8.2 Regular Expressions . . . . . . . . . . . . . . . . . . . . . . . 78

8.2.1 Basic Concepts . . . . . . . . . . . . . . . . . . . . . . 78

8.2.2 Unix Command line and regular expressions . . . . . . 79

8.2.3 Finite State Automata and PCRE . . . . . . . . . . . 82

8.2.4 Backreference . . . . . . . . . . . . . . . . . . . . . . . 83

8.3 Python RE Module . . . . . . . . . . . . . . . . . . . . . . . . 84

8.4 The Python NLTK Library . . . . . . . . . . . . . . . . . . . 87

8.4.1 The NLTK Corpus and Some Fun things to do . . . . 87

IV Classification

89

9 Classification

90

9.1 A Quick Introduction . . . . . . . . . . . . . . . . . . . . . . 90

9.2 Naive Bayes . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90

9.2.1 Smoothing . . . . . . . . . . . . . . . . . . . . . . . . 93

9.3 Measuring Accuracy . . . . . . . . . . . . . . . . . . . . . . . 94

9.3.1 Error metrics and ROC Curves . . . . . . . . . . . . . 94

9.4 Other classifiers . . . . . . . . . . . . . . . . . . . . . . . . . . 99

9.4.1 Decision Trees . . . . . . . . . . . . . . . . . . . . . . 99

9.4.2 Random Forest . . . . . . . . . . . . . . . . . . . . . . 101

9.4.3 Out-of-bag classification . . . . . . . . . . . . . . . . . 102

9.4.4 Maximum Entropy . . . . . . . . . . . . . . . . . . . . 103

V Extras

105

10 High(er) performance Python

106

10.1 Memory hierarchy . . . . . . . . . . . . . . . . . . . . . . . . 107

10.2 Parallelism . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110

10.3 Practical performance in Python . . . . . . . . . . . . . . . . 114

10.3.1 Profiling . . . . . . . . . . . . . . . . . . . . . . . . . . 114

10.3.2 Standard Python rules of thumb . . . . . . . . . . . . 117

iv

CONTENTS

10.3.3 For loops versus BLAS . . . . . . . . . . . . . . . . . . 122 10.3.4 Multiprocessing Pools . . . . . . . . . . . . . . . . . . 123 10.3.5 Multiprocessing example: Stream processing text files 124 10.3.6 Numba . . . . . . . . . . . . . . . . . . . . . . . . . . 129 10.3.7 Cython . . . . . . . . . . . . . . . . . . . . . . . . . . 129

CONTENTS

v

What is data science? With the major technological advances of the last two decades, coupled in part with the internet explosion, a new breed of analysist has emerged. The exact role, background, and skill-set, of a data scientist are still in the process of being defined and it is likely that by the time you read this some of what we say will seem archaic.

In very general terms, we view a data scientist as an individual who uses current computational techniques to analyze data. Now you might make the observation that there is nothing particularly novel in this, and subsequenty ask what has forced the definition.1 After all statisticians, physicists, biologisitcs, finance quants, etc have been looking at data since their respective fields emerged. One short answer comes from the fact that the data sphere has changed and, hence, a new set of skills is required to navigate it effectively. The exponential increase in computational power has provided new means to investigate the ever growing amount of data being collected every second of the day. What this implies is the fact that any modern data analyst will have to make the time investment to learn computational techniques necessary to deal with the volumes and complexity of the data of today. In addition to those of mathemics and statistics, these software skills are domain transfereable and so it makes sense to create a job title that is also transferable. We could also point to the "data hype" created in industry as a culprit for the term data science with the science creating an aura of validity and facilitating LinkedIn headhunting.

What skills are needed? One neat way we like to visualize the data science skill set is with Drew Conway's Venn Diagram[Con], see figure 1. Math and statistics is what allows us to properly quantify a phenomenon observed in data. For the sake of narrative lets take a complex deterministic situation, such as whether or not someone will make a loan payment, and attempt to answer this question with a limited number of variables and an imperfect understanding of those variables influence on the event we wish to predict. With the exception of your friendly real estate agent we generally acknowldege our lack of soothseer ability and make statements about the probability of this event. These statements take a mathematical form, for example

P[makes-loan-payment] = e+?creditscore.

1William S. Cleveland decide to coin the term data science and write Data Science: An action plan for expanding the technical areas of the field of statistics [Cle]. His report outlined six points for a university to follow in developing a data analyst curriculum.

vi

CONTENTS

Figure 1: Drew Conway's Venn Diagram

where the above quantifies the risk associated with this event. Deciding on the best coefficients and can be done quite easily by a host of software packages. In fact anyone with decent hacking skills can do achieve the goal. Of course, a simple model such as this would convince no one and would call for substantive expertise (more commonly called domain knowledge) to make real progress. In this case, a domain expert would note that additional variables such as the loan to value ratio and housing price index are needed as they have a huge effect on payment activity. These variables and many others would allow us to arrive at a "better" model

P[makes-loan-payment] = e+?X .

(1)

Finally we have arrived at a model capable of fooling someone! We could keep adding variables until the model will almost certainly fit the historic risk quite well. BUT, how do we know that this will allow us to quantify risk in the future? To make some sense of our uncertainty2 about our model we need to know eactly what (1) means. In particular, did we include too many variables and overfit? Did our method of solving (1) arrive at a good solution or just numerical noise? Most importantly, how appropriate is the logistic regression model to begin with? Answering these questions is often as much an art as a science, but in our experience, sufficient mathematical understanding is necessary to avoid getting lost.

2The distrinction between uncertainty and risk has been talked about quite extensively by Nassim Taleb[Tal05, Tal10]

................
................

In order to avoid copyright disputes, this page is only a partial summary.

Google Online Preview   Download

To fulfill the demand for quickly locating and searching documents.

It is intelligent file search solution for home and business.

Literature Lottery

Related searches