Dimensionless.in



Article Format

|Meta Title: |Python Versus R (Python vs. R) Choice of a Data Science Language |

|(Max. 55 characters) | |

|Meta Description: |Data scientists often have to chose between Python and R. This article explains how and why this |

|(Max. 160 characters) |choice should be made and the best choice for a professional data scientist. It also presents two|

| |contrasting source code samples, one in Python and one in R. |

|Focus Keyword: |Python, R, Data Science, Programming, Data Scientist |

|(ie: Short Stories) | |

Python Versus R

Python and R are the two most commonly used languages for Data Science today. They are both fully open source products and completely free to use and modify as required under the GNU public license.

But which one is better? And, more importantly, which one should you learn?

Both are widely used and are standard tools in the hands of every data scientist.

The answer may surprise you – because as a professional data scientist, you should be ready to deal with both.

Python has certain use cases and so does R. The scenarios in which they are used vary. It is more often the environment and the needs of the client or your employer which dictates the choice between Python and R. Many things are easier in Python. But R also has its place in your development toolkit.

Python

Python is a general-purpose programming language released by Guido Van Rossum in 1991.

Since then, Python has been used in multiple environments for multiple purposes, including, but not limited to:

● Web Development (Django)

● Web Microservices (Flask)

● Zappa Serverless Framework for Python

● TensorFlow (Deep Learning Machine Learning Models)

● Keras (High-Level Abstractions to Simplify TensorFlow Development)

● Popular apps built in Python include Dropbox, BitTorrent, Morpheus, Calibre, Blender, and Mercurial – among many, many others.

Python has more appeal for software engineers. This is mainly because industry use, production-ready code can be usually written in Python. If you have the background of a software engineer or already know programming, Python is a better choice for you (especially if you’re a beginner).

Another situation where Python shines is the sheer number pre-existing libraries that are readily available and open sourced to use. The large number of packages available in the PyPI (short for Python Package Index) repository with over 121k packages that automate many programming tasks at various levels of abstraction, making life easy for the programmer. At least 6k out of the packages on PyPI are focused on data science. Python also excels in readability. Compared to R, Python is much easier to read and to understand. Python is faster than R, in some cases dramatically faster.

R

R is a statisticians programming language designed for statisticians by statisticians. It originated in the ‘90s through George Ross Ihaka and Robert Gentleman. R excels in academic use and in the hands of a statistician. People who have formal training in Statistics, such as a Statistics degree, find working with R extremely simple. The repository for R packages or libraries, called CRAN (Comprehensive R Archive Network) contains nearly 12k packages, roughly half of which are for data science. R also excels at data visualization. Analyzing data on a one-time basis is often simpler and more easily expressible in R.

Also, once upon a time, using Python meant linking many libraries together, some of which would become incompatible after feature revisions and library updates. That is no longer true because of Anaconda – see below. For a short time, deep learning was strictly a Python feature – which shifted the balance of the machine learning world towards Python, for a short time. However, with the release of Keras for TensorFlow in R, that factor changed as well, and deep learning models could now be used in R.

So, what is the answer? Which one should you use?

The answer – both.

Jupyter Notebook – Integrating Python and R

The Anaconda distribution from Continuum Analytics has completely disrupted the machine learning picture. Anaconda supports the standard libraries required for Python and machine learning – NumPy, SciPy, Pandas, SymPy, Seaborn, Matplotlib – as well as full support for R with an outstanding IDE called R Studio.

For Deep Learning it supports TensorFlow, Theano, Caffe, Scikit-Learn, and Torch. One of its most remarkable features is the introduction of the Jupyter Notebook, an integrated platform which supported the use of Python and R in the same environment, while keeping everything open source.

Another option is the Hydrogen plug-in for the Atom text editor. It allows you to enter any code that you can use in a Jupyter Notebook and returns the result in the editor. However, it is still in alpha and crashed with an error on my local machine. The Jupyter Lab application allows Python and R notebook editing in the same environment, using the concept of separate and even remote kernels.

As the machine learning field progresses, one can expect more plugins like Hydrogen (which I can’t wait to test once it’s out of alpha) produced in the very near future. So, Python excels in machine learning, while R excels in statistics.

But why should you learn both?

Because a professional data scientist needs to understand statistics and the mathematics behind the machine learning algorithms in great detail.

We shall examine two SVM machine learning models, one through Python code, and then another through R code. This will give us a good picture of how both languages work.

Python Code

This code performs binary classification using non-linear support vector machine using a Gaussian kernel. The target to predict is a XOR of the inputs.

The color map illustrates the decision function learned by the Support Vector Machine Classifier. (SVC)

print(__doc__)

import numpy as np

import matplotlib.pyplot as plt

from sklearn import svm

xx, yy = np.meshgrid(np.linspace(-3, 3, 500),

np.linspace(-3, 3, 500))

np.random.seed(0)

X = np.random.randn(300, 2)

Y = np.logical_xor(X[:, 0] > 0, X[:, 1] > 0)

# fit the model

clf = svm.NuSVC()

clf.fit(X, Y)

# plot the decision function for each datapoint on the grid

Z = clf.decision_function(np.c_[xx.ravel(), yy.ravel()])

Z = Z.reshape(xx.shape)

plt.imshow(Z, interpolation='nearest',

extent=(xx.min(), xx.max(), yy.min(), yy.max()), aspect='auto',

origin='lower', cmap=plt.cm.PuOr_r)

contours = plt.contour(xx, yy, Z, levels=[0], linewidths=2,

linetypes='--')

plt.scatter(X[:, 0], X[:, 1], s=30, c=Y, cmap=plt.cm.Paired,

edgecolors='k')

plt.xticks(())

plt.yticks(())

plt.axis([-3, 3, -3, 3])

plt.show()

.Output:

[pic]

References:



R Code

This program uses the iris dataset to illustrate the use of a non-linear SVM classifier. This code is deliberately a little more complex since it applies ML techniques to a full-fledged built in dataset - the iris dataset - one of the canonical data sets used to illustrate the capacities of the ML techniques traditionally. THis code also illustrates the usage of the built-in statistical functions of R.

You will need to install R package e1071 and add it to the compile list by calling library(e1071) before executing the code below. But don’t worry - installing new packages in R Studio is ridiculously simple.

# NOT RUN {

data(iris)

attach(iris)

## classification mode

# default with factor response:

library(e1071)

#loads the svm library into the compile path

model ................
................

In order to avoid copyright disputes, this page is only a partial summary.

Google Online Preview   Download