Assignment No
SD Module- Python
Assignment No. 3
Title:
Write python code that loads any data set (example – game_medal.csv) & plot the graph.
Objectives:
Understand the basics of Data preprocessing,learn Pandas basic plot function ,matplotlib, Seaborn etc.
Problem Definition:
Develop Python Code that loads any data set (example – game_medal.csv) & plot the graph.
Outcomes:
10 1. Students will be able to demonstrate Python data preprocessing
11 2. Students will be able to demonstrate Plot the Graph in Python using Pandas Plot Function
12 3. Students will be able to demonstrate matplotlib, seborn packages.
Hardware Requirement: Any CPU with Pentium Processor or similar, 256 MB RAM or more,1 GB Hard Disk or more
14
Software Requirements: 32/64 bit Linux/Windows Operating System, R Studio
16
Theory:
Preprocessing
Data preprocessing is a data mining technique that involves transforming raw data into an understandable format. Real-world data is often incomplete, inconsistent, and/or lacking in certain behaviors or trends, and is likely to contain many errors. Data preprocessing is a proven method of resolving such issues. Data preprocessing prepares raw data for further processing.
Why preprocessing?
Real-world data are generally:
Incomplete: lacking attribute values, lacking certain attributes of interest, or containing only aggregate data
Noisy: containing errors or outliers
Inconsistent: containing discrepancies in codes or names
Tasks in data preprocessing:
• Data cleaning: fill in missing values, smooth noisy data, identify or remove outliers, and resolve inconsistencies.
• Data integration: using multiple databases, data cubes, or files.
• Data transformation: normalization and aggregation.
• Data reduction: reducing the volume but producing the same or similar analytical results.
• Data discretization: part of data reduction, replacing numerical attributes with nominal ones.
Seaborn is a Python data visualization library based on matplotlib. It provides a high-level interface for drawing attractive and informative statistical graphics.
1. Plotting categorical scatter plots with Seaborn
1. # Plotting categorical scatter
2. # plots with Seaborn
3.
4. # importing the required module
5. import matplotlib.pyplot as plt
6. import seaborn as sns
7.
8. # x axis values
9. x =['sun', 'mon', 'fri', 'sat', 'tue', 'wed', 'thu']
10.
11. # y axis values
12. y =[5, 6.7, 4, 6, 2, 4.9, 1.8]
13.
14. # plotting strip plot with seaborn
15. ax = sns.stripplot(x, y);
16.
17. # giving labels to x-axis and y-axis
18. ax.set(xlabel ='Days', ylabel ='Amount_spend')
19.
20. # giving title to the plot
21. plt.title('My first graph');
22.
23. # function to show plot
24. plt.show()
[pic]
Explanation : This is the one of kind of scatter plot of categorical data with the help of seaborn.
• Categorical data is represented in x-axis and values correspond to them represented through y-axis.
• .striplot() function is used to define the type of the plot and to plot them on canvas using .
• .set() function is use to set labels of x-axis and y-aixs.
• .title() function is used to give title to the graph.
• To view plot we use .show() function.
2. Stripplot using inbuilt data-set given in seaborn :
# importing the required module
import matplotlib.pyplot as plt
import seaborn as sns
# use to set style of background of plot
sns.set(style ="whitegrid")
# loading data-set
iris = sns.load_dataset('iris');
# plotting strip plot with seaborn
# deciding the attributes of dataset on which plot should be made
ax = sns.stripplot(x = 'species', y = 'sepal_length', data = iris);
# giving title to the plot
plt.title('Graph')
# function to show plot
plt.show()
[pic]
Explanation:
• iris is the dataset already present in seaborn module for use.
• We use .load_dataset() function in order to load the data.We can also load any other file by giving path and name of file in the argument.
• .set(style=”whitegrid”) function here is also use to define the background of plot.We can use “darkgrid”
instead of whitegrid if we want dark colored background.
• In .stripplot() function we have define which attribute of the dataset to be on x-axis and which attribute of dataset should on y-axis.data = iris means attributes which we define earlier should be taken from the given data.
• We can also draw this plot with matplotlib but problem with matplotlib is its default parameters. The reason why Seaborn is so great with DataFrames is, for example, labels from DataFrames are automatically propagated to plots or other data structures as you see in the above figure column name species comes on x-axis and column name stepal_length comes on y-aixs, that is not possible with matplotlib. We have to explicitly define the labels of x-axis and y-axis.
3. Swarmplot using inbuilt data-set given in seaborn :
# importing the required module
import matplotlib.pyplot as plt
import seaborn as sns
# use to set style of background of plot
sns.set(style ="whitegrid")
# loading data-set
iris = sns.load_dataset('iris');
# plotting strip plot with seaborn
# deciding the attributes of dataset on which plot should be made
ax = sns.swarmplot(x = 'species', y = 'sepal_length', data = iris);
# giving title to the plot
plt.title('Graph')
# function to show plot
plt.show()
Explanation:
This is very much similar to striplot but the only difference is that is do not allow overlapping of markers.It cause jittering in the markers of the plot so that graph can easily be readed without information loss as seen in the above plot.
• We use .swarmplot() function to plot swarn plot.
• Another difference that we can notice in Seaborn and Matplotlib is that working with DataFrames doesn’t go quite as smoothly with Matplotlib, which can be annoying if we doing exploratory analysis with Pandas. And that’s exactly what Seaborn do easily, the plotting functions operate on DataFrames and arrays that contain a whole dataset.
[pic]
4. If we want we can also change the representation of data on a particular axis. For example :
# importing the required module
import matplotlib.pyplot as plt
import seaborn as sns
# use to set style of background of plot
sns.set(style ="whitegrid")
# loading data-set
iris = sns.load_dataset('iris');
# plotting strip plot with seaborn
# deciding the attributes of dataset on which plot should be made
ax = sns.swarmplot(x = 'sepal_length', y = 'species', data = iris);
# giving title to the plot
plt.title('Graph')
# function to show plot
plt.show()
Explanation - The same can be done in striplot. At last we can say that Seaborn is extended version of matplotlib which tries to make a well-defined set of hard things easy.
[pic]
Matplotlib- which is arguably the most popular graphing and data visualization library for Python.
# importing the required module
import matplotlib.pyplot as plt
# x axis values
x = [1,2,3]
# corresponding y axis values
y = [2,4,1]
# plotting the points
plt.plot(x, y)
# naming the x axis
plt.xlabel('x - axis')
# naming the y axis
plt.ylabel('y - axis')
# giving a title to my graph
plt.title('My first graph!')
# function to show the plot
plt.show()
Following steps were followed:
• Define the x-axis and corresponding y-axis values as lists.
• Plot them on canvas using .plot() function.
• Give a name to x-axis and y-axis using .xlabel() and .ylabel() functions.
• Give a title to your plot using .title() function.
• Finally, to view your plot, we use .show() function.
[pic]
2. Plotting two or more lines on same plot
import matplotlib.pyplot as plt
# line 1 points
x1 = [1,2,3]
y1 = [2,4,1]
# plotting the line 1 points
plt.plot(x1, y1, label = "line 1")
# line 2 points
x2 = [1,2,3]
y2 = [4,1,3]
# plotting the line 2 points
plt.plot(x2, y2, label = "line 2")
# naming the x axis
plt.xlabel('x - axis')
# naming the y axis
plt.ylabel('y - axis')
# giving a title to my graph
plt.title('Two lines on same graph!')
# show a legend on the plot
plt.legend()
# function to show the plot
plt.show()
• Here, we plot two lines on same graph. We differentiate between them by giving them a name(label) which is passed as an argument of .plot() function.
• The small rectangular box giving information about type of line and its color is called legend. We can add a legend to our plot using .legend() function.
[pic]
3. Customization of Plots
import matplotlib.pyplot as plt
# x axis values
x = [1,2,3,4,5,6]
# corresponding y axis values
y = [2,4,1,5,2,6]
# plotting the points
plt.plot(x, y, color='green', linestyle='dashed', linewidth = 3,
marker='o', markerfacecolor='blue', markersize=12)
# setting x and y axis range
plt.ylim(1,8)
plt.xlim(1,8)
# naming the x axis
plt.xlabel('x - axis')
# naming the y axis
plt.ylabel('y - axis')
# giving a title to my graph
plt.title('Some cool customizations!')
# function to show the plot
plt.show()
we have done several customizations like
• setting the line-width, line-style, line-color.
• setting the marker, marker’s face color, marker’s size.
• overriding the x and y axis range. If overriding is not done, pyplot module uses auto-scale feature to set the axis range and scale.
[pic]
4. Bar Chart-
import matplotlib.pyplot as plt
# x-coordinates of left sides of bars
left = [1, 2, 3, 4, 5]
# heights of bars
height = [10, 24, 36, 40, 5]
# labels for bars
tick_label = ['one', 'two', 'three', 'four', 'five']
# plotting a bar chart
plt.bar(left, height, tick_label = tick_label,
width = 0.8, color = ['red', 'green'])
# naming the x-axis
plt.xlabel('x - axis')
# naming the y-axis
plt.ylabel('y - axis')
# plot title
plt.title('My bar chart!')
# function to show the plot
plt.show()
• Here, we use plt.bar() function to plot a bar chart.
• x-coordinates of left side of bars are passed along with heights of bars.
• you can also give some name to x-axis coordinates by defining tick_labels
[pic]
5. Histogram
import matplotlib.pyplot as plt
# frequencies
ages = [2,5,70,40,30,45,50,45,43,40,44,
60,7,13,57,18,90,77,32,21,20,40]
# setting the ranges and no. of intervals
range = (0, 100)
bins = 10
# plotting a histogram
plt.hist(ages, bins, range, color = 'green',
histtype = 'bar', rwidth = 0.8)
# x-axis label
plt.xlabel('age')
# frequency label
plt.ylabel('No. of people')
# plot title
plt.title('My histogram')
# function to show the plot
plt.show()
• Here, we use plt.hist() function to plot a histogram.
• frequencies are passed as the ages list.
• Range could be set by defining a tuple containing min and max value.
• Next step is to “bin” the range of values—that is, divide the entire range of values into a series of intervals—and then count how many values fall into each interval. Here we have defined bins = 10. So, there are a total of 100/10 = 10 intervals.
[pic]
6. Scatter plot
import matplotlib.pyplot as plt
# x-axis values
x = [1,2,3,4,5,6,7,8,9,10]
# y-axis values
y = [2,4,5,7,6,8,9,11,12,12]
# plotting points as a scatter plot
plt.scatter(x, y, label= "stars", color= "green",
marker= "*", s=30)
# x-axis label
plt.xlabel('x - axis')
# frequency label
plt.ylabel('y - axis')
# plot title
plt.title('My scatter plot!')
# showing legend
plt.legend()
# function to show the plot
plt.show()
• Here, we use plt.scatter() function to plot a scatter plot.
• Like a line, we define x and corresponding y – axis values here as well.
• marker argument is used to set the character to use as marker. Its size can be defined using s parameter.
[pic]
7. Pie Chart
import matplotlib.pyplot as plt
# defining labels
activities = ['eat', 'sleep', 'work', 'play']
# portion covered by each label
slices = [3, 7, 8, 6]
# color for each label
colors = ['r', 'y', 'g', 'b']
# plotting the pie chart
plt.pie(slices, labels = activities, colors=colors,
startangle=90, shadow = True, explode = (0, 0, 0.1, 0),
radius = 1.2, autopct = '%1.1f%%')
# plotting legend
plt.legend()
# showing the plot
plt.show()
• Here, we plot a pie chart by using plt.pie() method.
• First of all, we define the labels using a list called activities.
• Then, portion of each label can be defined using another list called slices.
• Color for each label is defined using a list called colors.
• shadow = True will show a shadow beneath each label in pie-chart.
• startangle rotates the start of the pie chart by given degrees counterclockwise from the x-axis.
• explode is used to set the fraction of radius with which we offset each wedge.
• autopct is used to format the value of each label. Here, we have set it to show the percentage value only upto 1 decimal place.
[pic]
8. Plotting curves of given equation
# importing the required modules
import matplotlib.pyplot as plt
import numpy as np
# setting the x - coordinates
x = np.arange(0, 2*(np.pi), 0.1)
# setting the corresponding y - coordinates
y = np.sin(x)
# potting the points
plt.plot(x, y)
# function to show the plot
plt.show()
Here, we use NumPy which is a general-purpose array-processing package in python.
• To set the x – axis values, we use np.arange() method in which first two arguments are for range and third one for step-wise increment. The result is a numpy array.
• To get corresponding y-axis values, we simply use predefined np.sin() method on the numpy array.
• Finally, we plot the points by passing x and y arrays to the plt.plot() function.
[pic]
Mini Project-1
Develop Python Code that loads any data set (example – game_medal.csv) & plot the graph.
The data used was provided by The Guardian at Kaggle: Olympic Sports and Medals, 1896-2014. The first step will be to see the form of the data and manipulate it into a suitable format: rows as countries, columns as olympic games, values as medal counts.
Download Link-
Description of Data Sets
Which Olympic athletes have the most gold medals? Which countries are they from and how has it changed over time?
More than 35,000 medals have been awarded at the Olympics since 1896. The first two Olympiads awarded silver medals and an olive wreath for the winner, and the IOC retrospectively awarded gold, silver, and bronze to athletes based on their rankings. This dataset includes a row for every Olympic athlete that has won a medal since the first games.
Data was provided by the IOC Research and Reference Service and published by The Guardian's Datablog.
Olympic Games.zip Folder contain 3 different dataset.
125 Dictionary.csv
126 Summer.csv
127 Winter.csv
Following Figure shows description of dictionary.csv
[pic]
Following Figure shows description of summer.csv
[pic]
Following Figure shows description of winter.csv
[pic]
Here we are work on summer.csv Dataset
Dataset Download Link-
[pic]
Complete Code-
Stream graph- stream graphs can show a visually appealing and story rich method for presenting frequency data in multiple categories across a time-like dimension.
[pic]
Here we see that each entry is an Athlete representing a Country, of a given Gender, who won a Medal in some Event in the Olympics in City in a particular Year. For team-based sports, multiple individuals can receive medals, but we'll want to count these medals only once
Then using a groupby on Country and Year, if we count the Medals and unstack the result, we end up with a dataframe in the desired format.
[pic]
Now, the NYT only includes eight named countries (the rest are grouped by continent). So we'll want to identify what these countries are in the list, based on their IOC country codes. There's some interesting trivia in which countries/regions/groups are included/excluded/merge/divide with time. At this point we can ignore the rest of the data and just focus on these categories
countries = [
"USA", # United States of America
"CHN", # China
"RU1", "URS", "EUN", "RUS", # Russian Empire, USSR, Unified Team (post-Soviet collapse), Russia
"GDR", "FRG", "EUA", "GER", # East Germany, West Germany, Unified Team of Germany, Germany
"GBR", "AUS", "ANZ", # Australia, Australasia (includes New Zealand)
"FRA", # France
"ITA" # Italy
]
sm = summer.loc[countries]
sm.loc["Rest of world"] = summer.loc[summer.index.difference(countries)].sum()
sm = sm[::-1]
Before any plotting, let's define colours similar to those in the NYT graph. For simplicity, I'll be using the named colours in matplotlib.
country_colors = {
"USA":"steelblue",
"CHN":"sandybrown",
"RU1":"lightcoral", "URS":"indianred", "EUN":"indianred", "RUS":"lightcoral",
"GDR":"yellowgreen", "FRG":"y", "EUA":"y", "GER":"y",
"GBR":"silver",
"AUS":"darkorchid", "ANZ":"darkorchid",
"FRA":"silver",
"ITA":"silver",
"Rest of world": "gainsboro"}
Let's present this data as a stacked bar plot. This will show: i) the total number of medals won (total height) and ii) compare the relative number of medals countries won in different years.
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
sns.set_style("ticks")
sns.set_context("notebook", font_scale=1.2)
colors = [country_colors[c] for c in sm.index]
plt.figure(figsize=(12,8))
sm.T.plot.bar(stacked=True, color=colors, ax=plt.gca())
# Reverse the order of labels, so they match the data
handles, labels = plt.gca().get_legend_handles_labels()
plt.legend(handles[::-1], labels[::-1])
# Set labels and remove superfluous plot elements
plt.ylabel("Number of medals")
plt.title("Stacked barchart of select countries' medals at the Summer Olympics")
sns.despine()
[pic]
This plot is quite different to the desired graph. In particular, the bars don't have any continuity (which we'll achieve by using the plot.area method of DataFrames. And secondly, we don't have zero values for when the World Wars occurred.
sm[1916] = np.nan # WW1
sm[1940] = np.nan # WW2
sm[1944] = np.nan # WW2
sm = sm[sm.columns.sort_values()]
plt.figure(figsize=(12,8))
sm.T.plot.area(color=colors, ax=plt.gca(), alpha=0.5)
# Reverse the order of labels, so they match the data
handles, labels = plt.gca().get_legend_handles_labels()
plt.legend(handles[::-1], labels[::-1])
# Set labels and remove superfluous plot elements
plt.ylabel("Number of medals")
plt.title("Stacked areachart of select countries' medals at the Summer Olympics")
plt.xticks(sm.columns, rotation=90)
sns.despine()
[pic]
This is looking much better. There are two features we are missing: i) this plot has a baseline (i.e. the bottom of the chart) set at zero, whereas we want the baseline to wiggle about ii) the transitions between times are jagged.
To fix the baseline, instead of using pandas's plot.area method, we use the stackplot function from matplotlib. Here, we show what the different baselines look like.
for bl in ["zero", "sym", "wiggle", "weighted_wiggle"]:
plt.figure(figsize=(6, 4))
f = plt.stackplot(sm.columns, sm.fillna(0), colors=colors, baseline=bl, alpha=0.5, linewidth=1)
[a.set_edgecolor(sns.dark_palette(colors[i])[-2]) for i,a in enumerate(f)] # Edges to be slighter darker
plt.title("Baseline: {}".format(bl))
plt.axis('off')
plt.show()
[pic][pic]
Conclusion/Analysis: Hence we are able to draw the various plot using seaborn, matplotlib and pandas packages on suitable dataset.
Assignment Question?
1. What is pandas ?
2. What is matplotlib?
3. What is Seaborn?
4. What is Dataframe?
5. What is syntax for read csv file in python?
6. What is numpy?
7. How to drop Drop duplicate pairs?
Oral Question?
1. What do you mean histogram?
2. What do you mean scatter plot?
3. What do you mean pie chat?
4. What do you mean bar chart?
5. What do you mean heatmap?
6. What do you mean scatter plot?
References:-
[pic]
-----------------------
|W (4) |C |D |V |T |Total Marks with |
| |(4) |(4) |(4) |(4) |Sign |
| | | | | | |
-----------------------
SNJB’S K.B.J. COLLEGE OF ENGINEERING, CHANDWAD
1
................
................
In order to avoid copyright disputes, this page is only a partial summary.
To fulfill the demand for quickly locating and searching documents.
It is intelligent file search solution for home and business.
Related download
Related searches
- writing assignment for 2nd grade
- aesop substitute assignment aesop online
- 6th grade writing assignment ideas
- 6th grade writing assignment pdf
- 9th grade writing assignment worksheet
- 9th grade writing assignment classroom
- 10th grade writing assignment idea
- biol 101 individual assignment 1
- aesop substitute assignment pin number
- literacy narrative assignment essay
- online homework assignment help
- new york life assignment form