DATA ANALYSIS and VISUALIZATION
is hard facts. These are units of information, often numeric, that are collected throughobservation.
The “data gets broadly divided into 2 kinds:
Qualitative or Categorical data
Quantitative or Numerical data
comprises categories like
on a BMI scale ascategories.
Similarly, it could be the flavor of ice cream. Categories could be
on the other hand is your numeric data like your actual
, volumeof your
is the process of evaluating data by applying statistical and/or logical techniques.
data cleaning/data wrangling
to improve data quality.
Getting data into the right format, getting rid of unnecessary data, correcting spellingmistakes, etc.
data using tools like Excel or Python etc. This may include plotting the dataout, creating pivot tables, and so on.
the data using statistical tools (i.e., finding correlations, trends,outliers, etc.).
is the representation of data or information in a graph, chart, or other visualformats.
Understanding of data in the raw form or tables would consume a lot of time and effortof stakeholders, readers, and users. Apart from this, it would be extremely difficult to interpretthe main message in that form.
Instead, if we choose a graphical representation, we would use charts, graphs and infographicsto express those messages, trends.
we are easily attracted towards graphical representation of the table. On top of that, it is easy tointerpret as well.
Understanding the Dataset
The dataset consists of 3 columns or
represents the balls of the over Rohit Sharma has faced in his entire IPL career.
accounts for the runs attributed to batsman for that particular ball.
provides the name of player who was dismissed on a particular ball.
Understanding the Approach
Understanding the most vulnerable ball for Rohit Sharma: Here, we will take into accountthe number of times the player himself was dismissed on any given ball of the over. Thecolumns we will use are:
with entries specific to
the data on
of the over and plot a
graph to interpret themost number of dismissals for a particular ball of over
1. Understanding the most productive ball for Rohit Sharma: Here, we will consider the total
runs scored by the player on any given ball of the over. The columns we will use are:
A. batsman_runs .
B. ball .
We will group the data on ball of the over and plot a bar graph to interpret the
most runs scored on a particular ball of over
Packages are imported in following manner.
In the next cell we have imported the following packages.
1. pandas . It is the most common library used by data scientists for data manipulation and
2. numpy . It adds support for arrays, along with a collection of mathematical functions to
operate on these arrays.
3. matplotlib . It is a plotting library for python. .pyplot is a sub-package or set of
functions available in matplotlib which we’ll be using
pd , np , plt are all aliases for their corresponding packages.
Alias are second name
assigned to values or variables.
%matplotlib inline is a “magic function” renders plots
Loading the Dataset
In the cell below, we have created a new pandas DataFrame by the name df and imported the
We have used .head() function to see the first 5 values of the dataset we created.
.head() can show up any number of values based on the parameter given.
If we want to see more, we can pass value in the function like df.head(10) will show first 10
values of the dataset
ball batsman_runs player_dismissed
0 2 0 NaN
1 3 0 NaN
2 4 0 RG Sharma
3 3 1 NaN
In : import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
In : df = pd.read_csv(“https://raw.githubusercontent.com/jainharshit27/datasets/main/Rohi
1. Understanding the most vulnerable ball for Rohit Sharma
Reducing the dataset to our need
In the cell below, we have created a new pandas DataFrame by the name df_Rohit and
assigned it a filtered version of dataframe df such that only those observations are accepted
which have player_dismissed value as RG Sharma . This can be done like:
df[df[“player_dismissed”] == “RG Sharma”]
Here, df[“player_dismissed”] == “RG Sharma” , this value will mark observation True
wherever it is.
Passing that value through df will filter out the False values.
Then, we have grouped data using .grouby() function using various values of ball
feature/column. The groupby() function is then followed by .count() to summarize values
for other numerical columns in the dataframe. The resulting dataframe is then assigned to
dataframe df_Rohit_dismissed .
df_Rohit = df[df[“player_dismissed”] == “RG Sharma”]
df_Rohit_dismissed = df_Rohit.groupby(“ball”).count()