Quote for the day ! Mistakes are a fact of life.It is the response to the error that counts.
Select Page

by | Nov 17, 2021 | Uncategorized | 0 comments

DATA ANALYSIS and VISUALIZATION
Data
What is
Data
?
Data
is hard facts. These are units of information, often numeric, that are collected throughobservation.
The “data gets broadly divided into 2 kinds:
1.
Qualitative or Categorical data
2.
Quantitative or Numerical data
Categorical data
comprises categories like
overweight
,
underweight
, or
normal
on a BMI scale ascategories.
Similarly, it could be the flavor of ice cream. Categories could be
mango
,
vanilla
,
chocolate
,
pineapple
,
orange
, etc
Numerical data
on the other hand is your numeric data like your actual
weight
,
height
, volumeof your
ice cream
etc.
Data Analysis
What is
Data Analysis
?
Data Analysis
is the process of evaluating data by applying statistical and/or logical techniques.
Data Analysis
involves:
1.
Performing
data cleaning/data wrangling
to improve data quality.
2.
Getting data into the right format, getting rid of unnecessary data, correcting spellingmistakes, etc.
3.
Manipulating
data using tools like Excel or Python etc. This may include plotting the dataout, creating pivot tables, and so on.
4.
Analyzing
and
interpreting
the data using statistical tools (i.e., finding correlations, trends,outliers, etc.).
Data Visualization
What is
Data Visualization
?
Data Visualization
is the representation of data or information in a graph, chart, or other visualformats.
Understanding of data in the raw form or tables would consume a lot of time and effortof stakeholders, readers, and users. Apart from this, it would be extremely difficult to interpretthe main message in that form.
Instead, if we choose a graphical representation, we would use charts, graphs and infographicsto express those messages, trends.

https://peltiertech.com/images/img200811/pt_col_timebrand_comp.png
we are easily attracted towards graphical representation of the table. On top of that, it is easy tointerpret as well.

 

Understanding the Dataset
The dataset consists of 3 columns or
features
namely
ball
,
batsman_runs
and
player_dismissed
.
1.
ball
represents the balls of the over Rohit Sharma has faced in his entire IPL career.
2.
batsman_runs
accounts for the runs attributed to batsman for that particular ball.
3.
player_dismissed
provides the name of player who was dismissed on a particular ball.
Understanding the Approach
1.
Understanding the most vulnerable ball for Rohit Sharma: Here, we will take into accountthe number of times the player himself was dismissed on any given ball of the over. Thecolumns we will use are:
A.
player_dismissed
with entries specific to
RG Sharma
.
B.
ball
.
We will
group
the data on
ball
of the over and plot a
bar
graph to interpret themost number of dismissals for a particular ball of over

1. Understanding the most productive ball for Rohit Sharma: Here, we will consider the total
runs scored by the player on any given ball of the over. The columns we will use are:
A. batsman_runs .
B. ball .
We will group the data on ball of the over and plot a bar graph to interpret the
most runs scored on a particular ball of over

Importing Packages
Packages are imported in following manner.
import package_name
In the next cell we have imported the following packages.
1. pandas . It is the most common library used by data scientists for data manipulation and
cleaning
2. numpy . It adds support for arrays, along with a collection of mathematical functions to
operate on these arrays.
3. matplotlib . It is a plotting library for python. .pyplot is a sub-package or set of
functions available in matplotlib which we’ll be using
pd , np , plt are all aliases for their corresponding packages.
Alias are second name
assigned to values or variables.
%matplotlib inline is a “magic function” renders plots
Loading

Loading the Dataset
In the cell below, we have created a new pandas DataFrame by the name df and imported the
mentioned file.
We have used .head() function to see the first 5 values of the dataset we created.
.head() can show up any number of values based on the parameter given.
If we want to see more, we can pass value in the function like df.head(10) will show first 10
values of the dataset
ball batsman_runs player_dismissed
0 2 0 NaN
1 3 0 NaN
2 4 0 RG Sharma
3 3 1 NaN
In [1]: import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
In [2]: df = pd.read_csv(“https://raw.githubusercontent.com/jainharshit27/datasets/main/Rohi
df.head()

1. Understanding the most vulnerable ball for Rohit Sharma
Reducing the dataset to our need
In the cell below, we have created a new pandas DataFrame by the name df_Rohit and
assigned it a filtered version of dataframe df such that only those observations are accepted
which have player_dismissed value as RG Sharma . This can be done like:
df[df[“player_dismissed”] == “RG Sharma”]
Here, df[“player_dismissed”] == “RG Sharma” , this value will mark observation True
wherever it is.
Passing that value through df[] will filter out the False values.
Then, we have grouped data using .grouby() function using various values of ball
feature/column. The groupby() function is then followed by .count() to summarize values
for other numerical columns in the dataframe. The resulting dataframe is then assigned to
dataframe df_Rohit_dismissed .

df_Rohit = df[df[“player_dismissed”] == “RG Sharma”]
df_Rohit_dismissed = df_Rohit.groupby(“ball”).count()
df_Rohit_dismissed

 

Blog Technical Support Developing Resources