Topic-16-SUPERVISED LEARNING-TRAINING DATA PRE-PROCESSING
TRAINING DATA PRE-PROCESSING
The first step in the
machine learning pipeline is to clean and transform the training data into a
useable format for analysis and modelling.
As such, data
pre-processing addresses:
- Assumptions about data shape
- Incorrect data types
- Outliers or errors
- Missing values
- Categorical variables
Loading the Data Set |
full_data=pd.read_csv("/content/titanic_dataset.csv") |
Data Shape |
After loading the dataset, I
examine its shape to get a better sense of the data and the information it
contains. |
|
print("train data" , full_data.shape) |
View first few rows |
full_data.head(5) |
|
This code will return the first 5
rows of the |
Data Info |
full_data.info() |
|
This code will return information
about the |
Missing
Data |
Missing Data |
|
sns.heatmap(full_data.isnull(),yticklabels = False, cbar = False,cmap = 'tab20c_r') |
|
This will create a heatmap of the
missing values in the |
|
1.
2.
3. 4.
5.
6.
7. 'tab20c_r': This is the name
of the colormap. 'tab20c_r' represents the
reversed version of the 'tab20c' colormap,
which is a qualitative colormap with 20 distinct colors. The _r at the end of the colormap name indicates that the
colors will be used in a reversed order compared to the original 'tab20c' colormap. Colormaps
are useful in data visualization because they help represent numerical values
through colors, making it easier to identify patterns and variations in the
data. For
example, when using 'tab20c_r' as the colormap in a
heatmap, it will display missing values (NaNs) and non-missing values with
distinct colors, making it easy to visually differentiate them and identify
patterns in the data with missing values.
In summary, the code creates a heatmap that visualizes the
missing values in the DataFrame The absence of y-axis tick labels and color bar makes the plot
simpler and cleaner. |
|
|
|
The 'Age' variable is missing
roughly 20% of its data. This proportion is likely small enough for
reasonable replacements using some form of imputation as well (using the
knowledge of the other columns to fill in reasonable values). However, too
much data from the 'Cabin' column is missing to do anything useful with it at
a basic level. This column may need to be dropped from the data set
altogether or change to another feature such as 'Cabin Known: 1 or 0'. We want to fill in missing age
data instead of just dropping the missing age data rows. One way to do this
is by filling in the mean age of all the passengers (imputation). However, we
can be smarter about this and check the average age by passenger class. |
|
plt.figure(figsize = (10,7)) sns.boxplot(x = 'Pclass', y = 'Age', data =
full_data, palette= 'GnBu_d').set_title('Age by Passenger Class') plt.show() |
plt.figure(figsize = (10,7)) |
This line creates a new figure
with a specific size of 10 inches in width and 7 inches in height. It sets
the dimensions of the plot canvas before creating the box plot. |
sns.boxplot |
This line generates the box plot
using Seaborn's |
x='Pclass', |
It specifies the 'Pclass'
variable as the x-axis (categorical variable) |
y='Age', |
It specifies the 'Age' variable
as the y-axis (numerical variable) to be plotted. |
data=full_data, |
The data for the plot is taken
from the |
palette='GnBu_d') |
The |
|
This sets the title of the plot
to 'Age by Passenger Class'. |
|
This displays the box plot. |
|
In summary, the code creates a
box plot to visualize the distribution of ages (y-axis) for different
passenger classes (x-axis) in the Each box represents the
interquartile range (IQR) of the age values for each passenger class, with the
median shown as a horizontal line inside the box. The whiskers extend to the
minimum and maximum values within 1.5 times the IQR, and any points outside
this range are considered outliers. The chosen color palette ('GnBu_d')
ensures that the colors in the plot transition from green to blue as the data
progresses on the x-axis. |
|
average
age values to impute based on Pclass for Age
|
Naturally, the wealthier
passengers in the higher classes tend to be older .We'll use these average
age values to impute based on Pclass for Age. |
|
|
The code you
provided defines a function called impute_age and
then applies this function to impute missing values in the 'Age' column of
the DataFrame full_data. Let's break down the code step by step: |
|
def impute_age(cols):: |
This line defines
the function impute_age, which takes a single argument cols. |
|
The function is
designed to handle a DataFrame with two columns, 'Age' and 'Pclass'. |
Age = cols[0]: |
This line extracts
the value of the 'Age' column from the DataFrame cols. |
Pclass = cols[1]: |
This line extracts
the value of the 'Pclass' column from the DataFrame cols. |
if pd.isnull(Age): |
This line checks if
the 'Age' value is missing (NaN) using the pd.isnull()
function from pandas. |
|
The following block
of code inside the if statement is used for imputing
missing values: |
If Pclass == 1 return 37 elif Pclass == 2: return 29 else: return 24 |
·
If the 'Pclass' value is 1, it imputes the missing 'Age' value with
37. ·
If the 'Pclass' value is 2, it imputes the missing 'Age' value with
29. ·
If the 'Pclass' value is anything other than 1 or 2 (including 3), it
imputes the missing 'Age' value with 24. |
else: return Age: |
If the 'Age' value
is not missing, it simply returns the original 'Age' value. |
|
full_data['Age']
= full_data[['Age', 'Pclass']].apply(impute_age, axis=1): This line applies
the impute_age function to the 'Age' column of the full_data DataFrame. The apply() method is
used along with axis=1 to apply the function
row-wise, meaning that for each row in the DataFrame, the function is called with
the corresponding 'Age' and 'Pclass' values as arguments, and the result
(imputed value or original value) is assigned back to the 'Age' column. |
|
In summary, the impute_age function is used to
fill missing values in the 'Age' column based on the corresponding 'Pclass'
value. The function is then applied to the 'Age' column of the DataFrame full_data to impute missing values and update the
'Age' column with the imputed values. |
Removing the features:
|
The Cabin column has too many
missing values to do anything useful with, so it would be best to remove it
from the data frame altogether. |
Remove Cabin feature |
full_data.drop('Cabin', axis = 1, inplace = True) |
|
The code you provided is removing the 'Cabin' feature (column) from
the DataFrame full_data. Let's break down the code |
full_data: |
This refers to the
DataFrame containing the data. |
drop(): |
This method is used to remove rows or columns from a DataFrame |
'Cabin': |
This is the name of
the column to be removed, which is 'Cabin' in this case. |
axis=1: |
This parameter specifies that we want to remove the column ('Cabin'),
not the row. The value 1 refers to columns, while 0 would refer to rows. |
inplace=True: |
This parameter is
set to True, which means that the operation will be
performed directly on the full_data
DataFrame itself, and the changes will be applied in place. |
|
If inplace=False (or not specified), the drop() method would return a new DataFrame with the
'Cabin' column removed, but the original full_data
DataFrame would remain unchanged. |
|
After running this code, the 'Cabin' column will be removed from the full_data DataFrame, and the DataFrame will no longer
contain the 'Cabin' feature. Any subsequent analysis or operations on the
DataFrame will not involve the 'Cabin' column |
Remove rows with missing data |
full_data.dropna(inplace = True) |
full_data: |
This refers to the
DataFrame containing the data. |
dropna(): |
This
method is used to remove rows with any missing data (NaN values) from the
DataFrame. |
inplace=True: |
This
parameter is set to True, which means that the
operation will be performed directly on the full_data
DataFrame itself, and the changes will be applied in place. |
|
If
inplace=False (or not specified), the dropna() method would return a new DataFrame with the
rows containing missing data removed, but the original full_data DataFrame would remain unchanged |
|
After running this
code, any rows in the full_data DataFrame that have one
or more missing values will be removed from the DataFrame. As a result, the full_data DataFrame will only contain rows with
complete data, i.e., rows that have no missing values. If there were any rows
with NaN values in any of the columns before this operation, they will be
removed, and the DataFrame will be updated accordingly. |
Removing the irrelevant columns
Name and Ticket can be removed
from the dataset as these features do not provide additional information
about a passenger's liklihood of survival. |
|
Remove unnecessary columns |
full_data.drop(['Name','Ticket'], axis = 1, inplace = True) |
|
The code
you provided is removing the 'Name' and 'Ticket' columns from the DataFrame full_data. Let's break down the code |
full_data: |
This refers to the
DataFrame containing the data. |
drop(): |
This method is used
to remove rows or columns from a DataFrame |
['Name', 'Ticket']: |
This is a list of
column names to be removed from the DataFrame. In this case, the columns
'Name' and 'Ticket' will be dropped. |
axis=1: |
This parameter
specifies that we want to remove the columns ('Name' and 'Ticket'), not the
rows. The value 1 refers to columns, while 0 would refer to rows. |
inplace=True: |
As above |
|
After running this code, the 'Name' and 'Ticket' columns will be
removed from the full_data DataFrame. The
DataFrame will no longer contain these two columns, and any subsequent
analysis or operations on the DataFrame will not involve the 'Name' and
'Ticket' columns. The full_data DataFrame will only
consist of the remaining columns after the removal. |
# Convert objects to category data type |
objcat = ['Sex','Embarked'] The code snippet you provided
indicates an intention to convert columns with object data type (usually
containing categorical variables) to the category data type. The columns
'Sex' and 'Embarked' are specified to be converted to the category data type. |
|
for colname in objcat: full_data[colname] = full_data[colname].astype('category') The code you provided is using a loop to convert multiple
columns with object data type (presumably containing categorical variables)
to the category data type in the DataFrame |
|
This is a loop that iterates over each element (column name)
in the |
|
This is used to access the column specified by the current |
|
This is the method call that converts the data type of the
current column to the category data type. |
|
By using this loop, all columns specified in |
Numeric summary# |
full_data.describe() |
|
The |
|
|
# Remove PassengerId |
PassengerId can be
removed from the dataset because it does not add any useful information in
predicting a passenger's survival. The remaining variables are the correct
data type. full_data.drop('PassengerId', inplace = True, axis = 1) |
|
|
Comments
Post a Comment