Topic-16-SUPERVISED LEARNING-TRAINING DATA PRE-PROCESSING

 

TRAINING DATA PRE-PROCESSING

The first step in the machine learning pipeline is to clean and transform the training data into a useable format for analysis and modelling.

As such, data pre-processing addresses:

  • Assumptions about data shape
  • Incorrect data types
  • Outliers or errors
  • Missing values
  • Categorical variables

Loading the Data Set

full_data=pd.read_csv("/content/titanic_dataset.csv")

Data Shape

After loading the dataset, I examine its shape to get a better sense of the data and the information it contains.

 

print("train data" , full_data.shape)

View first few rows

 

full_data.head(5)

 

 

This code will return the first 5 rows of the full_data DataFrame.

Data Info

 

full_data.info()

 

 

This code will return information about the full_data DataFrame, including the number of rows and columns, the data types of the columns, and the memory usage

Missing Data

Missing Data
From the entry totals above, there appears to be missing data. A heatmap will help better visualize what features as missing the most information.

 

sns.heatmap(full_data.isnull(),yticklabels = False, cbar = False,cmap = 'tab20c_r')

 

This will create a heatmap of the missing values in the full_data DataFrame.

 

1.      sns: This is an alias for the Seaborn library, which is commonly used for data visualization.

2.      heatmap(): This is a Seaborn function used to create a heatmap.

 

3.      full_data.isnull(). This expression returns a DataFrame of the same shape as full_data, containing boolean values (True or False) indicating whether each element in full_data is NaN (missing value) or not.

4.      yticklabels = False: This parameter specifies that no y-axis tick labels should be displayed on the heatmap.

5.      cbar = False: This parameter specifies that no color bar should be displayed alongside the heatmap.

6.      cmap = 'tab20c_r': This parameter sets the colormap for the heatmap. 'tab20c_r' is a reversed version of the 'tab20c' colormap, which is a qualitative colormap with 20 distinct colors.

7.      'tab20c_r': This is the name of the colormap. 'tab20c_r' represents the reversed version of the 'tab20c' colormap, which is a qualitative colormap with 20 distinct colors. The _r at the end of the colormap name indicates that the colors will be used in a reversed order compared to the original 'tab20c' colormap.

Colormaps are useful in data visualization because they help represent numerical values through colors, making it easier to identify patterns and variations in the data.

For example, when using 'tab20c_r' as the colormap in a heatmap, it will display missing values (NaNs) and non-missing values with distinct colors, making it easy to visually differentiate them and identify patterns in the data with missing values.Top of Form

Bottom of Form

 

In summary, the code creates a heatmap that visualizes the missing values in the DataFrame full_data, using a reversed 'tab20c' colormap to represent the missing values as a pattern of colors.

The absence of y-axis tick labels and color bar makes the plot simpler and cleaner.

 

 

 

 

The 'Age' variable is missing roughly 20% of its data. This proportion is likely small enough for reasonable replacements using some form of imputation as well (using the knowledge of the other columns to fill in reasonable values). However, too much data from the 'Cabin' column is missing to do anything useful with it at a basic level. This column may need to be dropped from the data set altogether or change to another feature such as 'Cabin Known: 1 or 0'.

We want to fill in missing age data instead of just dropping the missing age data rows. One way to do this is by filling in the mean age of all the passengers (imputation). However, we can be smarter about this and check the average age by passenger class.

boxplot() function from the Seaborn library

plt.figure(figsize = (10,7))

sns.boxplot(x = 'Pclass', y = 'Age', data = full_data, palette= 'GnBu_d').set_title('Age by Passenger Class')

plt.show()

plt.figure(figsize = (10,7))

 

This line creates a new figure with a specific size of 10 inches in width and 7 inches in height. It sets the dimensions of the plot canvas before creating the box plot.

sns.boxplot

This line generates the box plot using Seaborn's boxplot() function

x='Pclass',

It specifies the 'Pclass' variable as the x-axis (categorical variable)

y='Age',

It specifies the 'Age' variable as the y-axis (numerical variable) to be plotted.

data=full_data,

The data for the plot is taken from the full_data DataFrame.

palette='GnBu_d')

The palette='GnBu_d' parameter sets the color palette to 'GnBu_d', which is a green-blue sequential color palette.

.set_title('Age by Passenger Class'):

This sets the title of the plot to 'Age by Passenger Class'.

plt.show():

This displays the box plot.

 

In summary, the code creates a box plot to visualize the distribution of ages (y-axis) for different passenger classes (x-axis) in the full_data DataFrame.

Each box represents the interquartile range (IQR) of the age values for each passenger class, with the median shown as a horizontal line inside the box. The whiskers extend to the minimum and maximum values within 1.5 times the IQR, and any points outside this range are considered outliers. The chosen color palette ('GnBu_d') ensures that the colors in the plot transition from green to blue as the data progresses on the x-axis.

 

 

 

 

 

average age values to impute based on Pclass for Age

 

Naturally, the wealthier passengers in the higher classes tend to be older .We'll use these average age values to impute based on Pclass for Age.

 


 

The code you provided defines a function called impute_age and then applies this function to impute missing values in the 'Age' column of the DataFrame full_data. Let's break down the code step by step:

def impute_age(cols)::

This line defines the function impute_age, which takes a single argument cols.

 

The function is designed to handle a DataFrame with two columns, 'Age' and 'Pclass'.

Age = cols[0]:

This line extracts the value of the 'Age' column from the DataFrame cols.

Pclass = cols[1]:

This line extracts the value of the 'Pclass' column from the DataFrame cols.

if pd.isnull(Age):

This line checks if the 'Age' value is missing (NaN) using the pd.isnull() function from pandas.

 

The following block of code inside the if statement is used for imputing missing values:

If Pclass == 1

    return 37

elif

  Pclass == 2:

    return 29

else:

    return 24

·        If the 'Pclass' value is 1, it imputes the missing 'Age' value with 37.

·        If the 'Pclass' value is 2, it imputes the missing 'Age' value with 29.

·        If the 'Pclass' value is anything other than 1 or 2 (including 3), it imputes the missing 'Age' value with 24.

else: return Age:

If the 'Age' value is not missing, it simply returns the original 'Age' value.

 

full_data['Age'] = full_data[['Age', 'Pclass']].apply(impute_age, axis=1):

This line applies the impute_age function to the 'Age' column of the full_data DataFrame. The apply() method is used along with axis=1 to apply the function row-wise, meaning that for each row in the DataFrame, the function is called with the corresponding 'Age' and 'Pclass' values as arguments, and the result (imputed value or original value) is assigned back to the 'Age' column.

 

In summary, the impute_age function is used to fill missing values in the 'Age' column based on the corresponding 'Pclass' value. The function is then applied to the 'Age' column of the DataFrame full_data to impute missing values and update the 'Age' column with the imputed values.

 

Removing the features:

 

The Cabin column has too many missing values to do anything useful with, so it would be best to remove it from the data frame altogether.

Remove Cabin feature

full_data.drop('Cabin', axis = 1, inplace = True)

 

The code you provided is removing the 'Cabin' feature (column) from the DataFrame full_data. Let's break down the code

full_data:

This refers to the DataFrame containing the data.

drop():

This method is used to remove rows or columns from a DataFrame

'Cabin':

This is the name of the column to be removed, which is 'Cabin' in this case.

axis=1:

This parameter specifies that we want to remove the column ('Cabin'), not the row. The value 1 refers to columns, while 0 would refer to rows.

inplace=True:

This parameter is set to True, which means that the operation will be performed directly on the full_data DataFrame itself, and the changes will be applied in place.

 

If inplace=False (or not specified), the drop() method would return a new DataFrame with the 'Cabin' column removed, but the original full_data DataFrame would remain unchanged.

 

After running this code, the 'Cabin' column will be removed from the full_data DataFrame, and the DataFrame will no longer contain the 'Cabin' feature. Any subsequent analysis or operations on the DataFrame will not involve the 'Cabin' column

Remove rows with missing data

 

full_data.dropna(inplace = True)

 

full_data:

This refers to the DataFrame containing the data.

dropna():

This method is used to remove rows with any missing data (NaN values) from the DataFrame.

inplace=True:

This parameter is set to True, which means that the operation will be performed directly on the full_data DataFrame itself, and the changes will be applied in place.

 

If inplace=False (or not specified), the dropna() method would return a new DataFrame with the rows containing missing data removed, but the original full_data DataFrame would remain unchanged

 

After running this code, any rows in the full_data DataFrame that have one or more missing values will be removed from the DataFrame. As a result, the full_data DataFrame will only contain rows with complete data, i.e., rows that have no missing values. If there were any rows with NaN values in any of the columns before this operation, they will be removed, and the DataFrame will be updated accordingly.

Removing the irrelevant columns

Name and Ticket can be removed from the dataset as these features do not provide additional information about a passenger's liklihood of survival.

Remove unnecessary columns

full_data.drop(['Name','Ticket'], axis = 1, inplace = True)

 

The code you provided is removing the 'Name' and 'Ticket' columns from the DataFrame full_data. Let's break down the code

full_data:

This refers to the DataFrame containing the data.

drop():

This method is used to remove rows or columns from a DataFrame

['Name', 'Ticket']:

This is a list of column names to be removed from the DataFrame. In this case, the columns 'Name' and 'Ticket' will be dropped.

axis=1:

This parameter specifies that we want to remove the columns ('Name' and 'Ticket'), not the rows. The value 1 refers to columns, while 0 would refer to rows.

inplace=True:

As above

 

After running this code, the 'Name' and 'Ticket' columns will be removed from the full_data DataFrame. The DataFrame will no longer contain these two columns, and any subsequent analysis or operations on the DataFrame will not involve the 'Name' and 'Ticket' columns. The full_data DataFrame will only consist of the remaining columns after the removal.

 

# Convert objects to category data type

 

objcat = ['Sex','Embarked']

The code snippet you provided indicates an intention to convert columns with object data type (usually containing categorical variables) to the category data type. The columns 'Sex' and 'Embarked' are specified to be converted to the category data type.

 

for colname in objcat:

full_data[colname] = full_data[colname].astype('category')

The code you provided is using a loop to convert multiple columns with object data type (presumably containing categorical variables) to the category data type in the DataFrame full_data. The loop iterates over the list objcat, which contains the column names that need to be converted. Here's what the code does step-by-step:

for colname in objcat::

This is a loop that iterates over each element (column name) in the objcat list.

full_data[colname]:

This is used to access the column specified by the current colname during each iteration of the loop.

 

astype('category'):

This is the method call that converts the data type of the current column to the category data type.

 

By using this loop, all columns specified in objcat (in this case, 'Sex' and 'Embarked') will be converted from the object data type to the category data type. The category data type is more memory-efficient for columns with a limited number of unique values and can improve performance in certain categorical data operations. After running this loop, the DataFrame full_data will have the specified columns converted to the category data type.

 

 Numeric summary#

full_data.describe()

 

The describe() method in pandas is used to generate a numeric summary of the DataFrame. It provides descriptive statistics for each numerical column in the DataFrame full_data, such as count, mean, standard deviation, minimum, 25th percentile (Q1), median (50th percentile or Q2), 75th percentile (Q3), and maximum.

 

# Remove PassengerId

 

PassengerId can be removed from the dataset because it does not add any useful information in predicting a passenger's survival. The remaining variables are the correct data type.

full_data.drop('PassengerId', inplace = True, axis = 1)

 

 

 

Comments

Popular posts from this blog

Topic-18 | Evaluation Metrics for different model

Topic 22 | Neural Networks |

Topic 17 | Linear regression using Sklearn