Thursday, November 14, 2019

Beginning Machine Learning - Logistic Regression Algorithm - Titanic Dataset

While there have been many great tutorials online that I've used, this one is mostly from the "Machine Learning Full Course - Learn Machine Learning 10 Hours | Machine Learning Tutorial | Edureka". Some of the other sites I've used are also within the references:


Logistic Regression is used in situations where the outcome is binary, either True or False, on or off, yes or no , 0 or 1.

Whereas in linear regression the value to predict is continuous, in logistic regression the value to predict is categorical, i.e. on or off, yes, or no, 0 or 1, etc. Logistic regression solves a classification problem.
While linear regression uses a straight line, logistic regression uses an S-Curve. The S-Curve or the Sigmod function can be used to predict the Y value.


A good example where logistic regression is used is to predict the weather. It can also be done via linear regression.

The steps to be considered are:
 collect data -> Analyze the data -> perform Data Wrangling -> Setup our train and test sets -> do an accuracy check to see how our algorithm is performing

  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
#!/usr/bin/env python3

from matplotlib import pyplot as plt
import numpy as np
import seaborn as sns
import math
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score

'''
This code is part of me continuing my machine learning journey and is focused on
logistic regression. 
Author: Nik Alleyne
Author Blog: www.securitynik.com
filename: titanicLogisticRegression.py

This uses the titanic dataset an example can be found at 
http://biostat.mc.vanderbilt.edu/wiki/pub/Main/DataSets/titanic3.xls

pclass - Passenger Class (1=st, 2=2nd, 3=3rd)
survived - (0 = No, 1 = Yes)
name - Name
sex - Sex
age - Age 
sibsp - Number of Siblings/Spouse Aboard 
parch - Number of parents/Children Aboard
ticket - Ticket Number
fare - Fare Passenger (British Pound)
cabin - Cabin
embarked - Port of Embarkation (C = Cherbourg; Q = Queens, S = Southhampton) 
boat - Lifeboat
body - Body Identification Number
home.dest - Home/Destination
'''


def main():
    # Read the Excel file
    df = pd.read_excel('./titanic.xls')

    #Let's drop a few columns which may not be relevant
    df.drop(['boat'], axis=1, inplace=True)    
    df.drop(['body'], axis=1, inplace=True)
    df.drop(['home.dest'], axis=1, inplace=True)

    #print information on the dataset
    print('[*] Information on the dataset \n {}'.format(df.info()))
    
    #Let's see if we can read a few of the records
    print('[*] First 10 records are: \n{}'.format(df.head(10)))
    print('[*] The total number of entries is:{}'.format(len(df.index)))

    # Let's visualize the data from different perspectives
    sns.set(style='darkgrid')
    sns.countplot(x='sex', data=df)
    plt.show()

    ''' 
    Interestingly when the above graph was shown, it showed that 
    it was almost 2 to one in terms of the number of men to women on 
    the titanic
    '''

    #Let's now look at the survivors vs non survivors
    sns.countplot(x = 'survived', data=df)
    plt.show()
    ''' 
    From the graphs returned here, it showed that just about 62% 
    of the passengers survived
    '''
    
    # From the survivors, how many were men vs women
    sns.countplot(x='survived', hue='sex', data=df)
    plt.show()
    '''
    The graph for this finding was important to me.
    Even thought there were more males on the titanic, a significant 
    number of females surived compared to males
    '''


    
    #Let's see how the passengers were distributed by class
    sns.countplot(x = 'pclass', data=df)
    plt.show()
    ''' 
    From the graph produced here, it showed majority of the passengers were
    in class 3 and surprisingly (to me) there were more passengers in 
    first class than second class

    '''

    # Now let's see if class had anything to do with their survival
    sns.countplot(x = 'survived', hue='pclass', data=df)
    plt.show()
    '''
    Believe it or not, it would seem like class made a different. 
    Those in 3rd were more likely to have not survived. 
    1st class had the most folks who survived. Then again, it could be 
    possible that since 3rd class had the most passengers, this is why 
    most of them did not survive. That could be true. However, do remember
    from above, there was more men than women on the titanic, yet there 
    were significantly more women who survived than men
    '''

    # Finally, what what the age distribution of the passengers
    sns.countplot(x='age', data=df)
    plt.show()
    '''
    Looks like age 24 had the largest number of passengers on the titanic. Interesting
    '''

    # Since there are entries that are blank or have no entries, we need to clean up those records
    print('[*] Checking for null entries ... \n {}'.format(df.isnull()))
    '''
    The results show that we have entries which are null
    '''

    #Let's now see exactly which columns have null values and their count
    print('[*] Count of columns with null values \n{}'.format(df.isnull().sum()))
    '''
    From the results returned, age, cabin and embarked consists of null values
    '''

    #for those columns with null values, let's fill them out with dummy values
    #Let's drop the cabin column, since it has so many null values
    df.drop(['cabin'], axis=1, inplace=True)

    #clean up nan entries
    df.dropna(inplace=True)

    #Let's see if we have any null entries again. They should be all gone
    print('[*] Count of columns with null values \n{}'.format(df.isnull().sum()))
    '''
    Very nice! the results from this shows that all the null values have 
    been cleaned up. Nice clean dataset
    '''

    '''Still have to wrangle some of this data. We have to get the strings value to be numerical
    For example string value exists for names, sex, ticket and embarked. It seems we have to convert
    this to categorical variables in order to implement logistic regression.
    Basically, we need to ensure no strings are present as we implement machine language
    We will thus use pandas to help us out here with creating dummy variables
    '''

    #Currently 3 classes, let's simplfy this via binary
    pClass = pd.get_dummies(df['pclass'], drop_first=True)
    print('[*] Here is what the class currently looks like \n{}'.format(pClass))
    
    # Let's get the sex to a binary value: True or false for male or female
    male_female = pd.get_dummies(df['sex'], drop_first=True)
    print('[*] Here is what the sex column currently looks like \n{}'.format(male_female))
    
    # Since embarked consists of one of 3 categories, we can do wat we did to te clas here
    embark = pd.get_dummies(df['embarked'], drop_first=True)
    print('[*] Here is what the embark column currently looks like \n{}'.format(embark))

    #Let's now add these new columns into our existing dataset
    df = pd.concat([df,pClass,male_female,embark], axis=1)
    print('*] Our data now looks like \n {}'.format(df.head()))
    
    
    #Now that we have the new columns added, we need to remove the previous values and any other irrelevant columns
    df.drop(['pclass', 'sex', 'name', 'embarked', 'ticket'], axis=1, inplace=True)
    print('[*] Our finalized dataset:\n{}'.format(df.head()))


    '''
    Let's now look at splitting our data into our training set and test set
    specifically we would like to test if someone survives the titanic
    '''
    # first our features/independent variales. Use everything other than the y column
    X = df.drop('survived', axis=1)
    
    #our y axis / depedent variable. The value we would like to predict
    y = df['survived']

    #Train and split our dataset. 70% for training and 30% for testing
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=1)

    
    lr = LogisticRegression(verbose=True, solver='lbfgs')
    lr.fit(X_train, y_train)

    my_prediction = lr.predict(X_test)
    # Time for a prediction on the test data
    print('[*] Prediction on survial based on test data:\n{}'.format(my_prediction))

    #Time to test the accuracy of our model
    print('[*] Our Classification report \n {}'.format(classification_report(y_test, my_prediction)))
    

    '''Looking at the accuracy from the perspective of confusion matrix
     To learn more about confusion matrix see:
     https://en.wikipedia.org/wiki/Confusion_matrix
     https://www.dataschool.io/simple-guide-to-confusion-matrix-terminology/
    '''
    print('[*] Results from Confusion Matrix:\n{}'.format(confusion_matrix(y_test,my_prediction)))

    # Now let's calculate accuracy the easy way
    print('[*] Accuracy of the model is:{}'.format(accuracy_score(y_test,my_prediction)))


if __name__ == '__main__':
    main()

References:
https://www.youtube.com/watch?v=GwIo3gDZCVQ&list=PL9ooVrP1hQOHUfd-g8GUpKI3hHOwM_9Dn&index=1
https://stackoverflow.com/questions/46623583/seaborn-countplot-order-categories-by-count

No comments:

Post a Comment