Logistic Regression is used in situations where the outcome is binary, either True or False, on or off, yes or no , 0 or 1.
Whereas in linear regression the value to predict is continuous, in logistic regression the value to predict is categorical, i.e. on or off, yes, or no, 0 or 1, etc. Logistic regression solves a classification problem.
While linear regression uses a straight line, logistic regression uses an S-Curve. The S-Curve or the Sigmod function can be used to predict the Y value.
A good example where logistic regression is used is to predict the weather. It can also be done via linear regression.
The steps to be considered are:
collect data -> Analyze the data -> perform Data Wrangling -> Setup our train and test sets -> do an accuracy check to see how our algorithm is performing
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 | #!/usr/bin/env python3 from matplotlib import pyplot as plt import numpy as np import seaborn as sns import math import pandas as pd from sklearn.model_selection import train_test_split from sklearn.linear_model import LogisticRegression from sklearn.metrics import classification_report from sklearn.metrics import confusion_matrix from sklearn.metrics import accuracy_score ''' This code is part of me continuing my machine learning journey and is focused on logistic regression. Author: Nik Alleyne Author Blog: www.securitynik.com filename: titanicLogisticRegression.py This uses the titanic dataset an example can be found at http://biostat.mc.vanderbilt.edu/wiki/pub/Main/DataSets/titanic3.xls pclass - Passenger Class (1=st, 2=2nd, 3=3rd) survived - (0 = No, 1 = Yes) name - Name sex - Sex age - Age sibsp - Number of Siblings/Spouse Aboard parch - Number of parents/Children Aboard ticket - Ticket Number fare - Fare Passenger (British Pound) cabin - Cabin embarked - Port of Embarkation (C = Cherbourg; Q = Queens, S = Southhampton) boat - Lifeboat body - Body Identification Number home.dest - Home/Destination ''' def main(): # Read the Excel file df = pd.read_excel('./titanic.xls') #Let's drop a few columns which may not be relevant df.drop(['boat'], axis=1, inplace=True) df.drop(['body'], axis=1, inplace=True) df.drop(['home.dest'], axis=1, inplace=True) #print information on the dataset print('[*] Information on the dataset \n {}'.format(df.info())) #Let's see if we can read a few of the records print('[*] First 10 records are: \n{}'.format(df.head(10))) print('[*] The total number of entries is:{}'.format(len(df.index))) # Let's visualize the data from different perspectives sns.set(style='darkgrid') sns.countplot(x='sex', data=df) plt.show() ''' Interestingly when the above graph was shown, it showed that it was almost 2 to one in terms of the number of men to women on the titanic ''' #Let's now look at the survivors vs non survivors sns.countplot(x = 'survived', data=df) plt.show() ''' From the graphs returned here, it showed that just about 62% of the passengers survived ''' # From the survivors, how many were men vs women sns.countplot(x='survived', hue='sex', data=df) plt.show() ''' The graph for this finding was important to me. Even thought there were more males on the titanic, a significant number of females surived compared to males ''' #Let's see how the passengers were distributed by class sns.countplot(x = 'pclass', data=df) plt.show() ''' From the graph produced here, it showed majority of the passengers were in class 3 and surprisingly (to me) there were more passengers in first class than second class ''' # Now let's see if class had anything to do with their survival sns.countplot(x = 'survived', hue='pclass', data=df) plt.show() ''' Believe it or not, it would seem like class made a different. Those in 3rd were more likely to have not survived. 1st class had the most folks who survived. Then again, it could be possible that since 3rd class had the most passengers, this is why most of them did not survive. That could be true. However, do remember from above, there was more men than women on the titanic, yet there were significantly more women who survived than men ''' # Finally, what what the age distribution of the passengers sns.countplot(x='age', data=df) plt.show() ''' Looks like age 24 had the largest number of passengers on the titanic. Interesting ''' # Since there are entries that are blank or have no entries, we need to clean up those records print('[*] Checking for null entries ... \n {}'.format(df.isnull())) ''' The results show that we have entries which are null ''' #Let's now see exactly which columns have null values and their count print('[*] Count of columns with null values \n{}'.format(df.isnull().sum())) ''' From the results returned, age, cabin and embarked consists of null values ''' #for those columns with null values, let's fill them out with dummy values #Let's drop the cabin column, since it has so many null values df.drop(['cabin'], axis=1, inplace=True) #clean up nan entries df.dropna(inplace=True) #Let's see if we have any null entries again. They should be all gone print('[*] Count of columns with null values \n{}'.format(df.isnull().sum())) ''' Very nice! the results from this shows that all the null values have been cleaned up. Nice clean dataset ''' '''Still have to wrangle some of this data. We have to get the strings value to be numerical For example string value exists for names, sex, ticket and embarked. It seems we have to convert this to categorical variables in order to implement logistic regression. Basically, we need to ensure no strings are present as we implement machine language We will thus use pandas to help us out here with creating dummy variables ''' #Currently 3 classes, let's simplfy this via binary pClass = pd.get_dummies(df['pclass'], drop_first=True) print('[*] Here is what the class currently looks like \n{}'.format(pClass)) # Let's get the sex to a binary value: True or false for male or female male_female = pd.get_dummies(df['sex'], drop_first=True) print('[*] Here is what the sex column currently looks like \n{}'.format(male_female)) # Since embarked consists of one of 3 categories, we can do wat we did to te clas here embark = pd.get_dummies(df['embarked'], drop_first=True) print('[*] Here is what the embark column currently looks like \n{}'.format(embark)) #Let's now add these new columns into our existing dataset df = pd.concat([df,pClass,male_female,embark], axis=1) print('*] Our data now looks like \n {}'.format(df.head())) #Now that we have the new columns added, we need to remove the previous values and any other irrelevant columns df.drop(['pclass', 'sex', 'name', 'embarked', 'ticket'], axis=1, inplace=True) print('[*] Our finalized dataset:\n{}'.format(df.head())) ''' Let's now look at splitting our data into our training set and test set specifically we would like to test if someone survives the titanic ''' # first our features/independent variales. Use everything other than the y column X = df.drop('survived', axis=1) #our y axis / depedent variable. The value we would like to predict y = df['survived'] #Train and split our dataset. 70% for training and 30% for testing X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=1) lr = LogisticRegression(verbose=True, solver='lbfgs') lr.fit(X_train, y_train) my_prediction = lr.predict(X_test) # Time for a prediction on the test data print('[*] Prediction on survial based on test data:\n{}'.format(my_prediction)) #Time to test the accuracy of our model print('[*] Our Classification report \n {}'.format(classification_report(y_test, my_prediction))) '''Looking at the accuracy from the perspective of confusion matrix To learn more about confusion matrix see: https://en.wikipedia.org/wiki/Confusion_matrix https://www.dataschool.io/simple-guide-to-confusion-matrix-terminology/ ''' print('[*] Results from Confusion Matrix:\n{}'.format(confusion_matrix(y_test,my_prediction))) # Now let's calculate accuracy the easy way print('[*] Accuracy of the model is:{}'.format(accuracy_score(y_test,my_prediction))) if __name__ == '__main__': main() |
References:
https://www.youtube.com/watch?v=GwIo3gDZCVQ&list=PL9ooVrP1hQOHUfd-g8GUpKI3hHOwM_9Dn&index=1
https://stackoverflow.com/questions/46623583/seaborn-countplot-order-categories-by-count
No comments:
Post a Comment