Learning by practicing: November 2019

Thursday, November 14, 2019

Beginning Machine Learning - Logistic Regression Algorithm - Titanic Dataset

While there have been many great tutorials online that I've used, this one is mostly from the "Machine Learning Full Course - Learn Machine Learning 10 Hours | Machine Learning Tutorial | Edureka". Some of the other sites I've used are also within the references:

Logistic Regression is used in situations where the outcome is binary, either True or False, on or off, yes or no , 0 or 1.

Whereas in linear regression the value to predict is continuous, in logistic regression the value to predict is categorical, i.e. on or off, yes, or no, 0 or 1, etc. Logistic regression solves a classification problem.
While linear regression uses a straight line, logistic regression uses an S-Curve. The S-Curve or the Sigmod function can be used to predict the Y value.

A good example where logistic regression is used is to predict the weather. It can also be done via linear regression.

The steps to be considered are:
collect data -> Analyze the data -> perform Data Wrangling -> Setup our train and test sets -> do an accuracy check to see how our algorithm is performing

#!/usr/bin/env python3

from matplotlib import pyplot as plt
import numpy as np
import seaborn as sns
import math
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score

'''
This code is part of me continuing my machine learning journey and is focused on
logistic regression. 
Author: Nik Alleyne
Author Blog: www.securitynik.com
filename: titanicLogisticRegression.py

This uses the titanic dataset an example can be found at 
http://biostat.mc.vanderbilt.edu/wiki/pub/Main/DataSets/titanic3.xls

pclass - Passenger Class (1=st, 2=2nd, 3=3rd)
survived - (0 = No, 1 = Yes)
name - Name
sex - Sex
age - Age 
sibsp - Number of Siblings/Spouse Aboard 
parch - Number of parents/Children Aboard
ticket - Ticket Number
fare - Fare Passenger (British Pound)
cabin - Cabin
embarked - Port of Embarkation (C = Cherbourg; Q = Queens, S = Southhampton) 
boat - Lifeboat
body - Body Identification Number
home.dest - Home/Destination
'''


def main():
    # Read the Excel file
    df = pd.read_excel('./titanic.xls')

    #Let's drop a few columns which may not be relevant
    df.drop(['boat'], axis=1, inplace=True)    
    df.drop(['body'], axis=1, inplace=True)
    df.drop(['home.dest'], axis=1, inplace=True)

    #print information on the dataset
    print('[*] Information on the dataset \n {}'.format(df.info()))
    
    #Let's see if we can read a few of the records
    print('[*] First 10 records are: \n{}'.format(df.head(10)))
    print('[*] The total number of entries is:{}'.format(len(df.index)))

    # Let's visualize the data from different perspectives
    sns.set(style='darkgrid')
    sns.countplot(x='sex', data=df)
    plt.show()

    ''' 
    Interestingly when the above graph was shown, it showed that 
    it was almost 2 to one in terms of the number of men to women on 
    the titanic
    '''

    #Let's now look at the survivors vs non survivors
    sns.countplot(x = 'survived', data=df)
    plt.show()
    ''' 
    From the graphs returned here, it showed that just about 62% 
    of the passengers survived
    '''
    
    # From the survivors, how many were men vs women
    sns.countplot(x='survived', hue='sex', data=df)
    plt.show()
    '''
    The graph for this finding was important to me.
    Even thought there were more males on the titanic, a significant 
    number of females surived compared to males
    '''


    
    #Let's see how the passengers were distributed by class
    sns.countplot(x = 'pclass', data=df)
    plt.show()
    ''' 
    From the graph produced here, it showed majority of the passengers were
    in class 3 and surprisingly (to me) there were more passengers in 
    first class than second class

    '''

    # Now let's see if class had anything to do with their survival
    sns.countplot(x = 'survived', hue='pclass', data=df)
    plt.show()
    '''
    Believe it or not, it would seem like class made a different. 
    Those in 3rd were more likely to have not survived. 
    1st class had the most folks who survived. Then again, it could be 
    possible that since 3rd class had the most passengers, this is why 
    most of them did not survive. That could be true. However, do remember
    from above, there was more men than women on the titanic, yet there 
    were significantly more women who survived than men
    '''

    # Finally, what what the age distribution of the passengers
    sns.countplot(x='age', data=df)
    plt.show()
    '''
    Looks like age 24 had the largest number of passengers on the titanic. Interesting
    '''

    # Since there are entries that are blank or have no entries, we need to clean up those records
    print('[*] Checking for null entries ... \n {}'.format(df.isnull()))
    '''
    The results show that we have entries which are null
    '''

    #Let's now see exactly which columns have null values and their count
    print('[*] Count of columns with null values \n{}'.format(df.isnull().sum()))
    '''
    From the results returned, age, cabin and embarked consists of null values
    '''

    #for those columns with null values, let's fill them out with dummy values
    #Let's drop the cabin column, since it has so many null values
    df.drop(['cabin'], axis=1, inplace=True)

    #clean up nan entries
    df.dropna(inplace=True)

    #Let's see if we have any null entries again. They should be all gone
    print('[*] Count of columns with null values \n{}'.format(df.isnull().sum()))
    '''
    Very nice! the results from this shows that all the null values have 
    been cleaned up. Nice clean dataset
    '''

    '''Still have to wrangle some of this data. We have to get the strings value to be numerical
    For example string value exists for names, sex, ticket and embarked. It seems we have to convert
    this to categorical variables in order to implement logistic regression.
    Basically, we need to ensure no strings are present as we implement machine language
    We will thus use pandas to help us out here with creating dummy variables
    '''

    #Currently 3 classes, let's simplfy this via binary
    pClass = pd.get_dummies(df['pclass'], drop_first=True)
    print('[*] Here is what the class currently looks like \n{}'.format(pClass))
    
    # Let's get the sex to a binary value: True or false for male or female
    male_female = pd.get_dummies(df['sex'], drop_first=True)
    print('[*] Here is what the sex column currently looks like \n{}'.format(male_female))
    
    # Since embarked consists of one of 3 categories, we can do wat we did to te clas here
    embark = pd.get_dummies(df['embarked'], drop_first=True)
    print('[*] Here is what the embark column currently looks like \n{}'.format(embark))

    #Let's now add these new columns into our existing dataset
    df = pd.concat([df,pClass,male_female,embark], axis=1)
    print('*] Our data now looks like \n {}'.format(df.head()))
    
    
    #Now that we have the new columns added, we need to remove the previous values and any other irrelevant columns
    df.drop(['pclass', 'sex', 'name', 'embarked', 'ticket'], axis=1, inplace=True)
    print('[*] Our finalized dataset:\n{}'.format(df.head()))


    '''
    Let's now look at splitting our data into our training set and test set
    specifically we would like to test if someone survives the titanic
    '''
    # first our features/independent variales. Use everything other than the y column
    X = df.drop('survived', axis=1)
    
    #our y axis / depedent variable. The value we would like to predict
    y = df['survived']

    #Train and split our dataset. 70% for training and 30% for testing
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=1)

    
    lr = LogisticRegression(verbose=True, solver='lbfgs')
    lr.fit(X_train, y_train)

    my_prediction = lr.predict(X_test)
    # Time for a prediction on the test data
    print('[*] Prediction on survial based on test data:\n{}'.format(my_prediction))

    #Time to test the accuracy of our model
    print('[*] Our Classification report \n {}'.format(classification_report(y_test, my_prediction)))
    

    '''Looking at the accuracy from the perspective of confusion matrix
     To learn more about confusion matrix see:
     https://en.wikipedia.org/wiki/Confusion_matrix
     https://www.dataschool.io/simple-guide-to-confusion-matrix-terminology/
    '''
    print('[*] Results from Confusion Matrix:\n{}'.format(confusion_matrix(y_test,my_prediction)))

    # Now let's calculate accuracy the easy way
    print('[*] Accuracy of the model is:{}'.format(accuracy_score(y_test,my_prediction)))


if __name__ == '__main__':
    main()

References:
https://www.youtube.com/watch?v=GwIo3gDZCVQ&list=PL9ooVrP1hQOHUfd-g8GUpKI3hHOwM_9Dn&index=1
https://stackoverflow.com/questions/46623583/seaborn-countplot-order-categories-by-count

Beginning Machine Learning - Linear Regression

This post is the second part of my journey to learn machine learning. Hopefully I'm improving along the way :-). Feel free to add your comments on what I should do differently.

#!/usr/bin/env python3

'''
    This code is based on me learning more about Linear Regression 
    This is part of me expanding my knowledge on machine learning

    This version of the code uses the sickit learn

    Author: Nik Alleyne
    blog: www.securitynik.com
    filename: linearRegresAlgo_v3.py

'''


import numpy as np
import pandas as pd
from matplotlib import pyplot as plt
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split



def main():
    print('*[*] Beginning Linear regresion ...')
    
    # Reading Data - This file was downloaded fro GitHub. 
    # See the reference section for the URL
    df = pd.read_csv('./headbrain.csv',sep=',', dtype='int64', verbose=True)

    
    print('[*] First 10 records \n {}' .format(df.head(10)))
    print('[*] Quick description of the dataframe: \n{}'.format(df.describe()))
    print('[*] {} rows, columns '.format(df.shape))

    #Let's now create the X and Y axis using 
    X = np.array(df['Head Size(cm^3)'].values).reshape(-1, 1)
    Y = np.array(df['Brain Weight(grams)'].values)
    
    #Split the dataset into training and testing sets
    X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size = 0.30, random_state=10)

    lr = LinearRegression()
    lr.fit(X_train, y_train)
    print('[*] When X is 4234 the predicted value of y is{}'.format(lr.predict([[4234]])))
    
    r_sqr_score = lr.score(X, Y)
    print('[*] The R2 score is {}'.format(r_sqr_score))



if __name__ == '__main__':
    main()

The output from the above code is as follow:

root@securitynik:~/ML# ./linearRegresAlgo_v3.py 
*[*] Beginning Linear regresion ...
Tokenization took: 0.06 ms
Type conversion took: 0.21 ms
Parser memory cleanup took: 0.00 ms
[*] First 10 records 
    Gender  Age Range  Head Size(cm^3)  Brain Weight(grams)
0       1          1             4512                 1530
1       1          1             3738                 1297
2       1          1             4261                 1335
3       1          1             3777                 1282
4       1          1             4177                 1590
5       1          1             3585                 1300
6       1          1             3785                 1400
7       1          1             3559                 1255
8       1          1             3613                 1355
9       1          1             3982                 1375
[*] Quick description of the dataframe: 
           Gender   Age Range  Head Size(cm^3)  Brain Weight(grams)
count  237.000000  237.000000       237.000000           237.000000
mean     1.434599    1.535865      3633.991561          1282.873418
std      0.496753    0.499768       365.261422           120.340446
min      1.000000    1.000000      2720.000000           955.000000
25%      1.000000    1.000000      3389.000000          1207.000000
50%      1.000000    2.000000      3614.000000          1280.000000
75%      2.000000    2.000000      3876.000000          1350.000000
max      2.000000    2.000000      4747.000000          1635.000000
[*] (237, 4) rows, columns 
[*] When X is 4234 the predicted value of y is[1441.04828161]
[*] The R2 score is 0.6388174521966088

Beginning Machine Learning - Reubilding the Linear Regression Algorithm

Over the last few months, I've been caught up with expanding my knowledge on machine learning. As a result, these next few posts are all about me documenting my learning. As stated in many of my previous posts, this is all about making it easier for me to be able to refresh my memory in the future.

While there have been many great tutorials online that I've used, this one is mostly from the "Machine Learning Full Course - Learn Machine Learning 10 Hours | Machine Learning Tutorial | Edureka" on YouTube. Some of the other sites I've used are also within the references:

This post I'm rebuilding the Linear Regression algorithm and in the next post we use Sickit learn's Linear Regression

#!/usr/bin/env python3

'''
    This code is based on me learning more about Linear Regression 
    This is part of me expanding my knowledge on machine learning
    In this version I'm rebuilding the algorithm 

    Author: Nik Alleyne
    blog: www.securitynik.com
    filename: linearRegresAlgo_v2.py

'''


import numpy as np
import pandas as pd
from matplotlib import pyplot as plt
plt.rcParams['figure.figsize'] = (20.0, 10.0)



def main():
    print('*[*] Beginning Linear regresion ...')
    
    # Reading Data - This file was downloaded fro GitHub. 
    # See the reference section for the URL
    df = pd.read_csv('./headbrain.csv',sep=',', dtype='int64', verbose=True)

    #Gather information on the shape of the datset
    print('[*] {} rows, columns in the training dataset'.format(df.shape))
    print('[*] First 10 records of the training dataset')
    print(df.head(10))

    #Let's now create the X and Y axis using 
    X = df['Head Size(cm^3)'].values
    Y = df['Brain Weight(grams)'].values

    #Find the mean of X and Y
    mean_x = np.mean(X)
    mean_y = np.mean(Y)
    print('[*] The mean of X is {} || The mean of Y is {} '.format(mean_x, mean_y))
    
    # Calculating the coefficients
    # See formula here https://support.minitab.com/en-us/minitab-express/1/help-and-how-to/modeling-statistics/regression/how-to/multiple-regression/methods-and-formulas/methods-and-formulas/#coefficient-coef`
    numerator = 0
    denominator = 0

    for i in range(len(X)):
        numerator += ((X[i] - mean_x) * (Y[i] - mean_y))
        denominator += (X[i] - mean_x) ** 2
    b1 = numerator / denominator
    b0 = mean_y - (b1 * mean_x)
    print('[*] Coefficients:-> Brain Weight (b1): {} || Head size (b0): {}'.format(b1, b0))

    # When compared to the equation y = mx+c, we can say m = b1 & c = b0

    # create the graph
    max_x = np.max(X) + 100
    min_x = np.min(X) - 100

    # Calculating line values x and y
    x = np.linspace(min_x, max_x, 1000)
    y = b0 + b1 * x

    #plotting the line
    plt.plot(x,y, color='r', label='Regression Line')
    plt.scatter(X, Y, c='b', label='Scatter Plot')

    plt.xlabel('Head Size(cm^3)')
    plt.ylabel('Brain Weight(grams)')
    plt.legend()
    plt.show()

    # Let's now use the R2 method to determine how good the model is
    # Formula can be found here
    # https://support.minitab.com/en-us/minitab-express/1/help-and-how-to/modeling-statistics/regression/how-to/multiple-regression/methods-and-formulas/methods-and-formulas/#coefficient-coef
    ss_total = 0
    ss_error = 0

    for i in range(len(X)):
        y_pred = b0 + b1 * X[i]
        ss_total += (Y[i] - mean_y) ** 2
        ss_error += (Y[i] - y_pred) ** 2
    r_sq = 1 - (ss_error/ss_total)
    print('[*] Your R2 squared value is: {}'.format(r_sq))




if __name__ == '__main__':
    main()

When we run the code, we get:

root@securitynik:~/ML# ./linearRegresAlgo_v2.py | more
*[*] Beginning Linear regresion ...
Tokenization took: 0.06 ms
Type conversion took: 0.23 ms
Parser memory cleanup took: 0.00 ms
[*] (237, 4) rows, columns in the training dataset
[*] First 10 records of the training dataset
   Gender  Age Range  Head Size(cm^3)  Brain Weight(grams)
0       1          1             4512                 1530
1       1          1             3738                 1297
2       1          1             4261                 1335
3       1          1             3777                 1282
4       1          1             4177                 1590
5       1          1             3585                 1300
6       1          1             3785                 1400
7       1          1             3559                 1255
8       1          1             3613                 1355
9       1          1             3982                 1375
[*] The mean of X is 3633.9915611814345 || The mean of Y is 1282.873417721519
[*] Coefficients:-> Brain Weight (b1): 0.26342933948939945 || Head size (b0): 325.57342104944223
[*] Your R2 squared value is: 0.6393117199570003

That's it, my first shot at machine learning. Next post we use Sickit Learn rather than build the algorithm ourselves.

References:
https://www.youtube.com/watch?v=GwIo3gDZCVQ&list=PL9ooVrP1hQOHUfd-g8GUpKI3hHOwM_9Dn&index=1
https://matplotlib.org/3.1.1/tutorials/introductory/customizing.html#sphx-glr-tutorials-introductory-customizing-py
Headbrain.csv
read_csv
Calculating Coefficient
R2