Sunday, December 22, 2019

Coming Soon!!! Learning by Practicing - Mastering TShark Network Forensics - Moving from Zero to Hero

If you are looking for the right book to help you expand your Mastery of TShark Network forensics, this is the book you have been waiting for.

In Mastering TShark Network Forensics, you learn TShark from the basics to the not not so basic. There are numerous challenges throughout this book to reinforce the content as you go along.

Along with the various challenges in the chapters, there is a section fully dedicated to solving real-world challenges using TShark. This is meant to provide even further re-enforcement of the content previously learned.

As a final takeaway, you leverage TShark with Python to perform IP threat intelligence, comparing IPs found on a known blacklist with what has been captured in your environment.
 
 
Download the sample chapters here to confirm this is the TShark book you have been waiting on.  



NOTE: All PCAPS can be found on the Author's GitHub page located here

Do enjoy the read! Please do leave your comment on what you liked, what you don't like and most importantly, what I can do differently the next time.

Thursday, December 12, 2019

Come hang out at one of my upcoming classes to expand your knowledge on Intrusion Detection, Incident Handling, Hacker Techniques & Exploits

New Schedule coming out soon!!

Build on your Red & Blue Team skills from a practical perspective while learning about the Cyber Kill Chain

It's finally here! If you are looking for the right book to help you expand your network forensics knowledge, this is the book you need.

In Hack and Detect we leverage the Cyber Kill Chain for practical hacking and more importantly it's detection leveraging network forensics. In this book you will use Kali and many of its tools including Metasploit to hack and then we do lots of detecting via logs and packet analysis. We also implement mitigation strategies for limit and or prevent future compromises.

Grab your copy from Amazon to learn more.
https://www.amazon.com/dp/1731254458





Alternatively, grab the updated and production ready sample chapters here to get a sneak peak of what you can expect.

NOTE: All sample logs, pcaps, vbscripts, etc can be found on the book's GitHub page located here: This means if you don't wish to build your own lab, you have all you need to follow along.

Alternatively, you can use this link: https://bit.ly/NikAlleyne-Hackand-Detect


Do enjoy the read! Please do leave your comment on what you liked, what you don't like and most importantly, what I can do differently the next time if I decide to go down this road again. :-)

Wireless Security Analysis with Pandas

For this part of my Pandas learning, I drove around the neighbourhood looking for wireless information which I can use as part of this analysis.

To prepare to capture the traffic, I did as follows
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
root@securitynik:~/PA-Pandas# airodump-ng start wlan0
root@securitynik:~/PA-Pandas# ifconfig 

...

wlan0mon: flags=867<UP,BROADCAST,NOTRAILERS,RUNNING,PROMISC,ALLMULTI>  mtu 1500
        unspec 00-C0-CA-75-0B-E5-30-3A-00-00-00-00-00-00-00-00  txqueuelen 1000  (UNSPEC)
        RX packets 92483  bytes 19493415 (18.5 MiB)
        RX errors 0  dropped 968  overruns 0  frame 0
        TX packets 0  bytes 0 (0.0 B)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

Then start the actual capturing
1
root@securitynik:~/PA-Pandas# airodump-ng --write securitynik-Wi-Fi-Test wlan0mon

Once the capturing started, among the files it created were:
1
2
3
4
5
6
7
root@securitynik:~/PA-Pandas# ls -al *.csv
-rw-r--r-- 1 root root  436908 Nov 17 17:41 securitynik-Wi-Fi-Test-01.csv
-rw-r--r-- 1 root root  207204 Nov 17 17:41 securitynik-Wi-Fi-Test-01.kismet.csv
-rw-r--r-- 1 root root 5387235 Nov 17 17:41 securitynik-Wi-Fi-Test-01.log.csv

We will use the "securitynik-Wi-Fi-Test-01.csv" file. This file currently has 3884 lines as shown below.
1
2
3
root@securitynik:~/PA-Pandas# cat securitynik-Wi-Fi-Test-01.csv | wc --lines

3884

Looking at some sample data from the "securitynik-Wi-Fi-Test-01.csv" file
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
root@securitynik:~/PA-Pandas# cat securitynik-Wi-Fi-Test-01.csv | more

BSSID, First time seen, Last time seen, channel, Speed, Privacy, Cipher, Authentication, Power, # beacons, # IV, LA

N IP, ID-length, ESSID, Key

00:15:FF:5D:60:C4, 2019-11-17 16:25:17, 2019-11-17 17:27:03, -1,  -1, , ,   ,  -1,        0,        0,   0.  0.  0.

  0,   0, ,

88:DC:96:25:E8:94, 2019-11-17 16:25:11, 2019-11-17 17:27:04,  3, 270, WPA2, CCMP, PSK, -59,        8,        0,   0

.  0.  0.  0,   6, @ASPMC,

70:B3:17:1C:BA:80, 2019-11-17 16:25:11, 2019-11-17 17:27:01,  1, 195, WPA2, CCMP, PSK, -60,       17,        0,   0

.  0.  0.  0,  13, SnaponIncMISS,

00:02:6F:FD:FD:1C, 2019-11-17 16:25:15, 2019-11-17 17:27:02, 11, 130, WPA2, CCMP, PSK, -61,        9,        1,   0

.  0.  0.  0,   5, FGSEG,

82:D2:94:B7:86:83, 2019-11-17 16:25:19, 2019-11-17 17:27:01,  1, 360, WPA2, CCMP, PSK, -61,       54,        0,   0

.  0.  0.  0,   0, ,

...

While trying to read the file, in Pandas an error occurred. This it seems is because there are two sections. Therefore, we wil split it into two.

From below, we see these two sections headers

1
2
3
root@securitynik:~/PA-Pandas# cat securitynik-Wi-Fi-Test-01.csv | grep -i BSSID
BSSID, First time seen, Last time seen, channel, Speed, Privacy, Cipher, Authentication, Power, # beacons, # IV, LAN IP, ID-length, ESSID, Key
Station MAC, First time seen, Last time seen, Power, # packets, BSSID, Probed ESSIDs

Now that we have a snapshot into the data, let's now switch to our code:

  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
#!/usr/bin/env python3

'''
This code is me learning about Pandas Data Science. 
This is based on the Pandas for Data Science training from Pentester Academy
I decided to do things from my own perspective to some extent
I drove around the neighbourhood and captured the Wi-Fi information, so that
I can get my own perspective from my own data

Feel free to use this code as you see fit

Author: Nik Alleyne
Author Blog: www.securitynik.com


'''

from io import StringIO
import netaddr 
import numpy as np
import pandas as pd
from matplotlib import pyplot as plt
import seaborn as sns
import subprocess as sp
import sys



def usage():
    print('[*] Usage Information: ')
    print('[*] ./pandasWi-Fi.py <filename>. e.g. ./pandas-Wi-Fi.py my.csv')
    print('[*] Author: Nik Alleyne')
    print('[*] Author Blog: www.securitynik.com')
    sys.exit(-1)



def wifi_data_analysis(csv_file):
    print('[*] Opening the csv file ... ')
    wifi_data = open(csv_file, 'r').read()

    '''
    Need to split the data into two sections before creating the Pandas dataframe
    As can be seen below, there are two headers sections. One of these is for the Access Point
    and the other is for the client

root@securitynik:~/PA-Pandas# cat securitynik-Wi-Fi-Test-01.csv | grep -i BSSID
BSSID, First time seen, Last time seen, channel, Speed, Privacy, Cipher, Authentication, Power, # beacons, # IV, LAN IP, ID-length, ESSID, Key
Station MAC, First time seen, Last time seen, Power, # packets, BSSID, Probed ESSIDs

    '''
    print('[+] Splitting the data into a AP and client section')
    client_header = 'Station MAC, First time seen, Last time seen, Power, # packets, BSSID, Probed ESSIDs'
    ap_client_split = wifi_data.index(client_header)

    # Get AP section
    wifi_ap_data = StringIO(wifi_data[:ap_client_split])

    # Get Client section
    wifi_client_data = StringIO(wifi_data[ap_client_split:])

    '''
    This was a pain in the ass. Kept getting errors when attempting to create the DataFrame.
    Fortunately, this link helped to solve the problem
    https://stackoverflow.com/questions/18039057/python-pandas-error-tokenizing-data
    '''

    print('[+] Creating Access Point DataFrame ...')
    access_point_df = pd.read_csv(wifi_ap_data, sep=',', header=0, skipinitialspace=True, error_bad_lines=False, warn_bad_lines=False, parse_dates=['First time seen', 'Last time seen'])

    print('\n[*] Access Point column information before conversion {}'.format(access_point_df.columns))

    # My understanding is that we would be better off renaming those columns to something without space
    access_point_df.rename(columns={ 'BSSID' : 'BSSID', 'First time seen' : 'FirstTimeSeen', 'Last time seen' : 'LastTimeSeen', 'channel' : 'channel', 'Speed' : 'Speed', 'Privacy' : 'Privacy', 'Cipher' : 'Cipher', 'Authentication' : 'Authentication', 'Power' : 'Power', '# beacons' : 'BeaconsCount', '# IV' : 'IV', 'LAN IP' : 'LAN-IP', 'ID-length' : 'ID-Length', 'ESSID' : 'ESSID', 'Key' : 'Key' }, inplace=True)




    print('\n[*] Sample Access Point data \n {}'.format(access_point_df.head()))
    print('\n[*] Getting over all count of Access Point Data \n {}'.format(access_point_df.count()))
    print('\n[*] Overall you have {} rows and columns {} in the AP dataframe \n'.format(access_point_df.shape[0], access_point_df.shape[1]))
    print('\n[*] Data types in the AP dataframe \n {}'.format(access_point_df.dtypes))

    # Looking for the unique Access Point SSID
    print('\n[*] Here are the ESSIDs found ... \n {}' .format(list(set(access_point_df.ESSID))))
    
    # Get a count of the total unique SSIDs returned
    print('\n[*] Total unique SSIDs returned was:{} \n' .format(len(list(set(access_point_df.ESSID)))))

    # Looking for situatio where there is NAN
    print('[*] Do we have any "nan" values \n {}'.format(access_point_df.ESSID.hasnans))

    # Now that we see we have ESSID with NAN values, let's replace them
    access_point_df.ESSID.fillna('HIDDEN ESSID', inplace=True)

    # Let's now check again for those nan values
    print('[*] Do we have any "nan" values \n {}'.format(access_point_df.ESSID.hasnans))
    print('[*] First 10 records after the replacement of nans \n {}' .format(access_point_df.head()))
    # Good stuff, we replaced all the nan values

    # Looking at the frequency with which the SSIDs have been seen
    print('[*] Frequency of the SSID seen \n {}'.format(access_point_df.ESSID.value_counts()))

    # Plot the graph of the usage
    access_point_df.ESSID.value_counts().plot(kind='pie', figsize=(10,5))
    plt.show()

    # Looking at the channels in use
    print('\n[*] Frequency of the channels being seen \n {}'.format(access_point_df.channel.value_counts()))
    access_point_df.channel.value_counts().plot(kind='bar', figsize=(10,5))
    plt.show()
    
    # Time now for some grouping
    # first group by ESSID and the channels they are seen on
    print('\n[*] Grouping by SSID and channel ... \n {}'.format(access_point_df.groupby(['ESSID', 'channel'])['channel'].count()))

    # Looking at unstack
    print('\n[*] Looking at unstacking ... \n {}'.format(access_point_df.groupby(['ESSID', 'channel'])['channel'].count().unstack()))
    
    # The result above produced a number of channels with 'nan' values. Time to fill that with 0s
    print('\n[*] Filled the NANs with 0 ... \n {}'.format(access_point_df.groupby(['ESSID', 'channel'])['channel'].count().unstack().fillna(0)))

    # Create graph of the grouping information
    access_point_df.groupby(['ESSID', 'channel'])['channel'].count().unstack().fillna(0).plot(kind='bar', stacked=True, figsize=(10,5)).legend(bbox_to_anchor=(1.1,1))
    plt.show()


    # Extract the OUI from the MAC address - basically the firs 3 bytes
    oui_manufacturer = access_point_df.BSSID.str.extract('(..:..:..)', expand=False)
    print('\n[*] Here is your top 10 manufacturers OUI \n {} '.format(oui_manufacturer.head(10)))

    # Print the counts of each OUI
    print('\n[*] Here is your manufacturers OUI with the count \n {} '.format(oui_manufacturer.value_counts()))

    
    '''
        Client information and analysis start from here
    '''
    print('*'*100)
    print('[+] Creating Client DataFrame ...')
    client_df = pd.read_csv(wifi_client_data, sep=',', header=0, skipinitialspace=True, error_bad_lines=False, warn_bad_lines=False, parse_dates=['First time seen', 'Last time seen'])
    print('\n[*] Access Point column information before conversion {}'.format(client_df.columns))

    # Once again, addressing the space issue between column names
    client_df.rename(columns= {'Station MAC' : 'StationMAC', 'First time seen' : 'FirstTimeSeen', 'Last time seen' : 'LastTimeSeen', 'Power' : 'Power', '# packets' : 'PacketCount', 'BSSID' : 'BSSID', 'Probed ESSIDs' : 'ProbedESSIDs'}, inplace=True)

    print('\n[*] Sample client data \n {}'.format(client_df.head()))
    print('\n[*] Getting over all count of client Data \n {}'.format(client_df.count()))
    print('\n[*] Overall you have {} rows and columns {} in the AP dataframe \n'.format(client_df.shape[0], client_df.shape[1]))
    print('\n[*] Data types in the client dataframe \n {}'.format(client_df.dtypes))

    # Taking a look at the client SSIDs
    print('\n[*] Here are your client SSIDs \n {}'.format(client_df.BSSID.head()))

    # Looking at the probd ESSIDS
    print('\n[*] Here are the ESSIDs the clients are probing for ... \n {}'.format(client_df.ProbedESSIDs))



def main():
    sp.call(['clear'])
    sns.set_color_codes('dark')
    # Checking the command line to ensure 1 argument is passed to the command
    if (len(sys.argv) != 2 ):
        usage()
    else:
        print('[*] Reading command line arguments ... ')
        if (sys.argv[1].endswith('.csv')):
            print('[*] Found a CSV file ... ')
        else:
            print('[!] File is not .csv file. Exiting!!')
            sys.exit(-1)
            
    # Reading the CSV file
    wifi_data_analysis(sys.argv[1])



if __name__ == '__main__':
    main()

References:
seaborn.set_color_codes
Code Academy - Seaborn Styling, Part 2: Color
Pandas Read CSV
Pandas DataFrame Plot

Posts in this series:
Beginning Numpy
Beginning Pandas
Pandas String Operations, etc.

Pandas String Operations, etc.

Still learning about Pandas


 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
#!/usr/bin/env python3

'''
    Pandas strings, etc

'''

import pandas as pd
import numpy as np
import string

def main():
    # Create the first series consisting of name and age
    series_name_age = pd.Series(np.random.randint(1,50,26), name='age' ,index=list(string.ascii_lowercase[:26]))
    series_name_age.index.name = 'Name'  
    print('[*] Content of series_name_age \n{}'.format(series_name_age))

    #Create a second series consisting of name and income
    series_name_income = pd.Series(np.random.randint(100000,500000,26), name='Income', index=list(string.ascii_lowercase[:26]))
    series_name_income.index.name = 'Name' 
    print('\n[*] Content of series_name_income \n{}'.format(series_name_income))

    # Considering the values reported in the income series, print the salary of those making above 400K
    print('\n[*] Here are the list of people making above 400K \n {}'.format(series_name_income > 400000))
    
    # While the above only showed True or False, let's see the actual values
    print('\n[*] Actual income values \n{}'.format(series_name_income[series_name_income > 400000]))


    # Check to see if everyone makes a salary above 100000
    print('\n[*] Does everyone make above 100000? \n{}'.format((series_name_income > 100000).all()))

    # Check to see if everyone makes a salary above 400000
    print('\n[*] Does everyone make above 400000? \n{}'.format((series_name_income > 400000).all()))

    # Check to see if anyone, not everyone makes above 450000
    print('\n[*] Does anyone make above 450000? \n{}'.format((series_name_income > 450000).any()))


    # To convert a series to a different type just do as shown below:
    print('\n[*] Series_name_income as String \n{}'.format(series_name_income.to_string()))
    print('\n[*] Series_name_income as List \n{}'.format(series_name_income.to_list()))
    print('\n[*] Series_name_income as Dict \n{}'.format(series_name_income.to_dict()))
    print('\n[*] Series_name_income as Json \n{}'.format(series_name_income.to_json()))


    #Let's test to see if any of the values which were generated for income or age were duplicated
    print('\n[*] These are the unique values for age: \n{}'.format(series_name_age.unique()))
    print('\n[*] These are the unique values for income: \n{}'.format(series_name_income.unique()))

    # Let's now look for numbers which might have been duplicated and the number of times they appear
    print('\n[*] Age values usage and their occurrences: \n{}'.format(series_name_age.value_counts()))
    print('\n[*] Income value usage and their occurrences: \n{}'.format(series_name_income.value_counts()))

    # Let's get the minimum income and age
    print('\n[*] The minimum value for age: \n{}'.format(series_name_age.min()))
    print('\n[*] The minimum value for income: \n{}'.format(series_name_income.min()))

    # Let's get the maximum income and age
    print('\n[*] The max value for age: \n{}'.format(series_name_age.max()))
    print('\n[*] The max value for income: \n{}'.format(series_name_income.max()))

    # Now that we have the min and max of age and income, let's find the mean
    print('\n[*] The mean value for age to two decimals: \n{:.2f}'.format(series_name_age.mean()))
    print('\n[*] The mean value for income to two decimals: \n{:.2f}'.format(series_name_income.mean()))


if __name__ == '__main__':
    main()

Posts in this series:
Beginning Numpy
Beginning Pandas
Pandas String Operations, etc.

Pandas GroupBy

Learning about Pandas GroupBy from the perspective of the Iris Dataset

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
#!/usr/bin/env python3

'''
    Using the iris dataset to learn more about groupby
'''

import numpy as np
import pandas as pd
from matplotlib import pyplot as plt

def main():
    iris_df = pd.read_csv('./iris.data')
    print('[*] First 10 records \n {}'.format(iris_df.head(10)))

    #Get the column names
    print('\n[*] Column names \n{}'.format(iris_df.columns))

    # To determine the different species within the dataset
    print('\n[*] The numer of unique species is: {}'.format(len(set(iris_df.species))))
    print('[*] The unique species in the dataset are:  \n{}'.format(set(iris_df.species)))
    
    # Let's now group these by species
    group_by_species = iris_df.groupby('species')
    
    # Get the group and their indicies
    print('\n[*] Iris dataset now grouped by species \n {}'.format(group_by_species.indices))
    
    # Let's get the keys for above
    print('\n[*] Iris dataset keys \n {}'.format(group_by_species.indices.keys()))

    # Let's get the values for above
    print('\n[*] Iris dataset values \n {}'.format(group_by_species.indices.values()))

    # Iterating through the group
    for key, value in group_by_species:
        print('\n \\//-->>   Group Starts Here   <<--\\//')
        print('\n [*]{0} {1} \n'.format(key, value))
        print('\n \\//-->>   Group Ends Here   <<--\\// \n')


    # Rather than iterating, we could have just view the contents of the list
    print('\n\n[*] List view - Datasets group by species \n {}'.format(list(group_by_species)))

    # Get a specific group
    print('\n[*] Data for the Iris-setosa group \n {}'.format(group_by_species.get_group('Iris-setosa')))


if __name__ == '__main__':
    main()

Posts in this series:
Beginning Numpy
Beginning Pandas
Pandas String Operations, etc.

Pandas DataFrame Basics

* While the Pandas Series is like an array, the Pandas DataFrame is like a spreadsheet.
* Have both rows and columns which are generally labeled
* Rows represents the index
* DataFrame has two axis. These are "axis=0" and "axis=1".
* Axis=0 represents the columns. As in, if you wish to access all rows for a specific column, you should use "axis=0"
* Axis=1 represents the rows. This is if you wish to access all columns for a given row, ou use "axis=1"


 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
#!/usr/bin/env python3

import numpy as np
import pandas as pd


def main():
    my_data = {
            'User-1': [10, 'M', 'Cricketer'],
            'User-2': [30, 'F', 'BasketBall' ],
            'User-3': [15, 'F', 'Table Tennis'],
            'User-4': [100, 'M', 'History'],
            'User-5': [50, 'F', 'Soccer']
            }

    users_df = pd.DataFrame(my_data)
    print('\n[*] Current view of the dataframe \n {}'.format(users_df))
    print('\n[*] Here are your indexes \n {}'.format(users_df.index))
    print('\n[*] Here are your columns \n {}'.format(users_df.columns))
    print('\n[*] Here are your values \n {}'.format(users_df.values))

    # Add a new column
    users_df['new_index'] = ['Num', 'Sex', 'Sports']
    print('[*] The new dataframe \n {}'.format(users_df))

    #Change the index to the newly created column and make the change on the existing dataframe
    users_df.set_index('new_index', inplace=True)
    
    # Add a name to the newly created index
    users_df.columns.name = 'New Index'
    print('\n[*] users_df with new index column \n{}'.format(users_df))

    # to access a single column
    print('\n[*] Print information on User-2 \n {}'.format(users_df['User-2']))
    
    # To access multiple columns, leverage a list
    print('\n[*] Print information on User-2 and User-5 \n {}'.format(users_df[['User-2', 'User-5']]))

    # Access information for the entire row for sports
    print('\n[*] Print information on User-2 and User-5 \n {}'.format(users_df.loc['Num']))

    # To figure out the type of data returned
    print('\n[*] Type for the return column \n {}'.format(type(users_df.loc['Num'])))

    # Print inforation for User-3 and Sports. Notice the usage of '.at'. Also this has to be row,column
    print('\n[*] Print information on User 2 sports  \n {}'.format(users_df.at['Sports', 'User-2']))

    # Let's now transpose our dataframe. That is make the columns rows and the rows into columns
    users_transpose_df = users_df.T
    print('\n[*] Here we transpose the dataframe. We made the columns into rows and the rows into columns \n {}'.format(users_transpose_df))
    
    # find everyone whose Num is less than 50
    print('\n[*] Here is everyone whose age is less than 50 \n {}'.format((users_transpose_df.Num < 50))) 

    # Create a new column based on the information just returned
    users_transpose_df['derived_num_lt_50'] = users_transpose_df.Num < 50

    print('\n[*] Here is your new dataframe with its derived column \n {}'.format(users_transpose_df))

    # Let's now add a row and print it out
    users_transpose_df.loc['User-6'] = [70, 'M', 'Volleyball', 0]
    print('\n[*] New row added for User-6\n {}'.format(users_transpose_df))

    # Let's now describe the dataframe
    print('\n[*] Describing the dataframe \n {}'.format(users_transpose_df.describe()))

    # We can also describe specific column. In this case the Num
    print('\n[*] Describing the Num column \n {}'.format(users_transpose_df.Num.describe()))

    # Whereas the index was set above, we can reset the index
    print('\n[*] Index reset. Note the new index to the left with the incrementing numbers \n {}'.format(users_transpose_df.reset_index()))
    

if __name__ == '__main__':
    main()

Beginning Pandas

In this second part of my learning ...

* Pandas work well with text based data and is extremely powerful.
* Pandas allows us to look at data from both a macro and micro perspective.
* One of its main features it is the DataFrame. DataFrames can be considered similar to Excel spreadsheets.
* It also has the capability for Series, which are array like.
* Pandas index does not have to start at 0
* Pandas index does not have to be ordered
* Pandas index does not have to be a number. Can be a list of strings
* Pandas index are very flexible
* When using series, series provide the index and its values
* If a index is not specifically defined, Pandas will create an incrementing index
* Can even create Pandas series from Python dictionaries
* Pandas has the capability to do both position-based and label-based lookup. The two should not be confused
* To use label based lookup, use ".loc"
* For position based lookup use the ".iloc"
* Can pass a list of index values to both the ".loc" and ".loc"
* Alternatively, we can use the ".ix". This tries to first lookup based on label and if that does not work, then by position.
* Pandas is built on Numpy


 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
#!/usr/bin/env python3

import numpy as np
import pandas as pd


'''
side_by_side function from Wes McKinney, author of Pandas
if using python3, see this link for error you may get relating to the adjoin function
https://stackoverflow.com/questions/38156965/pandas-cannot-import-name-adjoin
'''
def side_by_side(*objs, **kwds):
  from pandas.io.formats.printing import adjoin
  space = kwds.get('space', 4)
  reprs = [repr(obj).split('\n') for obj in objs]
  print(adjoin(space, *reprs))



def main():
    print('[*] You are running pandas version {}'.format(pd.__version__))
    print('[*] You are running numpy version {}'.format(np.__version__))

    # Create a panda series with 6 random values between 1 and 100 from numpy
    rand_series = pd.Series(np.random.randint(1,100,6), index=['rand0', 'rand1', 'rand2', 'rand3', 'rand4', 'rand5'], name='Rand Series')
    rand_series.index.name = 'Rand Value'
    print('\n[*] \n{}'.format(rand_series))

    # We can also create a series from a python list
    list_series = pd.Series([98, 80, -50, 70, -10, 15], index=['numA','numB', 'numC', 'numD', 'numE', 'numF' ], name='list_series')
    list_series.index.name = 'List Value'
    print('\n[*] Current values in list_series \n{}'.format(list_series))

    # Create series from python dictionary, when used as below, all values were converted to Not a Number (NAN)
    #dict_series = pd.Series({10:1, 8:2, 11:4, 20:3, 5:9, 6:20}, index=['dict0', 'dict1', 'dict2', 'dict3', 'dict4', 'dict5'])

    # Use this instead
    dict_series = pd.Series({'dict0':1, 'dict1':2, 'dict2':4, 'dict3':3, 'dict4':9, 'dict5':20}, name='Dict Series')
    dict_series.index.name = 'Dict Value'
    print('\n[*] Current values in the dict_series \n {}'.format(dict_series))
    print()
    #Now call side_by_side function which was defined above
    print('[*] The 3 series side-by-side {} \n'.format(side_by_side(rand_series, list_series, dict_series)))

    #print the value of a specific index using its string name for a label based lookup
    print('\n[*] The value of numB is:{}\n '.format(list_series['numB']))

    #print the value of a specific location based on label based lookup
    print('\n[*] The value of rand1 is: \n{} '.format(rand_series.loc[['rand1', 'rand5']]))

    #for position based lookup used iloc
    print('\n[*] The value of numE is:{}\n '.format(list_series.iloc[5]))

    #We can use the .ix if we are not sure about the label or the position. It tries to lookup labels first then positions
    print('\n[*] The value of rand2 is:\n{} '.format(rand_series.ix[['rand2', 'rand3']]))
    


if __name__ == '__main__':
    main()


References
NumPy Reference
pandas: powerful Python data analysis toolkit
Pandas Series

Posts in this series:
Beginning Numpy
Beginning Pandas
Pandas String Operations, etc.