* Pandas work well with text based data and is extremely powerful.
* Pandas allows us to look at data from both a macro and micro perspective.
* One of its main features it is the DataFrame. DataFrames can be considered similar to Excel spreadsheets.
* It also has the capability for Series, which are array like.
* Pandas index does not have to start at 0
* Pandas index does not have to be ordered
* Pandas index does not have to be a number. Can be a list of strings
* Pandas index are very flexible
* When using series, series provide the index and its values
* If a index is not specifically defined, Pandas will create an incrementing index
* Can even create Pandas series from Python dictionaries
* Pandas has the capability to do both position-based and label-based lookup. The two should not be confused
* To use label based lookup, use ".loc"
* For position based lookup use the ".iloc"
* Can pass a list of index values to both the ".loc" and ".loc"
* Alternatively, we can use the ".ix". This tries to first lookup based on label and if that does not work, then by position.
* Pandas is built on Numpy
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 | #!/usr/bin/env python3 import numpy as np import pandas as pd ''' side_by_side function from Wes McKinney, author of Pandas if using python3, see this link for error you may get relating to the adjoin function https://stackoverflow.com/questions/38156965/pandas-cannot-import-name-adjoin ''' def side_by_side(*objs, **kwds): from pandas.io.formats.printing import adjoin space = kwds.get('space', 4) reprs = [repr(obj).split('\n') for obj in objs] print(adjoin(space, *reprs)) def main(): print('[*] You are running pandas version {}'.format(pd.__version__)) print('[*] You are running numpy version {}'.format(np.__version__)) # Create a panda series with 6 random values between 1 and 100 from numpy rand_series = pd.Series(np.random.randint(1,100,6), index=['rand0', 'rand1', 'rand2', 'rand3', 'rand4', 'rand5'], name='Rand Series') rand_series.index.name = 'Rand Value' print('\n[*] \n{}'.format(rand_series)) # We can also create a series from a python list list_series = pd.Series([98, 80, -50, 70, -10, 15], index=['numA','numB', 'numC', 'numD', 'numE', 'numF' ], name='list_series') list_series.index.name = 'List Value' print('\n[*] Current values in list_series \n{}'.format(list_series)) # Create series from python dictionary, when used as below, all values were converted to Not a Number (NAN) #dict_series = pd.Series({10:1, 8:2, 11:4, 20:3, 5:9, 6:20}, index=['dict0', 'dict1', 'dict2', 'dict3', 'dict4', 'dict5']) # Use this instead dict_series = pd.Series({'dict0':1, 'dict1':2, 'dict2':4, 'dict3':3, 'dict4':9, 'dict5':20}, name='Dict Series') dict_series.index.name = 'Dict Value' print('\n[*] Current values in the dict_series \n {}'.format(dict_series)) print() #Now call side_by_side function which was defined above print('[*] The 3 series side-by-side {} \n'.format(side_by_side(rand_series, list_series, dict_series))) #print the value of a specific index using its string name for a label based lookup print('\n[*] The value of numB is:{}\n '.format(list_series['numB'])) #print the value of a specific location based on label based lookup print('\n[*] The value of rand1 is: \n{} '.format(rand_series.loc[['rand1', 'rand5']])) #for position based lookup used iloc print('\n[*] The value of numE is:{}\n '.format(list_series.iloc[5])) #We can use the .ix if we are not sure about the label or the position. It tries to lookup labels first then positions print('\n[*] The value of rand2 is:\n{} '.format(rand_series.ix[['rand2', 'rand3']])) if __name__ == '__main__': main() |
References
NumPy Reference
pandas: powerful Python data analysis toolkit
Pandas Series
Posts in this series:
Beginning Numpy
Beginning Pandas
Pandas String Operations, etc.
No comments:
Post a Comment