Arrive at the HackWeek with a basic overview of the Pandas library for tabular data manipulation.
Data - an introduction to the world of Pandas
Note: This is an edited version of Cliburn Chan’s original tutorial, as part of his Stat-663 course at Duke. All changes remain licensed as the original, under the terms of the MIT license.
pandas is a Python package providing fast, flexible, and expressive data structures designed to work with relational or labeled data both. It is a fundamental high-level building block for doing practical, real world data analysis in Python.
pandas is well suited for:
Tabular data with heterogeneously-typed columns, as you might find in an SQL table or Excel spreadsheet
Ordered and unordered (not necessarily fixed-frequency) time series data.
Arbitrary matrix data with row and column labels
Virtually any statistical dataset, labeled or unlabeled, can be converted to a pandas data structure for cleaning, transformation, and analysis.
Key features
Easy handling of missing data
Size mutability: columns can be inserted and deleted from DataFrame and higher dimensional objects
Automatic and explicit data alignment: objects can be explicitly aligned to a set of labels, or the data can be aligned automatically
Powerful, flexible group by functionality to perform split-apply-combine operations on data sets
Intelligent label-based slicing, fancy indexing, and subsetting of large data sets
Intuitive merging and joining data sets
Flexible reshaping and pivoting of data sets
Hierarchical labeling of axes
Robust IO tools for loading data from flat files, Excel files, databases, and HDF5
Time series functionality: date range generation and frequency conversion, moving window statistics, moving window linear regressions, date shifting and lagging, etc.
Inevitably, we want to be able to store, view and manipulate data that is multivariate, where for every index there are multiple fields or columns of data, often of varying data type.
A DataFrame is a tabular data structure, encapsulating multiple series like columns in a spreadsheet. It is directly inspired by the R DataFrame.
try:apple=pd.read_csv('data/apple.csv',index_col=0,parse_dates=True)except:apple=web.DataReader('AAPL','yahoo',start=datetime.datetime(2015,1,1),end=datetime.datetime(2015,12,31))# Let's save this data to a CSV file so we don't need to re-download it on every run:apple.to_csv('data/apple.csv')
apple.head()
Open
High
Low
Close
Adj Close
Volume
Date
2015-01-02
111.389999
111.440002
107.349998
109.330002
103.866470
53204600
2015-01-05
108.290001
108.650002
105.410004
106.250000
100.940392
64285500
2015-01-06
106.540001
107.430000
104.629997
106.260002
100.949890
65797100
2015-01-07
107.199997
108.199997
106.699997
107.750000
102.365440
40105900
2015-01-08
109.230003
112.150002
108.699997
111.889999
106.298531
59364500
apple.tail()
Open
High
Low
Close
Adj Close
Volume
Date
2015-12-24
109.000000
109.000000
107.949997
108.029999
104.380112
13570400
2015-12-28
107.589996
107.690002
106.180000
106.820000
103.210999
26704200
2015-12-29
106.959999
109.430000
106.860001
108.739998
105.066116
30931200
2015-12-30
108.580002
108.699997
107.180000
107.320000
103.694107
25213800
2015-12-31
107.010002
107.029999
104.820000
105.260002
101.703697
40635300
Let’s save this data to a CSV file so we don’t need to re-download it on every run:
/Users/fperez/usr/conda/envs/s159/lib/python3.6/site-packages/pandas/plotting/_core.py:1714: UserWarning: Pandas doesn't allow columns to be created via a new attribute name - see https://pandas.pydata.org/pandas-docs/stable/indexing.html#attribute-access
series.name = label
Data conversions
One of the nicest features of pandas is the ease of converting tabular data across different storage formats. We will illustrate by converting the titanic dataframe into multiple formats.
/Users/fperez/usr/conda/envs/s159/lib/python3.6/site-packages/pandas/core/generic.py:1534: UserWarning: The spaces in these column names will not be changed. In pandas versions < 0.14, spaces were converted to underscores.
chunksize=chunksize, dtype=dtype)
t3=pd.read_sql('select * from titanic',con)t3.head(2)
survived
sex
age
fare
embarked
class
who
deck
embark_town
new column
0
0
male
22.0
7.2500
S
Third
man
None
Southampton
0
1
1
female
38.0
71.2833
C
First
woman
C
Cherbourg
0
JSON
t3.to_json('titanic.json')
t4=pd.read_json('titanic.json')t4.head(2)
age
class
deck
embark_town
embarked
fare
new column
sex
survived
who
0
22.0
Third
None
Southampton
S
7.2500
0
male
0
man
1
38.0
First
C
Cherbourg
C
71.2833
0
female
1
woman
t4=t4[t3.columns]t4.head(2)
survived
sex
age
fare
embarked
class
who
deck
embark_town
new column
0
0
male
22.0
7.2500
S
Third
man
None
Southampton
0
1
1
female
38.0
71.2833
C
First
woman
C
Cherbourg
0
HDF5
The HDF5 format was designed in the Earth Sciences community but it can be an excellent general purpose tool. It’s efficient and type-safe, so you can store complex dataframes in it and recover them back without information loss, using the to_hdf method:
t4.to_hdf('titanic.h5','titanic')
/Users/fperez/usr/conda/envs/s159/lib/python3.6/site-packages/pandas/core/generic.py:1471: PerformanceWarning:
your performance may suffer as PyTables will pickle object types that it cannot
map directly to c-types [inferred_type->mixed,key->block1_values] [items->['sex', 'embarked', 'class', 'who', 'deck', 'embark_town']]
return pytables.to_hdf(path_or_buf, key, self, **kwargs)
t5=pd.read_hdf('titanic.h5','titanic')t5.head(2)
survived
sex
age
fare
embarked
class
who
deck
embark_town
new column
0
0
male
22.0
7.2500
S
Third
man
None
Southampton
0
1
1
female
38.0
71.2833
C
First
woman
C
Cherbourg
0
Feather
You may need to install the Feather support first:
conda install -c conda-forge feather-format
t6=t5.reset_index(drop=True)t6.head()
survived
sex
age
fare
embarked
class
who
deck
embark_town
new column
0
0
male
22.0
7.2500
S
Third
man
None
Southampton
0
1
1
female
38.0
71.2833
C
First
woman
C
Cherbourg
0
2
1
female
4.0
16.7000
S
Third
child
G
Southampton
0
3
0
female
28.0
7.8958
S
Third
woman
None
Southampton
0
4
0
male
NaN
7.8958
S
Third
man
None
Southampton
0
t6.to_feather('titanic.feather')
t7=pd.read_feather('titanic.feather')t7.head()
survived
sex
age
fare
embarked
class
who
deck
embark_town
new column
0
0
male
22.0
7.2500
S
Third
man
None
Southampton
0
1
1
female
38.0
71.2833
C
First
woman
C
Cherbourg
0
2
1
female
4.0
16.7000
S
Third
child
G
Southampton
0
3
0
female
28.0
7.8958
S
Third
woman
None
Southampton
0
4
0
male
NaN
7.8958
S
Third
man
None
Southampton
0
Key Points
With Pandas, Python provides an excellent environment for the analysis of tabular data.