Vector Data Processing using Python Tools

GeoPandas Advanced Topics

Overview

Teaching: 30 min
Exercises: 0 min
Questions
  • What additional capabilities does GeoPandas provide, including data access, plotting and analysis?

  • How does it integrate with other common Python tools?

  • How do GeoPandas data objects integrate with analyses of raster data over vector geospatial features?

Objectives
  • Explore additional GeoPandas capabilities in reading from PostGIS and using its plot method.

  • Learn how to dissolve (aggregate) polygons into larger units, and apply spatial joins across GeoDataFrames, as examples of GeoPandas spatial operators.

  • Learn how to explore project (CRS) information and reproject.

  • Plot choropleth maps

  • Using folium to create interactive maps from a GeoDataFrame

  • Explore simple but powerful capabilities offered by the rasterstats package to generate summaries and statistics of raster properties over vector features, and explore these via GeoPandas.

View episode content in a Jupyter Notebook

Open the Jupyter Notebook in nbviewer

GeoPandas: Advanced topics

1. Introduction

We covered the basics of GeoPandas in the previous episode and notebook. Here, we’ll extend that introduction to illustrate additional aspects of GeoPandas and its interactions with other Python libraries, covering fancier mapping, analysis (unitary and binary spatial operators), raster zonal stats + GeoPandas.

Here are the main sections in this episode / notebook:

2. Set up packages and data file path

We’ll use these throughout the rest of the tutorial. Note that we’re not using the latest and greatest Matplotlib, 2.x. Some package dependency is forcing the use of Matplotlib 1.5.

%matplotlib inline

from __future__ import (absolute_import, division, print_function)
import os
import json
import psycopg2

import matplotlib as mpl
import matplotlib.pyplot as plt
# The two statemens below are used mainly to set up a plotting
# default style that's better than the default from Matplotlib 1.x
# Matplotlib 2.0 supposedly has better default styles.
import seaborn as sns
plt.style.use('bmh')

from shapely.geometry import Point
import pandas as pd
import geopandas as gpd
from geopandas import GeoSeries, GeoDataFrame

data_pth = "../data"
mpl.__version__, pd.__version__, gpd.__version__
('1.5.3', '0.20.3', '0.3.0')
with open(os.path.join(data_pth, "db.json")) as f:
    db_conn_dict = json.load(f)

3. Read HydroBASINS North America / Western Washington

Read HydroBASINS “all-levels” (lev00) hierarchical watersheds dataset for the Pacific NW, from Amazon Cloud PostgreSQL/PostGIS database. Watersheds in the dataset are at the finest (highest resolution) “Pfastetter” hierarchical level, level 12.

read_postgis is called as before, except now we’ll apply a SQL filter (server side) to the PostGIS dataset to select only the Pfastetter level-4 watershed with code 7831: WHERE pfaf_4 = 7831. This is most of Western Washington. Watershed polygons will still be read at their original level 12 resolution.

conn = psycopg2.connect(**db_conn_dict)
hydrobas_ww = gpd.read_postgis(
    "SELECT * FROM hybas_na_lev00_v1c WHERE pfaf_4 = 7831", 
    conn, geom_col='polygongeom', crs={'init': u'epsg:4326'}, 
    coerce_float=False)
conn.close()
len(hydrobas_ww)
413

413 polygon features returned. Let’s examine the attributes for the first feature.

hydrobas_ww.iloc[0]
gid                                                        19945
hybas_id                                             7.00001e+09
next_down                                                      0
next_sink                                            7.00001e+09
main_bas                                             7.00001e+09
dist_sink                                                      0
dist_main                                                      0
sub_area                                                   135.4
up_area                                                    135.4
endo                                                           0
coast                                                          0
order                                                          1
sort                                                       19945
pfaf_1                                                         7
pfaf_2                                                        78
pfaf_3                                                       783
pfaf_4                                                      7831
pfaf_5                                                     78310
pfaf_6                                                    783101
pfaf_7                                                   7831010
pfaf_8                                                  78310101
pfaf_9                                                 783101010
pfaf_10                                              7.83101e+09
pfaf_11                                              7.83101e+10
pfaf_12                                              7.83101e+11
polygongeom    (POLYGON ((-123.35 46.36666666666669, -123.349...
Name: 0, dtype: object

Plot a categorical map with coloring based on the aggregating column pfaf_7. Watershed boundaries are at the high-resolution Pfastetter level 12.

hydrobas_ww.plot(column='pfaf_7', cmap='Paired', categorical=True, figsize=(14, 8));

png

4. Dissolve into larger watersheds, and reproject

Dissolve source polygons into larger watersheds based on attribute values

Apply GeoDataFrame dissolve aggregation method (implemented from lower level shapely operators) on level-7 Pfastetter codes (pfaf_7) shown in the plot above. Aggregate attributes, retaining only pfaf_7 and pfaf_6 (plus polygongeom, of course). This operation results in only 17 polygons, from the original 413.

cols = ['pfaf_6', 'pfaf_7', 'polygongeom']
hydrobas_ww_p7 = hydrobas_ww[cols].dissolve(by='pfaf_7', aggfunc='first', as_index=False)
len(hydrobas_ww_p7)
17

Let’s examine some of the features.

hydrobas_ww_p7.head()
pfaf_7 polygongeom pfaf_6
0 7831010 (POLYGON ((-123.5722222222222 46.2458333333333... 783101
1 7831020 POLYGON ((-123.1791666666666 46.33333333333336... 783102
2 7831031 (POLYGON ((-123.9597222222222 46.9666666666667... 783103
3 7831032 POLYGON ((-123.8583333333333 47.39583333333336... 783103
4 7831033 POLYGON ((-124.3 47.34583333333336, -124.30221... 783103

Plot the results. Looks like the previous plot, except the polygon boundaries are now the pfaf_7 watersheds (same as the background colors).

hydrobas_ww_p7.plot(column='pfaf_7', cmap='Paired', categorical=True, figsize=(14, 8));

png

Beware of “invalid” geometries!

Beware that dissolve may fail if there are “invalid” geometries. This code is based on a GeoDataFrame examined in the previous, intro notebook. The 6 geometries/points reported are invalid (and are reported by the is_valid() method). This dissolve statement does work, though.

seas_grp = seas[['oceans', 'geometry']]
seas_oceans_diss = seas_grp[seas_grp.geometry.is_valid].dissolve(by='oceans')

Ring Self-intersection at or near point 10.407218181818182 54.821390909090908
Self-intersection at or near point -79.365827272727287 76.296645454545455
Ring Self-intersection at or near point 10.979445510225332 54.380555030408686
Ring Self-intersection at or near point 133.61550925464189 -4.3005540903175188
Ring Self-intersection at or near point 121.91067196634913 -5.0593090510592447
Ring Self-intersection at or near point 115.29553592754269 -7.0082630551828515

Reproject (transform) to WA State Plane South, epsg:2927

Partly so we can calculate polygon areas in linear units, not geodetic degrees. But also because that’s the projection used by most state and local governments in Washington.

No need to go to a web site to learn more about what epsg:2927 is. Use pyepsg, which issues queries to http://epsg.io web services.

import pyepsg

Extract the epsg code from the string returned by crs['init'], then query epsg 2927.

hydrobas_ww_p7_epsg_str = hydrobas_ww_p7.crs['init'].split(':')[1]
pyepsg.get(hydrobas_ww_p7_epsg_str)
<GeodeticCRS: 4326, WGS 84>
pyepsg.get('2927')
<ProjectedCRS: 2927, NAD83(HARN) / Washington South (ftUS)>

Apply the crs transformation (reprojection) using to_crs method.

hydrobas_ww_p7_wasp = hydrobas_ww_p7.to_crs(epsg=2927)

Plot the reprojected map. Note that, being in a planar project (not geodetic), the shape looks different compared to the previous map. More “normal”. And the axes are now in feet relative to some origin.

hydrobas_ww_p7_wasp.plot(column='pfaf_7', cmap='Paired', categorical=True, figsize=(14, 8));

png

5. Plot choropleth map based on calculated watershed areas

As the projection is in feet, auto-calculated polygon areas will be in feet2. So let’s convert to miles2 first (why not!). We’ll add a new column to the GeoDataFrame.

hydrobas_ww_p7_wasp['area_mi2'] = hydrobas_ww_p7_wasp.geometry.area / 27878400
hydrobas_ww_p7_wasp.head(3)
pfaf_7 polygongeom pfaf_6 area_mi2
0 7831010 (POLYGON ((863343.6925921033 347890.2242189167... 783101 1375.137396
1 7831020 POLYGON ((963803.8027083561 376154.9965688475,... 783102 2107.945774
2 7831031 (POLYGON ((776917.1877152125 614568.383332147,... 783103 528.472846

Pandas groupby aggregation

Now you could get the area of a pfaf_6 watershed via simple Pandas DataFrame groupby aggregation (sum).

Plot the choloropleth, using area_mi2.

The “fisher_jenks” value segmentation scheme (using 7 segments, k=7) used is one of the available pysal.esda.mapclassify.Map_Classifier classifiers from the powerful PySAL package (Python Spatial Analysis Library); GeoPandas can use these classifiers if PySAL is installed, as it is here. To get the list of classifiers, use:

import pysal
print(pysal.esda.mapclassify.Map_Classifier.__doc__)
f, ax = plt.subplots(1, figsize=(10, 6))
ax.set_title('Watersheds by area ($mi^2$)')
hydrobas_ww_p7_wasp.plot(column='area_mi2', scheme='fisher_jenks', k=7, 
                         cmap=plt.cm.Blues, legend=True, ax=ax)
ax.set_axis_off()
plt.axis('equal');

png

6. Choropleth map as an interactive map with folium

Folium is very cool, specially for use in Jupyter notebooks; or to export into stand-alone HTML.

import folium
folium.__version__
'0.4.0'

m.choropleth internally splits the geometry from the other attributes in hydrobas_ww_p7_wasp, and rejoins them based on the key pfaf_7. key_on uses an attribute reference derived from GeoJSON representations; this is awkward, and hopefully will be simplified in future folium implementations.

m = folium.Map(location=[47.8, -122.5], zoom_start=7, tiles="cartodbpositron")

m.choropleth(
    geo_data=hydrobas_ww_p7_wasp,
    data=hydrobas_ww_p7_wasp,
    columns=['pfaf_7', 'area_mi2'],
    key_on='feature.properties.pfaf_7',
    legend_name='Area (sq mi)', 
    fill_color='YlGn',
    fill_opacity=0.4,
    highlight=True)

m

png

This map is interactive, so play with it (zoom and pan). There is a lot more to explore in Folium! This is just a teaser.

7. Spatial join, sjoin, of polygons on points

We’ll use an old, local snapshot of NANOOS coastal and marine monitoring stations in the Pacific NW, from the NANOOS Visualization System (NVS) Data Explorer. While many stations are moorings on marine waters, some are inshore or in tidal shores and will overlap the watershed boundaries. The point file is in the GeoPackage format, an OGC format implemented in SQLite.

nanoosstations_gdf = gpd.read_file(os.path.join(data_pth, "nanoos_nvs.gpkg"))
len(nanoosstations_gdf)
194
nanoosstations_gdf.iloc[-1]
platform_label                                     WADOH_HendrsnBay1
platform_type                                   Fixed Shore Platform
name                                     Henderson Bay site, W shore
short_name                                       WADOH Henderson Bay
lat                                                          47.3612
lon                                                         -122.697
state                                                     Washington
region                                                   Puget Sound
provider                                                       WADOH
provider_url                                   http://www.doh.wa.gov
provider_type                                                  State
data_source                                                  Decagon
data_source_url                               http://www.decagon.com
url                http://www.doh.wa.gov/CommunityandEnvironment/...
status                                                        online
status_date                                     2016-06-24T18:00:00Z
nvs_station_url    http://nvs.nanoos.org/Explorer?action=oiw:fixe...
geometry                                   POINT (-122.6966 47.3612)
Name: 193, dtype: object

Points are on the coasts of the Pacific NW (BC, WA, OR) and out in the open ocean.

nanoosstations_gdf.plot(markersize=15);

png

Apply “inner” spatial join with sjoin operator from the gpd.tools package. An inner join will retain only overlapping features. Then plot as a map overlay on top of hydrobas_ww_p7, categorizing (coloring) each point by the pfaf_6 watershed it’s in.

nanoossta_hydrobas_ww_gdf = gpd.tools.sjoin(nanoosstations_gdf, hydrobas_ww_p7, how="inner")
len(nanoossta_hydrobas_ww_gdf)
63
f, ax = plt.subplots(1, figsize=(10, 6))
ax.set_axis_off()
plt.axis('equal')
hydrobas_ww_p7.plot(ax=ax, cmap='Greys_r', linewidth=1)
nanoossta_hydrobas_ww_gdf.plot(column='pfaf_6', markersize=30, 
                               categorical=True, legend=True, ax=ax);

png

8. rasterstats: “zonal” statistics from polygons on rasters

We’ll end by mixing features from a GeoDataFrame with a raster, applying zonal statistics using the cool and light weight rasterstats package.

Monthly Juy long-term climatology precipitation. The original monthly time series data are from the PRISM Climate Group; the monthly climatology and Pacific NW clip were created by your truly and Don Setiawan for the BiGCZ project.

ppt_july_tif_pth = os.path.join(data_pth, 'prism_precipitation_july_climatology.tif')

rasterio

Rasterstas uses rasterio to read rasters (and fiona to read vector datasets), so we’ll first do a quick exploration of rasterio.

import rasterio
import rasterio.plot as rioplot
rasterio.__version__
'0.36.0'
ppt_july = rasterio.open(ppt_july_tif_pth)
ppt_july
<open RasterReader name='../data/prism_precipitation_july_climatology.tif' mode='r'>

Examine the metadata read from the raster file (we can confirm CRS is epsg:4326), then plot the raster.

ppt_july.meta
    {"affine": Affine(0.04, 0.0, -125.12921197411002,
           0.0, -0.04, 49.045740291264174),
     "count": 1,
     "crs": CRS({"init": "epsg:4326"}),
     "driver": "GTiff",
     "dtype": "float32",
     "height": 177,
     "nodata": -9999.0,
     "transform": (-125.12921197411002, 0.04, 0.0, 49.045740291264174, 0.0, -0.04),
     "width": 353}

Plotting issue

with_bounds=True is supposed to display the X & Y coordinate labels (lat & lon) from the metadata. We don’t know why this isn’t working.

rioplot.show(ppt_july, with_bounds=True, cmap=plt.cm.Blues);

png

Apply rasterstas zonal_stats

Apply zonal_stats from rasterstats package. Can pass a GeoDataFrame directly (instead of the file path to a GIS file) because it implements our old friend, the __geo_interface__ method. For the raster, we pass its file path.

zonal_stats returns a geojson with the original properties plus the zonal statistics.

import rasterstats as rs
rs.__version__
'0.12.0'

Unfortunately, rasterstats requires the geometry column to be named “geometry”. So, we’ll rename it first.

hydrobas_ww_p7 = hydrobas_ww_p7.rename(columns={'polygongeom': 'geometry'})
hydrobas_ww_p7._geometry_column_name = 'geometry'
zonal_ppt_gjson = rs.zonal_stats(hydrobas_ww_p7, ppt_july_tif_pth, prefix='pptjuly_',
                                 geojson_out=True)
type(zonal_ppt_gjson), len(zonal_ppt_gjson)
(list, 17)
zonal_ppt_gdf = GeoDataFrame.from_features(zonal_ppt_gjson)
zonal_ppt_gdf.head(2)
geometry pfaf_6 pfaf_7 pptjuly_count pptjuly_max pptjuly_mean pptjuly_min
0 (POLYGON ((-123.5722222222222 46.2458333333333... 783101 7831010 259 49.824776 33.674382 23.438837
1 POLYGON ((-123.1791666666666 46.33333333333336... 783102 7831020 404 96.738289 29.443886 15.356521

And finally, a choropleth map of July precipitation by watershed!

f, ax = plt.subplots(1, figsize=(8, 6))
ax.set_title('Mean July precipitation ($mm/month$) by watershed')
zonal_ppt_gdf.plot(column='pptjuly_mean', scheme='Equal_Interval', k=5, 
                   cmap=plt.cm.Blues, legend=True, linewidth=1, ax=ax)
plt.axis('equal');

png

Key Points