Vector Data Processing using Python Tools

GeoPandas Advanced Topics

Overview

Teaching: 30 min
Exercises: 0 min
Questions
  • What additional capabilities does GeoPandas provide, including data access, plotting and analysis?

  • How does it integrate with other common Python tools?

  • How do GeoPandas data objects integrate with analyses of raster data over vector geospatial features?

Objectives
  • Explore additional GeoPandas capabilities in reading from PostGIS and using its plot method.

  • Learn how to dissolve (aggregate) polygons into larger units, and apply spatial joins across GeoDataFrames, as examples of GeoPandas spatial operators.

  • Learn how to explore project (CRS) information and reproject.

  • Plot choropleth maps

  • Using folium to create interactive maps from a GeoDataFrame

  • Explore simple but powerful capabilities offered by the rasterstats package to generate summaries and statistics of raster properties over vector features, and explore these via GeoPandas.

View episode content in a Jupyter Notebook

Open the Jupyter Notebook in nbviewer

GeoPandas: Advanced topics

1. Introduction

We covered the basics of GeoPandas in the previous episode and notebook. Here, we’ll extend that introduction to illustrate additional aspects of GeoPandas and its interactions with other Python libraries, covering fancier mapping, analysis (unitary and binary spatial operators), raster zonal stats and geopandas.

Here are the main sections in this episode / notebook:

2. Set up packages, data file path, PostgreSQL connectivity

We’ll use these throughout the rest of the tutorial.

%matplotlib inline

import os
import json
import psycopg2

import matplotlib.pyplot as plt
# The two statemens below are used mainly to set up a plotting
# default style that's better than the default from matplotlib
import seaborn as sns
plt.style.use('bmh')

from shapely.geometry import Point
import pandas as pd
import geopandas as gpd
from geopandas import GeoSeries, GeoDataFrame

data_pth = "../data"
with open(os.path.join(data_pth, ".db.json")) as f:
    db_conn_dict = json.load(f)

3. Read HydroBASINS North America / Western Washington from PostGIS

Read HydroBASINS “all-levels” (lev00) hierarchical watersheds dataset for North America from Amazon Cloud PostgreSQL/PostGIS database. Watersheds in the dataset are at the finest (highest resolution) “Pfastetter” hierarchical level, level 12.

from_postgis is called as before, except now we’ll apply a SQL filter to the PostGIS dataset to select only the Pfastetter level-4 watershed with code 7831: WHERE pfaf_4 = 7831. This is most of Western Washington. Watershed polygons will still be read at their original level 12 resolution.

conn = psycopg2.connect(**db_conn_dict)
hydrobas_ww = GeoDataFrame.from_postgis(
    "SELECT * FROM hybas_na_lev00_v1c WHERE pfaf_4 = 7831", 
    conn, geom_col='geometry', crs={'init': u'epsg:4326'}, 
    coerce_float=False)
conn.close()
len(hydrobas_ww)
413

413 polygon features returned. Let’s examine the attributes for the first feature.

hydrobas_ww.iloc[0]
gid                                                      20038
hybas_id                                           7.00002e+09
next_down                                                    0
next_sink                                          7.00002e+09
main_bas                                           7.00002e+09
dist_sink                                                    0
dist_main                                                    0
sub_area                                                  25.4
up_area                                                   25.4
endo                                                         0
coast                                                        1
order                                                        0
sort                                                     20038
pfaf_1                                                       7
pfaf_2                                                      78
pfaf_3                                                     783
pfaf_4                                                    7831
pfaf_5                                                   78310
pfaf_6                                                  783103
pfaf_7                                                 7831033
pfaf_8                                                78310330
pfaf_9                                               783103300
pfaf_10                                            7.83103e+09
pfaf_11                                            7.83103e+10
pfaf_12                                            7.83103e+11
geometry     (POLYGON ((-124.3416666666666 47.4583333333333...
Name: 0, dtype: object

Plot a categorical map, with coloring based on the aggregating column pfaf_7. Watershed boundaries are at the high-resolution Pfastetter level 12.

hydrobas_ww.plot(column='pfaf_7', cmap='Paired', categorical=True, figsize=(14,6));

png

4. Dissolve into larger watersheds, and reproject

Dissolve source polygons into larger watersheds based on attribute values

Apply GeoDataFrame dissolve aggregation method (implemented from lower level shapely operators) on level-7 Pfastetter codes (pfaf_7) shown in the plot above. Aggregate attributes, retaining only pfaf_7 and pfaf_6 (plus geometry, of course). Note that this operation results in only 17 polygons, from the original 413.

cols = ['pfaf_6', 'pfaf_7', 'geometry']
hydrobas_ww_p7 = hydrobas_ww[cols].dissolve(by='pfaf_7', aggfunc='first', as_index=False)
len(hydrobas_ww_p7)
17
hydrobas_ww_p7.crs

The CRS was not retained during the dissolve operation. Apply it by copying it from hydrobas_ww.

hydrobas_ww_p7.crs = hydrobas_ww.crs
hydrobas_ww_p7.crs
{'init': u'epsg:4326'}

Let’s examine some of the features.

hydrobas_ww_p7.head()


pfaf_7 geometry pfaf_6
0 7831010 (POLYGON ((-123.5722222222222 46.2458333333333... 783101
1 7831020 POLYGON ((-123.1791666666666 46.33333333333336... 783102
2 7831031 (POLYGON ((-123.9597222222222 46.9666666666667... 783103
3 7831032 POLYGON ((-123.8583333333333 47.39583333333336... 783103
4 7831033 POLYGON ((-124.3 47.34583333333336, -124.30221... 783103


Plot the results. Looks like the previous plot, except the polygon boundaries are now the pfaf_7 watersheds (same as the background colors).

hydrobas_ww_p7.plot(column='pfaf_7', cmap='Paired', categorical=True, figsize=(14,6));

png

Beware of “invalid” geometries!

Beware that dissolve may fail if there are “invalid” geometries. See this code from the previous, intro notebook. The 6 geometries/points reported are invalid (and are reported by the is_valid() method). This dissolve statement does work, though. ```python seas_grp = seas[[‘oceans’, ‘geometry’]] seas_oceans_diss = seas_grp[seas_grp.geometry.is_valid].dissolve(by=’oceans’)

Ring Self-intersection at or near point 10.407218181818182 54.821390909090908 Self-intersection at or near point -79.365827272727287 76.296645454545455 Ring Self-intersection at or near point 10.979445510225332 54.380555030408686 Ring Self-intersection at or near point 133.61550925464189 -4.3005540903175188 Ring Self-intersection at or near point 121.91067196634913 -5.0593090510592447 Ring Self-intersection at or near point 115.29553592754269 -7.0082630551828515 ```

Reproject (transform) to WA State Plane South, epsg:2927

Partly so we can calculate polygon areas in linear units, not geodetic degrees. But also because that’s the projection used by most state and local governments in Washington.

No need to go to a web site to learn more about what epsg:2927 is. Use pyepsg, which issues queries to http://epsg.io web services.

import pyepsg

Extract the epsg code from the string returned by crs['init'].

hydrobas_ww_p7_epsg_str = hydrobas_ww_p7.crs['init'].split(':')[1]
pyepsg.get(hydrobas_ww_p7_epsg_str)
<GeodeticCRS: 4326, WGS 84>

Query epsg 2927.

pyepsg.get('2927')
<ProjectedCRS: 2927, NAD83(HARN) / Washington South (ftUS)>

Apply the crs transformation (reprojection) using to_crs method.

hydrobas_ww_p7_wasp = hydrobas_ww_p7.to_crs(epsg=2927)

Plot the reprojected map.

Note that, being in a planar project (not geodetic), the shape looks different compared to the previous map. More “normal”. And the axes are now in feet relative to some origin.

hydrobas_ww_p7_wasp.plot(column='pfaf_7', cmap='Paired', categorical=True, figsize=(14,6));

png

5. Plot choropleth map based on calculated watershed areas

As the projection is in feet, auto-calculated polygon areas will be in feet2. So let’s convert to miles2 first (why not!). We’ll add a new column to the GeoDataFrame.

hydrobas_ww_p7_wasp['area_mi2'] = hydrobas_ww_p7_wasp.geometry.area / 27878400
hydrobas_ww_p7_wasp.head(3)


pfaf_7 geometry pfaf_6 area_mi2
0 7831010 (POLYGON ((863343.6925921033 347890.2242189167... 783101 1375.137396
1 7831020 POLYGON ((963803.8027083561 376154.9965688475,... 783102 2107.945774
2 7831031 (POLYGON ((776917.1877152125 614568.383332147,... 783103 528.472846


Pandas groupby aggregation

Now you could get the area of a pfaf_6 watershed via simple Pandas DataFrame groupby aggregation (sum).

Plot the choloropleth, using area_mi2.

“fisher_jenks” value segmentation scheme (using 7 segments, k=7) used is one of the available pysal.esda.mapclassify.Map_Classifier classifiers from the powerful PySAL package (Python Spatial Analysis Library); GeoPandas can use these classifiers if PySAL is installed. To get the list of classifiers, use: python import pysal print(pysal.esda.mapclassify.Map_Classifier.__doc__)

f, ax = plt.subplots(1, figsize=(6, 4))
hydrobas_ww_p7_wasp.plot(column='area_mi2', scheme='fisher_jenks', k=7, 
                         alpha=0.9, cmap=plt.cm.Blues, legend=True, ax=ax)
plt.axis('equal')
ax.set_title('Watersheds by area ($mi^2$)');

png

6. Choropleth map as an interactive map with folium

Folium is very cool, specially for use in Jupyter notebooks; or to export into stand-alone HTML.

import folium

Folium likes GeoJSON for geometries. GeoPandas makes conversion easy.

geo_str = hydrobas_ww_p7.to_json()

m.choropleth will do an attribute join between geo_str and hydrobas_ww_p7_wasp based on the key pfaf_7.

m = folium.Map(location=[47.8, -122.5], zoom_start=7, 
               tiles="cartodbpositron")

m.choropleth(geo_str=geo_str,
             data=hydrobas_ww_p7_wasp, columns=['pfaf_7', 'area_mi2'],
             key_on='feature.properties.pfaf_7',
             fill_color='YlGn', fill_opacity=0.4,
             legend_name='area')
m

png

This map is interactive, so play with it (zoom and pan) in the notebook. There is a lot more to explore in Folium! This is just a teaser.

7. Spatial join, sjoin, of polygons on points

We’ll use an old, local snapshot of NANOOS coastal and marine monitoring stations in the Pacific NW, from the NANOOS Visualization System (NVS) Data Explorer. While many stations are moorings on marine waters, some are inshore or in tidal shores and will overlap the watershed boundaries.

The point file is in the GeoPackage format, an OGC format implemented in SQLite.

nanoosstations_gdf = gpd.read_file(os.path.join(data_pth, "nanoos_nvs.gpkg"))
len(nanoosstations_gdf)
194
nanoosstations_gdf.iloc[-1]
data_source                                                  Decagon
data_source_url                               http://www.decagon.com
geometry                                   POINT (-122.6966 47.3612)
lat                                                          47.3612
lon                                                         -122.697
name                                     Henderson Bay site, W shore
nvs_station_url    http://nvs.nanoos.org/Explorer?action=oiw:fixe...
platform_label                                     WADOH_HendrsnBay1
platform_type                                   Fixed Shore Platform
provider                                                       WADOH
provider_type                                                  State
provider_url                                   http://www.doh.wa.gov
region                                                   Puget Sound
short_name                                       WADOH Henderson Bay
state                                                     Washington
status                                                        online
status_date                                     2016-06-24T18:00:00Z
url                http://www.doh.wa.gov/CommunityandEnvironment/...
Name: 193, dtype: object

Points in the coasts of the Pacific NW (BC, WA, OR) and out in the open ocean.

nanoosstations_gdf.plot(markersize=4);

png

Apply “inner” spatial join with sjoin operator from the gpd.tools package. An inner join will retain only overlapping features. Then plot as a map overlay on top of hydrobas_ww_p7, categorizing (coloring) points by the pfaf_6 watershed they’re in.

nanoossta_hydrobas_ww_gdf = gpd.tools.sjoin(nanoosstations_gdf, hydrobas_ww_p7, how="inner")
len(nanoossta_hydrobas_ww_gdf)
``

    63


```python
f, ax = plt.subplots(1, figsize=(6, 6))
plt.axis('equal')
hydrobas_ww_p7.plot(ax=ax)
nanoossta_hydrobas_ww_gdf.plot(column='pfaf_6', markersize=6, 
                               categorical=True, legend=True, ax=ax);

png

8. rasterstats: “zonal” statistics from polygons on rasters

We’ll end by mixing features from a GeoDataFrame with a raster, applying zonal statistics using the cool and light weight rasterstats package.

Monthly Juy long-term climatology precipitation. The original monthly time series data are from the PRISM Climate Group; the monthly climatology and Pacific NW clip were created by your truly and Don Setiawan for the BiGCZ project.

ppt_july_tif_pth = os.path.join(data_pth, 'prism_precipitation_july_climatology.tif')

rasterio

Rasterstas uses rasterio to read rasters (and fiona to read vector datasets), so we’ll first do a quick exploration of rasterio.

import rasterio
import rasterio.plot as rioplot
ppt_july = rasterio.open(ppt_july_tif_pth)
ppt_july
<open DatasetReader name='../data/prism_precipitation_july_climatology.tif' mode='r'>

Examine the metadata read from the raster file (we can confirm CRS is epsg:4326), then plot the raster.

ppt_july.meta
{'count': 1,
 'crs': CRS({'init': u'epsg:4326'}),
 'driver': u'GTiff',
 'dtype': 'float32',
 'height': 177,
 'nodata': -9999.0,
 'transform': Affine(0.04, 0.0, -125.12921197411002,
       0.0, -0.04, 49.045740291264174),
 'width': 353}

Plotting issue

with_bounds=True is supposed to display the X & Y coordinate labels (lat & lon) from the metadata. We don’t why this isn’t working.

rioplot.show(ppt_july, with_bounds=True, cmap=plt.cm.Blues);

png

Apply rasterstas zonal_stats

Apply zonal_stats from rasterstats package. Can pass a GeoDataFrame directly (instead of the file path to a GIS file) because it implements our old friend, the __geo_interface__ method. For the raster, we pass its file path.

zonal_stats returns a geojson with the original properties plus the zonal statistics.

import rasterstats as rs
zonal_ppt_gjson = rs.zonal_stats(hydrobas_ww_p7, ppt_july_tif_pth,
                                 prefix='pptjuly_',
                                 geojson_out=True)
len(zonal_ppt_gjson)
17
zonal_ppt_gdf = GeoDataFrame.from_features(zonal_ppt_gjson)
zonal_ppt_gdf.head(2)


geometry pfaf_6 pfaf_7 pptjuly_count pptjuly_max pptjuly_mean pptjuly_min
0 (POLYGON ((-123.5722222222222 46.2458333333333... 783101 7831010 259 49.824776 33.674382 23.438837
1 POLYGON ((-123.1791666666666 46.33333333333336... 783102 7831020 404 96.738289 29.443886 15.356521


And finally, a choropleth map of July precipitation by watershed!

f, ax = plt.subplots(1, figsize=(6, 5))
zonal_ppt_gdf.plot(column='pptjuly_mean', scheme='Equal_Interval', k=5, 
                   alpha=1, cmap=plt.cm.Blues, legend=True, ax=ax)
plt.axis('equal')
ax.set_title('Mean July precipitation ($mm/month$) by watershed');

png

Key Points