Overview

Teaching: 30 min
Exercises: 0 min
Questions
  • What is GeoPandas?

  • What functionality and advantages does GeoPandas offer over other Python geospatial tools?

  • What geospatial storage, analytical and plotting capabilities does it include?

  • What is its relationship to Pandas?

Objectives
  • Learn to use GeoPandas by reading from common vector geospatial formats (shape files, GeoJSON, etc), PostGIS databases, and from geospatial data generated on the fly.

  • Learn to leverage Pandas functionality in GeoPandas, for effective, mixed attribute-based and geospatial analyses.

  • Learn about common geospatial operations, including reprojection, spatial joins, and overlays.

View episode content in a Jupyter Notebook

Open the Jupyter Notebook in nbviewer

GeoPandas: Pandas + geometry data type + custom geo goodness

1. Background

GeoPandas adds a spatial geometry data type to Pandas and enables spatial operations on these types, using shapely. GeoPandas leverages Pandas together with several core open source geospatial packages and practices to provide a uniquely simple and convenient framework for handling geospatial feature data, operating on both geometries and attributes jointly, and as with Pandas, largely eliminating the need to iterate over features (rows). Also as with Pandas, it adds a very convenient and fine-tuned plotting method, and read/write methods that handle multiple file and “serialization” formats.

Additional Notes

GeoPandas is still fairly young, but it builds on mature and stable and widely used packages (Pandas, shapely, etc). Expect continued growth, and possibly some kinks.

When should you use GeoPandas?

When it may not be the best tool?

2. Set up packages and data file path

We’ll use these throughout the rest of the tutorial. Note that we’re not using the latest and greatest Matplotlib, 2.x. Some package dependency is forcing the use of Matplotlib 1.5.

%matplotlib inline

from __future__ import (absolute_import, division, print_function)
import os

import matplotlib as mpl
import matplotlib.pyplot as plt
# The two statemens below are used mainly to set up a plotting
# default style that's better than the default from Matplotlib 1.x
# Matplotlib 2.0 supposedly has better default styles.
import seaborn as sns
plt.style.use('bmh')

from shapely.geometry import Point
import pandas as pd
import geopandas as gpd
from geopandas import GeoSeries, GeoDataFrame

data_pth = "../data"
mpl.__version__, pd.__version__, gpd.__version__
('1.5.3', '0.20.3', '0.3.0')

3. GeoSeries: The geometry building block

Like a Pandas Series, a GeoSeries is the building block for the more broadly useful and powerful GeoDataFrame that we’ll focus on in this tutorial. Here we’ll first take a bit of time to examine a GeoSeries.

A GeoSeries is made up of an index and a GeoPandas geometry data type. This data type is a shapely.geometry object, and therefore inherits their attributes and methods such as area, bounds, distance, etc.

GeoPandas has six classes of geometric objects, corresponding to the three basic single-entity geometric types and their associated homogeneous collections of multiple entities:

A GeoSeries is then a list of geometry objects and their associated index values.

Entries (rows) in a GeoSeries can store different geometry types

GeoPandas does not constrain the geometry column to be of the same geometry type. This can lead to unexpected problems if you’re not careful! Specially if you’re used to thinking of a GIS file format like shape files, which store a single geometry type. Also beware that certain export operations (say, to shape files …) will fail if the list of geometry objects is heterogeneous.

But enough theory! Let’s get our hands dirty (so to speak) with code. We’ll start by illustrating how GeoSeries are constructured.

Create a GeoSeries from a list of shapely Point objects constructed directly from WKT text

Though you will rarely need this raw approach!

from shapely.wkt import loads

GeoSeries([loads('POINT(1 2)'), loads('POINT(1.5 2.5)'), loads('POINT(2 3)')])
0        POINT (1 2)
1    POINT (1.5 2.5)
2        POINT (2 3)
dtype: object

Create a GeoSeries from a list of shapely Point objects using the Point constructor

gs = GeoSeries([Point(-120, 45), Point(-121.2, 46), Point(-122.9, 47.5)])
gs
0        POINT (-120 45)
1      POINT (-121.2 46)
2    POINT (-122.9 47.5)
dtype: object
type(gs), len(gs)
(geopandas.geoseries.GeoSeries, 3)

A GeoSeries (and a GeoDataFrame) can store a CRS implicitly associated with the geometry column. This is useful as essential spatial metadata and for transformation (reprojection) to another CRS. Let’s assign the CRS.

gs.crs = {'init': 'epsg:4326'}

The plot method accepts standard matplotlib.pyplot style options, and can be tweaked like any other matplotlib figure.

Note: There may be something odd happening with plot markersize

gs.plot(marker='*', color='red', markersize=100, figsize=(4, 4))
plt.xlim([-123, -119.8])
plt.ylim([44.8, 47.7]);

png

Let’s get a bit fancier, as a stepping stone to GeoDataFrames. First, we’ll define a simple dictionary of lists, that we’ll use again later.

data = {'name': ['a', 'b', 'c'],
        'lat': [45, 46, 47.5],
        'lon': [-120, -121.2, -122.9]}

Note this convenient, compact approach to create a list of Point shapely objects out of X & Y coordinate lists:

geometry = [Point(xy) for xy in zip(data['lon'], data['lat'])]
geometry
[<shapely.geometry.point.Point at 0x7f2acc658c50>,
 <shapely.geometry.point.Point at 0x7f2acc6582b0>,
 <shapely.geometry.point.Point at 0x7f2acc658f98>]

We’ll wrap up by creating a GeoSeries where we explicitly define the index values.

gs = GeoSeries(geometry, index=data['name'])
gs
a        POINT (-120 45)
b      POINT (-121.2 46)
c    POINT (-122.9 47.5)
dtype: object

4. GeoDataFrames: The real power tool

Additional notes

Start with a simple, manually constructed illustration

We’ll build on the GeoSeries examples. Let’s reuse the data dictionary we defined earlier, this time to create a DataFrame.

df = pd.DataFrame(data)
df
lat lon name
0 45.0 -120.0 a
1 46.0 -121.2 b
2 47.5 -122.9 c

Now we use the DataFrame and the “list-of-shapely-Point-objects” approach to create a GeoDataFrame. Note the use of two DataFrame attribute columns, which are effectively just two simple Pandas Series.

geometry = [Point(xy) for xy in zip(df['lon'], df['lat'])]
gdf = GeoDataFrame(df, geometry=geometry)

There’s nothing new to visualize, but this time we’re using the plot method from a GeoDataFrame, not from a GeoSeries. They’re not exactly the same thing under the hood.

gdf.plot(marker='*', color='green', markersize=50, figsize=(3, 3));

png

FINALLY, we get to work with real data! Load and examine the simple “oceans” shape file

gpd.read_file is the workhorse for reading GIS files. It leverages the fiona package.

oceans = gpd.read_file(os.path.join(data_pth, "oceans.shp"))
oceans.head()
my_polygon ID Oceans geometry
0 S.Atlantic 1 South Atlantic Ocean POLYGON ((-67.26025728926088 -59.9309210526315...
1 N.Pacific 0 North Pacific Ocean (POLYGON ((180 66.27034771241199, 180 0, 101.1...
2 Southern 3 Southern Ocean POLYGON ((180 -60, 180 -90, -180 -90, -180 -60...
3 Arctic 2 Arctic Ocean POLYGON ((-100.1196521436255 52.89103112710165...
4 Indian 5 Indian Ocean POLYGON ((19.69705552221351 -59.94160091330382...

The crs was read from the shape file’s prj file:

oceans.crs
    {"init": "epsg:4326"}

Now we finally plot a real map (or blobs, depending on your aesthetics), from a dataset that’s global-scale and stored in “geographic” (latitude & longitude) coordinates. It’s not the actual ocean shapes defined by coastal boundaries, but bear with me. A colormap has been applied to distinguish the different Oceans.

oceans.plot(cmap='Set2', figsize=(10, 10));

png

oceans.shp stores both Polygon and Multi-Polygon geometry types (a Polygon may also be viewed as a Multi-Polygon with 1 member). We can get at the geometry types and other geometry properties easily.

oceans.geom_type
0         Polygon
1    MultiPolygon
2         Polygon
3         Polygon
4         Polygon
5    MultiPolygon
6         Polygon
dtype: object
# Beware that these area calculations are in degrees, which is fairly useless
oceans.geometry.area
0     5287.751094
1    11805.894558
2    10822.509589
3     9578.786157
4     9047.879388
5     9640.457926
6     8616.721287
dtype: float64
oceans.geometry.bounds
minx miny maxx maxy
0 -71.183612 -60.000000 28.736134 0.000000
1 -180.000000 0.000000 180.000000 67.479386
2 -180.000000 -90.000000 180.000000 -59.806846
3 -180.000000 47.660532 180.000000 90.000000
4 19.697056 -59.945004 146.991853 37.102940
5 -180.000000 -60.000000 180.000000 2.473291
6 -106.430148 0.000000 45.468236 76.644442

The envelope method returns the bounding box for each polygon. This could be used to create a new spatial column or GeoSeries; directly for plotting; etc.

oceans.envelope.plot(cmap='Set2', figsize=(8, 8), alpha=0.7, edgecolor='black');

png

Does it seem weird that some envelope bounding boxes, such as the North Pacific Ocean, span all longitudes? That’s because they’re Multi-Polygons with edges at the ends of the -180 and +180 degree coordinate range.

oceans[oceans['Oceans'] == 'North Pacific Ocean'].plot(figsize=(8, 8));
plt.ylim([-100, 100]);

png

Load “Natural Earth” countries dataset, bundled with GeoPandas

Natural Earth is a public domain map dataset available at 1:10m, 1:50m, and 1:110 million scales. Featuring tightly integrated vector and raster data, with Natural Earth you can make a variety of visually pleasing, well-crafted maps with cartography or GIS software.” A subset comes bundled with GeoPandas and is accessible from the gpd.datasets module. We’ll use it as a helpful global base layer map.

world = gpd.read_file(gpd.datasets.get_path('naturalearth_lowres'))
world.head(2)
pop_est continent name iso_a3 gdp_md_est geometry
0 28400000.0 Asia Afghanistan AFG 22270.0 POLYGON ((61.21081709172574 35.65007233330923,...
1 12799293.0 Africa Angola AGO 110300.0 (POLYGON ((16.32652835456705 -5.87747039146621...

Its CRS is also EPSG:4326:

world.crs
    {"init": "epsg:4326"}
world.plot(figsize=(8, 8));

png

Map plot overlays: Plotting multiple spatial layers

Here’s a compact, quick way of using the GeoDataFrame plot method to overlay two GeoDataFrames while style customizing the styles for each layer.

world.plot(ax=oceans.plot(cmap='Set2', figsize=(10, 10)), facecolor='gray');

png

We can also compose the plot using conventional matplotlib steps and options that give us more control.

f, ax = plt.subplots(1, figsize=(12, 6))
ax.set_title('Countries and Ocean Basins')
# Other nice categorical color maps (cmap) include 'Set2' and 'Set3'
oceans.plot(ax=ax, cmap='Paired')
world.plot(ax=ax, facecolor='lightgray', edgecolor='gray')
ax.set_ylim([-90, 90])
ax.set_axis_off()
plt.axis('equal');

png

5. Extras: Reading from other data source types; fancier plotting

Read PostgreSQL/PostGIS dataset from the Amazon Cloud

The fact that it’s on an Amazon Cloud is irrelevant. The approach is independent of the location of the database server; it could be on your computer.

import json
import psycopg2

First we’ll read the database connection information from a hidden JSON file, to add a level of security and not expose all that information on the github GeoHackWeek repository. This is also a good practice for handling sensitive information.

with open(os.path.join(data_pth, "db.json")) as f:
    db_conn_dict = json.load(f)

Open the database connection, returning a connection object:

conn = psycopg2.connect(**db_conn_dict)

Now that we’ve used the connection information, we’ll overwrite the user and password keys (for security) and print out the dictionary, to give you a look at what needs to be in it:

db_conn_dict['user'] = '*****'
db_conn_dict['password'] = '*****'
db_conn_dict
    {"database": "geohack",
     "host": "dssg2017.csya4zsfb6y4.us-east-1.rds.amazonaws.com",
     "password": "*****",
     "port": 5432,
     "user": "*****"}

Finally, the magic: Read in the world_seas PostGIS dataset (a spatially enabled table in the PostgreSQL database) into a GeoDataFrame, using the opened connection object. Note the use of a simple SQL query string: select * from world_seas

seas = gpd.read_postgis("select * from world_seas", conn, 
                        geom_col='geom', crs={'init': 'epsg:4326'}, 
                        coerce_float=False)

Limitations in reading from PostGIS

GeoPandas apparently can not automatically read geom_col and crs from PostGIS. They must be specified explicitly. That’s a hassle that hopefully will be fixed in the future. Want to contribute one of these enhancement??

Close the connection. Clean up after yourself.

conn.close()

Let’s take a look at the GeoDataFrame.

seas.head()
gid name id gazetteer is_generic oceans geom
0 1 Inner Seas off the West Coast of Scotland 18 4283 False North Atlantic Ocean (POLYGON ((-6.496945454545455 58.0874909090909...
1 2 Mediterranean Sea - Western Basin 28A 4279 False North Atlantic Ocean (POLYGON ((12.4308 37.80325454545454, 12.41498...
2 3 Mediterranean Sea - Eastern Basin 28B 4280 False North Atlantic Ocean (POLYGON ((23.60853636363636 35.60874545454546...
3 4 Sea of Marmara 29 3369 False North Atlantic Ocean (POLYGON ((26.21790909090909 40.05290909090909...
4 5 Black Sea 30 3319 False North Atlantic Ocean (POLYGON ((29.04846363636364 41.25555454545454...

More advanced plotting and data filtering

Color the layer based on one column that aggregates individual polygons; using a categorical map, as before, but explicitly selecting the column (column='oceans') and categorical mapping (categorical=True); displaying an auto-generated legend, while displaying all polygon boundaries. Each “oceans” entry (ocean basins, actually) contain one or more ‘seas’.

# The geopandas plot method doesn't currently support the matplotlib legend location parameter,
# so we can't control the legend location w/o using additional matplotlib machinery
seas.plot(column='oceans', categorical=True, legend=True, figsize=(14, 6));

png

Additional plotting examples

See http://darribas.org/gds16/content/labs/lab_03.html for great examples of lots of other cool GeoPandas map plotting tips.

Combine what we’ve learned. A map overlay, using world as a background layer, and filtering seas based on an attribute value (from oceans column) and an auto-derived GeoPandas geometry attribute (area). world is in gray, while the filtered seas is in color.

seas_na_arealt1000 = seas[(seas['oceans'] == 'North Atlantic Ocean') 
                          & (seas.geometry.area < 1000)]
seas_na_arealt1000.plot(ax=world.plot(facecolor='lightgray', figsize=(8, 8)), 
                        cmap='Paired', edgecolor='black')

# Use the bounds geometry attribute to set a nice
# geographical extent for the plot, based on the filtered GeoDataFrame
bounds = seas_na_arealt1000.geometry.bounds

plt.xlim([bounds.minx.min()-5, bounds.maxx.max()+5])
plt.ylim([bounds.miny.min()-5, bounds.maxy.max()+5]);

png

Save the filtered seas GeoDataFrame to a shape file

The to_file method uses the fiona package to write to a GIS file. The default driver for output file format is ‘ESRI Shapefile’, but many others are available because fiona leverages GDAL/OGR.

seas_na_arealt1000.to_file(os.path.join(data_pth, "seas_na_arealt1000.shp"))

Read from OGC WFS GeoJSON response into a GeoDataFrame

Use an Open Geospatial Consortium (OGC) Web Feature Service (WFS) request to obtain geospatial data from a remote source. OGC WFS is an open geospatial standard.

We won’t go into all details about what’s going on. Suffice it to say that we issue an OGC WFS request for all features from the layer named “oa:goainv” found in a GeoServer instance from NANOOS, requesting the response in GeoJSON format. Then we use the geojson package to “load” the raw response (a GeoJSON string) into a geojson feature object (a dictionary-like object).

The “oa:goainv” layer is a global dataset of monitoring sites and cruises where data relevant to ocean acidification are collected. It’s a work in progress from the Global Ocean Acidification Observation Network (GOA-ON); for additional information see the GOA-ON Data Portal.

import requests
import geojson

wfs_url = "http://data.nanoos.org/geoserver/ows"
params = dict(service='WFS', version='1.0.0', request='GetFeature',
              typeName='oa:goaoninv', outputFormat='json')

r = requests.get(wfs_url, params=params)
wfs_geo = geojson.loads(r.content)

Let’s examine the general characteristics of this GeoJSON object, including its __geo_interface__ interface, which we discussed earlier.

print(type(wfs_geo))
print(wfs_geo.keys())
print(len(wfs_geo.__geo_interface__['features']))
<class 'geojson.feature.FeatureCollection'>
dict_keys(['type', 'totalFeatures', 'crs', 'features'])
544

Now use the from_features constructor method to create a GeoDataFrame directly from the geojson.feature.FeatureCollection object.

wfs_gdf = GeoDataFrame.from_features(wfs_geo)

Finally, let’s visualize the data set as a simple map overlay plot; and as an example, display the values for the last feature.

wfs_gdf.plot(ax=world.plot(cmap='Set3', figsize=(10, 6)),
             marker='o', color='red', markersize=15);

png

wfs_gdf.iloc[-1]
Oceans                                                    North Atlantic Ocean
additional_organizations                                                      
agency                                                                        
city                                                                          
comments                                                                      
comments_about_overlaps                                                       
contact_email                     akoertzinger@geomar.de; tsteinhoff@geomar.de
contact_name                                  Arne Körtzinger;Tobias Steinhoff
country                                                                Germany
data_url                                                 http://www.socat.info
department                                                                    
deploy_date                                                                   
depth_range                                                                   
duration                                                                      
frequency                                           1 crossing every 2.5 weeks
geometry                                                  POINT (-53.53 46.27)
id                                                                         589
latitude                                                                 46.27
line_xy                      -53.53,46.27::-53.28,46.33::-53.03,46.38::-52....
location                                                                      
longitude                                                               -53.53
method                                                                        
method_documentation         Carbon system including CO2 fluxes in the Nort...
organization                   GEOMAR Helmholtz Centre for Ocean Research Kiel
organization_abbreviation                                               GEOMAR
overlaps_with                                                                 
parameters                       pCO2; dissolved oxygen; salinity; temperature
parameters_planned                                                            
platform_name                                                           GEOMAR
platform_name_kml                                                      589.kml
platform_type                                                              VOS
point_xy                                                          -53.53,46.27
project                                                                       
sensors                                                                       
source_doc                                                     on-line request
track_pt_lat                                                             46.27
track_pt_lon                                                            -53.53
type                                                                open ocean
url                             http://www.geomar.de/en/research/expeditionen/
Name: 543, dtype: object

Key Points