Tutorial: Exploring raster and vector geographic data with rasterio and geopandas

in utopian-io •  7 years ago  (edited)

Repository

https://github.com/mapbox/rasterio

What Will I Learn?

  • You will learn how to manipulate geographic data with geopandas and rasterio.
  • You will learn how to integrate vector geographic data with raster files using rasterio and geopandas.

Requirements

  • Intermediate python knowledge.
  • Basic rasterio and geopandas knowledge.

Difficulty

  • Intermediate

image.png

Rationale

This is a small project project of geographic data exploration. The main tools for this task are: Rasterio and Geopandas. This analysis began as an attempt to measure the access to public ways in each of Guatemala municipalities. First I have used Open Street Maps data, obtained from here. I also have at hand Guatemala's roads official data, obtained from the national institute of geography.. When comparing to official roads data, it can be seen OSM does not have all roads, also, OSM roads data contains streets for some places and has a very confusing set of categories. In general OSM roads data can be misleading. I will show you both and the code I have used to manipulate both datasets. The outcome measurement of this analysis is the proportion of people that live near a road. This outcome may be useful to analyse health seeking beahviour, for instance, or in general to analyse the population access to economical and social activities such as labor, consumption, trading, education and transportation.

First I am going to show the analysis on OSM data and after that I will show the IGN data analysis. Both analysis are the same. The only change is the roads data used.

For population estimations I have used Worldpop project data. I have introduced this before. That post also contains an introduction to rasterio. There is a possible issue with this analysis: Worldpop project used roads data to estimate population density. However, there are lots of other covariates that worldpop uses to model population density, such as poverty, night lights, deforestation, and many others. Even though this analysis is unconclusive and generates more questions than answers. it has been a great exercise to learn more about Fiona, Rasterio and GeoPandas.

Using OSM data:

import fiona
import rasterio
import rasterio.features
from rasterio import mask
import pandas as pd
import geopandas
import numpy as np
import matplotlib.pyplot as plt
import matplotlib as mlp
import scipy.ndimage as ndi
import re

%matplotlib inline

Reading the roads shapefile with geopandas is as simple as this:

gtroads_osm = geopandas.read_file("../../../DATOS/OSM/hotosm_gtm_roads_lines.shp/hotosm_gtm_roads_lines.shp")

It uses fiona to read and converts the data to a GeoDataFrame object, which allows the data to be manipulated with Pandas syntax. Let's take a quick look to the data:

image.png

Now load the Worldpop raster file:

wpgt_file = "../../../DATOS/WorldPop/GTM_ppp_v2b_2015/GTM_ppp_v2b_2015.tif"
wpgt_r = rasterio.open(wpgt_file)
wpgt = wpgt_r.read()

wpgt_r is a RasterReader object, while wpgt is the first band read from the raster file. A raster file can contain more than one band. A band is a 2D array of values. In this case, the WorldPop band contains the estimated population for a ~100m x 100m square.

Now one of the main things I want to show in this tutorial is the way to convert a vector geographical object to a raster:

rasterizado = rasterio.features.rasterize(
    [(x.geometry, 1) for i, x in gtroads_osm.iterrows()],
    out_shape=wpgt_r.shape,
    transform=wpgt_r.transform,
    fill=0,
    all_touched=True,
    dtype=rasterio.uint8, )

Rasterio features module has a rasterize function that allows you to convert a vector object to an image. The first argument of this function is a list of tuples. Each tuple has two members, the first is the geometry to be rasterized and the second member is the value that you want to give to that geometry. Next we define the shape of the output band, the transform (both of these are just taken from the original worldpop raster file), the fill value for pixels that have no rasterized vector, the all_touched option draws in all pixels that are touched by the geometries.

To manipulate this new band, I have had to save it in a raster file. This is because there is no way to put a band in memory in a Raster object in rasterio. So the band is saved in a new raster file, and then loaded again.

# To write the rasterized map to a file:
profile = wpgt_r.profile
profile.update(
    dtype=rasterio.uint8,
    count=1,
    compress='lzw', 
    nodata=0)
rasterizado[rasterizado < 1] = 0
with rasterio.open("GtRoads_OSM_100m_x_100m.tif", 'w', **profile) as out:
    out.write_band(1, rasterizado)

As you can see, you need a profile to make a raster file. We copy the profile from the worldpop raster object and then we use it to oepn a rasterio file writer object which exports our new band.

# I have not found a way to get a raster from in memory data. So I have loaded it from the file written above.
gtroads_osm_raster = rasterio.open("GtRoads_OSM_100m_x_100m.tif", 'r')
gtroads_osm_r = gtroads_osm_raster.read()

To get a mask of the raster pixels that are close to a pixel containing a road, I have used a max filter from scipy's ndimage library.

gtroads_osm_filtered = ndi.maximum_filter(gtroads_osm_r, size= 5 )

I have used a size of 5, which means that it makes a filter centered in each pixel with size 5x5. This will make a mask of geographical regions that are at between 200m and 300m of distance from a road, because each pixel has been assumed to have this size: 100m x 100m. This is a raw approximation, but let's stick with it for the sake of simplicity.

Now I use this mask to get the population that lives at ~ 300m or less from a road. Roads band is composed of 1s and 0s. To get the populations of the cells with 1, we may just multiply the worldpop band by the roads filtered band:

gtroads_access_osm = wpgt[0] * gtroads_osm_filtered[0]

plt.rcParams['figure.figsize'] = 14, 14
plt.imshow(np.log10((np.fmax( gtroads_access_osm+1, 1))), vmin = 0, interpolation="bilinear")
bar = plt.colorbar(fraction=0.03)
bar.set_ticks([0, 0.301, 1.041, 1.708, 2.0043])
bar.set_ticklabels([0, 1, 10, 50, 100])

image.png

Note that matplotlib's imshow function allows us to plot this 2D large array. I prefer to use transform the data with the log10 function because it has a wide range and cells with low population count become invisible without this rescaling. This is the worldpop dataset masked by proximity to guatemalan roads and streets from OSM data. It looks very nice, but it is better to have the actual numbers by municipality, so let's now import the municipalities geometries and aggregate this raster by them.

# Now let's get Guatemala municipalities shapes
gpmunis_json = fiona.open("../../../DATOS/IGN/GT-IGN-cartografia_basica-Division politica Administrativa (Municipios).geojson", "r", "GeoJSON")
gpmunis = geopandas.GeoDataFrame.from_features(gpmunis_json)

To aggregate the data points that are contained in each municipality polygon shape, we use the mask module. The mask function takes a raster band and a geometry. It generates a masked array subr and the bounds of the geometry within the raster, bounds.

munis_pop = []
for muni in gpmunis_json:
    subr, bounds = mask.mask(wpgt_r, [muni["geometry"]])
    munis_pop.append({
        "road_access": np.sum(gtroads_access_osm[subr.data[0]>0]),
        "total_pop": np.sum(subr.data[subr.data>0]),
        "codigo": muni["properties"]["COD_MUNI__"]
    }) 
    print("processes muni ", muni["properties"]["COD_MUNI__"])

As you can see, I have masked only over the worldpop raster band. To get the population close to roads (road_access) I use the subr mask as a filter (it filters only the positive values greater than 0) over the gtroads_access_osm array, which contains the population counts for cells close to roads. All this data is put in a list of dictionaries and then it is converted to a dataframe and merged with the municipalities geometries dataset.

mpdf = pd.DataFrame(munis_pop)
mpdf["perc_access"] = mpdf.road_access / mpdf.total_pop
gpmunis_osm = gpmunis.copy().merge(mpdf, left_on="COD_MUNI__", right_on="codigo", how="left")

Now we can visualize this in a nice map easily with GeoPandas. Note that GeoPandas takes care of everything, including the legend of the choropleth:

plt.rcParams['figure.figsize'] = 14, 14

plot = gpmunis_osm.plot(column = "perc_access", legend= True)
plot.legend(loc=2, prop={'size': 6})

image.png

Using IGN Data

Now I want to repeat this process with IGN data in order to compare between these two. IGN data has only roads and will be quite different. Let's see:

gtroads_ign = geopandas.read_file("../../../DATOS/IGN/IGN-cartografia_basica-Red de CArreteras.geojson", 
                                  mode="r", driver="GeoJSON")

print(len(gtroads_ign))
gtroads_ign.head()

image.png

It is unfortunate that IGN data has the properties encoded in raw html. So I made a quick parser to manage this inconvenient:

prev = None
def process_description_in_ugly_ign_format(input_str):
    if not input_str.startswith("Continuation"):
        return pd.Series(index = re.findall("atr-name\"\>([\w ]*)\<", input_str), 
                         data = re.findall("atr-value\"\>([^\<]*)\<", input_str))
    return pd.Series([])
    
properties = \
    gtroads_ign.description.apply(process_description_in_ugly_ign_format).fillna(method="ffill")
gtroads_ign = gtroads_ign.merge(properties, left_index=True, right_index=True)

I make use of regular expressions to find properties names and values. I put them in a pandas DataFrame and then I merge it with the original GeoDataFrame.

image.png

Now I perform the process described above for the OSM data:

rgtroads_ign = rasterio.features.rasterize(
    [(x.geometry, 1) for i,x in gtroads_ign.iterrows()],
    out_shape=wpgt_r.shape,
    transform=wpgt_r.transform,
    fill=0,
    all_touched=True,
    dtype=rasterio.uint8, )

# To write the rasterized map to a file:
profile = wpgt_r.profile
profile.update(
    dtype=rasterio.uint8,
    count=1,
    compress='lzw', 
    nodata=0)
rgtroads_ign[rgtroads_ign < 1] = 0
with rasterio.open("GtRoads_IGN_100m_x_100m.tif", 'w', **profile) as out:
    out.write_band(1, rgtroads_ign)
    
gtroads_raster_ign = rasterio.open("GtRoads_IGN_100m_x_100m.tif", 'r')
gtroads_ign_rdata = gtroads_raster_ign.read()

gtroads_ign_filtered = ndi.maximum_filter(gtroads_ign_rdata, size= 5 )

gtroads_access_ign = wpgt[0] * gtroads_ign_filtered[0]

plt.rcParams['figure.figsize'] = 14, 14
plt.imshow(np.log10((np.fmax( gtroads_access_ign+1, 1))), vmin = 0, interpolation="bilinear")
bar = plt.colorbar(fraction=0.03)
bar.set_ticks([0, 0.301, 1.041, 1.708, 2.0043])
bar.set_ticklabels([0, 1, 10, 50, 100])

image.png

You can see it looks quite different to the OSM rasterized map. Much less streets and roads. However, you can easily find roads that are missing in the OSM dataset (specially in Quiché district).

munis_acc_ign = []
for muni in gpmunis_json:
    subr, bounds = mask.mask(wpgt_r, [muni["geometry"]])
    munis_acc_ign.append({
        "road_access" : np.sum(gtroads_access_ign[subr.data[0]>0]),
        "total_pop"   : np.sum(subr.data[subr.data>0]),
        "Codigo"      : muni["properties"]["COD_MUNI__"]
    })
    print("processes muni ", muni["properties"]["COD_MUNI__"])

mpdf_ign = pd.DataFrame(munis_acc_ign)
mpdf_ign["perc_access"] = np.round(100 * mpdf_ign.road_access / mpdf_ign.total_pop)

gpmunis_ign = gpmunis.copy().merge(mpdf_ign, left_on="COD_MUNI__", right_on="Codigo", how="left")

plt.rcParams['figure.figsize'] = 14, 14

plot = gpmunis_ign.plot(column = "perc_access", legend= True)
plot.legend(loc=2, prop={'size': 6})

image.png

Conclusion

In this tutorial I have shown some of the features in rasterio for masking, rasterizing and manipulation of raster files. I have made minor use of geopandas package to load GIS geometries in GeoJSON and Shapefile formats and to make a choropleth visualization without much effort. It has been shown that these tools , rasterio, fiona, geopandas, numpy, scipy, and others, can be used together to manipulate vector and rasterized geographical data.

Curriculum

Proof of Work Done

https://github.com/Guitlle/Guatemala-roads-access

Authors get paid when people like you upvote their post.
If you enjoyed what you read here, create your account today and start earning FREE STEEM!
Sort Order:  
  ·  7 years ago (edited)

If I'm not mistaken (Its been long time since I touch python :D) almost library that derived from pandas is loading data into memory. For example, if I load data from PostGIS database using geopandas then It will load from that database into memory which is some computer with limited RAM can become hang. Any idea to anticipate this?

Also, if it's not a burden, would you share the memory usage and how long it takes to display the map? (just the estimation is okay)

And some tips for me to make your next tutorial more readable. Try to zoom in (CTRL + mousescroll) your Jupyter notebook before taking a screenshot :)

Hey @drsensor
Here's a tip for your valuable feedback! @Utopian-io loves and incentivises informative comments.

Contributing on Utopian
Learn how to contribute on our website.

Want to chat? Join us on Discord https://discord.gg/h52nFrV.

Vote for Utopian Witness!

Hi @drsensor. Thanks for commenting. Yes, this tutorial is handling data in memory. My computer has 8 GB of RAM and the files I used are in total less than 1GB ( I think ) . I can try to get you the actual amount of memory being used when I got a chance to do it. If you have to use much more data than this, then you either need more RAM (there are cloud servers with lots of RAM that you can rent for a few hours and a few dollars) or you need to optimize all this processing. Optimizing raw data analysis is a whole subject in its own. You may use cython and map reduce algorithms.

Next time I will try to make the images compatible with small devices. Thanks for the suggestion!

  ·  7 years ago (edited)

Nice, thank you for your time, looking forward on that .
I see, seems it will take ~800MB RAM. Be careful not to accidentally re-run the notebook (.ipynb) more than 7 times (800MB x 8). If you want to re-run it make sure to exit the notebook first to free the memory.

Actually long ago, I do a project about image processing task and after reading my notes, I do something like lazy evaluation which only loads and computes on specific parts when it's needed. There is some library that I want to use back then but it only one of these libs that I actually use (time constraint project 😂 ). Maybe you want to experiment with one of these libraries (if you hit memories problem or have some cluster computers):

  • Blaze: I'm interested in trying this but never had a chance. You can do batching, set the chunk size, then do something like external memory computation (maybe)
  • Spark/PySpark: At that time I'm not really interest using this because rigorous installation process for a single laptop and seems doesn't suit my use case.
  • Dask: I use this. This really suitable for my use case because of some part of the algorithm can run on GPU which is I can distribute the computation between CPU and GPU evenly (but not perfect). The downside is it really take a time to build good DAG. In my notes seems like I use delayed function a lot when loading images and do pre-processing tasks.

Thanks for the great links. I didn't know a couple of these. I have not encountered the oportunity to handle data this big, but hopefully I will and these options will come handy. Looks you know quite a bit about computationally intensive analysis. Hope to see some tutorials from you about this ;)

That is a fantastic piece of work you did here.
Just a quick question to you, why using Guatemala in particular? is it your home country? :)
And a suggestion for your future work, try to add more comments to your code, but also some more sub-headers.
Again, beautiful work!

Your contribution has been evaluated according to Utopian policies and guidelines, as well as a predefined set of questions pertaining to the category.

To view those questions and the relevant answers related to your post, click here.


Need help? Write a ticket on https://support.utopian.io/.
Chat with us on Discord.
[utopian-moderator]

Thanks @mcfarhat . This analysis is part of a larger project, so I seized the oportunity to write a tutorial of the skills learned during this. Thanks for your suggestion I will take it into account.

Hey @elguille
Thanks for contributing on Utopian.
Congratulations! Your contribution was Staff Picked to receive a maximum vote for the tutorials category on Utopian for being of significant value to the project and the open source community.

We’re already looking forward to your next contribution!

Contributing on Utopian
Learn how to contribute on our website or by watching this tutorial on Youtube.

Want to chat? Join us on Discord https://discord.gg/h52nFrV.

Vote for Utopian Witness!