Introduction to spatial site-suitability analysis in R


1. Introduction

The purpose of this tutorial is to introduce spatial analysis and basic geoprocessing in R. In this tutorial, readers will build a ‘site suitability’ model – a common spatial analysis approach for locating a land use in space given a set of spatial constraints or ‘decision factors’.

Typically, we employ some version of a site suitability analysis when asking a “where” question. In this tutorial, we ask where should we site a casino in Gettysburg, Pennsylvania conditional on a set of decision factors that describe the Gettysburg National Military Park, the town boundary, highways and a series of tabular demographic and zoning data.

Section 2 of this tutorial focuses on basic geospatial data visualization routines. We use both the native sf plotting functionality as well as ggplot. Section 3 introduces vector geoprocessing using the sf package. Section 4 describes the site suitability workflow and the final section illustrates more advanced spatial data visualization. Throughout, references are made to comparable functionality in ArcGIS, the most ubiquitous desktop GIS software.

The data can be downloaded here. Let’s begin with some basic setup routines.

Let’s check out the class of the blocks layer and return the first 6 rows.

The response tells us that this object is both a data.frame and a sf, or Simple Features collection.  This is a geospatial layer, similar in function to a shapefile. What makes sf so powerful is that because blocks is a data frame, it can be wrangled using standard dplyr verbs as demonstrated below.

The output above also tells us the number of spatial features and columns in the layer, blocks. It tells us that the geometry type is MULTIPOLYGON – which, to keep it simple, is described, “as a set of polygons”. Other sf geometry types include POLYGON, POINT and LINESTRING.

We can also see the spatial extent or bounding box of the layer, as well as the coordinate system, denoted as epsg. This data is fake, and thus it has no projection, however, a great reference for projection information can be found at spatialreference.org.

Finally, the geometry column stores all the spatial information associated with a given feature.

These are Census Block units. Here is a quick data dictionary:

  1. STFID – Unique ID from the U.S. Census
  2. AREA_HA – Area in hectacres
  3. pop2000 – Population in 2000
  4. hu2000 – Number of housing units in 2000
  5. pden2000 – Population density in 2000
  6. hden2000 – Housing unit density in 2000
  7. zoning – A categorical zoning classification

2. Plotting Simple Features

There are two options for plotting sf objects. First, sf has native plotting routines that can be used for building quick visualizations. Second, ggplot, the R standard for data visualization can also take sf objects as inputs. ggplot makes it possible to create high quality maps in R. Let’s start with the sf::plot command.


With just a handful of characters, sf will return a ‘small multiple’ plot, visualizing the range of values for each variable in the blocks shapefile. We can plot just one variable by using index notation like so:


We can even plot a handful of geometries like so:


Now, let’s take the ggplot approach. Plotting an sf object with ggplot is as simple as using the geom_sf argument. Here we are going to fill the block polygons with the pop2000 variable. Note that in the below code, the relevant data frame is specified in the geom_sf argument and not in the ggplot argument.

The code block below may throw an error which reads something to the effect of, Error in…could not find function “geom_sf”. This error may result from the fact that the current version of ggplot on CRAN does not support the geom_sf functionality.

If you do get this error, install the latest version of ggplot from GitHub by following these steps in R Studio:

  1. Run devtools::install_github(“tidyverse/ggplot2”).
  2. This may prompt you to install RTools. If so, do install.
  3. Once RTools is installed, it should automatically install devtools. If not, run Step 1 again.
  4. If another error pops up about a missing “dependency”, then manually install the missing package as well from Tools -> Install Packages

To plot in ggplot:


Let’s say that we wanted to create a small multiples plot that visualizes 2 maps of population density and housing unit density.

One of the major innovations of sf over previous spatial analysis tools in R, is that it allows us wrangle shapefiles and other spatial data types using dplyr verbs. Now, wrangling a shapefile is just as easy as wrangling a standard R data frame.

The code block below shows how we create the small multiple map. The tidyverse package makes it possible to “pipe” one command to the next using the %>% notation. We begin with mutate which recodes the numeric zoning variable to a string. We then select the three fields we need, including population density, housing unit density and zoning. We convert from sf to data.frame. Then we use the gather function to format this data from wide to long format.

What is wide vs. long format? The census blocks are currently in wide format, where rows represent units of analyses and columns represent variables. In order for ggplot to interpret these data into a small multiple map, we must reformat the data into long form, such that each row represents a single value for a given variable and corresponding zoning category.

The function that converts data from wide to long form is called gather, and to get some intuition on what’s happening, run the below code block through the gather function and inspect the resulting data frame.


The next step in our pipe below is to convert back from data.frame to sf using the st_sf command. We can convert to sf because gather maintains the geometry column, which is a requirement for simple features.

Finally, the whole dplyr statement is piped into ggplot where colour is set to NA to remove the boundaries. This is done to show the higher densities in downtown Gettysburg. Notice that we call geom_sf to map the sf object. We then pass variable to the facet_wrap operation which creates one map per variable in our long data.

While this approach is incredibly efficient, compared to say, laying out individual maps in the Layout View of ArcGIS, the major drawback is that we cannot create individual legends for each variable. This is problematic when using two variables with different scales, such as the maps shown above. If you want to create small multiple plots with individual legends, check out the gridExtra package, which we use below.

3. Basic geoprocessing with sf

The code block at the end of Section 2 shows that it is possible to wrangle simple features just like a standard data frame. These dplyr verbs easily replicate ArcGIS functionality like Select By Attributes, Field Calculator etc.

In this section we cover some basic geoprocessing in sf including area calculation, buffer and select by overlay (aka. Select by Location)

Let’s start with area. The below code block creates a new field in blocks that calculates area. The area is returned in units that correspond to the coordinate system of the input sf object. This data class cannot be read by ggplot nor operated on mathematically. Thus we have to recode area into, in this case, an integer. Note that in the codeblock below, we call as.integer in the aes argument to recode on the fly.


Now onto buffer. The below code block singles out the first polygon of blocks using top_n, buffers it by 100 meters and plots it with the original polygon overlayed. Note that with ggplot, it is possible to overlay geom_sf calls.


Let’s move onto selecting by an overlay. To illustrate this functionality we will create some dummy data by taking a feature located centrally in blocks, finding its centroid using st_centroid, buffering it and using the resulting feature to select those blocks that intersect it.


Now that we have our input and overlay features (blocks and dummyBuffer, respectively), we are ready to perform the selection, grabbing every block that intersects the buffer. This couldn’t be easier. In the code block below, our reference for subsetting is an entire sf object, the dummyBuffer. We create a new object called selection and plot it using the native sf plotting function. Note the add argument which allows us to do a quick overlay.


Turning back to ggplot momentarily, notice above in our ggplot overlay of blocks and dummyBuffer that there is no legend telling us the difference between the two layers. In order to do so we have to put both sets of geometries into the same sf object and map a field to an aes argument in geom_sf.

The codeblock below does this by creating a field called ‘geography’ in both the buffer and the blocks respectively. Using the base R command rbind or “row bind”, we join together the 554 blocks with the 1 buffer geometry (this is very similar to a merge in ArcGIS). Note that in order to successfully rbind together these geometries, they need to have the same columns, hence the select function. Finally, notice that the geography field is mapped to the aes argument in the ggplot function. And voila, a legend.


Let’s take this data visualization a bit further adding a title, subtitle and caption using the labs argument. Also, we map specific colors, legend labels and a legend title to the polygons with scale_fill_manual.


4. Site suitability analysis: Locating a casino in Gettysburg, PA


Gettysburg, Pennsylvania is the site of one of the most historically significant military conflicts in United States history. Situated in the rolling farmland of Central Pennsylvania, Gettysburg is not only a battlefield museum and national park, it is a vibrant community with a well-respected University and a bustling tourist economy.

For several years, advocates were attempting to locate a casino in Gettysburg. For this exercise, we are going to assume that the City Council has given you a set of location criteria or “decision factors” that you are to use when siting this casino. Below is the list of decision factors and the associated tools we will use for calculation:

  1. It can’t be downtown, so it should be within 1 mile from the Gettysburg town boundary (Buffer)
  2. It has to be at least ¼ mile away from the National Park (Buffer)
  3. It has to be adjacent to a major thoroughfare (Buffer)
  4. It has to be a commercial zone (dplyr)
  5. It has to have an existing population density less than the study area median (dplyr)
  6. It has to be greater than 40 acres (dplyr).

Let’s start by buffering the Gettysburg town boundary by 1 mile (1609 meters). Notice in the second geom_sf call when we plot the buffer, fill = NA and size = 2, which returns the polygon outline with a comparatively larger line weight.


Next, for decision factor two, we buffer the National Park layer, nationalParks. Notice there are 4 features in nrow(nationalParks). Before we buffer these features in the same manner as above, note what is accomplished by running plot(st_buffer(nationalParks[,1], 402)). Here we embed the st_buffer operation directly in the plot command. 402 refers to the number of meters in 1/4 mile.

Below, we buffer using a dplyr convention:


Decision factor 3 asks that the casino be located adjacent to a major thoroughfare. The code block below returns a map that selects all the blocks that intersect a 50 meter buffer from the highways. Here however, we are embedding the st_buffer argument inside the subsetting notation, all inside the geom_sf. Embedding in such a way, allows us to select, geoprocess and map, all in one code block. We then create a permanent highway buffer.


We tackle the final three non-spatial decision factors in one dplyr statement below. First, we use mutate to convert area in to acres. Then we use filter to select blocks that are greater than 40 acres, have population densities less than the study area median and have commercial zoning. Note that we plot using the sf plot routines.


So let’s create one plot below that shows the blocks we selected as part of attributeSelect with the highway buffer, park buffer, and the Gettysburg 1 mile buffer, all overlaid. Recall, the suitable property is within that 1 mile Gettysburg buffer. Can you spot the one suitable block?


Now let’s formalize into one cohesive selection strategy using a series of overlays, similar to ‘Select by Location’ in ArcMap. Our strategy is going to be as follows.

  1. Find thoseBlocks that are a part of attributesSelect and inside the gettysburgBuffer.
  2. Of thoseBlocks, find those that are not in the nationalParkBuffer.

Hopefully, we should be left with one block that is suitable.

Notice that the nationalParksBuffer is comprised of four separate POLYGONS (nationalParksBuffer$geometry). The first step is to “dissolve” away these four POLYGONS into one MULTIPOLYGON. There is no sf equivalent to the ArcMap “dissolve” operation. Instead we use a combination of group_by and summarize from the dplyr package, like so:


You can see that dissolve consists of 1 MULTIPOLYGON. The next step is to find all of thoseBlocks that fall inside of the gettysburgBuffer. We use a spatial subset as below, pulling from the blocks that we found in attributesSelect above.

Finally, we are going to use st_disjoint to find all of thoseBlocks that DO NOT intersect the dissolved Gettysburg Buffer called dissolve. This creates thoseBlocks_Not_in_ParksBuffer, which has a class of matrix. We can feed that matrix in to a subset command to create the finalSelection which we then plot. The operation for ‘does not intersect’ is st_disjoint.


We plot the result below in ggplot.


5. Some advanced approaches to spatial data visualization

As a final step, let’s create one last data visualization using the grid and gridExtra packages to add multiple plots to one ‘canvas’. Here we include our four basic decision factors, our final suitability map and an annotation.

While this framework is useful for combining multiple plots, it obviously is not a substitute for desktop publishing like Inkscape or InDesign.


To create this plot, we have to take the following steps:

  1. For those decision factors that involve a buffer, such as the National Park, we have to create a new layer that includes the original geometry and the buffer geometry. To do this, we have to remove from the original layers, every variable except for geometry. This will allow us to rbind together both layers (remember we can’t join using rbind, two layers with varying field names). We then add a new field to each called Legend that we will use to symbolize the difference between the original layer and its buffer.
  2. We create an plot object for each decision factor, (ie. parksDecisionFactor) that includes the theme argument where we can get specific about the aesthetics of the plot.
  3. We do this for each decision factor and for the “Final Site Suitability” plot.
  4. We use gridExtra to define a ‘layout matrix’ and plot all the plots as one.

While there is a lot of code below, keep in mind that much of it is copy/paste and even more of it could be replaced with functions, if one were so inclined. For example, we define a standalone mapTheme() defining the theme for each plot.

Let’s begin with the creation of a mapTheme() function that will replicate all the theme information for each of the below data visualizations.

Now, we build the first decision factor for nationalParks.


Next, let’s create the decision factor for the Gettysburg town and buffer.


Notice we also use the guides argument, which in this case, forces the legend to display the legend items in 2 rows.

Now the highways decision factor:


For the highways, we are only showing one geography, so we don’t need to rbind anything. The same hold true for the decision factor, attributeSelect, which includes the demographics and acreage criteria.


Now we combine the decision factors into one combinationPlot. Recall, that finalSelection is the one census block that meets our selection criteria. The combine layer is set to rbind four of the geometries in to one layer. This allows us to represent them all with one legend.

It is very important to remember that we cannot have multiple scale_fill_manual legends, hence the need to put them all together into combine.

We want the theme to be different for the combinationPlot then the four decision factor plots. As such, instead of the mapTheme() function here, we define theme using individual arguments.

Before you run the third statement below, run levels(combine$Legend). This is the default order that the legend will be created. We want the “Suitable Site” to come last, so we reorder the levels.


Now we are ready to combine all of the visualizations into one. First make sure that you have the grid and gridExtra libraries installed. Then we are going to create a list of the 5 plots we wish to include.

Next, we create a “layout matrix” where we specify the row by column order for the plots we included in the list. Finally we run grid.arrange using the top and bottom arguments to specify a title and subtitle respectively.


And that’s all. Thanks for reading…

Ken Steif, PhD is the founder of Urban Spatial. He is also the director of the Master of Urban Spatial Analytics program at the University of Pennsylvania. You can follow him on Twitter @KenSteif.