>

#Dataviz tutorial: Mapping San Francisco home prices using R

Authors: Ken Steif & Simon Kassel

I constantly preach to my students that no matter how policy relevant your work is, if it cannot be conveyed to a non-technical decision maker, it isn’t useful. This underlies why data visualization is such an important tool for data scientists.

Because so much public policy has spatial implications, much of the data visualization we produce at Urban Spatial is geographic in nature.

My colleague Simon and I recently worked together on a machine learning model of gentrification using Census data throughout the U.S. In that piece we discuss the importance of parcel level data.

We thought we’d revisit the subject here using higher resolution data and report on our findings by way of data visualization.

Given our mutual interest in making maps in R, in this piece, we share our code in an effort to get others interested in mapping with ggplot and associated packages.

Our data consists of 17,527 single family home sales in San Francisco between 2009 and 2015. The data have been cleaned and each sale has been associated with a neighborhood.

In this tutorial we’ll explore the rapid neighborhood change that has occurred in San Francisco in recent years by constructing time series plots as well as point and polygon maps.

To begin, open up a new R script, set your workspace and a couple selection options and install/load the following libraries

Next, we’re going to define two themes that will tell ggplot how to construct both maps and plots. Defining our themes up front ensures that we don’t have to repeat this code over and again for every plot we generate below.

We’ll also define some color palettes. Check out the ‘Zonum Solutions’ Color Ramp Generator for defining custom color ramps.

We’ll create several separate ramps depending on how many colors we are going to need for a given plot.

Next we’ll retrieve the data. First the home price data:

We’ll also download and unzip a shapefile of neighborhoods in San Francisco.

Let’s build some plots. First off, let’s check out the distribution of home prices for the entire dataset.

plot1_histogram

(Higher resolution)

It seems as though there may be some outliers. We’ll remove anything greater than 2.5 standard deviations from the mean.

Next, we’ll check out the distribution of prices for each year using a violin plot.

plot2_violin

(Higher resolution)

The white circles denote sale price means for each year. Not only do prices increase over time, but by the end of the time series, there far fewer sales under the $1 million mark and many more above it.

Let’s make some maps.  Our first step is to download a basemap using the fantastic ggmap package (PDF).  We’ll create a bounding box delineated by the neighborhood shapefile and then download the basemap.

plot3_basemap

Let’s put this basemap to work using it to create a small multiple plot of prices by year. The ‘facet’ command in ggplot (line 4 below) makes this possible. Note that here we are calling the map theme created above.

plot4_point map

(Higher resolution)

The movement of prices at $2 million and above move out across the landscape, almost with a contagion effect. By 2015, these highest priced sale abut Interstate 280 in the southern section of the city.

To see the full extent of the change, it may be easier to look only at the first and last years of the time series. Let’s use the ‘subset’ command to pull out just the first and last years.

 

plot5_point map selected years

(Higher resolution)

The price appreciation is readily apparent. Let’s zoom into just one neighborhood in the Mission District, where economic and cultural change are reported almost daily. First we’ll create a new data frame of just sales in the ‘Inner Mission Neighborhood’ and readjust the basemap. Then we’ll build a facetted time series map of sales.

plot6_point map Mission District

(Higher resolution)

2015 appears to have been a watershed year for the Inner District but what about the other neighborhoods in San Francisco? To display neighborhood level trends, we’ll move from point maps to polygon maps. To begin, we need to do a bit of data wrangling – generating a new data frame with median, standard deviation, sale count, percent change and other statistics by neighborhood and time. We’ll then output our new data frame in a format suitable for mapping.

Now we’re ready to build some neighborhoods maps. First we’ll map median home price.

plot7_neighborhood home value

(Higher resolution)

Next we’ll map percent change in prices.

plot8_change over time

(Higher resolution)

A picture begins to emerge when we look at percent change. It’s clear that wealthy areas around The Marina, Russian Hill and Nob Hill did not change much while areas south changed dramatically. Let’s explore the rate of change by generating some time series plots for the highest appreciating neighborhoods.

First, a bit of data wrangling.

Now let’s create the time series plot.

plot10_time series

(Higher resolution)

This plot shows how quickly home prices appreciated in the City’s fastest changing neighborhoods. It is always helpful to add a bit of geographical context to plots like this. Next, we’ll generate a map cutout and merge it with the time series plot.

Here is the code to generate the map cutout.

And here is the code to merge it with the time series plot using ‘grid.arrange’.

plot12_time series with map

(Higher resolution)

Finally, is there anything more concrete that we can conclude from these data? Anecdotally, we know that San Francisco is changing. We know some neighborhoods are changing faster than others.

Typically, we would join these home sale data to external social, demographic and economic factors to explain why neighborhoods are changing so fast. As we pointed out in our recent piece predicting gentrification however, there are a whole host of ‘endogenous’ or price-related features that are important to note when understanding neighborhood change.

Our final plot conveys one such phenomenon by plotting percent change in sale price as a function of initial, 2009 prices.

plot9_scatterplot

(Higher resolution)

This plot shows that by in large, lower priced neighborhoods in 2009 saw the greatest rates of price appreciation throughout the study period.

If you’re interested in this phenomenon, you may wish to check out the Rent Gap theory, a now classic marxist geography explanation of gentrification first suggested by the late Neil Smith in 1979.

We hope you enjoyed this tutorial. If you want to access the entire code base in one script, you can access it via Simon’s GitHub page.

 Ken Steif, PhD is the founder of Urban Spatial. He is also the director of the Master of Urban Spatial Analytics program at the University of Pennsylvania. You can follow him on Twitter @KenSteif.

Simon Kassel will be receiving his Masters of City Planning from the University of Pennsylvania in the Spring of 2017.  You can follow him on Twitter @SimonKassel.