Authors: Ken Steif & Simon Kassel
I constantly preach to my students that no matter how policy relevant your work is, if it cannot be conveyed to a non-technical decision maker, it isn’t useful. This underlies why data visualization is such an important tool for data scientists.
Because so much public policy has spatial implications, much of the data visualization we produce at Urban Spatial is geographic in nature.
My colleague Simon and I recently worked together on a machine learning model of gentrification using Census data throughout the U.S. In that piece we discuss the importance of parcel level data.
We thought we’d revisit the subject here using higher resolution data and report on our findings by way of data visualization.
Given our mutual interest in making maps in R, in this piece, we share our code in an effort to get others interested in mapping with ggplot and associated packages.
Our data consists of 17,527 single family home sales in San Francisco between 2009 and 2015. The data have been cleaned and each sale has been associated with a neighborhood.
In this tutorial we’ll explore the rapid neighborhood change that has occurred in San Francisco in recent years by constructing time series plots as well as point and polygon maps.
To begin, open up a new R script, set your workspace and a couple selection options and install/load the following libraries
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 |
# Set a working directory on your local machine setwd("") # Turn off scientific notation options(scipen = "999") # Ensure strings come in as character types options(stringsAsFactors = FALSE) # Install packages # Note: you must have installed each of these packages before loading them # Note 2: There may be some versioning issues with ggplot & ggmap. # Check out this stackoverflow thread http://bit.ly/2lXHRFJ library(ggplot2) library(ggmap) library(maptools) library(ggthemes) library(rgeos) library(broom) library(dplyr) library(plyr) library(grid) library(gridExtra) library(reshape2) library(scales) |
Next, we’re going to define two themes that will tell ggplot how to construct both maps and plots. Defining our themes up front ensures that we don’t have to repeat this code over and again for every plot we generate below.
We’ll also define some color palettes. Check out the ‘Zonum Solutions’ Color Ramp Generator for defining custom color ramps.
We’ll create several separate ramps depending on how many colors we are going to need for a given plot.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 |
# Define one that we will use for plots plotTheme <- function(base_size = 12) { theme( text = element_text( color = "black"), plot.title = element_text(size = 18,colour = "black"), plot.subtitle = element_text(face="italic"), plot.caption = element_text(hjust=0), axis.ticks = element_blank(), panel.background = element_blank(), panel.grid.major = element_line("grey80", size = 0.1), panel.grid.minor = element_blank(), strip.background = element_rect(fill = "grey80", color = "white"), strip.text = element_text(size=12), axis.title = element_text(size=8), axis.text = element_text(size=8), axis.title.x = element_text(hjust=1), axis.title.y = element_text(hjust=1), plot.background = element_blank(), legend.background = element_blank(), legend.title = element_text(colour = "black", face = "italic"), legend.text = element_text(colour = "black", face = "italic")) } # And another that we will use for maps mapTheme <- function(base_size = 12) { theme( text = element_text( color = "black"), plot.title = element_text(size = 18,colour = "black"), plot.subtitle=element_text(face="italic"), plot.caption=element_text(hjust=0), axis.ticks = element_blank(), panel.background = element_blank(), panel.grid.major = element_line("grey80", size = 0.1), strip.text = element_text(size=12), axis.title = element_blank(), axis.text = element_blank(), axis.title.x = element_blank(), axis.title.y = element_blank(), panel.grid.minor = element_blank(), strip.background = element_rect(fill = "grey80", color = "white"), plot.background = element_blank(), legend.background = element_blank(), legend.title = element_text(colour = "black", face = "italic"), legend.text = element_text(colour = "black", face = "italic")) } # Define some palettes palette_9_colors <- c("#0DA3A0","#2999A9","#458FB2","#6285BB","#7E7CC4","#9A72CD","#B768D6","#D35EDF","#F055E9") palette_8_colors <- c("#0DA3A0","#2D97AA","#4D8CB4","#6E81BF","#8E76C9","#AF6BD4","#CF60DE","#F055E9") palette_7_colors <- c("#2D97AA","#4D8CB4","#6E81BF","#8E76C9","#AF6BD4","#CF60DE","#F055E9") palette_1_colors <- c("#0DA3A0") |
Next we’ll retrieve the data. First the home price data:
1 2 3 4 5 |
# Read in a csv of home sale transactions directly from github. sf <- read.csv("https://raw.githubusercontent.com/simonkassel/Visualizing_SF_home_prices_R/master/Data/SF_home_sales_demo_data.csv") # We will need to consider Sale Year as a categorical variable so we convert it from a numeric variable to a factor sf$SaleYr <- as.factor(sf$SaleYr) |
We’ll also download and unzip a shapefile of neighborhoods in San Francisco.
1 2 3 4 5 6 7 8 |
# Define the URL of the zipped shapefile URL <- "https://github.com/simonkassel/Visualizing_SF_home_prices_R/raw/master/Data/SF_neighborhoods.zip" # Download the shapefile to your working directory and unzip it. download.file(URL, "SF_neighborhoods.zip") unzip("SF_neighborhoods.zip") # Read it into R as a spatial polygons data frame & plot neighb <- readShapePoly("SF_neighborhoods") plot(neighb) |
Let’s build some plots. First off, let’s check out the distribution of home prices for the entire dataset.
1 2 3 4 5 6 7 8 9 10 11 12 13 |
home_value_hist <- ggplot(sf, aes(SalePrice)) + geom_histogram(fill=palette_1_colors) + xlab("Sale Price($)") + ylab("Count") + scale_fill_manual(values=palette_1_colors) + plotTheme() + labs(x="Sale Price($)", y="Count", title="Distribution of San Francisco home prices", subtitle="Nominal prices (2009 - 2015)", caption="Source: San Francisco Office of the Assessor-Recorder\n@KenSteif & @SimonKassel") + scale_x_continuous(labels = comma) + scale_y_continuous(labels = comma) # Plotting it: home_value_hist # And saving it to the working directory: ggsave("plot1_histogram.png", home_value_hist, width = 8, height = 4, device = "png") |
It seems as though there may be some outliers. We’ll remove anything greater than 2.5 standard deviations from the mean.
1 |
sf <- sf[which(sf$SalePrice < mean(sf$SalePrice) + (2.5 * sd(sf$SalePrice))), ] |
Next, we’ll check out the distribution of prices for each year using a violin plot.
1 2 3 4 5 6 7 8 9 10 11 12 |
home_value_violin <- ggplot(sf, aes(x=SaleYr, y=SalePrice, fill=SaleYr)) + geom_violin(color = "grey50") + xlab("Sale Price($)") + ylab("Count") + scale_fill_manual(values=palette_7_colors) + stat_summary(fun.y=mean, geom="point", size=2, colour="white") + plotTheme() + theme(legend.position="none") + scale_y_continuous(labels = comma) + labs(x="Year",y="Sale Price($)",title="Distribution of San Francisco home prices", subtitle="Nominal prices (2009 - 2015); Sale price means visualized as points", caption="Source: San Francisco Office of the Assessor-Recorder\n@KenSteif & @SimonKassel") home_value_violin ggsave("plot2_violin.png", home_value_violin, width = 8, height = 4, device = "png") |
The white circles denote sale price means for each year. Not only do prices increase over time, but by the end of the time series, there far fewer sales under the $1 million mark and many more above it.
Let’s make some maps. Our first step is to download a basemap using the fantastic ggmap package (PDF). We’ll create a bounding box delineated by the neighborhood shapefile and then download the basemap.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 |
# Define the bounding box bbox <- neighb@bbox # Manipulate these values slightly so that we get some padding on our basemap between the edge of the data and the edge of the map sf_bbox <- c(left = bbox[1, 1] - .01, bottom = bbox[2, 1] - .005, right = bbox[1, 2] + .01, top = bbox[2, 2] + .005) # Download the basemap basemap <- get_stamenmap( bbox = sf_bbox, zoom = 13, maptype = "toner-lite") # Map it bmMap <- ggmap(basemap) + mapTheme() + labs(title="San Francisco basemap") bmMap |
Let’s put this basemap to work using it to create a small multiple plot of prices by year. The ‘facet’ command in ggplot (line 4 below) makes this possible. Note that here we are calling the map theme created above.
1 2 3 4 5 6 7 8 9 10 11 12 13 |
prices_mapped_by_year <- ggmap(basemap) + geom_point(data = sf, aes(x = long, y = lat, color = SalePrice), size = .25, alpha = 0.6) + facet_wrap(~SaleYr, scales = "fixed", ncol = 4) + coord_map() + mapTheme() + theme(legend.position = c(.85, .25)) + scale_color_gradientn("Sale Price", colors = palette_8_colors, labels = scales::dollar_format(prefix = "$")) + labs(title="Distribution of San Francisco home prices", subtitle="Nominal prices (2009 - 2015)", caption="Source: San Francisco Office of the Assessor-Recorder\n@KenSteif & @SimonKassel") prices_mapped_by_year |
The movement of prices at $2 million and above move out across the landscape, almost with a contagion effect. By 2015, these highest priced sale abut Interstate 280 in the southern section of the city.
To see the full extent of the change, it may be easier to look only at the first and last years of the time series. Let’s use the ‘subset’ command to pull out just the first and last years.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 |
We'll stack the two maps and increase the point size. prices_mapped_2009_2015 <- ggmap(basemap) + geom_point(data = subset(sf, sf$SaleYr == 2015 | sf$SaleYr == 2009), aes(x = long, y = lat, color = SalePrice), size = 1, alpha = 0.75) + facet_wrap(~SaleYr, scales = "fixed", ncol = 1) + coord_map() + mapTheme() + scale_color_gradientn("Sale Price", colors = palette_8_colors, labels = scales::dollar_format(prefix = "$")) + labs(title="Distribution of San Francisco home prices", subtitle="Nominal prices (2009 & 2015)", caption="Source: San Francisco Office of the Assessor-Recorder\n@KenSteif & @SimonKassel") prices_mapped_2009_2015 |
The price appreciation is readily apparent. Let’s zoom into just one neighborhood in the Mission District, where economic and cultural change are reported almost daily. First we’ll create a new data frame of just sales in the ‘Inner Mission Neighborhood’ and readjust the basemap. Then we’ll build a facetted time series map of sales.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 |
#Subset sales for the "Inner Mission neighborhood" missionSales <- sf[which(sf$Neighborhood == "Inner Mission"), ] #Create a new basemap at the appropriate scale centroid_lon <- median(missionSales$long) centroid_lat <- median(missionSales$lat) missionBasemap <- get_map(location = c(lon = centroid_lon, lat = centroid_lat), source = "stamen",maptype = "toner-lite", zoom = 15) #Create a facet map by year mission_mapped_by_year <- ggmap(missionBasemap) + geom_point(data = missionSales, aes(x = long, y = lat, color = SalePrice), size = 2) + facet_wrap(~SaleYr, scales = "fixed", ncol = 4) + coord_map() + mapTheme() + theme(legend.position = c(.85, .25)) + scale_color_gradientn("Sale Price", colors = palette_8_colors, labels = scales::dollar_format(prefix = "$")) + labs(title="Distribution of Mission District home prices", subtitle="Nominal prices (2009 - 2015)", caption="Source: San Francisco Office of the Assessor-Recorder\n@KenSteif & @SimonKassel") mission_mapped_by_year |
2015 appears to have been a watershed year for the Inner District but what about the other neighborhoods in San Francisco? To display neighborhood level trends, we’ll move from point maps to polygon maps. To begin, we need to do a bit of data wrangling – generating a new data frame with median, standard deviation, sale count, percent change and other statistics by neighborhood and time. We’ll then output our new data frame in a format suitable for mapping.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 |
# We'll start by summarizing our existing sales data frame sf.summarized <- ddply(sf, c("Neighborhood", "SaleYr"), summarise, medianPrice = median(SalePrice), saleCount = length(SaleYr), sdPrice = sd(SalePrice), minusSd = medianPrice - sdPrice, plusSD = medianPrice + sdPrice, .progress = "text") head(sf.summarized, 10) # Now we'll calculate the average annual home sale count for each neighborhood yearly_sales <- ddply(sf.summarized, ~Neighborhood, summarise, avg.yearly.sales = mean(saleCount)) # Then we will use a left join to join the average sale count data frame to the original summarized data frame. sf.summarized <- left_join(sf.summarized, yearly_sales, by = "Neighborhood") # Next, we'll calculate the % change in neighborhood median home value. In order to do this we will need to reshape the data # again using the dcast function in the package reshape2. medByYear <- dcast(sf.summarized, Neighborhood ~ SaleYr, value.var = "medianPrice") # Check out the reshaped data frame head(medByYear) # We now have a data frame in which each row is a neighborhood, each column is a year and the values are the corresponding median prices. # Now we can easily calculate the % change from 2009 to 2015. medByYear$pctChange <- (medByYear$`2015` - medByYear$`2009`) / medByYear$`2009` # And join it back to our sf.summarized dataset sf.summarized <- left_join(sf.summarized, medByYear[,c("Neighborhood", "pctChange")], by = "Neighborhood") # Some neighborhoods have very low annual sale counts. # We will remove these neighborhoods from the dataset by converting them to NA. sf.summarized$pctChange <- ifelse(sf.summarized$avg.yearly.sales < 10, NA, sf.summarized$pctChange) # Remember the neighborhood shapefile we imported at the beginning? We must now convert it to a format that ggplot understands. neighb.tidy <- tidy(neighb, region = c('nbrhood')) # Look at the resulting data frame to see how it has been transformed head(neighb.tidy) # Create an identical field to 'id' but with a name that will allow us to join the data frame to our summarized price data neighb.tidy$Neighborhood <- neighb.tidy$id # Now we're going to join these data frames together so that when we map the neighborhood polygons we can symbolize them using the summary # stats we created sf.summarized_tidy <- join(sf.summarized, neighb.tidy, by = "Neighborhood", match = "all") |
Now we’re ready to build some neighborhoods maps. First we’ll map median home price.
1 2 3 4 5 6 7 8 9 10 11 12 13 |
neighb_map <- ggmap(basemap) + geom_polygon(data = sf.summarized_tidy, aes(x = long, y = lat, group = group, fill = medianPrice), colour = "white", alpha = 0.75, size = 0.25) + scale_fill_gradientn("Neighborhood \nMedian \nSale Price", colors = palette_8_colors, labels = scales::dollar_format(prefix = "$")) + mapTheme() + theme(legend.position = c(.85, .25)) + coord_map() + facet_wrap(~SaleYr, nrow = 2) + labs(title="Median home price by neighborhood, San Francisco ", subtitle="Nominal prices (2009 - 2015)", caption="Source: San Francisco Office of the Assessor-Recorder\n@KenSteif & @SimonKassel") neighb_map |
Next we’ll map percent change in prices.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 |
# Plot the percent change in neighborhood median home value over time change_map <- ggmap(basemap) + geom_polygon(data = sf.summarized_tidy[which(sf.summarized_tidy$SaleYr == 2015), ], aes(x = long, y = lat, group = group, fill = pctChange), colour = "white", alpha = 0.75, size = 0.25) + coord_map() + scale_fill_gradientn("% Change", colors = palette_8_colors, labels = scales::percent_format()) + mapTheme() + theme(legend.position = "bottom", legend.direction = "horizontal", legend.key.width = unit(.5, "in")) + labs(title="Percent change in median home prices, San Francisco", subtitle="Nominal prices (2009 - 2015)", caption="Source: San Francisco Office of the Assessor-Recorder\nNeighborhoods with too few annual transactions are withheld\n@KenSteif & @SimonKassel") change_map |
A picture begins to emerge when we look at percent change. It’s clear that wealthy areas around The Marina, Russian Hill and Nob Hill did not change much while areas south changed dramatically. Let’s explore the rate of change by generating some time series plots for the highest appreciating neighborhoods.
First, a bit of data wrangling.
1 2 3 4 5 |
# Lets look at the top 8 most appreciating neighborhoods. topPctChange <- unique(sf.summarized$pctChange) %>% sort(decreasing = TRUE) %>% head(8) # Well use these percentages to subset our neighborhoods data frame sfForTimeSeries <- sf.summarized[which(sf.summarized$pctChange %in% topPctChange), ] |
Now let’s create the time series plot.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 |
time.series <- ggplot(sfForTimeSeries, aes(x = SaleYr, group=Neighborhood)) + geom_line(aes(y = medianPrice)) + geom_ribbon(aes(ymin = minusSd, ymax = plusSD, fill = Neighborhood), alpha = 0.75) + facet_wrap(~Neighborhood, scales = "fixed", nrow = 4) + scale_y_continuous(labels = scales::dollar_format(prefix = "$")) + ylab("Neighborhood median home price") + xlab(NULL) + plotTheme() + theme( legend.position = "none", panel.spacing.y = unit(1, "lines") ) + scale_fill_manual(values=palette_8_colors) + labs(title="Time series for highest growth neighborhoods, San Francisco", subtitle="Nominal prices (2009-2015); Median; Ribbon indicates 1 standard deviation", caption="Source: San Francisco Office of the Assessor-Recorder\n@KenSteif & @SimonKassel") time.series |
This plot shows how quickly home prices appreciated in the City’s fastest changing neighborhoods. It is always helpful to add a bit of geographical context to plots like this. Next, we’ll generate a map cutout and merge it with the time series plot.
Here is the code to generate the map cutout.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 |
sampleNeighborhoods <- ggmap(basemap) + geom_polygon(data = neighb.tidy, aes(x = long, y = lat, group = group), colour = NA, fill="black", alpha = 1) + geom_polygon(data = neighb.tidy[which(neighb.tidy$Neighborhood %in% sfForTimeSeries$Neighborhood), ], aes(x = long, y = lat, group = group, fill = Neighborhood), colour = "black") + coord_map() + scale_fill_manual(values=palette_9_colors)+ mapTheme() + theme(legend.position = "bottom", legend.direction = "horizontal", legend.key.size = unit(0.15, "in"), legend.title=element_blank()) + guides(fill=guide_legend(ncol = 2)) sampleNeighborhoods |
And here is the code to merge it with the time series plot using ‘grid.arrange’.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 |
# First well create a blank grob (or grid object) which will allow us to appropriately space out our plot below blank <- grid.rect(gp=gpar(col="white")) # Next, we will create an ordered list of the plots we want to use, including 'blank' gs <- list(time.series, blank, sampleNeighborhoods) # Finally, we will denote a matrix that will dictate the layout of the plots. Each number # in the matrix refers to a position in the list of plots. lay <- rbind(c(1,1,2), c(1,1,2), c(1,1,3), c(1,1,3), c(1,1,2), c(1,1,2)) # Arrange the plot grid.arrange(grobs = gs, layout_matrix = lay) |
Finally, is there anything more concrete that we can conclude from these data? Anecdotally, we know that San Francisco is changing. We know some neighborhoods are changing faster than others.
Typically, we would join these home sale data to external social, demographic and economic factors to explain why neighborhoods are changing so fast. As we pointed out in our recent piece predicting gentrification however, there are a whole host of ‘endogenous’ or price-related features that are important to note when understanding neighborhood change.
Our final plot conveys one such phenomenon by plotting percent change in sale price as a function of initial, 2009 prices.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 |
# First we create a data frame of just 2009 and remove neighborhoods # that have an NA value for pctChange due to on insufficient sale volume sf.2009 <- sf.summarized[which(sf.summarized$SaleYr == 2009), ] %>% na.omit() # Then create the scatterplot using neighborhood name labels instead of points change_scatterplot <- ggplot(sf.2009, aes(x = pctChange, y = medianPrice, label = Neighborhood)) + geom_label(data = sf.2009[which(!sf.2009$pctChange %in% topPctChange),], aes(label = Neighborhood), fill = "grey20", size = 2, color = "white") + geom_label(data = sf.2009[which(sf.2009$pctChange %in% topPctChange),], aes(label = Neighborhood, fill = Neighborhood), size = 2, color = "white") + scale_fill_manual(values=palette_9_colors) + geom_smooth(method = "lm", se = FALSE) + plotTheme() + theme(legend.position = "none") + scale_y_continuous(labels = dollar_format(prefix = "$")) + scale_x_continuous(labels = percent, limits = c(min(sf.2009$pctChange - .04), max(sf.2009$pctChange + .025)) ) + labs(x="% Change", y="Median Sale Price (2009)", title="Change in home price as a function of initial price", subtitle="Median price; Change between 2009 - 2015", caption="Source: San Francisco Office of the Assessor-Recorder\n@KenSteif & @SimonKassel") change_scatterplot |
This plot shows that by in large, lower priced neighborhoods in 2009 saw the greatest rates of price appreciation throughout the study period.
If you’re interested in this phenomenon, you may wish to check out the Rent Gap theory, a now classic marxist geography explanation of gentrification first suggested by the late Neil Smith in 1979.
We hope you enjoyed this tutorial. If you want to access the entire code base in one script, you can access it via Simon’s GitHub page.
Ken Steif, PhD is the founder of Urban Spatial. He is also the director of the Master of Urban Spatial Analytics program at the University of Pennsylvania. You can follow him on Twitter @KenSteif.
Simon Kassel will be receiving his Masters of City Planning from the University of Pennsylvania in the Spring of 2017. You can follow him on Twitter @SimonKassel.