Analyzing Nationwide Utility Rates with R, SQL and Plotly

R and SQL make excellent complements for analyzing data due to their respective strengths. The sqldf package provides an interface for working with SQL in R by querying data from a database into an R data.frame. This post will demonstrate how to query and analyze data using the sqldf package in conjunction with the graphing libraries plotly and ggplot2 as well as some other packages that provide useful statistical tests and other functions.

The data that will be examined in this post is the U.S. Electric Utility Companies and Rates dataset, compiled by data.gov. The data is described as follows:

This dataset, compiled by NREL using data from Ventyx and the U.S. Energy Information Administration dataset 861, provides average residential, commercial and industrial electricity rates by zip code for both investor owned utilities (IOU) and non-investor owned utilities. Note: the file includes average rates for each utility, but not the detailed rate structure data found in the OpenEI U.S. Utility Rate Database. A more recent version of this data is also available through the NREL Utility Rate API with more search options. This data was released by NREL/Ventyx in February 2011.

Start by loading the sqldf package and the other packages that will be used.

library(sqldf)
library(plotly)
library(ggplot2)
library(reshape2)
library(dplyr)
library(agricolae)
library(cluster)

Loading the Data Using R and SQL

The dataset comes in two csv files. To load the data to use with SQL, a file connection is created for each csv.

iou <- file('../data/iouzipcodes2011.csv')
nou <- file('../data/noniouzipcodes2011.csv')

A combined data.frame with both files can be created using a SQL query with the sqldf package. The two datasets are appended using the UNION ALL operator. Here is a link to where I found this useful method to load the data into R using SQL.

utility_df <- sqldf('SELECT * FROM iou 
                     UNION ALL 
                     SELECT * FROM nou', dbname='utility_db')

To quickly test the data was loaded successfully into an R data.frame using the SQL query above, inspect the first five rows of the database using the following query.

sqldf('SELECT * FROM utility_df LIMIT 5')

##     zip eiaid     utility_name state service_type      ownership comm_rate
## 1 35218   195 Alabama Power Co    AL      Bundled Investor Owned 0.1057612
## 2 35219   195 Alabama Power Co    AL      Bundled Investor Owned 0.1057612
## 3 35214   195 Alabama Power Co    AL      Bundled Investor Owned 0.1057612
## 4 35215   195 Alabama Power Co    AL      Bundled Investor Owned 0.1057612
## 5 35216   195 Alabama Power Co    AL      Bundled Investor Owned 0.1057612
##     ind_rate  res_rate
## 1 0.06029244 0.1149433
## 2 0.06029244 0.1149433
## 3 0.06029244 0.1149433
## 4 0.06029244 0.1149433
## 5 0.06029244 0.1149433

This query is essentially the same as the head() function as seen below.

head(utility_df, 5)

##     zip eiaid     utility_name state service_type      ownership comm_rate
## 1 35218   195 Alabama Power Co    AL      Bundled Investor Owned 0.1057612
## 2 35219   195 Alabama Power Co    AL      Bundled Investor Owned 0.1057612
## 3 35214   195 Alabama Power Co    AL      Bundled Investor Owned 0.1057612
## 4 35215   195 Alabama Power Co    AL      Bundled Investor Owned 0.1057612
## 5 35216   195 Alabama Power Co    AL      Bundled Investor Owned 0.1057612
##     ind_rate  res_rate
## 1 0.06029244 0.1149433
## 2 0.06029244 0.1149433
## 3 0.06029244 0.1149433
## 4 0.06029244 0.1149433
## 5 0.06029244 0.1149433

Analyzing the Data with R and SQL

Let's say we are interested in finding average electricity rates for each group (commercial, residential, industrial) at a state level. The following query is used to collect the desired data into an R data.frame. The AVG() aggregation used in the query as we are interested in the overall average of all utilities by state.

state.avg.rates <- sqldf('SELECT state, AVG(comm_rate) as comm_rate, AVG(ind_rate) as ind_rate, AVG(res_rate) as res_rate  
                          FROM utility_df 
                          GROUP BY state 
                          ORDER BY state')

head(state.avg.rates)

##   state  comm_rate   ind_rate   res_rate
## 1    AK 0.31816373 0.08144623 0.35434410
## 2    AL 0.08038278 0.06528392 0.08398812
## 3    AR 0.07966575 0.06387541 0.09221964
## 4    AZ 0.10138372 0.07438303 0.11264911
## 5    CA 0.10370644 0.07208640 0.13453128
## 6    CO 0.10266863 0.07872797 0.11884693

The query outputs a data.frame containing each state's average electricity rates for each group. This data can be visualized in a choropleth graph with the plotly library.

# Define what is displayed when hovering over the states in the choropleth graph.

state.avg.rates$hover <- with(state.avg.rates, paste(state, '<br>', 'Commercial Rate', round(comm_rate, 3), 
                              '<br>', 'Industrial Rate', round(ind_rate, 3),
                              '<br>', 'Residential Rate', round(res_rate, 3)))

# Set the state borders to white
state.border <- list(color = toRGB('white'))

# Define how the choropleth graph will be displayed
geo <- list(
  scope = 'usa',
  projection = list(type = 'albers usa'),
  showlakes = TRUE,
  lakecolor = toRGB('white')
)

# Build the graph and plot.
p <- plot_ly(state.avg.rates, z = round(state.avg.rates$res_rate, 3), text = state.avg.rates$hover, locations = state.avg.rates$state, type = 'choropleth',
        locationmode = 'USA-states', colors = 'Blues', marker = list(line = state.border)) %>%
  layout(title = 'Average Electricity Rates by State', geo = geo)

p

The shading represents the average residential rates by state as specified in the plotting function. Several quick inferences can be made from the choropleth graph:

Hawaii and Alaska have the highest residential electricity rates by far at about .35, with Vermont next at .17.
Most of the Mid and Central United States average between a residential rate of 0.09 to .13.
Washington, Oregon, Nevada, and Tennesse have the lowest residential electricity rates.

The states can also be clustered using the kmeans() function in R and plotted with the clustplot function from the cluster package. The number of clusters is set at six due to the number of ticks output from the choropleth graph (completely arbitrary and should not be done in practice, for example, purposes only).

state.avg.rates$state <- as.factor(state.avg.rates$state)
rownames(state.avg.rates) <- state.avg.rates$state
clust <- kmeans(state.avg.rates[,2:4], 6)
clusplot(state.avg.rates[,2:4], clust$cluster, color=TRUE, shade=TRUE, labels=2, lines=0, main = "Cluster Plot of States' Electricity Rates")

As one might expect from the choropleth graph, Hawaii and Alaska are put into their own cluster. The extreme values of Hawaii and Alaska make it harder to visualize the clusters of the other states. Let's perform another query using SQL's WHERE clause to remove the states in question.

state.avg.rates2 <- sqldf('SELECT state, AVG(comm_rate) as comm_rate, AVG(ind_rate) as ind_rate, AVG(res_rate) as res_rate  
                           FROM utility_df 
                           WHERE NOT state IN ("AK", "HI") 
                           GROUP BY state 
                           ORDER BY state')

With Hawaii and Alaska removed, plot a new choropleth graph to display the state average residential electricity rate.

p <- plot_ly(state.avg.rates2, z = round(state.avg.rates2$res_rate, 3), text = paste('Residential Rate:', round(state.avg.rates2$res_rate, 3)), locations = state.avg.rates2$state, type = 'choropleth',
        locationmode = 'USA-states', colors = 'Blues', marker = list(line = state.border)) %>%
  layout(title = 'Average Residential Electricity Rates by State', geo = geo)

p

The new graph is much more clear in visualizing how the average residential electricity rates differ by state. Plotting a new cluster plot with the re-queried data provides a better view of how the states are clustered.

state.avg.rates2$state <- as.factor(state.avg.rates2$state)
rownames(state.avg.rates2) <- state.avg.rates2$state
clust2 <- kmeans(state.avg.rates2[,2:4], 6)
clusplot(state.avg.rates2[,2:4], clust2$cluster, color=TRUE, shade=TRUE, labels=2, lines=0)

As suspected from our original inferences, the states with the lowest rates (Oregon, Washington, Tennesse and Nevada) are placed in their own respective cluster. Vermont is set in its own cluster. Most of the states are arranged in clusters five and six, with mostly Northeastern states occupying separate clusters.

The data includes rate by the type of ownership of a particular utility. Are there any significant differences in rates depending on the type of ownership?

To see the different classifications of ownership, a quick query can be run.

sqldf('SELECT DISTINCT ownership FROM utility_df')

##               ownership
## 1        Investor Owned
## 2           Cooperative
## 3             Municipal
## 4 Political Subdivision
## 5                 State
## 6 Retail Power Marketer
## 7               Federal

The query shows there are seven types of ownership classification. Let's write another query to pull the rates by state and ownership type.

ownership.rate <- sqldf('SELECT state, ownership, utility_name, 
                         AVG(comm_rate) as comm_rate, AVG(ind_rate) as ind_rate, AVG(res_rate) as res_rate 
                         FROM utility_df 
                         GROUP BY state, ownership, utility_name  
                         ORDER BY state')

With the data queried into an R data.frame, it can then be melted and reshaped using the reshape2 and dplyr packages to reshape into the format we need to plot.

ownership.melted <- melt(ownership.rate, id.vars = 'ownership', 
                         measure.vars = c('res_rate', 'comm_rate', 'ind_rate'), variable.name = 'rate')

ownership.rates2 <- dcast(ownership.melted, formula = ownership ~ rate, fun.aggregate = mean)
ownership.rates2 <- ownership.rates2[order(-ownership.rates2$res_rate),]

p <- plot_ly(ownership.rates2, x=ownership.rates2$ownership, y=round(ownership.rates2$res_rate, 3), type='bar') %>%
  layout(title = 'Average Residential Electricity Rates by Utility Ownership', yaxis = list(title = 'Electricity Rate'))

p

The plot shows most ownership classifications are similar regarding average residential electricity rates. Federal rates are by far the lowest, which could be due to these rates representing low-income housing or other federally sponsored housing projects.

To see if there are any significant differences between the residential rates of the ownership classifications, we can run a Kruskal-Wallis post-hoc test using the agricolae package.

ownership.res <- filter(ownership.melted, rate == 'res_rate') # First filter all rate types except for Residential
kruskal(ownership.res$value, ownership.res$ownership, console = TRUE)

## 
## Study: ownership.res$value ~ ownership.res$ownership
## Kruskal-Wallis test's
## Ties or no Ties
## 
## Critical Value: 83.92807
## Degrees of freedom: 6
## Pvalue Chisq  : 5.551115e-16 
## 
## ownership.res$ownership,  means of the ranks
## 
##                       ownership.res.value   r
## Cooperative                      939.7176 857
## Federal                          217.2857   7
## Investor Owned                   724.7676 185
## Municipal                        814.1767 566
## Political Subdivision            565.5976  82
## Retail Power Marketer           1284.5000   2
## State                            765.4286   7
## 
## Post Hoc Analysis
## 
## t-Student: 1.961361
## Alpha    : 0.05
## Groups according to probability of treatment differences and alpha level.
## 
## Treatments with the same letter are not significantly different.
## 
##                       ownership.res$value groups
## Retail Power Marketer           1284.5000      a
## Cooperative                      939.7176      a
## Municipal                        814.1767      a
## State                            765.4286     ab
## Investor Owned                   724.7676     ab
## Political Subdivision            565.5976     bc
## Federal                          217.2857      c

The Kruskal-Wallis test output corresponds to the quick inferences made above on the graph. Utilities owned by Retail Power Marketers, Cooperatives and Municipalities are not significantly different from each other. State and Investor Owned utilities are not significantly distinct from the former three types and Political Subdivisions. Federally owned services are significantly different than each other ownership classification, as was expected.

Further Exploration of the Data

Suppose one is interested in finding the utilities with the highest residential electricity rates. We can write a query to pull this data from the database and plot. The query returns the top 20 utilities to preserve readability as specified with the LIMIT clause.

utilities <- sqldf('SELECT utility_name, state, res_rate 
                    FROM utility_df 
                    GROUP BY utility_name, state 
                    ORDER BY res_rate DESC 
                    LIMIT 20')

p <- plot_ly(utilities, x=utilities$utility_name, y=round(utilities$res_rate, 3), type='bar', group = utilities$state) %>%
  layout(title = 'Highest Residential Rates by Utility', yaxis = list(title = 'Electricity Rate'))

## Warning in plot_ly(utilities, x = utilities$utility_name, y = round(utilities$res_rate, : The group argument has been deprecated. Use `group_by()` or split instead.
## See `help('plotly_data')` for examples

Interestingly, 18 of the 20 utilities queried are in Alaska.

Which states have the most utilities? Intuition would say larger states such as Texas and California would have the most utilities.

owned <- sqldf('SELECT state, COUNT(DISTINCT utility_name) as utilities 
                FROM utility_df 
                GROUP BY state 
                ORDER BY utilities DESC 
                LIMIT 10')

p <- plot_ly(owned, x=owned$state, y=owned$utilities, type='bar') %>%
  layout(title = 'States with the Most Utilities', yaxis = list(title = 'Count of Utilities'))

p

Contrary to our intuition, several states with smaller areas such as Georgia, Iowa and Tennesse are those with the most utilities.

Of those states in the above chart, are there significant differences in the ownership classification of the utilities?

owned.type <- sqldf('SELECT state, ownership, COUNT(DISTINCT utility_name) as utilities 
                FROM utility_df 
                WHERE state IN ("TX", "IA", "TN", "NC", "IN", "GA", "MN", "MO", "WI", "OH")
                GROUP BY state, ownership 
                ORDER BY utilities DESC')

ggplot(owned.type, aes(state, utilities)) + 
  geom_bar(aes(fill=ownership), position='dodge', stat='identity') + 
  labs(title = 'Utilities by Ownership Type', ylab = 'Count of Utilities')

Cooperative and municipal-owned utilities are the most common type of ownership in the states considered. To see the proportions of the different types of ownership, the following query can be used. The query includes all states as we are curious to see if cooperative and municipal-owned utilities are indeed representative of the majority of ownership types.

utility.type <- sqldf('SELECT ownership, COUNT(DISTINCT utility_name) as count
                       FROM utility_df 
                       GROUP BY ownership 
                       ORDER BY count DESC')

utility.type$percent <- utility.type$count / sum(utility.type$count) * 100 # Create ratio to find proportions of ownership classifications

p <- plot_ly(utility.type, x=utility.type$ownership, y=round(utility.type$percent, 2), type='bar') %>% 
  layout(title = 'Proportion of Utilities by Ownership Classification', yaxis = list(title = '% of Total Utilities'))

p

As one might anticipate from the previous graph, cooperative and municipal-owned utilities roughly 85% of all utilities in the United States as recorded by the dataset.

Summary

SQL and R make a great combination when exploring and analyzing datasets using the sqldf package package and other packages available in R. Although the dataset examined in this post was only about 5mb and therefore doesn't need to be loaded into a database for analysis, it provides a good starting point for working with SQL and R in unison. I hope to do more posts shortly using SQL and R on larger datasets.