The consumer complaints database provided by the Bureau of Consumer Financial Protection, can be downloaded as a 190mb csv file.
Although the csv file is not large relative to other available datasets that can exceed many gigabytes in size, it still provides good motivation for aggregating the data using SQL and outputting into a Pandas DataFrame. This can all be done conveniently with Pandas's iotools.
This post will explore creating a database file with the Consumer Complaints database and analyzing the data with Pandas and Plotly. I was inspired to use SQL and Plotly in conjunction by this blog post that examines one of the NYC Open Data Portal.
import pandas as pd
import numpy as np
from IPython.display import display
import sqlite3
from sqlalchemy import create_engine
import string
from statsmodels.stats.multicomp import pairwise_tukeyhsd
from scipy import stats
import plotly.plotly as py
from plotly.tools import FigureFactory as FF
from plotly.graph_objs import Bar, Scatter, Marker, Layout, Choropleth, Histogram
Before creating a new database file, the Consumer Complaints csv file can be inspected to get a quick view of what is included.
display(pd.read_csv('raw/Consumer_Complaints.csv', nrows=5).head())
Building the Database¶
Create the database connection using sqlalchemy create_engine
method.
db_conn = create_engine('sqlite:///databases/complaints.db')
The amount of rows in the csv to process on each iteration
chunks = 25000
The csv data is processed and appended to the database created above. As each chunk is processed, the data columns are formatted to remove whitespace and make each word proper case.
for data in pd.read_csv('raw/Consumer_Complaints.csv', chunksize=chunks,
iterator=True, encoding='utf-8'):
data = data.rename(columns={col: col.replace('-', ' ') for col in data.columns})
data = data.rename(columns={col: col.strip() for col in data.columns})
data = data.rename(columns={col: string.capwords(col) for col in data.columns})
data = data.rename(columns={col: col.replace(' ', '') for col in data.columns})
data.to_sql('data', db_conn, if_exists='append')
To confirm the data was processed and appended to the database correctly, a quick query can be ran to display the first 10 rows of the database.
pd.read_sql_query('SELECT * FROM data LIMIT 10', db_conn)
With the data processed into a database file, we can begin analyzing the data.
Which Product has the most complaints?
query = pd.read_sql_query('SELECT Product, Company, COUNT(*) as `Complaints`'
'FROM data '
'GROUP BY Product '
'ORDER BY `Complaints` DESC', db_conn)
py.iplot([Bar(x=query.Product, y=query.Complaints)], filename = 'ConsumerComplaints_Products with most complaints')
Credit and debt related products have the most complaints, starting with Mortgages.
Which Companies have the Most Complaints?
query_responses = pd.read_sql_query('SELECT Company, COUNT(*) as `Complaints` '
'FROM data '
'GROUP BY Company '
'ORDER BY `Complaints` DESC '
'LIMIT 10 ', db_conn)
py.iplot([Bar(x=query_responses.Company, y=query_responses.Complaints)], filename='ConsumerComplaints_Companies with most Complaints')
The largest financial firms including Bank of America, Wells Fargo and J.P. Morgan are near the top, which could be due simply to the size of the firms relative to others.
Using SQL queries and Pandas makes reshaping and aggregating data much more straightforward. For example, with SQL and Pandas, we can quickly pull the relevant data from the database and create a ratio from the metrics to find which companies have the lowest timely response ratios.
query_comp = pd.read_sql_query('SELECT Company, '
'COUNT(CASE WHEN `TimelyResponse?` = "Yes" THEN 1 ELSE NULL END) As YesCount, '
'COUNT(CASE WHEN `TimelyResponse?` = "No" THEN 1 ELSE NULL END) As NoCount, '
'COUNT(*) as Total '
'FROM data '
'GROUP BY Company '
'HAVING COUNT(*) > 500 '
'ORDER BY YesCount DESC', db_conn)
query_comp['Timely_Response_Ratio'] = query_comp.YesCount / query_comp.Total * 100
bot_10_response = query_comp.sort_values('Timely_Response_Ratio', ascending=True)[0:10]
responses = 'Timely Responses: ' + bot_10_response.YesCount.astype(str) + '<br>' + 'Untimely Responses: ' + bot_10_response.NoCount.astype(str)
py.iplot([Bar(x=bot_10_response.Company, y=bot_10_response.Timely_Response_Ratio,
text=responses)], filename='ConsumerComplaints_LowestTimelyResponse')
Note none of the firms with the most complaints are included here.
It was seen earlier that mortgages related issues have the most complaints. Which issues are specifically complained about the most?
query3 = pd.read_sql_query('SELECT Product, Issue, COUNT(*) `Number of Complaints` '
'FROM data '
'WHERE Product = "Mortgage" '
'GROUP BY Issue '
'ORDER BY `Number of Complaints` DESC', db_conn)
py.iplot([Bar(x=query3.Issue, y=query3['Number of Complaints'])], filename='ConsumerComplaints_MostComplainedProductIssue')
The majority of mortgage complaints are related to loan modification, collection, services, payments and other related issues.
Choropleth maps can be plotted using Plotly. These maps can display which states have the most complaints.
state_query = pd.read_sql_query('SELECT State, Product, COUNT(*) as `Complaints` '
'FROM data '
'WHERE State <> "None" '
'GROUP BY State', db_conn
)
dat = [dict(
type = 'choropleth',
locations = state_query['State'],
z = state_query['Complaints'],
locationmode = 'USA-states'
)]
layout = dict(geo = dict(
scope='usa', showlakes=True, lakecolor = 'rgb(255, 255, 255)'))
fig = dict(data=dat, layout=layout)
py.iplot(fig, filename='ConsumerComplaints_StatesWithMostComplaints')
As one would expect, the more populated states such as California, Texas, Florida and New York have the most complaints.
Is there a particular pattern to which companies receive the most complaints by state? Would one expect the company with the most complaints be contained more to regions in which the firm has a large presence (Bank of America on the East Coast, J.P. Morgan in the Northeast, etc.)?
query_comps = pd.read_sql_query('SELECT State, Company, COUNT(*) as `Complaints` '
'FROM data '
'WHERE State <> "None" '
'AND Company <> "None" '
'GROUP BY Company, State '
'ORDER BY `Complaints` DESC, Company, State', db_conn)
state_grouped = query_comps.groupby('State', as_index=False).first().sort_values('Complaints', ascending=False)
state_grouped['Company_Code'] = state_grouped['Company'].astype('category').cat.codes
company_codes = pd.DataFrame(state_grouped['Company'].unique(), index=state_grouped['Company_Code'].unique(), columns=['Code'])
dat = [dict(
type = 'choropleth',
locations = state_grouped['State'],
z = state_grouped['Company_Code'],
locationmode = 'USA-states',
text = state_grouped['Company'],
colorscale = [[0.0, 'rgb(255,255,179)'], [0.1, 'rgb(141,211,199)'], [0.2, 'rgb(190,186,218)'],[0.3, 'rgb(251,128,114)'],\
[0.4, 'rgb(128,177,211)'], [0.5, 'rgb(253,180,98)'], [0.6, 'rgb(179,222,105)'], [0.7, 'rgb(252,205,229)'],\
[0.8, 'rgb(217,217,217)'], [0.9, 'rgb(188,128,189)'], [1, 'rgb(204,235,197)']],
colorbar = dict(
tickmode = 'array',
tickvals = list(company_codes.index),
ticktext = list(company_codes.Code))
)]
layout = dict(
geo = dict(
scope='usa', showlakes=True, lakecolor = 'rgb(255, 255, 255)'))
fig = dict(data=dat, layout=layout)
py.iplot(fig, filename='ConsumerComplaints_MostComplainedCompanyByState')
The choropleth map shows Bank of America receives the most complaints throughout the United States while Wells Fargo is generally centered around the Mid and Central US. Interestingly, Equifax receives the most complaints in many Southeastern states.
Are mortgage complaints centered in a particular area of the United States?
query_products = pd.read_sql_query('SELECT State, Product, COUNT(*) as `Complaints` '
'FROM data '
'WHERE State <> "None" '
'AND Product <> "None" '
'GROUP BY Product, State '
'ORDER BY `Complaints` DESC, Product, State', db_conn)
product_grouped = query_products.groupby('State', as_index=False).first().sort_values('Complaints', ascending=False)
product_grouped['Product_Code'] = product_grouped['Product'].astype('category').cat.codes
product_codes = pd.DataFrame(product_grouped['Product'].unique(), index=product_grouped['Product_Code'].unique(), columns=['Code'])
dat = [dict(
type = 'choropleth',
locations = product_grouped['State'],
z = product_grouped['Product_Code'],
locationmode = 'USA-states',
text = product_grouped['Product'],
colorscale = [[0.0, 'rgb(255,255,179)'], [0.25, 'rgb(141,211,199)'], [0.50, 'rgb(190,186,218)'],[0.75, 'rgb(251,128,114)'],\
[1, 'rgb(128,177,211)']],
colorbar = dict(
tickmode = 'array',
tickvals = list(product_codes.index),
ticktext = list(product_codes.Code))
)]
layout = dict(
geo = dict(
scope='usa', showlakes=True, lakecolor = 'rgb(255, 255, 255)'))
fig = dict(data=dat, layout=layout)
py.iplot(fig, filename='ConsumerComplaints_MostComplainedProductByState')
Mortgage related issues receive the most complaints by far across the United States. Interestingly enough, debt collection and credit reporting are the most complained about in the central United States.
The dataset contains tags that detail if the consumer is elderly and/or a servicemember. Using this data, we can test if having a particular tag results in a slower or faster response time using the statsmodels package.
tag_responses = pd.read_sql_query('SELECT Company, Tags, `TimelyResponse?`, '
'COUNT(CASE WHEN `TimelyResponse?` = "Yes" THEN 1 ELSE NULL END) As YesCount, '
'COUNT(CASE WHEN `TimelyResponse?` = "No" THEN 1 ELSE NULL END) As NoCount, '
'COUNT(*) as Total '
'FROM data '
'GROUP BY Company '
'ORDER BY YesCount DESC', db_conn)
tag_responses.rename(columns={'TimelyResponse?':'TimelyResponse'}, inplace=True)
tag_responses.Tags = np.where(tag_responses.Tags.isnull() == True, 'No Tag', tag_responses.Tags)
tag_responses['Timely_Response_Ratio'] = tag_responses.YesCount / tag_responses.Total * 100
pivoted = pd.pivot_table(tag_responses, index=['TimelyResponse', 'Company'], columns=['Tags'], values=['Timely_Response_Ratio']).reset_index()
Using the kruskal-wallis implementation in the statsmodel package, it can be tested if there is a significant difference in response time ratio by designated tag.
stats.mstats.kruskalwallis(pivoted.Timely_Response_Ratio['No Tag'], pivoted.Timely_Response_Ratio['Older American'],
pivoted.Timely_Response_Ratio['Older American, Servicemember'], pivoted.Timely_Response_Ratio['Servicemember'])
The p-value is way above 0.05 and therefore it can be concluded there is no significant difference in response time. Each tag can also be compared individually using Tukey's test.
print pairwise_tukeyhsd(tag_responses.Timely_Response_Ratio, tag_responses.Tags)
Tukey's test reports there are no significant differences between each group.
Summary¶
In this notebook, using SQL and Pandas to analyze larger sets of data was explored with visualizations from the Plotly library. Pandas provides many tools to allow smoother workflows with large data sets and SQL. Python is also a good tool for working with databases, especially with packages such as SQLAlchemy.