Analyzing the Consumer Complaints Database with Python, SQL and Plotly

The consumer complaints database provided by the Bureau of Consumer Financial Protection, can be downloaded as a 190mb csv file.

Although the csv file is not large relative to other available datasets that can exceed many gigabytes in size, it still provides good motivation for aggregating the data using SQL and outputting into a Pandas DataFrame. This can all be done conveniently with Pandas's iotools.

This post will explore creating a database file with the Consumer Complaints database and analyzing the data with Pandas and Plotly. I was inspired to use SQL and Plotly in conjunction by this blog post that examines one of the NYC Open Data Portal.

In [1]:
import pandas as pd
import numpy as np
from IPython.display import display
import sqlite3
from sqlalchemy import create_engine
import string
from statsmodels.stats.multicomp import pairwise_tukeyhsd
from scipy import stats
In [2]:
import plotly.plotly as py
from plotly.tools import FigureFactory as FF
from plotly.graph_objs import Bar, Scatter, Marker, Layout, Choropleth, Histogram

Before creating a new database file, the Consumer Complaints csv file can be inspected to get a quick view of what is included.

In [3]:
display(pd.read_csv('raw/Consumer_Complaints.csv', nrows=5).head())
Date received Product Sub-product Issue Sub-issue Consumer complaint narrative Company public response Company State ZIP code Tags Consumer consent provided? Submitted via Date sent to company Company response to consumer Timely response? Consumer disputed? Complaint ID
0 12/5/2011 Credit card NaN Rewards NaN NaN NaN Discover NY 13835 NaN NaN Postal mail 12/6/2011 Closed without relief Yes No 2922
1 12/5/2011 Credit card NaN Billing disputes NaN NaN NaN Capital One NY 11582 NaN NaN Referral 12/5/2011 Closed with relief Yes No 2928
2 12/5/2011 Credit card NaN Closing/Cancelling account NaN NaN NaN Bank of America NC 28411 Servicemember NaN Phone 12/9/2011 Closed with relief Yes Yes 2929
3 12/5/2011 Mortgage Other mortgage Loan modification,collection,foreclosure NaN NaN NaN Wells Fargo & Company VA 20194 NaN NaN Web 12/6/2011 Closed without relief Yes Yes 2931
4 12/5/2011 Credit card NaN Identity theft / Fraud / Embezzlement NaN NaN NaN JPMorgan Chase & Co. IL 60202 NaN NaN Referral 12/5/2011 Closed without relief Yes No 2934

Building the Database

Create the database connection using sqlalchemy create_engine method.

In [4]:
db_conn = create_engine('sqlite:///databases/complaints.db')

The amount of rows in the csv to process on each iteration

In [5]:
chunks = 25000

The csv data is processed and appended to the database created above. As each chunk is processed, the data columns are formatted to remove whitespace and make each word proper case.

In [6]:
for data in pd.read_csv('raw/Consumer_Complaints.csv', chunksize=chunks,
                       iterator=True, encoding='utf-8'):
    
    data = data.rename(columns={col: col.replace('-', ' ') for col in data.columns})
    data = data.rename(columns={col: col.strip() for col in data.columns})
    data = data.rename(columns={col: string.capwords(col) for col in data.columns})
    data = data.rename(columns={col: col.replace(' ', '') for col in data.columns})
    
    data.to_sql('data', db_conn, if_exists='append')

To confirm the data was processed and appended to the database correctly, a quick query can be ran to display the first 10 rows of the database.

In [7]:
pd.read_sql_query('SELECT * FROM data LIMIT 10', db_conn)
Out[7]:
index DateReceived Product SubProduct Issue SubIssue ConsumerComplaintNarrative CompanyPublicResponse Company State ZipCode Tags ConsumerConsentProvided? SubmittedVia DateSentToCompany CompanyResponseToConsumer TimelyResponse? ConsumerDisputed? ComplaintId
0 0 12/5/2011 Credit card None Rewards None None None Discover NY 13835 None None Postal mail 12/6/2011 Closed without relief Yes No 2922
1 1 12/5/2011 Credit card None Billing disputes None None None Capital One NY 11582 None None Referral 12/5/2011 Closed with relief Yes No 2928
2 2 12/5/2011 Credit card None Closing/Cancelling account None None None Bank of America NC 28411 Servicemember None Phone 12/9/2011 Closed with relief Yes Yes 2929
3 3 12/5/2011 Mortgage Other mortgage Loan modification,collection,foreclosure None None None Wells Fargo & Company VA 20194 None None Web 12/6/2011 Closed without relief Yes Yes 2931
4 4 12/5/2011 Credit card None Identity theft / Fraud / Embezzlement None None None JPMorgan Chase & Co. IL 60202 None None Referral 12/5/2011 Closed without relief Yes No 2934
5 5 12/5/2011 Credit card None Collection debt dispute None None None Bank of America CA 95648 None None Referral 12/5/2011 Closed with relief Yes No 2937
6 6 12/5/2011 Credit card None Credit determination None None None JPMorgan Chase & Co. TX 78041 None None Fax 12/9/2011 Closed without relief Yes Yes 2939
7 7 12/5/2011 Credit card None Billing disputes None None None Capital One VT 5043 None None Referral 12/5/2011 Closed with relief Yes No 2940
8 8 12/5/2011 Credit card None Credit card protection / Debt protection None None None Capital One KS 67579 Older American None Phone 12/9/2011 Closed without relief Yes No 2942
9 9 12/5/2011 Credit card None Credit reporting None None None JPMorgan Chase & Co. CA 94127 None None Referral 12/5/2011 Closed without relief Yes Yes 1940

With the data processed into a database file, we can begin analyzing the data.

Which Product has the most complaints?

In [8]:
query = pd.read_sql_query('SELECT Product, Company, COUNT(*) as `Complaints`'
                         'FROM data '
                         'GROUP BY Product '
                         'ORDER BY `Complaints` DESC', db_conn)
In [9]:
py.iplot([Bar(x=query.Product, y=query.Complaints)], filename = 'ConsumerComplaints_Products with most complaints')
Out[9]:

Credit and debt related products have the most complaints, starting with Mortgages.

Which Companies have the Most Complaints?

In [10]:
query_responses = pd.read_sql_query('SELECT Company, COUNT(*) as `Complaints` '
                           'FROM data '
                           'GROUP BY Company '
                           'ORDER BY `Complaints` DESC '
                           'LIMIT 10 ', db_conn)
In [11]:
py.iplot([Bar(x=query_responses.Company, y=query_responses.Complaints)], filename='ConsumerComplaints_Companies with most Complaints')
Out[11]:

The largest financial firms including Bank of America, Wells Fargo and J.P. Morgan are near the top, which could be due simply to the size of the firms relative to others.

Using SQL queries and Pandas makes reshaping and aggregating data much more straightforward. For example, with SQL and Pandas, we can quickly pull the relevant data from the database and create a ratio from the metrics to find which companies have the lowest timely response ratios.

In [12]:
query_comp = pd.read_sql_query('SELECT Company, '
                           'COUNT(CASE WHEN `TimelyResponse?` = "Yes" THEN 1 ELSE NULL END) As YesCount, '
                           'COUNT(CASE WHEN `TimelyResponse?` = "No" THEN 1 ELSE NULL END) As NoCount, '
                           'COUNT(*) as Total '
                           'FROM data '
                           'GROUP BY Company '
                           'HAVING COUNT(*) > 500 '
                           'ORDER BY YesCount DESC', db_conn)
In [13]:
query_comp['Timely_Response_Ratio'] = query_comp.YesCount / query_comp.Total * 100
In [14]:
bot_10_response = query_comp.sort_values('Timely_Response_Ratio', ascending=True)[0:10]
In [15]:
responses = 'Timely Responses: ' + bot_10_response.YesCount.astype(str) + '
'
+ 'Untimely Responses: ' + bot_10_response.NoCount.astype(str)
In [16]:
py.iplot([Bar(x=bot_10_response.Company, y=bot_10_response.Timely_Response_Ratio, 
              text=responses)], filename='ConsumerComplaints_LowestTimelyResponse')
Out[16]:

Note none of the firms with the most complaints are included here.

It was seen earlier that mortgages related issues have the most complaints. Which issues are specifically complained about the most?

In [17]:
query3 = pd.read_sql_query('SELECT Product, Issue, COUNT(*) `Number of Complaints` '
                          'FROM data '
                          'WHERE Product = "Mortgage" '
                          'GROUP BY Issue '
                          'ORDER BY `Number of Complaints` DESC', db_conn)
In [18]:
py.iplot([Bar(x=query3.Issue, y=query3['Number of Complaints'])], filename='ConsumerComplaints_MostComplainedProductIssue')
Out[18]:

The majority of mortgage complaints are related to loan modification, collection, services, payments and other related issues.

Choropleth maps can be plotted using Plotly. These maps can display which states have the most complaints.

In [19]:
state_query = pd.read_sql_query('SELECT State, Product, COUNT(*) as `Complaints` '
                                'FROM data '
                                'WHERE State <> "None" '
                                'GROUP BY State', db_conn
                               )
In [20]:
dat = [dict(
    type = 'choropleth',
    locations = state_query['State'],
    z = state_query['Complaints'],
    locationmode = 'USA-states'
    )]

layout = dict(geo = dict(
    scope='usa', showlakes=True, lakecolor = 'rgb(255, 255, 255)'))

fig = dict(data=dat, layout=layout)

py.iplot(fig, filename='ConsumerComplaints_StatesWithMostComplaints')
Out[20]:

As one would expect, the more populated states such as California, Texas, Florida and New York have the most complaints.

Is there a particular pattern to which companies receive the most complaints by state? Would one expect the company with the most complaints be contained more to regions in which the firm has a large presence (Bank of America on the East Coast, J.P. Morgan in the Northeast, etc.)?

In [21]:
query_comps = pd.read_sql_query('SELECT State, Company, COUNT(*) as `Complaints` '
                               'FROM data '
                               'WHERE State <> "None" '
                               'AND Company <> "None" '
                               'GROUP BY Company, State '
                               'ORDER BY `Complaints` DESC, Company, State', db_conn)

state_grouped = query_comps.groupby('State', as_index=False).first().sort_values('Complaints', ascending=False)
state_grouped['Company_Code'] = state_grouped['Company'].astype('category').cat.codes
company_codes = pd.DataFrame(state_grouped['Company'].unique(), index=state_grouped['Company_Code'].unique(), columns=['Code'])
In [22]:
dat = [dict(
    type = 'choropleth',
    locations = state_grouped['State'],
    z =  state_grouped['Company_Code'],
    locationmode = 'USA-states',
    text = state_grouped['Company'],
    colorscale = [[0.0, 'rgb(255,255,179)'], [0.1, 'rgb(141,211,199)'], [0.2, 'rgb(190,186,218)'],[0.3, 'rgb(251,128,114)'],\
            [0.4, 'rgb(128,177,211)'], [0.5, 'rgb(253,180,98)'], [0.6, 'rgb(179,222,105)'], [0.7, 'rgb(252,205,229)'],\
            [0.8, 'rgb(217,217,217)'], [0.9, 'rgb(188,128,189)'], [1, 'rgb(204,235,197)']],
    colorbar = dict(
        tickmode = 'array',
        tickvals = list(company_codes.index),
        ticktext = list(company_codes.Code))
    )]

layout = dict(
    geo = dict(
        scope='usa', showlakes=True, lakecolor = 'rgb(255, 255, 255)'))

fig = dict(data=dat, layout=layout)

py.iplot(fig, filename='ConsumerComplaints_MostComplainedCompanyByState')
Out[22]:

The choropleth map shows Bank of America receives the most complaints throughout the United States while Wells Fargo is generally centered around the Mid and Central US. Interestingly, Equifax receives the most complaints in many Southeastern states.

Are mortgage complaints centered in a particular area of the United States?

In [23]:
query_products = pd.read_sql_query('SELECT State, Product, COUNT(*) as `Complaints` '
                               'FROM data '
                               'WHERE State <> "None" '
                               'AND Product <> "None" '
                               'GROUP BY Product, State '
                               'ORDER BY `Complaints` DESC, Product, State', db_conn)

product_grouped = query_products.groupby('State', as_index=False).first().sort_values('Complaints', ascending=False)
product_grouped['Product_Code'] = product_grouped['Product'].astype('category').cat.codes
product_codes = pd.DataFrame(product_grouped['Product'].unique(), index=product_grouped['Product_Code'].unique(), columns=['Code'])
In [24]:
dat = [dict(
    type = 'choropleth',
    locations = product_grouped['State'],
    z =  product_grouped['Product_Code'],
    locationmode = 'USA-states',
    text = product_grouped['Product'],
    colorscale = [[0.0, 'rgb(255,255,179)'], [0.25, 'rgb(141,211,199)'], [0.50, 'rgb(190,186,218)'],[0.75, 'rgb(251,128,114)'],\
            [1, 'rgb(128,177,211)']],
    colorbar = dict(
        tickmode = 'array',
        tickvals = list(product_codes.index),
        ticktext = list(product_codes.Code))
    )]

layout = dict(
    geo = dict(
        scope='usa', showlakes=True, lakecolor = 'rgb(255, 255, 255)'))

fig = dict(data=dat, layout=layout)

py.iplot(fig, filename='ConsumerComplaints_MostComplainedProductByState')
Out[24]:

Mortgage related issues receive the most complaints by far across the United States. Interestingly enough, debt collection and credit reporting are the most complained about in the central United States.

The dataset contains tags that detail if the consumer is elderly and/or a servicemember. Using this data, we can test if having a particular tag results in a slower or faster response time using the statsmodels package.

In [25]:
tag_responses = pd.read_sql_query('SELECT Company, Tags, `TimelyResponse?`, '
                 'COUNT(CASE WHEN `TimelyResponse?` = "Yes" THEN 1 ELSE NULL END) As YesCount, '
                 'COUNT(CASE WHEN `TimelyResponse?` = "No" THEN 1 ELSE NULL END) As NoCount, '
                 'COUNT(*) as Total '
                 'FROM data '
                 'GROUP BY Company '
                 'ORDER BY YesCount DESC', db_conn)
In [26]:
tag_responses.rename(columns={'TimelyResponse?':'TimelyResponse'}, inplace=True)
In [27]:
tag_responses.Tags = np.where(tag_responses.Tags.isnull() == True, 'No Tag', tag_responses.Tags)
tag_responses['Timely_Response_Ratio'] = tag_responses.YesCount / tag_responses.Total * 100
In [28]:
pivoted = pd.pivot_table(tag_responses, index=['TimelyResponse', 'Company'], columns=['Tags'], values=['Timely_Response_Ratio']).reset_index()

Using the kruskal-wallis implementation in the statsmodel package, it can be tested if there is a significant difference in response time ratio by designated tag.

In [29]:
stats.mstats.kruskalwallis(pivoted.Timely_Response_Ratio['No Tag'], pivoted.Timely_Response_Ratio['Older American'], 
                          pivoted.Timely_Response_Ratio['Older American, Servicemember'], pivoted.Timely_Response_Ratio['Servicemember'])
Out[29]:
KruskalResult(statistic=0.34332534569847822, pvalue=0.95168441327165698)

The p-value is way above 0.05 and therefore it can be concluded there is no significant difference in response time. Each tag can also be compared individually using Tukey's test.

In [30]:
print pairwise_tukeyhsd(tag_responses.Timely_Response_Ratio, tag_responses.Tags)
                     Multiple Comparison of Means - Tukey HSD,FWER=0.05                     
============================================================================================
            group1                        group2            meandiff  lower    upper  reject
--------------------------------------------------------------------------------------------
            No Tag                    Older American         0.6005  -5.4108   6.6118 False 
            No Tag            Older American, Servicemember -0.1383  -12.7571 12.4805 False 
            No Tag                    Servicemember          1.9458  -3.8194   7.711  False 
        Older American        Older American, Servicemember -0.7388  -14.5275 13.0499 False 
        Older American                Servicemember          1.3453  -6.6628   9.3534 False 
Older American, Servicemember         Servicemember          2.0841  -11.5991 15.7673 False 
--------------------------------------------------------------------------------------------

Tukey's test reports there are no significant differences between each group.

Summary

In this notebook, using SQL and Pandas to analyze larger sets of data was explored with visualizations from the Plotly library. Pandas provides many tools to allow smoother workflows with large data sets and SQL. Python is also a good tool for working with databases, especially with packages such as SQLAlchemy.