The consumer complaints database provided by the Bureau of Consumer Financial Protection, can be downloaded as a 190mb csv file.

Although the csv file is not large relative to other available datasets that can exceed many gigabytes in size, it still provides good motivation for aggregating the data using SQL and outputting into a Pandas DataFrame. This can all be done conveniently with Pandas's iotools.

This post will explore creating a database file with the Consumer Complaints database and analyzing the data with Pandas and Plotly. I was inspired to use SQL and Plotly in conjunction by this blog post that examines one of the NYC Open Data Portal.

In [1]:

import pandas as pd
import numpy as np
from IPython.display import display
import sqlite3
from sqlalchemy import create_engine
import string
from statsmodels.stats.multicomp import pairwise_tukeyhsd
from scipy import stats

In [2]:

import plotly.plotly as py
from plotly.tools import FigureFactory as FF
from plotly.graph_objs import Bar, Scatter, Marker, Layout, Choropleth, Histogram

Before creating a new database file, the Consumer Complaints csv file can be inspected to get a quick view of what is included.

In [3]:

display(pd.read_csv('raw/Consumer_Complaints.csv', nrows=5).head())

	Date received	Product	Sub-product	Issue	Sub-issue	Consumer complaint narrative	Company public response	Company	State	ZIP code	Tags	Consumer consent provided?	Submitted via	Date sent to company	Company response to consumer	Timely response?	Consumer disputed?	Complaint ID
0	12/5/2011	Credit card	NaN	Rewards	NaN	NaN	NaN	Discover	NY	13835	NaN	NaN	Postal mail	12/6/2011	Closed without relief	Yes	No	2922
1	12/5/2011	Credit card	NaN	Billing disputes	NaN	NaN	NaN	Capital One	NY	11582	NaN	NaN	Referral	12/5/2011	Closed with relief	Yes	No	2928
2	12/5/2011	Credit card	NaN	Closing/Cancelling account	NaN	NaN	NaN	Bank of America	NC	28411	Servicemember	NaN	Phone	12/9/2011	Closed with relief	Yes	Yes	2929
3	12/5/2011	Mortgage	Other mortgage	Loan modification,collection,foreclosure	NaN	NaN	NaN	Wells Fargo & Company	VA	20194	NaN	NaN	Web	12/6/2011	Closed without relief	Yes	Yes	2931
4	12/5/2011	Credit card	NaN	Identity theft / Fraud / Embezzlement	NaN	NaN	NaN	JPMorgan Chase & Co.	IL	60202	NaN	NaN	Referral	12/5/2011	Closed without relief	Yes	No	2934

Building the Database¶

Create the database connection using sqlalchemy create_engine method.

In [4]:

db_conn = create_engine('sqlite:///databases/complaints.db')

The amount of rows in the csv to process on each iteration

In [5]:

chunks = 25000

The csv data is processed and appended to the database created above. As each chunk is processed, the data columns are formatted to remove whitespace and make each word proper case.

In [6]:

for data in pd.read_csv('raw/Consumer_Complaints.csv', chunksize=chunks,
                       iterator=True, encoding='utf-8'):
    
    data = data.rename(columns={col: col.replace('-', ' ') for col in data.columns})
    data = data.rename(columns={col: col.strip() for col in data.columns})
    data = data.rename(columns={col: string.capwords(col) for col in data.columns})
    data = data.rename(columns={col: col.replace(' ', '') for col in data.columns})
    
    data.to_sql('data', db_conn, if_exists='append')

To confirm the data was processed and appended to the database correctly, a quick query can be ran to display the first 10 rows of the database.

In [7]:

pd.read_sql_query('SELECT * FROM data LIMIT 10', db_conn)

Out[7]:

	index	DateReceived	Product	SubProduct	Issue	SubIssue	ConsumerComplaintNarrative	CompanyPublicResponse	Company	State	ZipCode	Tags	ConsumerConsentProvided?	SubmittedVia	DateSentToCompany	CompanyResponseToConsumer	TimelyResponse?	ConsumerDisputed?	ComplaintId
0	0	12/5/2011	Credit card	None	Rewards	None	None	None	Discover	NY	13835	None	None	Postal mail	12/6/2011	Closed without relief	Yes	No	2922
1	1	12/5/2011	Credit card	None	Billing disputes	None	None	None	Capital One	NY	11582	None	None	Referral	12/5/2011	Closed with relief	Yes	No	2928
2	2	12/5/2011	Credit card	None	Closing/Cancelling account	None	None	None	Bank of America	NC	28411	Servicemember	None	Phone	12/9/2011	Closed with relief	Yes	Yes	2929
3	3	12/5/2011	Mortgage	Other mortgage	Loan modification,collection,foreclosure	None	None	None	Wells Fargo & Company	VA	20194	None	None	Web	12/6/2011	Closed without relief	Yes	Yes	2931
4	4	12/5/2011	Credit card	None	Identity theft / Fraud / Embezzlement	None	None	None	JPMorgan Chase & Co.	IL	60202	None	None	Referral	12/5/2011	Closed without relief	Yes	No	2934
5	5	12/5/2011	Credit card	None	Collection debt dispute	None	None	None	Bank of America	CA	95648	None	None	Referral	12/5/2011	Closed with relief	Yes	No	2937
6	6	12/5/2011	Credit card	None	Credit determination	None	None	None	JPMorgan Chase & Co.	TX	78041	None	None	Fax	12/9/2011	Closed without relief	Yes	Yes	2939
7	7	12/5/2011	Credit card	None	Billing disputes	None	None	None	Capital One	VT	5043	None	None	Referral	12/5/2011	Closed with relief	Yes	No	2940
8	8	12/5/2011	Credit card	None	Credit card protection / Debt protection	None	None	None	Capital One	KS	67579	Older American	None	Phone	12/9/2011	Closed without relief	Yes	No	2942
9	9	12/5/2011	Credit card	None	Credit reporting	None	None	None	JPMorgan Chase & Co.	CA	94127	None	None	Referral	12/5/2011	Closed without relief	Yes	Yes	1940

With the data processed into a database file, we can begin analyzing the data.

Which Product has the most complaints?

In [8]:

query = pd.read_sql_query('SELECT Product, Company, COUNT(*) as `Complaints`'
                         'FROM data '
                         'GROUP BY Product '
                         'ORDER BY `Complaints` DESC', db_conn)

In [9]:

py.iplot([Bar(x=query.Product, y=query.Complaints)], filename = 'ConsumerComplaints_Products with most complaints')

Out[9]:

Credit and debt related products have the most complaints, starting with Mortgages.

Which Companies have the Most Complaints?

In [10]:

query_responses = pd.read_sql_query('SELECT Company, COUNT(*) as `Complaints` '
                           'FROM data '
                           'GROUP BY Company '
                           'ORDER BY `Complaints` DESC '
                           'LIMIT 10 ', db_conn)

In [11]:

py.iplot([Bar(x=query_responses.Company, y=query_responses.Complaints)], filename='ConsumerComplaints_Companies with most Complaints')

Out[11]:

The largest financial firms including Bank of America, Wells Fargo and J.P. Morgan are near the top, which could be due simply to the size of the firms relative to others.

Using SQL queries and Pandas makes reshaping and aggregating data much more straightforward. For example, with SQL and Pandas, we can quickly pull the relevant data from the database and create a ratio from the metrics to find which companies have the lowest timely response ratios.

In [12]:

query_comp = pd.read_sql_query('SELECT Company, '
                           'COUNT(CASE WHEN `TimelyResponse?` = "Yes" THEN 1 ELSE NULL END) As YesCount, '
                           'COUNT(CASE WHEN `TimelyResponse?` = "No" THEN 1 ELSE NULL END) As NoCount, '
                           'COUNT(*) as Total '
                           'FROM data '
                           'GROUP BY Company '
                           'HAVING COUNT(*) > 500 '
                           'ORDER BY YesCount DESC', db_conn)

In [13]:

query_comp['Timely_Response_Ratio'] = query_comp.YesCount / query_comp.Total * 100

In [14]:

bot_10_response = query_comp.sort_values('Timely_Response_Ratio', ascending=True)[0:10]

In [15]:

responses = 'Timely Responses: ' + bot_10_response.YesCount.astype(str) + '<br>' + 'Untimely Responses: ' + bot_10_response.NoCount.astype(str)

In [16]:

py.iplot([Bar(x=bot_10_response.Company, y=bot_10_response.Timely_Response_Ratio, 
              text=responses)], filename='ConsumerComplaints_LowestTimelyResponse')

Out[16]:

Note none of the firms with the most complaints are included here.

It was seen earlier that mortgages related issues have the most complaints. Which issues are specifically complained about the most?

In [17]:

query3 = pd.read_sql_query('SELECT Product, Issue, COUNT(*) `Number of Complaints` '
                          'FROM data '
                          'WHERE Product = "Mortgage" '
                          'GROUP BY Issue '
                          'ORDER BY `Number of Complaints` DESC', db_conn)

In [18]:

py.iplot([Bar(x=query3.Issue, y=query3['Number of Complaints'])], filename='ConsumerComplaints_MostComplainedProductIssue')

Out[18]:

The majority of mortgage complaints are related to loan modification, collection, services, payments and other related issues.

Choropleth maps can be plotted using Plotly. These maps can display which states have the most complaints.

In [19]:

state_query = pd.read_sql_query('SELECT State, Product, COUNT(*) as `Complaints` '
                                'FROM data '
                                'WHERE State <> "None" '
                                'GROUP BY State', db_conn
                               )

In [20]:

dat = [dict(
    type = 'choropleth',
    locations = state_query['State'],
    z = state_query['Complaints'],
    locationmode = 'USA-states'
    )]

layout = dict(geo = dict(
    scope='usa', showlakes=True, lakecolor = 'rgb(255, 255, 255)'))

fig = dict(data=dat, layout=layout)

py.iplot(fig, filename='ConsumerComplaints_StatesWithMostComplaints')

Out[20]:

As one would expect, the more populated states such as California, Texas, Florida and New York have the most complaints.

Is there a particular pattern to which companies receive the most complaints by state? Would one expect the company with the most complaints be contained more to regions in which the firm has a large presence (Bank of America on the East Coast, J.P. Morgan in the Northeast, etc.)?

In [21]:

query_comps = pd.read_sql_query('SELECT State, Company, COUNT(*) as `Complaints` '
                               'FROM data '
                               'WHERE State <> "None" '
                               'AND Company <> "None" '
                               'GROUP BY Company, State '
                               'ORDER BY `Complaints` DESC, Company, State', db_conn)

state_grouped = query_comps.groupby('State', as_index=False).first().sort_values('Complaints', ascending=False)
state_grouped['Company_Code'] = state_grouped['Company'].astype('category').cat.codes
company_codes = pd.DataFrame(state_grouped['Company'].unique(), index=state_grouped['Company_Code'].unique(), columns=['Code'])

In [22]:

dat = [dict(
    type = 'choropleth',
    locations = state_grouped['State'],
    z =  state_grouped['Company_Code'],
    locationmode = 'USA-states',
    text = state_grouped['Company'],
    colorscale = [[0.0, 'rgb(255,255,179)'], [0.1, 'rgb(141,211,199)'], [0.2, 'rgb(190,186,218)'],[0.3, 'rgb(251,128,114)'],\
            [0.4, 'rgb(128,177,211)'], [0.5, 'rgb(253,180,98)'], [0.6, 'rgb(179,222,105)'], [0.7, 'rgb(252,205,229)'],\
            [0.8, 'rgb(217,217,217)'], [0.9, 'rgb(188,128,189)'], [1, 'rgb(204,235,197)']],
    colorbar = dict(
        tickmode = 'array',
        tickvals = list(company_codes.index),
        ticktext = list(company_codes.Code))
    )]

layout = dict(
    geo = dict(
        scope='usa', showlakes=True, lakecolor = 'rgb(255, 255, 255)'))

fig = dict(data=dat, layout=layout)

py.iplot(fig, filename='ConsumerComplaints_MostComplainedCompanyByState')

Out[22]:

The choropleth map shows Bank of America receives the most complaints throughout the United States while Wells Fargo is generally centered around the Mid and Central US. Interestingly, Equifax receives the most complaints in many Southeastern states.

Are mortgage complaints centered in a particular area of the United States?

In [23]:

query_products = pd.read_sql_query('SELECT State, Product, COUNT(*) as `Complaints` '
                               'FROM data '
                               'WHERE State <> "None" '
                               'AND Product <> "None" '
                               'GROUP BY Product, State '
                               'ORDER BY `Complaints` DESC, Product, State', db_conn)

product_grouped = query_products.groupby('State', as_index=False).first().sort_values('Complaints', ascending=False)
product_grouped['Product_Code'] = product_grouped['Product'].astype('category').cat.codes
product_codes = pd.DataFrame(product_grouped['Product'].unique(), index=product_grouped['Product_Code'].unique(), columns=['Code'])

In [24]:

dat = [dict(
    type = 'choropleth',
    locations = product_grouped['State'],
    z =  product_grouped['Product_Code'],
    locationmode = 'USA-states',
    text = product_grouped['Product'],
    colorscale = [[0.0, 'rgb(255,255,179)'], [0.25, 'rgb(141,211,199)'], [0.50, 'rgb(190,186,218)'],[0.75, 'rgb(251,128,114)'],\
            [1, 'rgb(128,177,211)']],
    colorbar = dict(
        tickmode = 'array',
        tickvals = list(product_codes.index),
        ticktext = list(product_codes.Code))
    )]

layout = dict(
    geo = dict(
        scope='usa', showlakes=True, lakecolor = 'rgb(255, 255, 255)'))

fig = dict(data=dat, layout=layout)

py.iplot(fig, filename='ConsumerComplaints_MostComplainedProductByState')

Out[24]:

Mortgage related issues receive the most complaints by far across the United States. Interestingly enough, debt collection and credit reporting are the most complained about in the central United States.

The dataset contains tags that detail if the consumer is elderly and/or a servicemember. Using this data, we can test if having a particular tag results in a slower or faster response time using the statsmodels package.

In [25]:

tag_responses = pd.read_sql_query('SELECT Company, Tags, `TimelyResponse?`, '
                 'COUNT(CASE WHEN `TimelyResponse?` = "Yes" THEN 1 ELSE NULL END) As YesCount, '
                 'COUNT(CASE WHEN `TimelyResponse?` = "No" THEN 1 ELSE NULL END) As NoCount, '
                 'COUNT(*) as Total '
                 'FROM data '
                 'GROUP BY Company '
                 'ORDER BY YesCount DESC', db_conn)

In [26]:

tag_responses.rename(columns={'TimelyResponse?':'TimelyResponse'}, inplace=True)

In [27]:

tag_responses.Tags = np.where(tag_responses.Tags.isnull() == True, 'No Tag', tag_responses.Tags)
tag_responses['Timely_Response_Ratio'] = tag_responses.YesCount / tag_responses.Total * 100

In [28]:

pivoted = pd.pivot_table(tag_responses, index=['TimelyResponse', 'Company'], columns=['Tags'], values=['Timely_Response_Ratio']).reset_index()

Using the kruskal-wallis implementation in the statsmodel package, it can be tested if there is a significant difference in response time ratio by designated tag.

In [29]:

stats.mstats.kruskalwallis(pivoted.Timely_Response_Ratio['No Tag'], pivoted.Timely_Response_Ratio['Older American'], 
                          pivoted.Timely_Response_Ratio['Older American, Servicemember'], pivoted.Timely_Response_Ratio['Servicemember'])

Out[29]:

KruskalResult(statistic=0.34332534569847822, pvalue=0.95168441327165698)

The p-value is way above 0.05 and therefore it can be concluded there is no significant difference in response time. Each tag can also be compared individually using Tukey's test.

In [30]:

print pairwise_tukeyhsd(tag_responses.Timely_Response_Ratio, tag_responses.Tags)

                     Multiple Comparison of Means - Tukey HSD,FWER=0.05                     
============================================================================================
            group1                        group2            meandiff  lower    upper  reject
--------------------------------------------------------------------------------------------
            No Tag                    Older American         0.6005  -5.4108   6.6118 False 
            No Tag            Older American, Servicemember -0.1383  -12.7571 12.4805 False 
            No Tag                    Servicemember          1.9458  -3.8194   7.711  False 
        Older American        Older American, Servicemember -0.7388  -14.5275 13.0499 False 
        Older American                Servicemember          1.3453  -6.6628   9.3534 False 
Older American, Servicemember         Servicemember          2.0841  -11.5991 15.7673 False 
--------------------------------------------------------------------------------------------

Tukey's test reports there are no significant differences between each group.

Summary¶

In this notebook, using SQL and Pandas to analyze larger sets of data was explored with visualizations from the Plotly library. Pandas provides many tools to allow smoother workflows with large data sets and SQL. Python is also a good tool for working with databases, especially with packages such as SQLAlchemy.

Analyzing the Consumer Complaints Database with Python, SQL and Plotly

Building the Database¶

Summary¶

Related Posts

Categories

Recent Posts