Extract and Analyze the Seattle Pet Licenses Dataset

The city of Seattle makes available its database of pet licenses issued from 2005 to the beginning of 2017 as part of the city's ongoing Open Data Initiative. This post will explore extracting the data from Seattle's Open Data portal using requests, then transform the extracted JSON data into a workable dataset with pandas to analyze and investigate the pet license database.

About Seattle Pet Licensing

The city of Seattle requires pets over eight weeks old be licensed. There are several benefits to licensing one's pet, including a return ride home if your pet is lost, and easier contact from a veterinarian if your pet is unfortunately injured. If the licensing is performed at the Seattle Animal Shelter on the third Saturday of any given month, a free rabies vaccine is included, as well as other vaccines and a microchip for a small additional fee.

Getting Started

Import the %matplotlib inline magic function for displaying the output visualizations inline in the notebook and import the libraries that will be used in this section of the analysis.

In [1]:
%matplotlib inline

import requests
import pandas as pd
import numpy as np
from urllib.error import HTTPError
import matplotlib.pyplot as plt
import seaborn as sns
import statsmodels.api as sm

import warnings
warnings.simplefilter('ignore')
/Users/aaron/anaconda/lib/python3.6/site-packages/statsmodels/compat/pandas.py:56: FutureWarning: The pandas.core.datetools module is deprecated and will be removed in a future version. Please use the pandas.tseries module instead.
  from pandas.core import datetools

Here, we set the formatting for the plots created with seaborn that appear later in the analysis.

In [2]:
sns.set(font_scale=2, palette=sns.color_palette("Set2", 10))
sns.set_style("ticks")

Getting the Data

We extract the data programmatically from the Seattle Open Data Portal. The portal utilizes Socrata to host its data. Therefore we will be working with the Socrata API. The Seattle pet licenses database API endpoint can be found on the dataset's information page by clicking the API button in the top-right corner.

Socrata's paging documentation states there is a return limit of 1,000 for each call. Therefore, we need to page the results to capture the entire dataset. The default format returned from the API is JSON, though other formats such as CSV and XML are also available. We select the default JSON format and collect the results into a list that we can later use to build a pandas DataFrame. The following block extracts the data from the API into a list until there are no more results to return.

In [3]:
count = 1000
offset = 0
params = {'$limit': count, '$offset': offset}
endpoint = 'https://data.seattle.gov/resource/87w5-fd6d.json'

results = []

while True:

    try:
        r = requests.get(endpoint, params=params)
        rcontent = r.json()
        
        if rcontent == []:
            break

        results.append(rcontent)
        offset += count
        params['$offset'] = offset

    except HTTPError as err:

        if err.response.status_code == '404':
            break
        else:
            print(err.response.status_code)

The data is now stored in a list that we can iterate through and normalize each JSON result using pandas's handy json_normalize function. Once all the JSON data is collected into a DataFrame, we can inspect the number of rows and print the first few rows of the data.

In [4]:
petlicenses = pd.DataFrame()

for i in results:
    petlicenses = petlicenses.append(pd.io.json.json_normalize(i))

print(len(petlicenses))
petlicenses.head()
66042
Out[4]:
animal_s_name license_issue_date license_number primary_breed secondary_breed species zip_code
0 Ozzy 2005-03-29T00:00:00.000 130651 Dachshund, Standard Smooth Haired NaN Dog 98104
1 Jack 2009-12-23T00:00:00.000 898148 Schnauzer, Miniature Terrier, Rat Dog 98107
2 Ginger 2006-01-20T00:00:00.000 29654 Retriever, Golden Retriever, Labrador Dog 98117
3 Pepper 2006-02-07T00:00:00.000 75432 Manx Mix Cat 98103
4 Addy 2006-08-04T00:00:00.000 729899 Retriever, Golden NaN Dog 98105

The row count is just over 66k, which is the same number of rows given on the dataset's portal page. We see the data has seven columns, and are the same as listed on the data's webpage. We now have a nice, tidy dataset that was created programmatically with a combination of libraries! Isn't that much more satisfying than just downloading the data as a .csv from the website? (Of course, it is).

It is often a good idea to save the extracted and transformed data onto the disk to avoid having to repeatedly get the data from the API if it is to be used later. We quickly save the DataFrame as a .csv using the to_csv method.

In [5]:
petlicenses.to_csv('../../data/seattle pet licenses/seattle_pet_licenses.csv', index=False, encoding='utf-8')

Exploratory Data Analysis of the Seattle Pet Licenses Dataset

Now that we have a tidy dataset to work with, we can move onto exploring the data! As a start, let's see how many pet licenses have been issued by species of pet.

In [6]:
petlicenses['species'].value_counts()
Out[6]:
Dog          43076
Cat          22915
Livestock       51
Name: species, dtype: int64

We see a vast disparity, almost 2 to 1, between the count of licenses issued to dogs and cats. 'Livestock,' according to the pet licensing municipal code, include miniature goats and potbellied pigs. Therefore, we wouldn't expect to see too many licenses issued to livestock in comparison to dogs or cats, but considering the dataset encompasses all of Seattle's issued pet licenses over a 12 year period, 51 seems rather low.

According to a survey conducted by the American Pet Products Association conducted between 2017 and 2018, there were an estimated 94.2 million cats, averaging two cats per household and an estimated 89.7 million dogs averaging just under 1.5 dogs per household. Assuming these numbers are proportional to the pet ownership population in Seattle, why are dogs licensed almost twice as much as cats? Some possible factors include:

  • One of the main benefits of getting a license for one's pet is the pet can be returned at no cost should the pet, unfortunately, go missing. As dogs are often outdoors much more frequently than cats for walks and such, perhaps the thought of one's dog becoming lost is more prevalent in dog owners' minds.
  • According to another survey conducted in 2012 by the American Veterinary Medical Association, 66.7% of dog owners surveyed thought their dog(s) are a part of their family, compared to 56.1% of cat owners surveyed. The perception of family inclusiveness could be another factor for the difference in licensing rates as it can be reasonably assumed that owners who consider their pets part of their family would be more likely to purchase a license.
    • The same AVMA survey also found that the average amount spent on veterinary care for each dog per year was \$227 compared to just \$90 for cats. Although this could be further evidence of the general attitude towards dogs and cats as family members, it is typically more difficult to determine if a cat is ill as they are not prone to showing they are not well, therefore this could also play a role in the difference in amount spent on care.

As we are more interested in dog and cat pet licenses for this analysis and the low overall volume of licenses issued to livestock-classified pets, we remove the livestock licenses from the data.

In [7]:
petlicenses = petlicenses.loc[petlicenses['species'] != 'Livestock']

Finding Most Common Dog and Cat Names

The pet's name for each license is also available in the licenses dataset. Using this information, can we find any similarity in the names of cats and dogs given by their owners? To answer this question, let's first plot the top ten cat and dog names by total count.

In [8]:
f, ax = plt.subplots(1, 2, figsize=(20, 7))

c = petlicenses[petlicenses['species'] == 'Cat']['animal_s_name'].value_counts()[0:10].plot(kind='bar', 
                                                                                            ax=ax[0], 
                                                                                            title='Top Ten Cat Names')
d = petlicenses[petlicenses['species'] == 'Dog']['animal_s_name'].value_counts()[0:10].plot(kind='bar', 
                                                                                            ax=ax[1], 
                                                                                            title='Top Ten Dog Names')
c.set_ylabel('Count')
d.set_ylabel('Count')

sns.despine()

The most common name for both dogs and cats is Lucy, with the names Max, Luna, Bella, Charlie, and Max also reaching the top 10 names for both animals. Given that 60% of the top ten cat and dog names are the same between animals, we can reasonably assume there is a decent amount of commonality in names amongst dogs and cats.

Now let's examine the 50 most common cat and dog names by total count and see the intersection of names using some of Python's available set operations. We print the names that are shared by cats and dogs in the top 50 names, as well as the total number.

In [9]:
cat_names = petlicenses[petlicenses['species'] == 'Cat']['animal_s_name'].value_counts()[0:50].index.tolist()
dog_names = petlicenses[petlicenses['species'] == 'Dog']['animal_s_name'].value_counts()[0:50].index.tolist()

common_names = set(cat_names).intersection(dog_names)
print(len(common_names), common_names)
30 {'Penny', 'Zoe', 'Milo', 'Lucy', 'Buddy', 'Oscar', 'Chloe', 'Sadie', 'Ruby', 'Bella', 'Oliver', 'Olive', 'Lily', 'Sophie', 'Sam', 'Maggie', 'Stella', 'Henry', 'Lulu', 'Lola', 'Charlie', 'Toby', 'Molly', 'Gracie', 'Luna', 'Ginger', 'Daisy', 'Pepper', 'Jack', 'Max'}

Interestingly, 30 of the 50 top cat and dog names are shared, the same ratio (60%) found in the list of top ten names.

Determining the Most Common Pet Breeds

According to the American Kennel Club, Labrador Retrievers, German Sheperds, and Golden Retrievers are the most popular dog breeds of 2016 and have been since 2013. The top three most popular cat breeds according to a recent article by The Spruce are the Siamese, Persian, and Maine Coon. Using the primary breed information available in the pet license database, we can see if dog and cat owners in Seattle confirm the results as reported by the articles.

Here, we plot the top 20 most frequently appearing primary breeds of dogs and cats.

In [10]:
f, ax = plt.subplots(1, 2, figsize=(20, 8))

petlicenses[petlicenses['species'] == 'Cat']['primary_breed'].value_counts()[0:20].plot(kind='bar', 
                                                                                        ax=ax[0], 
                                                                                        title='Top 20 Cat Breeds')
petlicenses[petlicenses['species'] == 'Dog']['primary_breed'].value_counts()[0:20].plot(kind='bar', 
                                                                                        ax=ax[1], 
                                                                                       title='Top 20 Dog Breeds')

sns.despine()

Based on the available data, we see the top three dog breeds are Labrador and Golden Retrievers and Short Coat Chihuahuas, with German and Australian Shepherds making up the fourth and fifth most popular breeds. The Australian Shepherd and Chihuahua are the 16th and 30th most popular breed on the AKC's list, respectively. Therefore, the results reported by the American Kennel Club are similar to most frequently appearing breeds in the data available, with a few deviations.

Technically speaking, domestic cats are not considered a pure breed, as they are the mix of several different breeds. Excluding the domestic breeds, we see the next most popular cat breeds are the American Shorthair, Siamese, and the LaPerm. The American Shorthair is the twelveth most popular breed in The Spruce's list while the LaPerm doesn't make an appearance.

Dog breeds also have a more even distribution while cat breeds are heavily skewed towards the domestic breeds, particularly the short hair variety. There are many potential reasons for this, of which the primary one is cat breeding is relatively new in comparison to breeding dogs for specific traits. Thus there are far fewer cat breeds recognized by the relevant governing bodies than dogs. It is also much more difficult to identify the breed of a cat, the only surefire way being genetic testing.

Analysis of Pet Licensing by Geography

The zip code of the pet's home is also given in the dataset. As the data is user-generated, we need to perform a few cleaning steps to get the zip codes all in a similar format before we can analyze the data further.

Zip codes are often given a four-digit suffix beginning with a hyphen for more specificity regarding the particular area. This suffix is not required for our purposes. Therefore we convert the zip code into a string and split the data by the hype and keep only the first piece.

In [11]:
petlicenses['zip_code'] = petlicenses['zip_code'].astype(str)
petlicenses['zip_code'] = petlicenses['zip_code'].str.split('-').str[0]

Now that we have a cleaner set of zip codes, we can start by plotting which zips have the most licensed pets. As there are quite a lot of zip codes, we focus on the top 15 by volume. A count column filled with $1$s is also added for convenience when plotting.

In [12]:
petlicenses['count'] = 1

petlicenses_zip = pd.pivot_table(petlicenses, 
                                     index=['zip_code'], 
                                     values=['count'], aggfunc=np.sum)

top15_zips = petlicenses_zip.sort_values('count', ascending=False)[0:15]

z = top15_zips.plot(kind='bar', figsize=(15, 5), legend=False, title='Count of Licensed Pets by Zip Code')

z.set_ylabel('Count')
z.set_xlabel('Zip Code')
z.xaxis.labelpad = 30
sns.despine()

Only three zip codes, 98115, 98103, and 98117 have more then 5,000 licensed pets. Anecdotally, the fact that the 98115 zip code has the most licensed pets doesn't suprise as I lived in the 98115 zip code for a number of years and could always see a number of dogs being walked throughout the day.

In [13]:
petlicenses_species_zip = pd.pivot_table(petlicenses, 
                                     index=['zip_code', 'species'], 
                                     values=['count'], aggfunc=np.sum).sort_values(by='count', ascending=False).reset_index()

petlicenses_species_zip = petlicenses_species_zip.loc[petlicenses_species_zip['zip_code'].isin(top15_zips.index)]

plt.figure(figsize=(18, 7))

sns.set(font_scale=1.75, palette=sns.color_palette("Set2", 10))
sns.set_style("ticks")

bar = sns.barplot(x='zip_code', y='count', hue='species', data=petlicenses_species_zip)
bar.set_title('Count of Licensed Dogs and Cats by Zip Code')
bar.set_ylabel('Count')
bar.set_xlabel('Zip Code')

bar.xaxis.labelpad = 10
sns.despine()
plt.legend(bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0.)
Out[13]:

As one would reasonably expect given the knowledge of an almost 2:1 ratio in the number of cats licensed compared to dogs, the top 15 zip codes by total licensed pets also shown a significant difference in licensed dogs and cats. However, some of these differences are not as stark, for example, the zip codes 98133 and 98105 show a fair number of cats licensed compared to dogs.

Earlier in the analysis, we found the most frequently appearing cat and dog breeds in the pet licenses database. Does the frequency of breeds appearing differ when viewed at a zip code level? That is, are there areas of Seattle that favor a particular dog or cat breed over others? One could reasonably hypothesize that smaller dogs such as Chihuahuas, Miniature Poodles, and Pugs would be more common in the denser areas of the city due to space and that some apartment buildings charge extra for larger breeds of dogs.

The following creates two new data frames that contain the top five cat and dog breeds for the top five zip codes. A rank column is added to find the most frequently occurring breeds by each zip code using pandas's groupby and rank() methods that are then used to filter the breeds by the top five.

In [14]:
breeds_zipcode = pd.pivot_table(petlicenses, 
                               index=['primary_breed', 'zip_code', 'species'], 
                               values=['count'], aggfunc=np.sum).reset_index().sort_values('count', ascending=False)

cats = breeds_zipcode.loc[breeds_zipcode['species'] == 'Cat']
dogs = breeds_zipcode.loc[breeds_zipcode['species'] == 'Dog']

top5_zips = top15_zips[0:5]

cats['rank'] = cats.groupby('zip_code')['count'].rank(method='dense', ascending=False)

cats = cats.loc[cats['zip_code'].isin(top5_zips.index)]

cats_top5 = cats.loc[cats['rank'] <= 5.0]

cats_top5 = pd.pivot_table(cats_top5, 
                           index=['zip_code', 'primary_breed'], 
                           values=['count'], 
                           aggfunc=np.sum).reset_index().sort_values(by='count', ascending=False)

dogs['rank'] = dogs.groupby('zip_code')['count'].rank(method='dense', ascending=False)
dogs = dogs.loc[dogs['zip_code'].isin(top5_zips.index)]

dogs_top5 = dogs.loc[dogs['rank'] <= 5.0]

dogs_top5 = pd.pivot_table(dogs_top5, 
                           index=['zip_code', 'primary_breed'], 
                           values=['count'], 
                           aggfunc=np.sum).reset_index().sort_values(by='count', ascending=False)

Plot the top five most frequently occurring dog and cat breeds within the zip codes with the most licensed pets.

In [15]:
f, ax = plt.subplots(2, 1, figsize=(20, 18), sharex=True)

cat_bar = sns.barplot(x='zip_code', y='count', hue='primary_breed', data=cats_top5, ax=ax[0])
dog_bar = sns.barplot(x='zip_code', y='count', hue='primary_breed', data=dogs_top5, ax=ax[1])

cat_bar.set_title('Most Popular Cat Breeds by Zip Code')
cat_bar.set_ylabel('Count')
cat_bar.set_xlabel('Zip Code')

dog_bar.set_title('Most Popular Dog Breeds by Zip Code')
dog_bar.set_ylabel('Count')
dog_bar.set_xlabel('Zip Code')

cat_bar.xaxis.labelpad = 20
dog_bar.xaxis.labelpad = 20

sns.despine()

cat_bar.legend(bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0.)
dog_bar.legend(bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0.)
Out[15]:

As one might reasonably suspect, the most common breeds overall are also the most frequently appearing within each zip code, although there are several ties for the fifth most popular dog breed. Is there any more variety when the domestic cat breeds and retriever dog breeds are removed from the results before ranking?

In [16]:
cats_non_domestic = cats.loc[cats['primary_breed'].str.contains('Domestic') == False]

cats_non_domestic['rank'] = cats_non_domestic.groupby('zip_code')['count'].rank(method='dense', ascending=False)

cats_non_domestic_top5 = cats_non_domestic.loc[cats_non_domestic['rank'] <= 5.0]

cats_non_domestic_top5 = pd.pivot_table(cats_non_domestic_top5, 
                           index=['zip_code', 'primary_breed'], 
                           values=['count'], 
                           aggfunc=np.sum).reset_index().sort_values(by='count', ascending=False)

dogs_non_retriever = dogs.loc[dogs['primary_breed'].str.contains('Retriever') == False]

dogs_non_retriever['rank'] = dogs_non_retriever.groupby('zip_code')['count'].rank(method='dense', ascending=False)

dogs_non_retriever_top5 = dogs_non_retriever.loc[dogs_non_retriever['rank'] <= 5.0]

dogs_non_retriever_top5 = pd.pivot_table(dogs_non_retriever_top5, 
                           index=['zip_code', 'primary_breed'], 
                           values=['count'], 
                           aggfunc=np.sum).reset_index().sort_values(by='count', ascending=False)

f, ax = plt.subplots(2, 1, figsize=(20, 18), sharex=True)

sns.set(font_scale=2, palette=sns.color_palette("Set2", 10))
sns.set_style("ticks")

cat_bar = sns.barplot(x='zip_code', y='count', hue='primary_breed', data=cats_non_domestic_top5, ax=ax[0])
dog_bar = sns.barplot(x='zip_code', y='count', hue='primary_breed', data=dogs_non_retriever_top5, ax=ax[1])

cat_bar.set_title('Most Popular Cat Breeds (excl. Domestic)')
cat_bar.set_ylabel('Count')
cat_bar.set_xlabel('Zip Code')

dog_bar.set_title('Most Popular Dog Breeds (excl. Retrievers)')
dog_bar.set_ylabel('Count')
dog_bar.set_xlabel('Zip Code')

cat_bar.xaxis.labelpad = 20
dog_bar.xaxis.labelpad = 20

sns.despine()

cat_bar.legend(bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0.)
dog_bar.legend(bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0.)
Out[16]:

After the removal of the domestic cat breeds, we see there is a relatively even distribution of several of the most popular breeds including the Siamese, American Shorthair, LaPerm, and the Maine Coon. There also appears to be several Manx and Russian Blues in the 98117 and 98125 zip codes, which is unique from the other zip codes with the most licensed pets. The frequency of breeds is also somewhat consistent across zip codes, with Siamese and American Shorthairs being the most popular followed by LaPerms and Maine Coons. Could these results be indicative of a general breed preference amongst Seattle cat owners? Given the comparatively low frequency of these breeds appearing in comparison to the domestic breeds, there likely isn't much of a relationship; however, it could be interesting to compare these results with a survey or some other data set that gave cat owner breed preferences.

Similar to the Siamese cat, Short Hair Chihuahuas are the most popular dog breeds after the exclusion of Labrador and golden retrievers. What is perhaps most interesting is Miniature Poodles are the third most popular breed in and only in the 98115 zip code. We also see American Pit Bull Terriers, and Australian Cattle Dogs are popular in the 98107 and 98125 zip codes, but not in the other zip codes. As seen in the top cat breeds, the top dog breeds across the considered zip codes are quite consistent, with Australian and German Shepherds, Border Collies and Terriers following Short Hair Chihuahuas in general.

Time Series Analysis of Issued Pet Licenses

The Seattle pet licenses dataset contains the date the license was registered with the city, which gives us the ability to investigate how the volume of licenses being issued has changed over time. To begin analyzing the time series, we first convert the datasets issue_date column into a datetime format using pandas's to_datetime function.

In [17]:
petlicenses['issue_date'] = pd.to_datetime(petlicenses['license_issue_date'])

Plot the number of licenses issued by day for all the days included in the dataset.

In [18]:
petlicense_date = pd.pivot_table(petlicenses, index=['issue_date'], values=['count'], aggfunc=np.sum)
p = petlicense_date.plot(figsize=(16, 5), legend=False, title = 'Daily Pet License Volume')

p.set_ylabel('Volume')
p.set_xlabel('Date')
p.xaxis.labelpad = 30
sns.despine()

There appears to be very little to no pet licenses being issued from the beginning in 2005 up until about mid-2013 when some noticeable volume began to emerge. It isn't until near the end of 2014 that the number of issued pet licenses balloons to over a hundred or more each day.

What could cause such an extreme shift in volume that has continued for just over three years after almost no activity for close to 10 years? Some cursory searching on Google News did not return anything definitive that could be a possible factor of the sudden change, nor did any further investigation on the Seattle Animal Shelter website.

To get a better visualization of how the data appears to trend, split the data by pre-2015 and post-2015 and plot.

In [19]:
f, ax = plt.subplots(1, 2, figsize=(20, 6))

petlicense_date2015 = petlicense_date.loc[petlicense_date.index >='2015-01-01']
petlicense_date2014 = petlicense_date.loc[(petlicense_date.index < '2015-01-01') & (petlicense_date.index >= '2014-01-01')]

p1 = petlicense_date2014.plot(ax=ax[0], legend=False, title='Pet License Volume 2014 - 2015')
p2 = petlicense_date2015.plot(ax=ax[1], legend=False, title='Pet License Volume 2015 - 2017')

p1.set_ylabel('Volume')
p1.set_xlabel('Date')
p1.xaxis.labelpad = 30
sns.despine()

p2.set_ylabel('Volume')
p2.set_xlabel('Date')
p2.xaxis.labelpad = 30
sns.despine()

The data plotted separately further displays the stark difference in pet licensing volume that began in November of 2014 and has continued as far as the data is available. Post-2015, it seems there are a few peaks at the end of each year around November and December, but other than that there doesn't appear to be any significant trend.

We can explore the trend and other components of the time series by performing time series decomposition. Decomposing a time-series into unique components that each represent a particular facet of the time series data helps identify any cyclical or seasonal trends and is an essential piece of any time series analysis.

The statsmodels library provides a function [seasonal_decompose] that performs decomposition on a time series. The frequency parameter freq is set to 12 to indicate monthly frequency, though other values such as 7 (weekly frequency) or 3 (quarterly frequency), could also be used.

In [20]:
decomp2015 = sm.tsa.seasonal_decompose(petlicense_date2015, freq=12)

To get a more visually appealing graph of the decomposed time-series, we access the attributes that represent the fitted components and plot using pandas's convenient visualization methods.

In [21]:
f, ax = plt.subplots(4, 1, figsize=(20, 20), sharex=True, squeeze=True)

decomp2015.observed.plot(ax=ax[0], title='Observed', legend=False)
decomp2015.trend.plot(ax=ax[1], title='Trend', legend=False)
decomp2015.seasonal.plot(ax=ax[2], title='Seasonal', legend=False)
g = decomp2015.resid.plot(ax=ax[3], title='Residual', legend=False)

sns.despine()
g.set_xlabel('Date')
Out[21]:
Text(0.5,0,'Date')