Extract and Analyze the Seattle Pet Licenses Dataset

The city of Seattle makes available its database of pet licenses issued from 2000 to the beginning of 2020 as part of the city's ongoing Open Data Initiative. This post will explore extracting the data from Seattle's Open Data portal using requests, then transform the extracted JSON data into a workable dataset with pandas to analyze and investigate the pet license database.

About Seattle Pet Licensing

The city of Seattle requires pets over eight weeks old be licensed. There are several benefits to licensing one's pet, including a return ride home if your pet is lost, and easier contact from a veterinarian if your pet is unfortunately injured. If the licensing is performed at the Seattle Animal Shelter on the third Saturday of any given month, a free rabies vaccine is included, as well as other vaccines and a microchip for a small additional fee.

Getting Started

Import the %matplotlib inline magic function for displaying the output visualizations inline in the notebook and import the libraries that will be used in this section of the analysis.

In [1]:
%matplotlib inline

import requests
import pandas as pd
import numpy as np
from urllib.error import HTTPError
import matplotlib.pyplot as plt
import seaborn as sns
import statsmodels.api as sm

import warnings

Here, we set the formatting for the plots created with seaborn that appear later in the analysis.

In [2]:
sns.set(font_scale=2, palette=sns.color_palette("Set2", 10), color_codes=False)

Getting the Data

We extract the data programmatically from the Seattle Open Data Portal. The portal utilizes Socrata to host its data. Therefore we will be working with the Socrata API. The Seattle pet licenses database API endpoint can be found on the dataset's information page by clicking the API button in the top-right corner.

Socrata's paging documentation states there is a return limit of 1,000 for each call. Therefore, we need to page the results to capture the entire dataset. The default format returned from the API is JSON, though other formats such as CSV and XML are also available. We select the default JSON format and collect the results into a list that we can later use to build a pandas DataFrame. The following block extracts the data from the API into a list until there are no more results to return.

In [3]:
count = 1000
offset = 0
params = {'$limit': count, '$offset': offset}
endpoint = 'https://data.seattle.gov/resource/87w5-fd6d.json'

results = []

while True:

        r = requests.get(endpoint, params=params)
        rcontent = r.json()
        if rcontent == []:

        offset += count
        params['$offset'] = offset

    except HTTPError as err:

        if err.response.status_code == '404':

The data is now stored in a list that we can iterate through and normalize each JSON result using pandas's handy json_normalize function. Once all the JSON data is collected into a DataFrame, we can inspect the number of rows and print the first few rows of the data.

In [4]:
petlicenses = pd.DataFrame()

for i in results:
    petlicenses = petlicenses.append(pd.io.json.json_normalize(i))

animal_s_name license_issue_date license_number primary_breed secondary_breed species zip_code
0 NaN 2000-01-03T00:00:00.000 263574 Shepherd NaN Dog 98119
1 Fancy 2000-01-05T00:00:00.000 119820 Retriever, Labrador NaN Dog 98106
2 Skip 2000-01-06T00:00:00.000 10401 Siberian Husky Mix Dog NaN
3 Kanga 2000-01-12T00:00:00.000 941592 German Shepherd NaN Dog 98107
4 Oscar 2000-01-24T00:00:00.000 422763 Retriever, Golden NaN Dog NaN

The row count is just over 49k, which is the same number of rows given on the dataset's portal page. We see the data has seven columns, and are the same as listed on the data's webpage. We now have a nice, tidy dataset that was created programmatically with a combination of libraries! Isn't that much more satisfying than just downloading the data as a .csv from the website? (Of course, it is).

It is often a good idea to save the extracted and transformed data onto the disk to avoid having to repeatedly get the data from the API if it is to be used later. We quickly save the DataFrame as a .csv using the to_csv method.

In [7]:
petlicenses.to_csv('../../data/seattle_pet_licenses.csv', index=False, encoding='utf-8')

Exploratory Data Analysis of the Seattle Pet Licenses Dataset

Now that we have a tidy dataset to work with, we can move onto exploring the data! As a start, let's see how many pet licenses have been issued by species of pet.

In [8]:
Dog     33220
Cat     15971
Goat       36
Pig         3
Name: species, dtype: int64

We see a vast disparity, more than 2 to 1, between the count of licenses issued to dogs and cats. Some Seattle residents have registered goats and pigs as pets as well, though, as one may expect, the volume of registrations for goats and pigs is far lower than more common household pets.

According to a survey conducted by the American Pet Products Association conducted between 2017 and 2018, there were an estimated 94.2 million cats, averaging two cats per household and an estimated 89.7 million dogs averaging just under 1.5 dogs per household. Assuming these numbers are proportional to the pet-owning population in Seattle, why are dogs licensed almost twice as much as cats? Some possible factors include:

  • One of the main benefits of getting a license for one's pet is the pet can be returned at no cost should the pet, unfortunately, go missing. As dogs are often outdoors much more frequently than cats for walks and such, perhaps the thought of one's dog becoming lost is more prevalent in dog owners' minds.
  • According to another survey conducted in 2012 by the American Veterinary Medical Association, 66.7% of dog owners surveyed thought their dog(s) are a part of their family, compared to 56.1% of cat owners surveyed. The perception of family inclusiveness could be another factor for the difference in licensing rates as it can be reasonably assumed that owners who consider their pets part of their family would be more likely to purchase a license.
    • The same AVMA survey also found that the average amount spent on veterinary care for each dog per year was $227 compared to just $90 for cats. Although this could be further evidence of the general attitude towards dogs and cats as family members, it is typically more challenging to determine if a cat is ill as they are not prone to showing they are not well, therefore this could also play a role in the difference in amount spent on care.

As we are more interested in dog and cat pet licenses for this analysis and the low overall volume of licenses issued to other types of pets, we remove the goat and pig licenses from the data.

In [12]:
petlicenses = petlicenses.loc[petlicenses['species'].str.contains('Dog|Cat')]
['Dog' 'Cat']

Finding Most Common Dog and Cat Names

The pet's name for each license is also available in the licenses dataset. Using this information, can we find any similarity in the names of cats and dogs given by their owners? To answer this question, let's first plot the top ten cat and dog names by total count.

In [13]:
f, ax = plt.subplots(1, 2, figsize=(20, 7))

cat_names = petlicenses[petlicenses['species'] == 'Cat']['animal_s_name'].value_counts()
dog_names = d = petlicenses[petlicenses['species'] == 'Dog']['animal_s_name'].value_counts()

c = cat_names[0:10].plot(kind='bar', 
                         title='Top Ten Cat Names')

d = dog_names[0:10].plot(kind='bar', 
                         title='Top Ten Dog Names')



The most common name for both dogs and cats is Lucy, with the names Max, Luna, Bella, Charlie, and Max, also reaching the top 10 names for both animals. Given that 60% of the top ten cat and dog names are the same between animals, we can reasonably assume there is a decent amount of commonality in names amongst dogs and cats.

Now let's examine the 50 most common cat and dog names by total count and see the intersection of names using some of Python's available set operations. We print the names that are shared by cats and dogs in the top 50 names, as well as the total number.

In [14]:
cat_names = cat_names[0:50].index.tolist()
dog_names = dog_names[0:50].index.tolist()

common_names = set(cat_names).intersection(dog_names)
print(len(common_names), common_names)
29 {'Maggie', 'Lucy', 'Milo', 'Ruby', 'Oliver', 'Jasper', 'Buddy', 'Lily', 'Max', 'Oscar', 'Olive', 'Henry', 'Sophie', 'Molly', 'Penny', 'Charlie', 'Leo', 'Rosie', 'Chloe', 'Gracie', 'Pepper', 'Sasha', 'Luna', 'Bella', 'Stella', 'Jack', 'Lola', 'Daisy', 'Abby'}

Interestingly, 29 of the 50 top cat and dog names are shared, almost precisely the same ratio (60%) found in the list of top ten names.

Determining the Most Common Pet Breeds

According to the American Kennel Club, Labrador Retrievers, German Shepherds, and Golden Retrievers are the most popular dog breeds of 2016 and have been since 2013. The top three most popular cat breeds, according to a recent article by The Spruce are the Siamese, Persian, and Maine Coon. Using the primary breed information available in the pet license database, we can see if dog and cat owners in Seattle confirm the results as reported by the articles.

Here, we plot the top 20 most frequently appearing primary breeds of dogs and cats.

In [15]:
f, ax = plt.subplots(1, 2, figsize=(20, 8))

cat_breeds = petlicenses[petlicenses['species'] == 'Cat']['primary_breed'].value_counts()
dog_breeds = petlicenses[petlicenses['species'] == 'Dog']['primary_breed'].value_counts()

                      title='Top 20 Cat Breeds')

                      title='Top 20 Dog Breeds')


Based on the available data, we see the top three dog breeds are Labrador and Golden Retrievers and Short Coat Chihuahuas, with German Shepherds and Terriers making up the fourth and fifth most popular breeds. Technically speaking, domestic cats are not considered a pure breed, as they are the mix of several different breeds. Excluding the domestic breeds, we see the next most popular cat breeds are the American Shorthair, Siamese, and the LaPerm. The American Shorthair is the twelfth most popular breed in The Spruce's list while the LaPerm doesn't make an appearance.

Dog breeds also have a more even distribution while cat breeds are heavily skewed towards the domestic breeds, particularly the short hair variety. There are many potential reasons for this, of which the primary one is cat breeding is relatively new in comparison to breeding dogs for specific traits. Thus there are far fewer cat breeds recognized by the relevant governing bodies than dogs. It is also much more challenging to identify the breed of a cat, the only surefire way being genetic testing.

Analysis of Pet Licensing by Geography

The zip code of the pet's home is also given in the dataset. As the data is user-generated, we need to perform a few cleaning steps to get the zip codes all in a similar format before we can analyze the data further.

Zip codes are often given a four-digit suffix beginning with a hyphen for more specificity regarding the particular area. This suffix is not required for our purposes. Therefore we convert the zip code into a string and split the data by the hype and keep only the first piece.

In [16]:
petlicenses['zip_code'] = petlicenses['zip_code'].astype(str)
petlicenses['zip_code'] = petlicenses['zip_code'].str.split('-').str[0]

Now that we have a cleaner set of zip codes, we can start by plotting which zips have the most licensed pets. As there are quite a lot of zip codes, we focus on the top 15 by volume. A count column filled with $1$s is also added for convenience when plotting.

In [17]:
petlicenses['count'] = 1

petlicenses_zip = pd.pivot_table(petlicenses, 
                                 values=['count'], aggfunc=np.sum)

top15_zips = petlicenses_zip.sort_values('count', ascending=False)[0:15]

z = top15_zips.plot(kind='bar', figsize=(15, 5), 
                    title='Count of Licensed Pets by Zip Code')

z.set_xlabel('Zip Code')
z.xaxis.labelpad = 30

Only three zip codes, 98115, 98103, and 98117 have more then 3,000 licensed pets. Anecdotally, the fact that the 98115 zip code has the most licensed pets doesn't surprise as I lived in the 98115 zip code for several years and could always see numerous dogs being walked throughout the day.

In [19]:
petlicenses_species_zip = pd.pivot_table(petlicenses, 
                                         index=['zip_code', 'species'], 

petlicenses_species_zip = petlicenses_species_zip.loc[petlicenses_species_zip['zip_code'].isin(top15_zips.index)]

plt.figure(figsize=(18, 7))

sns.set(font_scale=1.75, palette=sns.color_palette("Set2", 10), color_codes=False)

bar = sns.barplot(x='zip_code', y='count', hue='species', data=petlicenses_species_zip)
bar.set_title('Count of Licensed Dogs and Cats by Zip Code')
bar.set_xlabel('Zip Code')

bar.xaxis.labelpad = 10
plt.legend(bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0.)
<matplotlib.legend.Legend at 0x1c1be4b750>

As one would reasonably expect, given the knowledge of an almost 2:1 ratio in the number of cats licensed compared to dogs, the top 15 zip codes by total licensed pets also shown a significant difference in licensed dogs and cats.

Earlier in the analysis, we found the most frequently appearing cat and dog breeds in the pet licenses database. Does the frequency of breeds appearing differ when viewed at a zip code level? That is, are there areas of Seattle that favor a particular dog or cat breed over others? One could reasonably hypothesize that smaller dogs such as Chihuahuas, Miniature Poodles, and Pugs would be more common in the denser areas of the city due to space and that some apartment buildings charge extra for larger breeds of dogs.

The following creates two new data frames that contain the top five cat and dog breeds for the top five zip codes. A rank column is added to find the most frequently occurring breeds by each zip code using pandas's groupby and rank() methods that are then used to filter the breeds by the top five.

In [20]:
breeds_zipcode = pd.pivot_table(petlicenses, 
                               index=['primary_breed', 'zip_code', 'species'], 
                               values=['count'], aggfunc=np.sum).reset_index().sort_values('count', 

cats = breeds_zipcode.loc[breeds_zipcode['species'] == 'Cat']
dogs = breeds_zipcode.loc[breeds_zipcode['species'] == 'Dog']

top5_zips = top15_zips[0:5]

cats['rank'] = cats.groupby('zip_code')['count'].rank(method='dense', 

cats = cats.loc[cats['zip_code'].isin(top5_zips.index)]

cats_top5 = cats.loc[cats['rank'] <= 5.0]

cats_top5 = pd.pivot_table(cats_top5, 
                           index=['zip_code', 'primary_breed'], 

dogs['rank'] = dogs.groupby('zip_code')['count'].rank(method='dense', 

dogs = dogs.loc[dogs['zip_code'].isin(top5_zips.index)]

dogs_top5 = dogs.loc[dogs['rank'] <= 5.0]

dogs_top5 = pd.pivot_table(dogs_top5, 
                           index=['zip_code', 'primary_breed'], 

Plot the top five most frequently occurring dog and cat breeds within the zip codes with the most licensed pets.

In [21]:
f, ax = plt.subplots(2, 1, figsize=(20, 18), sharex=True)

cat_bar = sns.barplot(x='zip_code', y='count', hue='primary_breed', data=cats_top5, ax=ax[0])
dog_bar = sns.barplot(x='zip_code', y='count', hue='primary_breed', data=dogs_top5, ax=ax[1])

cat_bar.set_title('Most Popular Cat Breeds by Zip Code')
cat_bar.set_xlabel('Zip Code')

dog_bar.set_title('Most Popular Dog Breeds by Zip Code')
dog_bar.set_xlabel('Zip Code')

cat_bar.xaxis.labelpad = 20
dog_bar.xaxis.labelpad = 20


cat_bar.legend(bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0.)
dog_bar.legend(bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0.)
<matplotlib.legend.Legend at 0x11480cd10>

As one might reasonably suspect, the most common breeds overall are also the most frequently appearing within each zip code, although there are several ties for the fifth most popular dog breed. Is there any more variety when the domestic cat breeds and retriever dog breeds are removed from the results before ranking?

In [22]:
cats_non_domestic = cats.loc[cats['primary_breed'].str.contains('Domestic') == False]

cats_non_domestic['rank'] = cats_non_domestic.groupby('zip_code')['count'].rank(method='dense', ascending=False)

cats_non_domestic_top5 = cats_non_domestic.loc[cats_non_domestic['rank'] <= 5.0]

cats_non_domestic_top5 = pd.pivot_table(cats_non_domestic_top5, 
                           index=['zip_code', 'primary_breed'], 
                           aggfunc=np.sum).reset_index().sort_values(by='count', ascending=False)

dogs_non_retriever = dogs.loc[dogs['primary_breed'].str.contains('Retriever') == False]

dogs_non_retriever['rank'] = dogs_non_retriever.groupby('zip_code')['count'].rank(method='dense', ascending=False)

dogs_non_retriever_top5 = dogs_non_retriever.loc[dogs_non_retriever['rank'] <= 5.0]

dogs_non_retriever_top5 = pd.pivot_table(dogs_non_retriever_top5, 
                           index=['zip_code', 'primary_breed'], 
                           aggfunc=np.sum).reset_index().sort_values(by='count', ascending=False)

f, ax = plt.subplots(2, 1, figsize=(20, 18), sharex=True)

sns.set(font_scale=2, palette=sns.color_palette("Set2", 10), color_codes=False)

cat_bar = sns.barplot(x='zip_code', y='count', hue='primary_breed', data=cats_non_domestic_top5, ax=ax[0])
dog_bar = sns.barplot(x='zip_code', y='count', hue='primary_breed', data=dogs_non_retriever_top5, ax=ax[1])

cat_bar.set_title('Most Popular Cat Breeds (excl. Domestic)')
cat_bar.set_xlabel('Zip Code')

dog_bar.set_title('Most Popular Dog Breeds (excl. Retrievers)')
dog_bar.set_xlabel('Zip Code')

cat_bar.xaxis.labelpad = 20
dog_bar.xaxis.labelpad = 20


cat_bar.legend(bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0.)
dog_bar.legend(bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0.)
<matplotlib.legend.Legend at 0x1c1d0dcb50>

After the removal of the domestic cat breeds, we see there is a relatively even distribution of several of the most popular breeds, including the Siamese, American Shorthair, LaPerm, and the Maine Coon. The frequency of breeds is also somewhat consistent across zip codes, with Siamese and American Shorthairs being the most popular, followed by LaPerms and Maine Coons. Could these results be indicative of a general breed preference amongst Seattle cat owners? Given the comparatively low frequency of these breeds appearing in comparison to the domestic breeds, there likely isn't much of a relationship; however, it could be interesting to compare these results with a survey or some other data set that gave cat owner breed preferences. Similar to the Siamese cat, the Short Hair Chihuahua is the most popular dog breed after the exclusion of Labrador and golden retrievers. What is perhaps most interesting is Miniature Poodles only appear in the 98115 and 98117 zip codes, as well as Beagles only showing up in the 98103 zip code.

Time Series Analysis of Issued Pet Licenses

The Seattle pet licenses dataset contains the date the license was registered with the city, which gives us the ability to investigate how the volume of licenses being issued has changed over time. To begin analyzing the time series, we first convert the datasets issue_date column into a datetime format using pandas's to_datetime function.

In [23]:
petlicenses['issue_date'] = pd.to_datetime(petlicenses['license_issue_date'])

Plot the number of licenses issued by day for all the days included in the dataset.

In [24]:
petlicense_date = pd.pivot_table(petlicenses, index=['issue_date'], values=['count'], aggfunc=np.sum)
p = petlicense_date.plot(figsize=(16, 5), legend=False, title = 'Daily Pet License Volume')

p.xaxis.labelpad = 30

There appears to be very little to no pet licenses being issued from the beginning of 2000 up until about mid-2013 when some noticeable volume began to emerge. It isn't until near the end of 2014 that the number of issued pet licenses balloons to over a hundred or more each day.

What could cause such an extreme shift in the volume that has continued for just over three years after almost no activity for close to 10 years? Some cursory searching on Google News did not return anything definitive that could be a possible factor of the sudden change, nor did any further investigation on the Seattle Animal Shelter website.

To get a better visualization of how the data appears to trend, split the data by pre-2015 and post-2015, and plot.

In [25]:
f, ax = plt.subplots(1, 2, figsize=(20, 6))

petlicense_date2015 = petlicense_date.loc[petlicense_date.index >='2015-01-01']
petlicense_date2014 = petlicense_date.loc[(petlicense_date.index < '2015-01-01') & (petlicense_date.index >= '2014-01-01')]

p1 = petlicense_date2014.plot(ax=ax[0], legend=False, title='Pet License Volume 2014 - 2015')
p2 = petlicense_date2015.plot(ax=ax[1], legend=False, title='Pet License Volume 2015 - 2020')

p1.xaxis.labelpad = 30

p2.xaxis.labelpad = 30

The data plotted separately further displays the stark difference in pet licensing volume that began in November of 2014 and has continued as far as the data is available. Post-2015, it seems there are a few peaks at the end of each year around November and December. Other than that, there doesn't appear to be any significant trend.

We can explore the trend and other components of the time series by performing time series decomposition. Decomposing a time-series into unique components that each represent a particular facet of the time series data helps identify any cyclical or seasonal trends and is an essential piece of any time series analysis.

The statsmodels library provides a function [seasonal_decompose] that performs decomposition on a time series. The frequency parameter freq is set to 12 to indicate monthly frequency, though other values such as 7 (weekly frequency) or 3 (quarterly frequency), could also be used.

In [26]:
decomp2015 = sm.tsa.seasonal_decompose(petlicense_date2015, freq=12)

To get a more visually appealing graph of the decomposed time-series, we access the attributes that represent the fitted components and plot using pandas's convenient visualization methods.

In [27]:
f, ax = plt.subplots(4, 1, figsize=(20, 20), sharex=True, squeeze=True)

decomp2015.observed.plot(ax=ax[0], title='Observed', legend=False)
decomp2015.trend.plot(ax=ax[1], title='Trend', legend=False)
decomp2015.seasonal.plot(ax=ax[2], title='Seasonal', legend=False)
g = decomp2015.resid.plot(ax=ax[3], title='Residual', legend=False)

Text(0.5, 0, 'Date')