The Austin Animal Center is the largest no-kill animal shelter and shelters and protects over 18,000 animals each year. As part of the City of Austin's Open Data Initiative, the Center makes available their data detailing shelter pet intake and outcomes. According to the data portal, over 90% of animal outcomes are adoptions, transfers to other shelter partners or returning lost pets to owners.
Given the data made available by the AAC, is there a predictable pattern or visible trend to shelter pet outcomes? By performing exploratory data analysis we can unveil potential patterns and trends to help inform what features and terms are included in our final prediction model.
This notebook will walkthrough extracting the AAC's shelter pet outcome dataset using the requests library. Once the data is extracted, we will clean the data and enrich the data by leveraging feature engineering to create new and potentially more useful indicators of shelter animal outcomes.
The libraries used to perform the exploratory data analysis will be primarily the ubiquitous pandas for loading and manipulating the extracted data, and seaborn for visualizing the data. The packages are imported below; we also call the %matplotlib inline
function for printing the visualizations.
%matplotlib inline
import requests
import pandas as pd
import numpy as np
from urllib.error import HTTPError
import matplotlib.pyplot as plt
Obtain Shelter Pet Data from the Austin Open Data Portal¶
The first step in analyzing the data is obtaining the data! The shelter outcome data can be extracted in a variety of ways, including going to the Austin Animal Center Outcomes webpage and downloading the data directly to standard formats, but that's not as fun as extracting the data programmatically with Python =p.
Clicking the API button in the top-right corner of the AAC webpage reveals the API endpoint that will be called to get the data. There are several export options, including CSV and JSON.
The API is powered by Socrata, which has a call limit of 1,000. Therefore, we need to page through the results and get the results 1,000 at a time, as noted by the Socrata API paging documentation until the total amount of records is reached. The dataset information states there are 77.9k total rows in the dataset.
We set the endpoint, the number of records to return on each call, and the number of pages we want to page through. Although the data states there are just over only 78k rows in the data, the pages
variable is set to 100 anyway, and we break the call loop once there are no results to return.
endpoint = 'https://data.austintexas.gov/resource/9t4d-g238.json'
count = 1000
pages = 100
We extract the data using the requests
library and append the results to a list. Once the call returns a 404
error, the loop stops and the results are returned.
Note: This approach is not optimal, a likely better approach would be to use a while
statement and check the HTTP status code on each loop until it returns a 404
; however, this quick and dirty loop works fine for our purposes. Once the extraction of the data is complete, the length of the list is returned.
results = []
params = {'$limit': count, '$offset': 0}
for i in range(0, pages):
try:
r = requests.get(endpoint, params=params)
results.append(r.json())
params['$offset'] = count
count += 1000
except HTTPError as err:
if err.response.status_code == '404':
break
else:
print(err.response.status_code)
len(results)
We convert the list of JSON results into a pandas DataFrame by using the very convenient json_normalize
function. Iterate through the list of results and convert the returned JSON into a DataFrame and append to our initialized, empty DataFrame outcome_df
.
outcome_df = pd.DataFrame()
for i in results:
outcome_df = outcome_df.append(pd.io.json.json_normalize(i))
len(outcome_df)
The DataFrame contains x rows of data, just as the data webpage stated.
As we are only interested in the outcomes of cats for this particular task, filter the animal_type
column and reset the index.
cats_df = outcome_df[outcome_df['animal_type'] == 'Cat'].reset_index()
del cats_df['index']
Print the number of cat shelter outcomes as well as the first 5 rows of the filtered dataset.
print(len(cats_df))
cats_df.head()
We successfully extracted the Austin Animal Center's shelter outcome dataset and have already started manipulating the data to our needs! We now turn our attention to the process of feature engineering, an often manual but extremely valuable step in creating models. Before beginning; however, it can be worthwhile to save the data we extracted to a file in case we need to return to it. There are many options for saving a pandas DataFrame, as noted by the pandas documentation.
outcome_df.to_csv('../../data/aac_outcomes_working.csv', index=False, encoding='utf-8')
cats_df.to_csv('../../data/cats_working.csv', index=False, encoding='utf-8')
Feature Engineering¶
Feature engineering, although often laborious and potentially time-consuming, can yield significantly better results in the modeling phase as the data is richer with information machine learning algorithms can leverage. The step of feature engineering is another phase where a data scientist or analyst's domain knowledge about the particular data being analyzed can enhance and play a vital role in the success of the analysis and model of the problem.
In the end, and possibly for the foreseeable future, machine learning is at the core pattern recognition and statistics with computers, and cannot replace the specialized knowledge of a skilled analyst.
Each row represents one outcome. Therefore we can start by adding a simple count
column to denote the number of observations in each group.
cats_df['count'] = 1
The sex_upon_outcome
column denotes the cat's gender and if it was spayed or neutered upon the shelter outcome. This variable can be split into two new features, sex
and Spay/Neuter
. These features can be added with the np.where()
function, which is essentially an if
statement that works with pandas Series.
cats_df['sex'] = np.where(cats_df['sex_upon_outcome'].str.contains('Male'), 'Male', 'Female')
cats_df['Spay/Neuter'] = np.where(cats_df['sex_upon_outcome'].str.contains('Intact'), 'No', 'Yes')
The cat's age upon outcome is given in not a very data analysis-friendly format. Let's inspect the first 5 rows of the age_upon_outcome
column to see what we're dealing with:
cats_df['age_upon_outcome'].head()
The cat's age is approximated with a numeric value and a date interval such as 'weeks', 'months', and 'years', therefore we will only be able to get an approximation of the cat's age, which is typically the best achievable regarding of accuracy in the shelter environment.
We will transform the column into a day value that can be converted into months and years as needed. Start by splitting the age_upon_outcome
column using the .str.split
method. Note we could also use the expand=True
parameter to return a DataFrame of the split columns.
cats_df = cats_df[cats_df['age_upon_outcome'] != 'NULL']
cats_df['Periods'], cats_df['Period Range'] = cats_df['age_upon_outcome'].str.split(' ').str[0].fillna(0).astype(int), cats_df['age_upon_outcome'].str.split(' ').str[1].fillna(0)
cats_df['Period Range'].unique()
Taking advantage of the .unique()
Series method, we can find the date intervals we need to account for during the conversion.
With the period lengths, we can approximate the age of the cat upon the outcome in days by multiplying the number of periods by the appropriate number of days. This is done using the np.where()
function. Once the approximate age in days is obtained, we also determine the age in years by dividing the days by 365.
cats_df['Period Range'] = np.where(cats_df['Period Range'].str.contains('day'), 1,
np.where(cats_df['Period Range'].str.contains('week'), 7,
np.where(cats_df['Period Range'].str.contains('month'), 30,
np.where(cats_df['Period Range'].str.contains('year'), 365, 0)))).astype(int)
cats_df['outcome_age_(days)'] = cats_df['Period Range'] * cats_df['Periods']
cats_df['outcome_age_(years)'] = cats_df['outcome_age_(days)'] / 365
In the shelter environment, kittens, defined as cats of approximately six months of age or less, are often adopted much more quickly than adult cats. Kittens can also be transferred to other area organizations for treatments and to get spayed or neutered if the shelter does not have the appropriate facilities or staff on hand to perform the procedure. Therefore, it makes sense to create a few more features that designate whether a cat was of adult or kitten age upon outcome.
cats_df['Cat/Kitten (outcome)'] = np.where(cats_df['outcome_age_(days)'] >= 180, 'Cat', 'Kitten')
cats_df['sex_age_outcome'] = cats_df['sex_upon_outcome'] + ' ' + cats_df['Cat/Kitten (outcome)']
cats_df['age_group'] = pd.cut(cats_df['outcome_age_(years)'], 10)
cats_df['age_group'].value_counts()
Next, we want to convert the date columns into formats that are workable with pandas
and for future data analysis. Thus, we apply the handy pd.to_datetime()
function. The month and year the cat was born and the time of outcome is then extracted and stored in new columns.
cats_df['date_of_birth'] = pd.to_datetime(cats_df['date_of_birth'])
cats_df['dob_year'] = cats_df['date_of_birth'].dt.year
cats_df['dob_month'] = cats_df['date_of_birth'].dt.month
cats_df['dob_monthyear'] = pd.to_datetime(cats_df['monthyear']).dt.to_period('M')
cats_df['datetime'] = pd.to_datetime(cats_df['datetime'])
cats_df['outcome_month'] = cats_df['datetime'].dt.month
cats_df['outcome_year'] = cats_df['datetime'].dt.year
cats_df['outcome_weekday'] = cats_df['datetime'].dt.weekday_name
cats_df['outcome_hour'] = cats_df['datetime'].dt.hour
There are far fewer recognized cat breeds than dog breeds, as dogs have been selected for desired traits for millennia, while the cat has more or less stayed the same as its role in human society during the same time was essentially constant (keeping vermin out of food and grain stores). Only in the last century or so has there been more breeding for desired traits such as coat pattern and behavior. Also, unlike numerous dogs breeds, cat breeds are extremely difficult to determine without genetic testing and aren't officially recognized by the CFA (Cat Fancier's Association) without the proper paperwork. As a result, the vast majority of cats who arrive in shelters are of the Domestic breed, generally classified into three different hair lengths, 'short', 'medium', and 'long', which are defined as cats of mixed breed.
We can see the counts of cat breeds using the .value_counts()
method.
cats_df['breed'].value_counts()
As suspected, the Domestic breed is the most common by far. Let's clean up the breed names by removing 'mix' (we will assume the term 'mix' isn't relevant to the type of breed, which is certainly true for Domestic breeds) and trimming the values to remove any whitespace.
cats_df['breed'] = cats_df['breed'].str.lower().str.replace('mix', '').str.replace('medium hair', 'mediumhair').str.strip()
The Cat Fancier's Association currently recognizes 42 cat breeds. Let's create two new features that distinguish the cat as a breed recognized by the CFA or a domestic breed. As we saw, most of the cats will be classified into the domestic category, but we will be able to analyze if a cats breed (or breed type) has any impact on its outcome.
We create a list of all the CFA breeds and then convert the list to lowercase and join all the elements with a bar.
cfa_breeds = ['Abyssinian', 'American Bobtail', 'American Curl', 'American Shorthair', 'American Wirehair',
'Balinese', 'Bengal', 'Birman', 'Bombay', 'British Shorthair', 'Burmese', 'Burmilla',
'Chartreux', 'Colorpoint Shorthair', 'Cornish Rex', 'Devon Rex', 'Egyptian Mau', 'European Burmese',
'Exotic', 'Havana Brown', 'Japanese Bobtail', 'Korat', 'LaPerm', 'Maine Coon', 'Manx',
'Norwegian Forest Cat', 'Ocicat', 'Oriental', 'Persian', 'Ragamuffin', 'Ragdoll', 'Russian Blue',
'Scottish Fold', 'Selkirk Rex', 'Siamese', 'Siberian', 'Singapura', 'Somali', 'Tonkinese',
'Turkish Angora', 'Turkish Van']
cfa_breeds = [i.lower() for i in cfa_breeds]
cfa_breeds = '|'.join(cfa_breeds)
Some of the records contain two breeds, which we need to split to correctly categorize the cats as a domestic or pedigree breed. This is done using the .str.split()
method with expand = True
in combination with the pd.concat()
function to join the resulting DataFrame
and our original dataset.
The cat record is then categorized based on the resulting two breed columns. We categorize the cat as a CFA recognized breed if one of the breed columns contains a breed in the list above (we assume that records with a Domestic/CFA Breed or vice-versa is a CFA breed).
cats_df = pd.concat([cats_df,
cats_df['breed'].str.split('/', expand=True).rename(columns={0:'breed1', 1:'breed2'})],
axis=1)
cats_df['cfa_breed'] = np.where((cats_df['breed1'].str.contains(cfa_breeds)) |
(cats_df['breed2'].str.contains(cfa_breeds)),
True, False)
cats_df['domestic_breed'] = np.where((cats_df['breed1'].str.contains(cfa_breeds) == False) |
(cats_df['breed2'].str.contains(cfa_breeds) == False),
True, False)
The last given feature that may be of interest in our analysis is the color of the cat. As before, we can find the counts of each color using the value_counts()
function.
cats_df['color'].value_counts()
We see there is quite a variety of different colors and similar to the listed breeds, there are at times two given colors. If we want to use this information in our visualizations and model, we need to clean up the data with some more feature engineering!
Since we are dealing with not just coat colors but coat patterns, such as calico, tabby, colorpoint, and others, it would make sense to break up the data into more useful information based on the given color and if a coat pattern is given. Luckily, there aren't too many coat patterns to deal with, and we can list them as with the breeds. An excellent resource to see what each coat pattern looks like can be found at the University of Florida's Shelter Medicine website.
coat_patterns = ['tabby', 'tortie', 'tortoiseshell', 'calico', 'smoke', 'torbie',
'colorpoint', 'point', 'brindle', 'agouti', 'bicolor', 'tricolor']
We want to extract the given coat pattern and color information, one way of which is using a regex string in conjunction with pandas' str.extract
.
# Create the regex string by joining the coat patterns with | and encapsulating them in parantheses
coat_reg = '(' + '|'.join(coat_patterns) + ')'
The color
column is converted to lower case, and if a coat pattern is given that matches one of the values in our list, it is extracted into a new column, coat pattern
. We also remove the coat pattern from the color
column, so we don't create duplicate values.
cats_df['color'] = cats_df['color'].str.lower()
cats_df['coat_pattern'] = cats_df['color'].str.extract(coat_reg, expand=False)
cats_df['color'] = cats_df['color'].str.replace(coat_reg, '').str.strip()
The transformed color column is then split into a two column DataFrame using the .str.split()
method with the parameter expand
set to True
. This operation will create some empty values in the first color column, color1
, as we removed some information from the original column earlier. These blank values would then be present only when a breed specific color, such as tabby, torbie or tortoiseshell, was given; therefore, we search for any rows in the color1
column that are blank and change them to 'Breed Specific'. The extracted coat colors and patterns are then combined into a new column, coat
.
cats_df = pd.concat([cats_df,
cats_df['color'].str.split('/', expand=True).rename(columns={0:'color1', 1:'color2'})], axis=1)
cats_df['color1'] = np.where(cats_df['color1'] == '', 'Breed Specific', cats_df['color1'])
cats_df['coat'] = np.where(cats_df['color1'] == 'Breed Specific', cats_df['coat_pattern'], cats_df['color1'])
Conclusion¶
That wraps up the feature engineering portion of our data analysis task! Although add more features to the data can be somewhat tedious, it can result in a much-improved model for prediction of outcomes. The process of feature engineering also allows the individual analyst or data scientist working with the data to apply their domain knowledge to the data to enrich its information.
As a final step, we save the new data with the added features into a csv
that we will use in the following steps of analyzing the shelter outcome data.
cats_df.to_csv('../../data/cats_df_clean.csv', index=False, encoding='utf-8')
The next step in the analysis is to visualize the data and hopefully generate any hypotheses and features that appear to have a significant impact on the outcome of the cat.
References¶
Related Posts
- Exploratory Data Analysis of Shelter Cat Outcomes with Pandas and Seaborn
- From Intake to Outcome: Analyzing the Austin Animal Center's Intake and Outcomes Datasets
- Austin Animal Center Intakes Exploratory Data Analysis with Python, Pandas and Seaborn
- Extract and Analyze the Seattle Pet Licenses Dataset
- Predicting Shelter Cat Adoptions and Transfers with Scikit-learn and Machine Learning