The Austin Animal Center provides its animal intake and outcome datasets on Socrata. When an animal is taken into the shelter, it is given a unique identifier that is also used in the outcomes dataset. We have already investigated and performed exploratory data analysis on the Austin Animal Center's intakes and animal outcomes individually and found several interesting facets of information. In this analysis, we merge the intakes and outcomes dataset using pandas to enable us to perform exploratory data analysis on the merged data. With the data merged, we will be able to explore in more depth the transition from intake to outcome.
Getting Started¶
As before, we import the libraries that will be used throughout the analysis. The %matplotlib inline
is used for displaying the visualizations.
%matplotlib inline
import requests
import pandas as pd
import numpy as np
import seaborn as sns
from urllib.error import HTTPError
import matplotlib.pyplot as plt
import warnings
warnings.simplefilter('ignore')
The following imports a Python module containing the get_soda_api_data()
function and the other transformation function used in the intakes analysis. The code can be found as a Github Gist.
import aac_analysis as aac
Getting the Data¶
As seen in previous analyses, the Austin Animal Center datasets are hosted on the Socrata Open Data Access (SODA) API. We've already extracted the intakes dataset in the most recent analysis. Therefore, we only need to extract the outcomes data using the SODA API. We extract the AAC outcomes dataset using the get_soda_api_data()
function from the aac_analysis
module.
outcomes_endpoint = 'https://data.austintexas.gov/resource/9t4d-g238.json'
outcomes_df = aac.get_soda_api_data(outcomes_endpoint)
Print the dimensions of the extracted data and the first five rows.
print(outcomes_df.shape)
outcomes_df.head()
There are slightly more than 80k rows and 12 columns, which match the metadata available on the outcomes data page on Socrata (at the time of this writing). Now that we have the data extracted from the SODA API and converted into a pandas DataFrame let's save the data as a CSV file to avoid the need to call the Socrata API more times than necessary.
outcomes_df.to_csv('../../../datasets/austin animal center intake outcomes/aac_outcomes.csv',
encoding='utf-8', index=False)
Creating and preparing the merged dataset¶
Now that we have the outcomes dataset, we begin adding more features to the data similarly to the intakes. As we have already created the necessary functions during the intakes analysis, this step is merely a matter of applying the functions to the data. Before that, however, we create another function for adding more features related to the animal's estimated date of birth, which is provided in the outcomes data but not the intakes.
def transform_birthdates(df):
df['date_of_birth'] = pd.to_datetime(df['date_of_birth'])
df['dob_year'] = df['date_of_birth'].dt.year
df['dob_month'] = df['date_of_birth'].dt.month
df['dob_monthyear'] = pd.to_datetime(df['monthyear']).dt.to_period('M')
return df
We now add more features to the outcomes using our previously developed functions and the newly written transform_birthdates()
function.
outcomes_df = aac.transform_age(outcomes_df, 'age_upon_outcome')
outcomes_df = aac.transform_date(outcomes_df, 'outcome')
outcomes_df = aac.create_unique_id(outcomes_df, 'outcome')
outcomes_df = transform_birthdates(outcomes_df)
outcomes_df.head()
The additional feature columns are now added to the outcomes data! The last step in creating the dataset is to load the intakes data we previously worked with and merge the two sets. We load the intakes data that was saved at the end of the analysis with all the added features already included using pandas's read_csv
.
intakes_df = pd.read_csv('../../../datasets/austin animal center intake outcomes/aac_intakes_transformed.csv')
Both datasets are now loaded as DataFrames. Before proceeding to merge the data, we take an additional step of dropping the duplicate columns in both datasets. First, let's check to see what and how many duplicate columns are in both datasets.
set(intakes_df.columns).intersection(outcomes_df.columns)
We want to keep some of the columns like breed
and color
in one dataset, but drop the duplicate and no longer needed columns in the other dataset. We choose to keep the still relevant columns in the outcomes dataset, not for any particular reason as the values should be the same across the data. One could investigate further and see if the animal's descriptive values like color
change from intake to outcome are indeed true, but this isn't germane to our particular analysis.
cols_to_drop = ['index', 'name']
outcomes_df.drop(['animal_type', 'breed', 'color'] + cols_to_drop, axis=1, inplace=True)
intakes_df.drop(cols_to_drop, axis=1, inplace=True)
The unneeded and duplicate columns are now removed from the datasets. Therefore, we can proceed to merge the data! The unique identifier we created using the provided animal's id and intake and outcome number is set as the index for both datasets. This unique id allows us to perform the merge appropriately as some of the animal IDs in both datasets are not unique, due to some animals being taken into the shelter on multiple occasions. The merge itself is performed using pandas's merge
function, which merges two DataFrames by using a SQL-style join.
outcomes_df.set_index('animal_id_new', inplace=True)
intakes_df.set_index('animal_id_new', inplace=True)
in_out = pd.merge(outcomes_df, intakes_df, how='inner',
right_index=True, left_index=True, suffixes=['_outcome', '_intake'])
Let's check the columns of the new dataset along with the dimensions and the first few rows of the data.
print(in_out.columns)
print(in_out.shape)
in_out.head()
With one merged dataset that contains the available data of an animal's intake to outcome, we can find the amount of time the animal was in the shelter before its outcome. The intake and outcome datetimes are converted into datetime objects using pandas's to_datetime()
function. We can then subtract the times to create a time_in_shelter
column.
in_out['outcome_datetime'] = pd.to_datetime(in_out['outcome_datetime'])
in_out['intake_datetime'] = pd.to_datetime(in_out['intake_datetime'])
in_out['time_in_shelter'] = in_out['outcome_datetime'] - in_out['intake_datetime']
Using the describe()
method, we can get a handy table of descriptive statistics that gives us a general sense of the distribution of the newly created time_in_shelter
column.
in_out['time_in_shelter'].describe()
Interestingly, the minimum time an animal was in the shelter is $-1,607$ days, which is obviously not a real value. Having a negative number of days in the shelter would imply a number of the intake datetimes are more recent than their respective outcome datetimes. To investigate a little further, we can filter the dataset for intake datetimes that are more recent than their outcomes.
in_later = in_out[['intake_datetime', 'outcome_datetime']][in_out['intake_datetime'] > in_out['outcome_datetime']]
print(len(in_later))
in_later.head()
The length of the filtered data indicates there are just over 500 records that have a more recent intake than outcome. As there are just under 80,000 rows in the entire dataset, it is likely the incorrect datetimes are merely a matter of the data being entered incorrectly. Assuming the intake and outcome dates are switched, we can correct the datetimes by replacing the values when the intake is more recent than the outcome.
in_out['intake_datetime'], in_out['outcome_datetime'] = np.where(in_out['intake_datetime'] > in_out['outcome_datetime'],
[in_out['outcome_datetime'], in_out['intake_datetime']],
[in_out['intake_datetime'], in_out['outcome_datetime']])
We can now update the time_in_shelter
column using the corrected dates and print the descriptive statistics table to see if the distribution is more in-line with reality.
in_out['time_in_shelter'] = pd.to_datetime(in_out['outcome_datetime']) - pd.to_datetime(in_out['intake_datetime'])
in_out['time_in_shelter'].describe()
Much better! We see the mean number of days from intake to outcome is just over two weeks, and about 50% of the animals incoming to the shelter will remain there for less than five days. The most prolonged stay in the shelter is 1,606 days or just under 4.5 years. To double-check that the intake and outcome dates are adjusted, we can filter the dataset again as before. If there are no intake dates that are more recent than the outcome, the returned DataFrame should be empty.
in_out[['intake_datetime', 'outcome_datetime']][in_out['intake_datetime'] > in_out['outcome_datetime']]
The filtered DataFrame is indeed empty! Now that we have merged the intakes and outcomes datasets and cleaned and transformed the data, we can proceed to perform exploratory data analysis! Before moving to the next step, let's quickly remove some of the no longer needed columns left over from the merge and save the data to a CSV.
del in_out['Unnamed: 0']
del in_out['monthyear']
del in_out['age_upon_outcome_Periods']
del in_out['age_upon_outcome_Period Range']
del in_out['datetime']
in_out.to_csv('../../../datasets/austin animal center intake outcomes/aac_intakes_outcomes.csv',
encoding='utf-8', index=False)
Analyzing animals' journey from intake to outcome¶
Now we're in the heat of the meat! The exploratory data analysis we perform here will be instrumental in later steps when constructing prediction models as we will already have an understanding of what variables are seemingly crucial in determining the animals' length of stay at the shelter or outcome type. Before diving in, we quickly set some default styles for the plots that will be created with seaborn using set
.
sns.set(style="white", context="talk", font_scale=1.2)
To start, let's get a sense of the distribution of animal types that are included in the data using value_counts
.
in_out['animal_type'].value_counts()
As one might have reasonably assumed, the majority of animals taken into the shelter are cats and dogs. Unfortunately, there is no information regarding what entails the Other
category, while birds and livestock are taken into the shelter rather rarely in comparison. There are only nine livestock records, which is rather low for our analysis and also not as much of interest. Therefore we filter out these records with .loc
.
in_out = in_out.loc[in_out['animal_type'] != 'Livestock']
Let's investigate the mean duration an animal stays at the shelter until its outcome based on the animal's age group and type. Before plotting, we create a new column that is the float representation of the time_in_shelter
column, which has a timedelta
data type. This conversion can be done by accessing the total_seconds()
method of the dt
attribute and dividing it by the number of seconds in one day.
in_out['time_in_shelter'] = pd.to_timedelta(in_out['time_in_shelter'])
in_out['time_in_shelter_days'] = in_out['time_in_shelter'].dt.total_seconds() / (24 * 60 * 60)
animal_shelter_time = in_out.groupby(['animal_type',
'age_upon_intake_age_group'])['time_in_shelter_days'].mean().reset_index()
bar = sns.barplot(x='age_upon_intake_age_group', y='time_in_shelter_days',
hue='animal_type', data=animal_shelter_time, order=['(-0.025, 2.5]', '(2.5, 5.0]', '(5.0, 7.5]',
'(7.5, 10.0]', '(10.0, 12.5]', '(12.5, 15.0]',
'(15.0, 17.5]', '(17.5, 20.0]', '(20.0, 22.5]',
'(22.5, 25.0]'])
bar.set_title('Mean Time in Shelter', fontsize=32)
bar.set_ylabel('Mean Days in Shelter', fontsize=28)
bar.set_xlabel('Age Group At Intake', fontsize=28)
bar.set_xticklabels(bar.get_xticklabels(), rotation=90)
bar.xaxis.labelpad = 10
sns.despine(bottom=True)
plt.legend(bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0.)
plt.tight_layout()
sns.despine(bottom=True)
Cats stay in the shelter longer than any other animal. The length of stay also increases for cats as their age increases, up until about 12.5 years old. Interestingly, the time in the shelter for dogs appears to be relatively flat regardless of the age of the dog. Anecdotally, I have seen cats of a higher age remain at local shelters longer on average in general, so these results aren't too surprising in that regard. It is unfortunate, however; since cats are prolific breeders, and in an environment such as Austin, likely breed year round, there is a high chance kittens are coming into the shelter at a much higher rate than cats of an older age. As kittens are adopted much quicker than adult cats in general, older cats can end up staying in a shelter environment longer than younger cats. The data seems to confirm this, as the majority of the cat intakes (and intakes overall) are within 2.5 years of age.
print('Overall:')
print(in_out['age_upon_intake_age_group'].value_counts())
print('Cats:')
print(in_out.loc[in_out['animal_type'] == 'Cat']['age_upon_intake_age_group'].value_counts())
Through the combination of the intake and outcome types, we can see the count of overall intakes and their resulting outcomes. As we've seen in the separate intake and outcome analyses, stray animal intakes and adoption or transfer outcomes represent the vast majority of the cases. Therefore, a combination of these intakes and outcomes will likely be the majority of overall cases. The following barplot depicts the top 10 intake to outcome cases concerning total count.
in_out['Intake_Outcome'] = in_out['intake_type'] + ' -> ' + in_out['outcome_type']
intake_outcomes = in_out['Intake_Outcome'].value_counts()[0:10].plot(kind='bar',
fontsize=16,
figsize=(15, 10))
intake_outcomes.set_ylabel('Count', fontsize=22)
intake_outcomes.set_xlabel('Intake to Outcome', fontsize=22)
intake_outcomes.set_title('Intakes to Outcomes', fontsize=26)
plt.tight_layout()
sns.despine(bottom=True)
As we suspected, stray intakes to adoption or transfer outcome are the majority of cases in the data, followed by lost animals being returned to their owners (Stray -> Return to Owner)
and pets being surrendered by their owners and later adopted (Owner Surrender -> Adoption).
Judging from the above plot, it appears the first few intakes to outcome paths represent the majority of overall cases, but we can quickly find the cumulative proportion of the intake to outcome paths with a quick and dirty loop.
for i in range(1, 10):
print(in_out['Intake_Outcome'].value_counts()[0:i].sum() / in_out['Intake_Outcome'].value_counts().sum())
The reported cumulative proportions show the first ten intake to outcome paths regarding overall volume represent about 95% of all cases. Stray intakes to adoptions or transfers alone make up just over half of all animal records. Now that we know more about the volume and types of intakes to outcomes that appear at the shelter, we can turn our attention to finding out which intakes stay in the shelter the longest and the common outcome. The following groups the intakes and outcome results by the mean number of days in the shelter.
intake_outcome_days = in_out.groupby('Intake_Outcome')['time_in_shelter_days'].mean().sort_values(ascending=False).plot(kind='bar',
fontsize=16,
figsize=(18, 10))
intake_outcome_days.set_ylabel('Mean Days in Shelter', fontsize=26)
intake_outcome_days.set_xlabel('Intake to Outcome', fontsize=26)
intake_outcome_days.set_title('Intakes to Outcomes', fontsize=30)
plt.tight_layout()
sns.despine(bottom=True)
Missing outcome cases and euthanasia requests to adoptions appear to take the longest to resolve on average. Stray and wildlife intakes, as well as owner surrenders that end up in adoptions, take approximately one month, or about 30 days. As shelters are often short on space for animals and resources in general, knowing the average amount of time an animal may be in the shelter before its outcome can be valuable information for keeping overflow under control.
In the outcomes analysis, we noted that an intact animal (not spayed or neutered) would not be adopted before being altered to help keep the animal population down. It was assumed that transfer outcomes represented intact animal intakes being transferred to a different facility to be spayed or neutered. Now that we have the intake sex and outcome sex data resulting from the combination of the two datasets, we can see if this assumption was accurate. The following plots the top 15 intake to outcome cases concerning overall volume, which we saw earlier makes up over 95% of all cases, by the animal's intact status upon intake and outcome.
in_out['intact_intake_outcome'] = in_out['sex_upon_intake'] + ' -> ' + in_out['sex_upon_outcome']
in_out['intact_intake_outcome'] = in_out['intact_intake_outcome'].str.replace('Neutered|Spayed', 'Altered').str.strip()
in_out['intact_intake_outcome'] = in_out['intact_intake_outcome'].str.replace('Male|Female', '').str.strip()
top_outcomes = in_out.loc[in_out['Intake_Outcome'].isin(in_out['Intake_Outcome'].value_counts()[0:15].index)]
top_outcomes_sex = top_outcomes.groupby(['intact_intake_outcome',
'Intake_Outcome'])['count'].sum().reset_index()
plt.tight_layout()
sns.despine(bottom=True)
fact = sns.factorplot(x='Intake_Outcome', y='count', col='intact_intake_outcome',
kind='bar', sharey=False, col_wrap=3, size=6, data=top_outcomes_sex, ci=None)
fact.set_xticklabels(rotation=90)
fact.set_xlabels('Outcome')
First, it appears there are a few records that may be mislabeled, as the intake status is altered, but the outcome is intact, which would imply the spay or neuter operation was somehow reversed. Fortunately, however, this mislabeling appears only in four cases. As we can see from the Intact -> Altered
bar plot, by far the most common intake to outcome is a stray intake to adoption, which shows that the animal is spayed or neutered without being transferred. This result indicates that our original assumption of intact animals being transported to partner facilities before being adopted was incorrect and that spays or neuters do occur on-site before adoption. The majority of altered animal intake to outcomes are either owners surrendering their pets to the shelter or the shelter returning a missing pet to its owner, followed by already altered strays being adopted.
in_out['intake_cond_outcome'] = in_out['intake_condition'] + ' -> ' + in_out['outcome_type']
intake_cond_out = in_out.groupby(['intake_condition', 'outcome_type'])['count'].sum().reset_index()
g = sns.factorplot(x='outcome_type', y='count', col='intake_condition',
col_wrap=3, kind='bar', size=6, sharey=False, data=intake_cond_out)
g.set_xticklabels(rotation=90)
g.set_xlabels('Outcome')
plt.tight_layout()
sns.despine(bottom=True)
Feral and nursing animals are mostly transferred to other facilities, while pregnant animals are likely to be adopted or transferred. The shelter would likely want to adopt a pregnant pet to an owner who is willing to take care of the resulting offspring rather than further burden any of the shelter's partner facilities. About 50% of sick animals entering the shelter needed to be euthanized, however, the remaining 50% are either transferred or adopted. It is possible the animal will still need to be euthanized after being moved, but hopefully, the partner facility will have the resources to administer more effective care. Injured animals are also often euthanized, suggesting the injury is severe, and euthanasia was, unfortunately, the best course of action. However, injured animals are also often adopted or returned to their owner at a seemingly higher proportion than sick animals, implying the animal has or will recover from its injury.
Does the condition of the animal upon intake help determine how long an animal may need to stay at the shelter? One would assume that sick or injured animals may need to stay longer to convalesce and that older animals may remain in the shelter for more days on average before being adopted.
intake_cond_out_time = in_out.groupby(['intake_cond_outcome'])['time_in_shelter_days'].mean().sort_values(ascending=False).plot(kind='bar',
figsize=(20, 10),
fontsize=20)
intake_cond_out_time.set_ylabel('Mean Days in Shelter', fontsize=26)
intake_cond_out_time.set_xlabel('Intake Condition to Outcome', fontsize=26)
intake_cond_out_time.set_title('Mean Days in Shelter by Intake Condition', fontsize=30)
plt.tight_layout()
sns.despine(bottom=True)
Most non-normal intake conditions that eventually lead to an adoption appear to take the most days on average, ranging anywhere from about 50 to 60 days. Animals that go missing from the shelter also seem to take the longest to resolve on average. Although it's unclear how an animal would go missing from the shelter, the length of time between the intake and the missing outcome is likely the amount of time the staff keeps a search for the animal before it receives the 'Missing' outcome categorization.
Conclusion¶
Through the combination of the Austin Animal Center's intakes and outcomes datasets and some cleaning and feature engineering of the data, we were able to investigate further into the complete process from the animal's intake into its outcome. By performing exploratory data analysis on the merged and transformed dataset, we revealed several exciting pieces of information that gives us more knowledge of what to expect when an animal arrives at the shelter regarding the time of stay and possible outcomes. This information will be of critical value when building and validating prediction models as we have prior knowledge of the factors and other components of the data that appear to have importance in predicting the animal's length of stay in the shelter or its outcome.
in_out = in_out.loc[:,:'time_in_shelter_days']
in_out.to_csv('../../../datasets/austin animal center intake outcomes/aac_intakes_outcomes.csv',
encoding='utf-8', index=False)
Related Posts
- Austin Animal Center Intakes Exploratory Data Analysis with Python, Pandas and Seaborn
- Extract and Analyze the Seattle Pet Licenses Dataset
- Analyzing the Next Decade of Earth Close-Approaching Objects with nasapy
- Plot Earth Fireball Impacts with nasapy, pandas and folium
- Predicting Shelter Cat Adoptions and Transfers with Scikit-learn and Machine Learning