Extraction and Feature Engineering of Animal Austin Center's Shelter Outcomes Dataset using Requests and Pandas

The Austin Animal Center is the largest no-kill animal shelter and shelters and protects over 18,000 animals each year. As part of the City of Austin's Open Data Initiative, the Center makes available their data detailing shelter pet intake and outcomes. According to the data portal, over 90% of animal outcomes are adoptions, transfers to other shelter partners or returning lost pets to owners.

Given the data made available by the AAC, is there a predictable pattern or visible trend to shelter pet outcomes? By performing exploratory data analysis we can unveil potential patterns and trends to help inform what features and terms are included in our final prediction model.

This notebook will walkthrough extracting the AAC's shelter pet outcome dataset using the requests library. Once the data is extracted, we will clean the data and enrich the data by leveraging feature engineering to create new and potentially more useful indicators of shelter animal outcomes.

The libraries used to perform the exploratory data analysis will be primarily the ubiquitous pandas for loading and manipulating the extracted data, and seaborn for visualizing the data. The packages are imported below; we also call the %matplotlib inline function for printing the visualizations.

In [1]:
%matplotlib inline
In [2]:
import requests
import pandas as pd
import numpy as np
from urllib.error import HTTPError
import matplotlib.pyplot as plt

Obtain Shelter Pet Data from the Austin Open Data Portal

The first step in analyzing the data is obtaining the data! The shelter outcome data can be extracted in a variety of ways, including going to the Austin Animal Center Outcomes webpage and downloading the data directly to standard formats, but that's not as fun as extracting the data programmatically with Python =p.

Clicking the API button in the top-right corner of the AAC webpage reveals the API endpoint that will be called to get the data. There are several export options, including CSV and JSON.

The API is powered by Socrata, which has a call limit of 1,000. Therefore, we need to page through the results and get the results 1,000 at a time, as noted by the Socrata API paging documentation until the total amount of records is reached. The dataset information states there are 77.9k total rows in the dataset.

We set the endpoint, the number of records to return on each call, and the number of pages we want to page through. Although the data states there are just over only 78k rows in the data, the pages variable is set to 100 anyway, and we break the call loop once there are no results to return.

In [3]:
endpoint = 'https://data.austintexas.gov/resource/9t4d-g238.json'
count = 1000
pages = 100

We extract the data using the requests library and append the results to a list. Once the call returns a 404 error, the loop stops and the results are returned.

Note: This approach is not optimal, a likely better approach would be to use a while statement and check the HTTP status code on each loop until it returns a 404; however, this quick and dirty loop works fine for our purposes. Once the extraction of the data is complete, the length of the list is returned.

In [4]:
results = []
params = {'$limit': count, '$offset': 0}

for i in range(0, pages):
    
    try:
        r = requests.get(endpoint, params=params)
        results.append(r.json())
        params['$offset'] = count
        count += 1000
        
    except HTTPError as err:
        if err.response.status_code == '404':
            break
        else:
            print(err.response.status_code)

len(results)
Out[4]:
100

We convert the list of JSON results into a pandas DataFrame by using the very convenient json_normalize function. Iterate through the list of results and convert the returned JSON into a DataFrame and append to our initialized, empty DataFrame outcome_df.

In [5]:
outcome_df = pd.DataFrame()

for i in results:
    outcome_df = outcome_df.append(pd.io.json.json_normalize(i))

len(outcome_df)
Out[5]:
78256

The DataFrame contains x rows of data, just as the data webpage stated.

As we are only interested in the outcomes of cats for this particular task, filter the animal_type column and reset the index.

In [6]:
cats_df = outcome_df[outcome_df['animal_type'] == 'Cat'].reset_index()
del cats_df['index']

Print the number of cat shelter outcomes as well as the first 5 rows of the filtered dataset.

In [7]:
print(len(cats_df))
cats_df.head()
29422
Out[7]:
age_upon_outcome animal_id animal_type breed color date_of_birth datetime monthyear name outcome_subtype outcome_type sex_upon_outcome
0 2 weeks A684346 Cat Domestic Shorthair Mix Orange Tabby 2014-07-07T00:00:00 2014-07-22T16:04:00 2014-07-22T16:04:00 NaN Partner Transfer Intact Male
1 1 month A685067 Cat Domestic Shorthair Mix Blue Tabby/White 2014-06-16T00:00:00 2014-08-14T18:45:00 2014-08-14T18:45:00 Lucy NaN Adoption Intact Female
2 3 months A678580 Cat Domestic Shorthair Mix White/Black 2014-03-26T00:00:00 2014-06-29T17:45:00 2014-06-29T17:45:00 *Frida Offsite Adoption Spayed Female
3 1 year A675405 Cat Domestic Medium Hair Mix Black/White 2013-03-27T00:00:00 2014-03-28T14:55:00 2014-03-28T14:55:00 Stella Luna NaN Return to Owner Spayed Female
4 3 weeks A670420 Cat Domestic Shorthair Mix Black/White 2013-12-16T00:00:00 2014-01-09T19:29:00 2014-01-09T19:29:00 NaN Partner Transfer Intact Male

We successfully extracted the Austin Animal Center's shelter outcome dataset and have already started manipulating the data to our needs! We now turn our attention to the process of feature engineering, an often manual but extremely valuable step in creating models. Before beginning; however, it can be worthwhile to save the data we extracted to a file in case we need to return to it. There are many options for saving a pandas DataFrame, as noted by the pandas documentation.

In [8]:
outcome_df.to_csv('../../data/aac_outcomes_working.csv', index=False, encoding='utf-8')
cats_df.to_csv('../../data/cats_working.csv', index=False, encoding='utf-8')

Feature Engineering

Feature engineering, although often laborious and potentially time-consuming, can yield significantly better results in the modeling phase as the data is richer with information machine learning algorithms can leverage. The step of feature engineering is another phase where a data scientist or analyst's domain knowledge about the particular data being analyzed can enhance and play a vital role in the success of the analysis and model of the problem.

In the end, and possibly for the foreseeable future, machine learning is at the core pattern recognition and statistics with computers, and cannot replace the specialized knowledge of a skilled analyst.

Each row represents one outcome. Therefore we can start by adding a simple count column to denote the number of observations in each group.

In [9]:
cats_df['count'] = 1

The sex_upon_outcome column denotes the cat's gender and if it was spayed or neutered upon the shelter outcome. This variable can be split into two new features, sex and Spay/Neuter. These features can be added with the np.where() function, which is essentially an if statement that works with pandas Series.

In [10]:
cats_df['sex'] = np.where(cats_df['sex_upon_outcome'].str.contains('Male'), 'Male', 'Female')
cats_df['Spay/Neuter'] = np.where(cats_df['sex_upon_outcome'].str.contains('Intact'), 'No', 'Yes')

The cat's age upon outcome is given in not a very data analysis-friendly format. Let's inspect the first 5 rows of the age_upon_outcome column to see what we're dealing with:

In [11]:
cats_df['age_upon_outcome'].head()
Out[11]:
0     2 weeks
1     1 month
2    3 months
3      1 year
4     3 weeks
Name: age_upon_outcome, dtype: object

The cat's age is approximated with a numeric value and a date interval such as 'weeks', 'months', and 'years', therefore we will only be able to get an approximation of the cat's age, which is typically the best achievable regarding of accuracy in the shelter environment.

We will transform the column into a day value that can be converted into months and years as needed. Start by splitting the age_upon_outcome column using the .str.split method. Note we could also use the expand=True parameter to return a DataFrame of the split columns.

In [12]:
cats_df = cats_df[cats_df['age_upon_outcome'] != 'NULL']
cats_df['Periods'], cats_df['Period Range'] = cats_df['age_upon_outcome'].str.split(' ').str[0].fillna(0).astype(int), cats_df['age_upon_outcome'].str.split(' ').str[1].fillna(0)
In [13]:
cats_df['Period Range'].unique()
Out[13]:
array(['weeks', 'month', 'months', 'year', 'years', 'days', 'week', 'day'], dtype=object)

Taking advantage of the .unique() Series method, we can find the date intervals we need to account for during the conversion.

With the period lengths, we can approximate the age of the cat upon the outcome in days by multiplying the number of periods by the appropriate number of days. This is done using the np.where() function. Once the approximate age in days is obtained, we also determine the age in years by dividing the days by 365.

In [14]:
cats_df['Period Range'] = np.where(cats_df['Period Range'].str.contains('day'), 1, 
                                   np.where(cats_df['Period Range'].str.contains('week'), 7, 
                                            np.where(cats_df['Period Range'].str.contains('month'), 30, 
                                                     np.where(cats_df['Period Range'].str.contains('year'), 365, 0)))).astype(int)

cats_df['outcome_age_(days)'] = cats_df['Period Range'] * cats_df['Periods']
cats_df['outcome_age_(years)'] = cats_df['outcome_age_(days)'] / 365

In the shelter environment, kittens, defined as cats of approximately six months of age or less, are often adopted much more quickly than adult cats. Kittens can also be transferred to other area organizations for treatments and to get spayed or neutered if the shelter does not have the appropriate facilities or staff on hand to perform the procedure. Therefore, it makes sense to create a few more features that designate whether a cat was of adult or kitten age upon outcome.

In [15]:
cats_df['Cat/Kitten (outcome)'] = np.where(cats_df['outcome_age_(days)'] >= 180, 'Cat', 'Kitten')
cats_df['sex_age_outcome'] = cats_df['sex_upon_outcome'] + ' ' + cats_df['Cat/Kitten (outcome)']
In [16]:
cats_df['age_group'] = pd.cut(cats_df['outcome_age_(years)'], 10)
cats_df['age_group'].value_counts()
Out[16]:
(-0.022, 2.2]    25107
(2.2, 4.4]        1554
(4.4, 6.6]         901
(6.6, 8.8]         724
(8.8, 11.0]        599
(11.0, 13.2]       297
(13.2, 15.4]       151
(15.4, 17.6]        58
(17.6, 19.8]        16
(19.8, 22.0]        14
Name: age_group, dtype: int64

Next, we want to convert the date columns into formats that are workable with pandas and for future data analysis. Thus, we apply the handy pd.to_datetime() function. The month and year the cat was born and the time of outcome is then extracted and stored in new columns.

In [17]:
cats_df['date_of_birth'] = pd.to_datetime(cats_df['date_of_birth'])

cats_df['dob_year'] = cats_df['date_of_birth'].dt.year
cats_df['dob_month'] = cats_df['date_of_birth'].dt.month
cats_df['dob_monthyear'] = pd.to_datetime(cats_df['monthyear']).dt.to_period('M')

cats_df['datetime'] = pd.to_datetime(cats_df['datetime'])
cats_df['outcome_month'] = cats_df['datetime'].dt.month
cats_df['outcome_year'] = cats_df['datetime'].dt.year
cats_df['outcome_weekday'] = cats_df['datetime'].dt.weekday_name
cats_df['outcome_hour'] = cats_df['datetime'].dt.hour

There are far fewer recognized cat breeds than dog breeds, as dogs have been selected for desired traits for millennia, while the cat has more or less stayed the same as its role in human society during the same time was essentially constant (keeping vermin out of food and grain stores). Only in the last century or so has there been more breeding for desired traits such as coat pattern and behavior. Also, unlike numerous dogs breeds, cat breeds are extremely difficult to determine without genetic testing and aren't officially recognized by the CFA (Cat Fancier's Association) without the proper paperwork. As a result, the vast majority of cats who arrive in shelters are of the Domestic breed, generally classified into three different hair lengths, 'short', 'medium', and 'long', which are defined as cats of mixed breed.

We can see the counts of cat breeds using the .value_counts() method.

In [18]:
cats_df['breed'].value_counts()
Out[18]:
Domestic Shorthair Mix                     23334
Domestic Medium Hair Mix                    2323
Domestic Longhair Mix                       1228
Siamese Mix                                  998
Domestic Shorthair                           386
American Shorthair Mix                       211
Snowshoe Mix                                 148
Domestic Medium Hair                         131
Maine Coon Mix                               106
Manx Mix                                      79
Russian Blue Mix                              65
Siamese                                       60
Domestic Longhair                             45
Himalayan Mix                                 32
Ragdoll Mix                                   22
Persian Mix                                   20
Siamese/Domestic Shorthair                    13
Bengal Mix                                    12
Angora Mix                                    12
American Curl Shorthair Mix                   11
Balinese Mix                                  10
Japanese Bobtail Mix                          10
Maine Coon                                    10
Persian                                        9
Tonkinese Mix                                  8
Snowshoe                                       8
Bombay Mix                                     7
British Shorthair Mix                          6
Himalayan                                      6
Cymric Mix                                     6
                                           ...  
Chartreux Mix                                  2
Domestic Shorthair/Siamese                     2
British Shorthair                              2
Domestic Longhair/Rex                          2
Devon Rex                                      2
Munchkin Longhair Mix                          2
Turkish Van Mix                                2
Bengal                                         2
Munchkin Shorthair Mix                         2
Bengal/Domestic Shorthair                      2
Domestic Medium Hair/Maine Coon                2
Manx/Domestic Longhair                         2
Snowshoe/Ragdoll                               2
Domestic Shorthair/Manx                        1
Manx/Siamese                                   1
Angora/Persian                                 1
Domestic Shorthair/Maine Coon                  1
Domestic Longhair/Russian Blue                 1
American Wirehair Mix                          1
Domestic Medium Hair/Manx                      1
Manx                                           1
Domestic Shorthair/Abyssinian                  1
Domestic Shorthair/British Shorthair           1
Snowshoe/Domestic Shorthair                    1
Havana Brown Mix                               1
Domestic Shorthair/Domestic Medium Hair        1
Ocicat Mix                                     1
Siamese/Angora                                 1
Domestic Longhair/Domestic Longhair            1
Domestic Shorthair/Domestic Shorthair          1
Name: breed, Length: 79, dtype: int64

As suspected, the Domestic breed is the most common by far. Let's clean up the breed names by removing 'mix' (we will assume the term 'mix' isn't relevant to the type of breed, which is certainly true for Domestic breeds) and trimming the values to remove any whitespace.

In [19]:
cats_df['breed'] = cats_df['breed'].str.lower().str.replace('mix', '').str.replace('medium hair', 'mediumhair').str.strip()

The Cat Fancier's Association currently recognizes 42 cat breeds. Let's create two new features that distinguish the cat as a breed recognized by the CFA or a domestic breed. As we saw, most of the cats will be classified into the domestic category, but we will be able to analyze if a cats breed (or breed type) has any impact on its outcome.

We create a list of all the CFA breeds and then convert the list to lowercase and join all the elements with a bar.

In [20]:
cfa_breeds = ['Abyssinian', 'American Bobtail', 'American Curl', 'American Shorthair', 'American Wirehair', 
              'Balinese', 'Bengal', 'Birman', 'Bombay', 'British Shorthair', 'Burmese', 'Burmilla', 
              'Chartreux', 'Colorpoint Shorthair', 'Cornish Rex', 'Devon Rex', 'Egyptian Mau', 'European Burmese', 
              'Exotic', 'Havana Brown', 'Japanese Bobtail', 'Korat', 'LaPerm', 'Maine Coon', 'Manx', 
              'Norwegian Forest Cat', 'Ocicat', 'Oriental', 'Persian', 'Ragamuffin', 'Ragdoll', 'Russian Blue', 
              'Scottish Fold', 'Selkirk Rex', 'Siamese', 'Siberian', 'Singapura', 'Somali', 'Tonkinese', 
              'Turkish Angora', 'Turkish Van']

cfa_breeds = [i.lower() for i in cfa_breeds]
cfa_breeds = '|'.join(cfa_breeds)

Some of the records contain two breeds, which we need to split to correctly categorize the cats as a domestic or pedigree breed. This is done using the .str.split() method with expand = True in combination with the pd.concat() function to join the resulting DataFrame and our original dataset.

The cat record is then categorized based on the resulting two breed columns. We categorize the cat as a CFA recognized breed if one of the breed columns contains a breed in the list above (we assume that records with a Domestic/CFA Breed or vice-versa is a CFA breed).

In [21]:
cats_df = pd.concat([cats_df, 
                     cats_df['breed'].str.split('/', expand=True).rename(columns={0:'breed1', 1:'breed2'})], 
                    axis=1)

cats_df['cfa_breed'] = np.where((cats_df['breed1'].str.contains(cfa_breeds)) | 
                                (cats_df['breed2'].str.contains(cfa_breeds)), 
                                True, False)

cats_df['domestic_breed'] = np.where((cats_df['breed1'].str.contains(cfa_breeds) == False) | 
                                     (cats_df['breed2'].str.contains(cfa_breeds) == False), 
                                     True, False)

The last given feature that may be of interest in our analysis is the color of the cat. As before, we can find the counts of each color using the value_counts() function.

In [22]:
cats_df['color'].value_counts()
Out[22]:
Brown Tabby                  4445
Black                        3847
Black/White                  2804
Brown Tabby/White            2338
Orange Tabby                 2180
Tortie                       1335
Calico                       1318
Blue Tabby                   1130
Orange Tabby/White           1095
Blue                         1058
Torbie                        845
Blue/White                    792
Blue Tabby/White              545
White/Black                   508
Cream Tabby                   498
Lynx Point                    438
White                         334
Seal Point                    321
White/Brown Tabby             287
Cream Tabby/White             236
Flame Point                   200
White/Blue                    196
Black Tabby                   177
Torbie/White                  175
Gray Tabby                    132
Black Smoke                   105
White/Orange Tabby            101
Lilac Point                    77
Tortie Point                   75
Gray                           74
                             ... 
Torbie/Calico                   1
Blue/Calico                     1
Brown/Buff                      1
Tortie/Black Smoke              1
Lilac Point/Cream               1
Agouti/Brown Tabby              1
Blue/Blue                       1
Blue/Brown                      1
Agouti/White                    1
Brown Brindle                   1
Orange Tiger                    1
Fawn                            1
Lynx Point/Blue                 1
Tricolor/Calico                 1
Blue Smoke/Gray                 1
Yellow                          1
Brown Tabby/Black Brindle       1
Brown Tiger/White               1
Calico/Orange                   1
Blue Point/Cream                1
Brown Merle/Brown Tabby         1
Lynx Point/Tan                  1
Brown Tabby/Gray Tabby          1
Blue/Cream                      1
Orange Tabby/Black              1
Tortie/Tortie                   1
White/Calico Point              1
Brown Merle                     1
Tortie Point/Blue               1
White/Lilac Point               1
Name: color, Length: 231, dtype: int64

We see there is quite a variety of different colors and similar to the listed breeds, there are at times two given colors. If we want to use this information in our visualizations and model, we need to clean up the data with some more feature engineering!

Since we are dealing with not just coat colors but coat patterns, such as calico, tabby, colorpoint, and others, it would make sense to break up the data into more useful information based on the given color and if a coat pattern is given. Luckily, there aren't too many coat patterns to deal with, and we can list them as with the breeds. An excellent resource to see what each coat pattern looks like can be found at the University of Florida's Shelter Medicine website.

In [23]:
coat_patterns = ['tabby', 'tortie', 'tortoiseshell', 'calico', 'smoke', 'torbie',
                 'colorpoint', 'point', 'brindle', 'agouti', 'bicolor', 'tricolor']

We want to extract the given coat pattern and color information, one way of which is using a regex string in conjunction with pandas' str.extract.

In [24]:
# Create the regex string by joining the coat patterns with | and encapsulating them in parantheses
coat_reg = '(' + '|'.join(coat_patterns) + ')'

The color column is converted to lower case, and if a coat pattern is given that matches one of the values in our list, it is extracted into a new column, coat pattern. We also remove the coat pattern from the color column, so we don't create duplicate values.

In [25]:
cats_df['color'] = cats_df['color'].str.lower()

cats_df['coat_pattern'] = cats_df['color'].str.extract(coat_reg, expand=False)

cats_df['color'] = cats_df['color'].str.replace(coat_reg, '').str.strip()

The transformed color column is then split into a two column DataFrame using the .str.split() method with the parameter expand set to True. This operation will create some empty values in the first color column, color1, as we removed some information from the original column earlier. These blank values would then be present only when a breed specific color, such as tabby, torbie or tortoiseshell, was given; therefore, we search for any rows in the color1 column that are blank and change them to 'Breed Specific'. The extracted coat colors and patterns are then combined into a new column, coat.

In [26]:
cats_df = pd.concat([cats_df, 
                     cats_df['color'].str.split('/', expand=True).rename(columns={0:'color1', 1:'color2'})], axis=1)

cats_df['color1'] = np.where(cats_df['color1'] == '', 'Breed Specific', cats_df['color1'])

cats_df['coat'] = np.where(cats_df['color1'] == 'Breed Specific', cats_df['coat_pattern'], cats_df['color1'])

Conclusion

That wraps up the feature engineering portion of our data analysis task! Although add more features to the data can be somewhat tedious, it can result in a much-improved model for prediction of outcomes. The process of feature engineering also allows the individual analyst or data scientist working with the data to apply their domain knowledge to the data to enrich its information.

As a final step, we save the new data with the added features into a csv that we will use in the following steps of analyzing the shelter outcome data.

In [27]:
cats_df.to_csv('../../data/cats_df_clean.csv', index=False, encoding='utf-8')

The next step in the analysis is to visualize the data and hopefully generate any hypotheses and features that appear to have a significant impact on the outcome of the cat.

References

Related Posts