Analyzing Adoptable Pet Descriptions from Petfinder with IBM Watson Part One

Many animals listed on Petfinder are also given a description by the shelter that provides further details and information on the pet. These descriptions are useful for increasing interest among potential adopters by helping to establish a more personal connection to the animal beyond just cute pictures (not to say I can't get enough of cute cat pictures).

Do these descriptions vary in tone depending on the type of animal or the animal's age or other statistics? Through the combination of several Python libraries petpy, textacy, pandas, and the IBM Watson Tone Analyzer API, we will take the first step in answer these questions and more by cleaning and transforming the extracted data and adoptable pet descriptions from the Petfinder API.

Getting Started

Before diving in, import the libraries that we will use throughout the analysis.

In [42]:
from petpy import Petfinder
import os
import textacy
from textacy import preprocessing
import html
import numpy as np
import pandas as pd
from ibm_watson import ToneAnalyzerV3
from ibm_cloud_sdk_core.authenticators import IAMAuthenticator

Receiving the Needed API Keys from Petfinder and IBM Watson

To receive a Petfinder API key, an account must be created on Petfinder's developer page. After creating an account, Petfinder will generate an API and secret key that will be used to create an authenticated connection to the API.

The steps to getting started with IBM Watson are a bit more involved. First, create an account by clicking the 'Get Started for Free' button on the Watson Tone Analyzer home page. A Lite plan offers 2,500 free API calls per month, which is plenty for our purposes. The next step is to create a service instance of the Tone Analyzer. The documentation on creating an account and a service instance has much more information if needed. After a service instance of the Tone Analyzer has been created, the required credentials to connect to the API programmatically will be provided.

The libraries for interacting with the Petfinder API and IBM Watson will need to be installed before authenticating with the obtained credentials for each respective API. Install the following libraries (if not already installed) to begin using the APIs.

pip install --upgrade petpy
pip install --upgrade ibm-watson

Authenticating the API Connections

Once the respective API credentials are obtained, we can continue to authenticate our connection to the APIs. It is recommended to obfuscate the API credentials to avoid exposing the keys to the public. One approach to doing securing the generated API keys is to store the keys as environment variables and load them using os.environ.

In [3]:
pf_key = os.environ['PETFINDER_KEY']
pf_secret_key = os.environ['PETFINDER_SECRET_KEY']

watson_tone_service_api_key = os.environ['WATSON_TONE_SERVICE_API_KEY']
watson_tone_service_url = os.environ['WATSON_TONE_SERVICE_URL']

Authenticating with the Petfinder API involves initializing the Petfinder class with the received key from Petfinder. The connection to the IBM Watson Tone Analyzer API is made through initializing the ToneAnalyzerV3 class using the service URL and API key obtained previously. Another example of authenticating with the Tone Analyzer service (and other IBM services) can be found in the Python-SDK library on Github.

In [56]:
pf = Petfinder(pf_key, pf_secret_key)

authenticator = IAMAuthenticator(watson_tone_service_api_key)
tone_analyzer = ToneAnalyzerV3(authenticator=authenticator, 
                               version='2017-09-21')
tone_analyzer.set_service_url(watson_tone_service_url)

Now that our connections to the needed API have been authenticated, we can begin extracting the required data and analyzing the tones of adoptable pet descriptions!

Obtaining and Preparing the Data

Our strategy for preparing the pet adoption data is as follows:

  • Extract a sample of adoptable cat and dog descriptions from the Petfinder API.
  • Combine the extracted data and remove unnecessary data points.
  • Preprocess the provided animal descriptions to clean the text.
  • Obtain the tone scores of the cleaned pet descriptions using the IBM Watson Tone Analyzer.

Extracting Adoptable Cat and Dog Descriptions from Petfinder

As we only have 2,500 free calls to the Tone Analyzer API, we set the number of results to return from the Petfinder database to 1,000 cat and dog records. The pet_find method makes it easy to extract a sample of adoptable pet information from the database.

In [7]:
cats = pf.animals(animal_type='cat', status='adoptable', 
                  results_per_page=100, pages=10, return_df=True)
dogs = pf.animals(animal_type='dog', status='adoptable', 
                  results_per_page=100, pages=10, return_df=True)

The cat and dog data extracted from the Petfinder database are then concatenated together using pd.concat. We also print the shape of the DataFrame to ensure the number of records is what we would expect. As we are extracting 1,000 cat and dog records, the total number of rows should be 2,000.

In [8]:
cat_dog = pd.concat([cats, dogs])
cat_dog.shape
Out[8]:
(2000, 44)

To avoid having to call the Petfinder API more than necessary and for reproducible results, it is a good step to save the raw extracted results from the API into a CSV or other format before applying any transformations and data cleansing.

In [13]:
cat_dog.to_csv('../data/cat_dog.csv', index=False, encoding='utf-8')

We also want to make sure all the pet records have a description available for analysis. Therefore, we filter any records that have a missing description before proceeding. After removing any empty or missing descriptions, we print the new shape of the DataFrame to see how many records were removed.

In [19]:
cat_dog = cat_dog.loc[~pd.isnull(cat_dog['description'])]
cat_dog.shape
Out[19]:
(1325, 44)

Nearly 700 records did not have a description, so let's make sure we still have a reasonable number of each animal type.

In [24]:
print('Cats: ' + str(len(cat_dog.loc[cat_dog['animal_type'] == 'cat'])))
print('Dogs: ' + str(len(cat_dog.loc[cat_dog['animal_type'] == 'dog'])))
Cats: 674
Dogs: 651

Great! We still have an almost similar number of adoptable cat and dog records. Although we lost a fair amount of records after removing those that did not have a description, we should still have a large enough sample size of each type.

In [25]:
cat_dog.columns
Out[25]:
Index(['id', 'organization_id', 'url', 'type', 'species', 'age', 'gender',
       'size', 'coat', 'tags', 'name', 'description', 'photos', 'videos',
       'status', 'status_changed_at', 'published_at', 'distance',
       'breeds.primary', 'breeds.secondary', 'breeds.mixed', 'breeds.unknown',
       'colors.primary', 'colors.secondary', 'colors.tertiary',
       'attributes.spayed_neutered', 'attributes.house_trained',
       'attributes.declawed', 'attributes.special_needs',
       'attributes.shots_current', 'environment.children', 'environment.dogs',
       'environment.cats', 'contact.email', 'contact.phone',
       'contact.address.address1', 'contact.address.address2',
       'contact.address.city', 'contact.address.state',
       'contact.address.postcode', 'contact.address.country', 'animal_id',
       'animal_type', 'organization_id.1'],
      dtype='object')

Preprocessing the Pet Descriptions

Text preprocessing is a crucial step when analyzing text, especially text from the web. Fortunately, the Petfinder platform provides something of a standardized way to create pet descriptions. Therefore, the text we will be dealing with will hopefully be cleaner than other online sources (social media and forums, for example). Also, we want to keep the descriptions unaltered as much as possible to preserve the original tone and quality of the description.

textacy is a wonderful library for processing and analyzing text that is built on top of the natural language library spaCy. The textacy library makes it much more straightforward to preprocess text with its preprocessing module. There are many text processing methods available in textacy; however, we will only take advantage of the fix_bad_unicode and normalize_whitespace methods to hopefully normalize the pet descriptions to get better results from the Watson Tone Analyzer. We also use the unescape function from Python's standard html library to convert any escaped HTML strings into their actual representations. The last line replaces any HTML escaped apostrophe with a real apostrophe.

The preprocessing methods in the preprocess module and others are used in pandas' apply method. A new column description_clean is created to store the new descriptions to preserve the original descriptions just in case.

In [51]:
cat_dog['description_clean'] = cat_dog['description'].apply(
    lambda x: preprocessing.normalize_whitespace(str(x))).apply(
    lambda x: html.unescape(str(x))).apply(
    lambda x: x.replace(''', "'"))

Analyzing the Tone of the Cat and Dog Descriptions

We can now use the cleaned description text as input for the Watson Tone Analyzer API. For each row in the DataFrame, we run the tone method of the ToneAnalyzerV3 class we initialized and set the content_type parameter to plain text and the sentences parameter to False as we are not interested in sentence-level tone analysis for this task. We iterate through the DataFrame rows using the iterrows() method.

As with most APIs, the return data type from the Tone Analyzer API is a JSON object. Therefore, we must coerce the result in a tabular structure to make it easier for analysis. Fortunately, pandas provides a convenient function, json_normalize, for normalizing structured JSON into a flat data table. Thus, we will leverage this function to coerce the Tone Analyzer JSON results into a pandas DataFrame and combine the respective pet's data into a new DataFrame.

In [81]:
cat_dog_tones = pd.DataFrame()

for idx, row in cat_dog.iterrows():
    tones = tone_analyzer.tone(row['description_clean'], content_type='text/plain;charset=utf-8').get_result()
    tones_df = pd.io.json.json_normalize(tones['document_tone'], 'tones')
    
    tones_df[cat_dog.columns.tolist()] = pd.DataFrame([row], index=tones_df.index)
    cat_dog_tones = cat_dog_tones.append(tones_df)

Once the iteration through the DataFrame is complete and we have combined the output Watson Tone Analyzer API with the original data, we inspect the first few rows of the resulting DataFrame to make sure the output is what we expect.

In [85]:
cat_dog_tones[['animal_type', 'tone_name', 'score', 'description_clean']].head()
Out[85]:
animal_type tone_name score description_clean
0 cat Joy 0.545500 Primary Color: Brown Tabby Weight: 12.19lbs Ag...
0 cat Joy 0.756990 Sweetie Pie is a great choice for any home, so...
1 cat Confident 0.828525 Sweetie Pie is a great choice for any home, so...
2 cat Analytical 0.596122 Sweetie Pie is a great choice for any home, so...
0 cat Tentative 0.715720 Jewel is a little shy at first. She would like...

We can see the Tone Analyzer API data tone_id, tone_name, and score has been appended to the data! As before, it is often a good idea to save the results to a file to avoid having to call any APIs more than necessary (and incurring any additional fees, etc.).

In [86]:
cat_dog_tones.to_csv('../data/cat_dog_tones.csv', index=False, encoding='utf-8')

We have accomplished the first step of extracting and preparing the data for analysis! We now proceed to investigate the scores of the tones in the adoptable pet descriptions as analyzed by the algorithms used under the hood by the Watson Tone Analyzer API. Using the scores output from the Tone Analyzer API will allow us to see if there are any significant differences in how shelters represent adoptable cats and dogs.

In the next post, we will analyze and visualize the compiled adoptable dog, and cat descriptions tones we received from the IBM Watson Tone Analyzer to see if there are any significant differences between the tones used when describing the animals.

Related Posts