Many animals listed on Petfinder are also given a description by the shelter that provides further details and information on the pet. These descriptions are useful for increasing interest among potential adopters by helping to establish a more personal connection to the animal beyond just cute pictures (not to say I can't get enough of cute cat pictures).
Do these descriptions vary in tone depending on the type of animal or the animal's age or other statistics? Through the combination of several Python libraries petpy
, textacy
, pandas
, and the IBM Watson Tone Analyzer API, we will take the first step in answer these questions and more by cleaning and transforming the extracted data and adoptable pet descriptions from the Petfinder API.
Getting Started¶
Before diving in, import the libraries that we will use throughout the analysis.
from petpy import Petfinder
import os
import textacy
from textacy import preprocessing
import html
import numpy as np
import pandas as pd
from ibm_watson import ToneAnalyzerV3
from ibm_cloud_sdk_core.authenticators import IAMAuthenticator
Receiving the Needed API Keys from Petfinder and IBM Watson¶
To receive a Petfinder API key, an account must be created on Petfinder's developer page. After creating an account, Petfinder will generate an API and secret key that will be used to create an authenticated connection to the API.
The steps to getting started with IBM Watson are a bit more involved. First, create an account by clicking the 'Get Started for Free' button on the Watson Tone Analyzer home page. A Lite plan offers 2,500 free API calls per month, which is plenty for our purposes. The next step is to create a service instance of the Tone Analyzer. The documentation on creating an account and a service instance has much more information if needed. After a service instance of the Tone Analyzer has been created, the required credentials to connect to the API programmatically will be provided.
The libraries for interacting with the Petfinder API and IBM Watson will need to be installed before authenticating with the obtained credentials for each respective API. Install the following libraries (if not already installed) to begin using the APIs.
pip install --upgrade petpy
pip install --upgrade ibm-watson
Authenticating the API Connections¶
Once the respective API credentials are obtained, we can continue to authenticate our connection to the APIs. It is recommended to obfuscate the API credentials to avoid exposing the keys to the public. One approach to doing securing the generated API keys is to store the keys as environment variables and load them using os.environ
.
pf_key = os.environ['PETFINDER_KEY']
pf_secret_key = os.environ['PETFINDER_SECRET_KEY']
watson_tone_service_api_key = os.environ['WATSON_TONE_SERVICE_API_KEY']
watson_tone_service_url = os.environ['WATSON_TONE_SERVICE_URL']
Authenticating with the Petfinder API involves initializing the Petfinder
class with the received key from Petfinder. The connection to the IBM Watson Tone Analyzer API is made through initializing the ToneAnalyzerV3
class using the service URL and API key obtained previously. Another example of authenticating with the Tone Analyzer service (and other IBM services) can be found in the Python-SDK library on Github.
pf = Petfinder(pf_key, pf_secret_key)
authenticator = IAMAuthenticator(watson_tone_service_api_key)
tone_analyzer = ToneAnalyzerV3(authenticator=authenticator,
version='2017-09-21')
tone_analyzer.set_service_url(watson_tone_service_url)
Now that our connections to the needed API have been authenticated, we can begin extracting the required data and analyzing the tones of adoptable pet descriptions!
Obtaining and Preparing the Data¶
Our strategy for preparing the pet adoption data is as follows:
- Extract a sample of adoptable cat and dog descriptions from the Petfinder API.
- Combine the extracted data and remove unnecessary data points.
- Preprocess the provided animal descriptions to clean the text.
- Obtain the tone scores of the cleaned pet descriptions using the IBM Watson Tone Analyzer.
Extracting Adoptable Cat and Dog Descriptions from Petfinder¶
As we only have 2,500 free calls to the Tone Analyzer API, we set the number of results to return from the Petfinder database to 1,000 cat and dog records. The pet_find
method makes it easy to extract a sample of adoptable pet information from the database.
cats = pf.animals(animal_type='cat', status='adoptable',
results_per_page=100, pages=10, return_df=True)
dogs = pf.animals(animal_type='dog', status='adoptable',
results_per_page=100, pages=10, return_df=True)
The cat and dog data extracted from the Petfinder database are then concatenated together using pd.concat
. We also print the shape of the DataFrame
to ensure the number of records is what we would expect. As we are extracting 1,000 cat and dog records, the total number of rows should be 2,000.
cat_dog = pd.concat([cats, dogs])
cat_dog.shape
To avoid having to call the Petfinder API more than necessary and for reproducible results, it is a good step to save the raw extracted results from the API into a CSV or other format before applying any transformations and data cleansing.
cat_dog.to_csv('../data/cat_dog.csv', index=False, encoding='utf-8')
We also want to make sure all the pet records have a description available for analysis. Therefore, we filter any records that have a missing description before proceeding. After removing any empty or missing descriptions, we print the new shape of the DataFrame
to see how many records were removed.
cat_dog = cat_dog.loc[~pd.isnull(cat_dog['description'])]
cat_dog.shape
Nearly 700 records did not have a description, so let's make sure we still have a reasonable number of each animal type.
print('Cats: ' + str(len(cat_dog.loc[cat_dog['animal_type'] == 'cat'])))
print('Dogs: ' + str(len(cat_dog.loc[cat_dog['animal_type'] == 'dog'])))
Great! We still have an almost similar number of adoptable cat and dog records. Although we lost a fair amount of records after removing those that did not have a description, we should still have a large enough sample size of each type.
cat_dog.columns
Preprocessing the Pet Descriptions¶
Text preprocessing is a crucial step when analyzing text, especially text from the web. Fortunately, the Petfinder platform provides something of a standardized way to create pet descriptions. Therefore, the text we will be dealing with will hopefully be cleaner than other online sources (social media and forums, for example). Also, we want to keep the descriptions unaltered as much as possible to preserve the original tone and quality of the description.
textacy
is a wonderful library for processing and analyzing text that is built on top of the natural language library spaCy
. The textacy
library makes it much more straightforward to preprocess text with its preprocessing
module. There are many text processing methods available in textacy
; however, we will only take advantage of the fix_bad_unicode
and normalize_whitespace
methods to hopefully normalize the pet descriptions to get better results from the Watson Tone Analyzer. We also use the unescape
function from Python's standard html
library to convert any escaped HTML strings into their actual representations. The last line replaces any HTML escaped apostrophe with a real apostrophe.
The preprocessing methods in the preprocess
module and others are used in pandas' apply
method. A new column description_clean
is created to store the new descriptions to preserve the original descriptions just in case.
cat_dog['description_clean'] = cat_dog['description'].apply(
lambda x: preprocessing.normalize_whitespace(str(x))).apply(
lambda x: html.unescape(str(x))).apply(
lambda x: x.replace(''', "'"))
Analyzing the Tone of the Cat and Dog Descriptions¶
We can now use the cleaned description text as input for the Watson Tone Analyzer API. For each row in the DataFrame
, we run the tone
method of the ToneAnalyzerV3
class we initialized and set the content_type
parameter to plain text and the sentences
parameter to False
as we are not interested in sentence-level tone analysis for this task. We iterate through the DataFrame
rows using the iterrows()
method.
As with most APIs, the return data type from the Tone Analyzer API is a JSON object. Therefore, we must coerce the result in a tabular structure to make it easier for analysis. Fortunately, pandas
provides a convenient function, json_normalize
, for normalizing structured JSON into a flat data table. Thus, we will leverage this function to coerce the Tone Analyzer JSON results into a pandas DataFrame
and combine the respective pet's data into a new DataFrame
.
cat_dog_tones = pd.DataFrame()
for idx, row in cat_dog.iterrows():
tones = tone_analyzer.tone(row['description_clean'], content_type='text/plain;charset=utf-8').get_result()
tones_df = pd.io.json.json_normalize(tones['document_tone'], 'tones')
tones_df[cat_dog.columns.tolist()] = pd.DataFrame([row], index=tones_df.index)
cat_dog_tones = cat_dog_tones.append(tones_df)
Once the iteration through the DataFrame
is complete and we have combined the output Watson Tone Analyzer API with the original data, we inspect the first few rows of the resulting DataFrame
to make sure the output is what we expect.
cat_dog_tones[['animal_type', 'tone_name', 'score', 'description_clean']].head()
We can see the Tone Analyzer API data tone_id
, tone_name
, and score
has been appended to the data! As before, it is often a good idea to save the results to a file to avoid having to call any APIs more than necessary (and incurring any additional fees, etc.).
cat_dog_tones.to_csv('../data/cat_dog_tones.csv', index=False, encoding='utf-8')
We have accomplished the first step of extracting and preparing the data for analysis! We now proceed to investigate the scores of the tones in the adoptable pet descriptions as analyzed by the algorithms used under the hood by the Watson Tone Analyzer API. Using the scores output from the Tone Analyzer API will allow us to see if there are any significant differences in how shelters represent adoptable cats and dogs.
In the next post, we will analyze and visualize the compiled adoptable dog, and cat descriptions tones we received from the IBM Watson Tone Analyzer to see if there are any significant differences between the tones used when describing the animals.