Many animals listed on Petfinder are also given a description by the shelter that provides further details and information on the pet. These descriptions are useful for increasing interest among potential adopters by helping to establish a more personal connection to the animal beyond just cute pictures (not to say I can't get enough of cute cat pictures).
Do these descriptions vary in tone depending on the type of animal or the animal's age or other statistics? Through the combination of several Python libraries
pandas, and the IBM Watson Tone Analyzer API, we will take the first step in answer these questions and more by cleaning and transforming the extracted data and adoptable pet descriptions from the Petfinder API.
Before diving in, import the libraries that we will use throughout the analysis.
from petpy import Petfinder import os import textacy from textacy import preprocessing import html import numpy as np import pandas as pd from ibm_watson import ToneAnalyzerV3 from ibm_cloud_sdk_core.authenticators import IAMAuthenticator
Receiving the Needed API Keys from Petfinder and IBM Watson¶
To receive a Petfinder API key, an account must be created on Petfinder's developer page. After creating an account, Petfinder will generate an API and secret key that will be used to create an authenticated connection to the API.
The steps to getting started with IBM Watson are a bit more involved. First, create an account by clicking the 'Get Started for Free' button on the Watson Tone Analyzer home page. A Lite plan offers 2,500 free API calls per month, which is plenty for our purposes. The next step is to create a service instance of the Tone Analyzer. The documentation on creating an account and a service instance has much more information if needed. After a service instance of the Tone Analyzer has been created, the required credentials to connect to the API programmatically will be provided.
The libraries for interacting with the Petfinder API and IBM Watson will need to be installed before authenticating with the obtained credentials for each respective API. Install the following libraries (if not already installed) to begin using the APIs.
pip install --upgrade petpy pip install --upgrade ibm-watson
Authenticating the API Connections¶
Once the respective API credentials are obtained, we can continue to authenticate our connection to the APIs. It is recommended to obfuscate the API credentials to avoid exposing the keys to the public. One approach to doing securing the generated API keys is to store the keys as environment variables and load them using
pf_key = os.environ['PETFINDER_KEY'] pf_secret_key = os.environ['PETFINDER_SECRET_KEY'] watson_tone_service_api_key = os.environ['WATSON_TONE_SERVICE_API_KEY'] watson_tone_service_url = os.environ['WATSON_TONE_SERVICE_URL']
Authenticating with the Petfinder API involves initializing the
Petfinder class with the received key from Petfinder. The connection to the IBM Watson Tone Analyzer API is made through initializing the
ToneAnalyzerV3 class using the service URL and API key obtained previously. Another example of authenticating with the Tone Analyzer service (and other IBM services) can be found in the Python-SDK library on Github.
pf = Petfinder(pf_key, pf_secret_key) authenticator = IAMAuthenticator(watson_tone_service_api_key) tone_analyzer = ToneAnalyzerV3(authenticator=authenticator, version='2017-09-21') tone_analyzer.set_service_url(watson_tone_service_url)
Now that our connections to the needed API have been authenticated, we can begin extracting the required data and analyzing the tones of adoptable pet descriptions!
Our strategy for preparing the pet adoption data is as follows:
- Extract a sample of adoptable cat and dog descriptions from the Petfinder API.
- Combine the extracted data and remove unnecessary data points.
- Preprocess the provided animal descriptions to clean the text.
- Obtain the tone scores of the cleaned pet descriptions using the IBM Watson Tone Analyzer.
Extracting Adoptable Cat and Dog Descriptions from Petfinder¶
As we only have 2,500 free calls to the Tone Analyzer API, we set the number of results to return from the Petfinder database to 1,000 cat and dog records. The
pet_find method makes it easy to extract a sample of adoptable pet information from the database.
cats = pf.animals(animal_type='cat', status='adoptable', results_per_page=100, pages=10, return_df=True) dogs = pf.animals(animal_type='dog', status='adoptable', results_per_page=100, pages=10, return_df=True)
The cat and dog data extracted from the Petfinder database are then concatenated together using
pd.concat. We also print the shape of the
DataFrame to ensure the number of records is what we would expect. As we are extracting 1,000 cat and dog records, the total number of rows should be 2,000.
cat_dog = pd.concat([cats, dogs]) cat_dog.shape
To avoid having to call the Petfinder API more than necessary and for reproducible results, it is a good step to save the raw extracted results from the API into a CSV or other format before applying any transformations and data cleansing.
cat_dog.to_csv('../data/cat_dog.csv', index=False, encoding='utf-8')
We also want to make sure all the pet records have a description available for analysis. Therefore, we filter any records that have a missing description before proceeding. After removing any empty or missing descriptions, we print the new shape of the
DataFrame to see how many records were removed.
cat_dog = cat_dog.loc[~pd.isnull(cat_dog['description'])] cat_dog.shape
Nearly 700 records did not have a description, so let's make sure we still have a reasonable number of each animal type.
print('Cats: ' + str(len(cat_dog.loc[cat_dog['animal_type'] == 'cat']))) print('Dogs: ' + str(len(cat_dog.loc[cat_dog['animal_type'] == 'dog'])))
Cats: 674 Dogs: 651
Great! We still have an almost similar number of adoptable cat and dog records. Although we lost a fair amount of records after removing those that did not have a description, we should still have a large enough sample size of each type.
Index(['id', 'organization_id', 'url', 'type', 'species', 'age', 'gender', 'size', 'coat', 'tags', 'name', 'description', 'photos', 'videos', 'status', 'status_changed_at', 'published_at', 'distance', 'breeds.primary', 'breeds.secondary', 'breeds.mixed', 'breeds.unknown', 'colors.primary', 'colors.secondary', 'colors.tertiary', 'attributes.spayed_neutered', 'attributes.house_trained', 'attributes.declawed', 'attributes.special_needs', 'attributes.shots_current', 'environment.children', 'environment.dogs', 'environment.cats', 'contact.email', 'contact.phone', 'contact.address.address1', 'contact.address.address2', 'contact.address.city', 'contact.address.state', 'contact.address.postcode', 'contact.address.country', 'animal_id', 'animal_type', 'organization_id.1'], dtype='object')
Preprocessing the Pet Descriptions¶
Text preprocessing is a crucial step when analyzing text, especially text from the web. Fortunately, the Petfinder platform provides something of a standardized way to create pet descriptions. Therefore, the text we will be dealing with will hopefully be cleaner than other online sources (social media and forums, for example). Also, we want to keep the descriptions unaltered as much as possible to preserve the original tone and quality of the description.
textacy is a wonderful library for processing and analyzing text that is built on top of the natural language library
textacy library makes it much more straightforward to preprocess text with its
preprocessing module. There are many text processing methods available in
textacy; however, we will only take advantage of the
normalize_whitespace methods to hopefully normalize the pet descriptions to get better results from the Watson Tone Analyzer. We also use the
unescape function from Python's standard
html library to convert any escaped HTML strings into their actual representations. The last line replaces any HTML escaped apostrophe with a real apostrophe.
The preprocessing methods in the
preprocess module and others are used in pandas'
apply method. A new column
description_clean is created to store the new descriptions to preserve the original descriptions just in case.
cat_dog['description_clean'] = cat_dog['description'].apply( lambda x: preprocessing.normalize_whitespace(str(x))).apply( lambda x: html.unescape(str(x))).apply( lambda x: x.replace(''', "'"))
Analyzing the Tone of the Cat and Dog Descriptions¶
We can now use the cleaned description text as input for the Watson Tone Analyzer API. For each row in the
DataFrame, we run the
tone method of the
ToneAnalyzerV3 class we initialized and set the
content_type parameter to plain text and the
sentences parameter to
False as we are not interested in sentence-level tone analysis for this task. We iterate through the
DataFrame rows using the
As with most APIs, the return data type from the Tone Analyzer API is a JSON object. Therefore, we must coerce the result in a tabular structure to make it easier for analysis. Fortunately,
pandas provides a convenient function,
json_normalize, for normalizing structured JSON into a flat data table. Thus, we will leverage this function to coerce the Tone Analyzer JSON results into a pandas
DataFrame and combine the respective pet's data into a new
cat_dog_tones = pd.DataFrame() for idx, row in cat_dog.iterrows(): tones = tone_analyzer.tone(row['description_clean'], content_type='text/plain;charset=utf-8').get_result() tones_df = pd.io.json.json_normalize(tones['document_tone'], 'tones') tones_df[cat_dog.columns.tolist()] = pd.DataFrame([row], index=tones_df.index) cat_dog_tones = cat_dog_tones.append(tones_df)
Once the iteration through the
DataFrame is complete and we have combined the output Watson Tone Analyzer API with the original data, we inspect the first few rows of the resulting
DataFrame to make sure the output is what we expect.
cat_dog_tones[['animal_type', 'tone_name', 'score', 'description_clean']].head()
|0||cat||Joy||0.545500||Primary Color: Brown Tabby Weight: 12.19lbs Ag...|
|0||cat||Joy||0.756990||Sweetie Pie is a great choice for any home, so...|
|1||cat||Confident||0.828525||Sweetie Pie is a great choice for any home, so...|
|2||cat||Analytical||0.596122||Sweetie Pie is a great choice for any home, so...|
|0||cat||Tentative||0.715720||Jewel is a little shy at first. She would like...|
We can see the Tone Analyzer API data
score has been appended to the data! As before, it is often a good idea to save the results to a file to avoid having to call any APIs more than necessary (and incurring any additional fees, etc.).
cat_dog_tones.to_csv('../data/cat_dog_tones.csv', index=False, encoding='utf-8')
We have accomplished the first step of extracting and preparing the data for analysis! We now proceed to investigate the scores of the tones in the adoptable pet descriptions as analyzed by the algorithms used under the hood by the Watson Tone Analyzer API. Using the scores output from the Tone Analyzer API will allow us to see if there are any significant differences in how shelters represent adoptable cats and dogs.
In the next post, we will analyze and visualize the compiled adoptable dog, and cat descriptions tones we received from the IBM Watson Tone Analyzer to see if there are any significant differences between the tones used when describing the animals.