Combining the multiprocessing package for concurrent use of multiple CPUs and the petpy package for interacting with the Petfinder API allows one to find and download a vast amount of animal images for use in other tasks, such as image classification.
This post will introduce how to use the multiprocessing and petpy packages to quickly and easily download a large set of cat images of all the different breeds available in the Petfinder database. We will end up with a collection of just under 45,000 of cat images sorted by user-defined breed classifications.
Start by importing the various packages and modules that will be needed.
import petpy
import os
import pandas as pd
import urllib.request
import urllib.error
from multiprocessing import Pool
from multiprocessing.dummy import Pool as ThreadPool
Get the available cat breeds¶
Create a connection to the Petfinder API by calling the Petfinder
class from the petpy package with your given API key. You can receive an API key by creating an account on the Petfinder developer page.
key = os.getenv('PETFINDER_KEY')
pf = petpy.Petfinder(key)
Use the breed_list()
method to get the available cat breeds in the Petfinder database.
cat_breeds = pf.breed_list('cat', return_df=True)['cat breeds'].tolist()
Extracting cat breed records with multiprocessing¶
To speed up the process of extracting the pet records in the Petfinder database for each breed, we will utilize the multiprocessing library to spread out the task across multiple cores.
The machine that I am working with has a quad-core CPU, thus I set the maximum amount of processes that can be running at one time to be twice that amount. It is generally not recommended to exceed the device's maximum number of cores by too much (I've heard more than double the amount of cores for longer tasks is the recommended max), as the process can slow down if the program has to switch between processes more than is required.
pool = ThreadPool(processes=8)
To leverage the concurrency provided by multiprocessing
, we first define a worker function that wraps the pet_find()
method. We will pull up to 1,500 pet records for each individual breed.
def get_cat_breeds(cat):
breeds = pf.pet_find('US', animal='cat', breed=cat, count=500, pages=3, return_df=True)
return(breeds)
With the worker function and the pool initialized, we can begin extracting the pet records using the pet_find()
method in petpy
concurrently. We also time the duration of the process using the %%time
magic function available in Jupyter Notebook.
%%time # use Jupyter Notebook time magic function for recording how long it takes to accumulate the results.
cats = pool.map(get_cat_breeds, cat_breeds)
pool.close()
pool.join()
The entire process took just under 5 and a half minutes, likely due to the time taken to convert the JSON results from the API into pandas DataFrames. The completed pool process returns the collected results as a list, which we can convert to a DataFrame with concat()
.
cats = pd.concat(cats)
The process compiled 32,000 individual pet records that matched the breeds we were looking for in the cat_breeds
list. Admittedly, the prcoess of coercing the JSON results from the Petfinder API into pandas DataFrames could likely be much more efficient; however, we were able to find just under 32,000 adoptable cat records from animal shelters across the United States and return the results in a clean and tidy DataFrame, so this seems like an acceptable trade-off.
len(cats)
Cleaning the data to get the image URLs¶
As we are only interested in the associated photo images for each cat, we can clean the data set we extracted to reshape and remove the data that is not needed for the task at hand.
The following helper function is used to clean and reshape the data we need.
def get_images(df):
try:
del df['media.photos.photo'] # This column may need to be deleted manually.
except:
pass
# Keep only the columns that contain 'id', 'breed', and 'photo'
photos = df[df.columns[df.columns.str.contains('id|breed|photo')]]
# Melt the data to reshape it from wide to long and remove any NAs introduced from empty photo records.
photos_melted = pd.melt(photos, id_vars=['id', 'breed0', 'breed1'])
photos_melted.dropna(subset=['value'], inplace=True)
del photos_melted['variable']
# The Petfinder API gives two fields for breed, thus we want to split these into individual data sets
breeds1 = photos_melted.loc[:,['id', 'breed0', 'value']]
breeds2 = photos_melted.loc[:,['id', 'breed1', 'value']]
breeds2.dropna(inplace=True)
# The columns of each breed dataset are renamed and appended with the index column deleted.
breeds1.rename(columns={'breed0':'breed', 'value':'img'}, inplace=True)
breeds2.rename(columns={'breed1':'breed', 'value':'img'}, inplace=True)
breed_photos = breeds1.append(breeds2).drop_duplicates().reset_index()
del breed_photos['index']
return breed_photos
cat_breed_images = get_images(cats)
The resulting DataFrame that is output from our helper function contains only the data that is of interest.
cat_breed_images.head()
The cleaning of the data left us with just over 20,000 unique cat records from the Petfinder database.
len(cat_breed_images['id'].unique())
The Petfinder API provides several different sizes of each uploaded image associated to a record for thumbnails, search results and individual pet profiles. We want to extract the size information from each image URL to filter the larger images that will be used for future tasks.
cat_breed_images['image_width'] = cat_breed_images['img'].str.split('width=', 1).str[1].str.split('&', 0).str[0].astype(int)
We can get a quick count of each unique image width using the value_counts()
attribute.
cat_breed_images['image_width'].value_counts()
As there are the same number of images for each size, we can go ahead and filter the data set to keep only the images that are listed with a 500 pixel width.
cat_images_largest = cat_breed_images.groupby('id').apply(lambda x: x[x['image_width'] == 500])
Remove some of the added columns resulting from the groupby and apply operations as well as reset the index.
del cat_images_largest['id']
cat_images_largest.reset_index(inplace=True)
del cat_images_largest['level_1']
Also, replace space and '/' characters with an underscore or space, respectively, to help clean up the breed names.
cat_images_largest['breed'] = cat_images_largest['breed'].str.replace(' ', '_')
cat_images_largest['breed'] = cat_images_largest['breed'].str.replace('/', '')
Apply the drop_duplicates()
method to the DataFrame to remove duplicate pet records and images.
breed_images = cat_images_largest.drop_duplicates(subset=['img', 'id'])
Our cat image DataFrame is now reshaped into the format we need with only unique pet records, thus we should be almost ready to begin downloading the images. First, we can get a quick count of the number of breed images we have for each respective breed by using the value_counts()
method as we did earlier.
breed_images['breed'].value_counts()
As we would expect, more common breeds such as the Domestic Short Hair, Medium Hair and Long Hair have the most images available. Although the Petfinder API lists the Tuxedo, Calico, and Tabby as breeds, they are actually just colorings and not genetically distinctive to be considered individual 'breeds'. As the API is user-input from shelters and organizations listing cats for adoption, this is to be expected. I decided to keep those images rather than filter them out.
Downloading the cat images¶
Before downloading the images, we first need to coerce our results that are stored in a DataFrame into a list of lists for us to take advantage of the multiprocessing
module. First, remove all but the first 5,000 images for each breed, which for our current dataset will only cut a few thousand images for the Domestic Short Hair breed. 5,000 is an admittedly arbitrary number.
breed_images_5000 = breed_images.groupby('breed').head(5000).reset_index()
del breed_images_5000['index']
We then take the columns of the DataFrame we need and convert them each to a list.
urls, breed, index = breed_images_5000['img'].tolist(), breed_images_5000['breed'].tolist(), breed_images_5000.index.tolist()
breed_list = [index, breed, urls]
As of now, our list is just a list of three lists containing the information we need. We must rearrange the list of lists to be in a format that allows us to easily input the values into the Pool
process as it iterates through the values. Therefore, we create a new list and iterate through the breed_list
collection and append the values of each list into the newly created list.
breed_list_new = []
for i in range(0, len(breed_list[0])):
breed_list_new.append([breed_list[0][i], breed_list[1][i], breed_list[2][i]])
len(breed_list_new)
We see we have just under 45,000 images with URLs compiled in the new list.
To keep the images organized after downloading, we first create individual directories in the main directory where we will store the downloaded images. To do this, use the unique()
attribute of a pandas Series and convert it to a list, as so:
breed_dirs = list(breed_images_5000['breed'].unique())
We then create individual directories for the breed images by iterating through the list and using the makedirs()
function in the os
package.
for i in breed_dirs:
os.makedirs('cat_breeds/' + str(i))
With the directories created, we can proceed to writing the worker function that wil be used to download the images in the Pool
process as we did previously when compiling the pet record results. The downloaded image name will have a naming convention of BREEDNAME_INDEX. We also make sure to write an error exception with urllib
and the HTTPError
for grabbing the images from the URLs.
def download_breed_images(breed_img):
try:
urllib.request.urlretrieve(breed_img[2],
os.path.join('cat_breeds/',
str(breed_img[1]), str(breed_img[1]) + str(breed_img[0]) + '.jpg'))
except urllib.error.HTTPError as err:
print(err.code)
As the process is I/O bound, in that all we are doing is calling a URL and downloading the stored image, we can increase the number of processes as each iteration should be quick. I chose a value of 5x the number of cores available on my machine, again an arbitrary choice that may or may not be the most efficient =).
pool = ThreadPool(processes=20)
We are now ready to download the images to our machine! As before, we start the pool by using the map()
method and track the amount of time the process takes to download the images. Any HTTPErrors that arise will also print with the error code.
%%time
pool.map(download_breed_images, breed_list_new)
pool.close()
pool.join()
The entire process to download just under 45,000 images took about 6 and a half minutes and only had two HTTP errors! The images will be stored in a separate directory cat_breeds with subdirectories containing the respective breed images.
Summary¶
I hope this post served as a fun and useful introduction to what is possible with the multiprocessing
module and the petpy
library. Please note that as the Petfinder API is a public API with users around the country inputting data, as well as records continually being added and removed as pets move through the shelter system to adoption that the results obtained above will likely be different when performed at different times.
The images that were downloaded during this exercise can also be downloaded as a tar.gz
file using the following Dropbox link (warning: the file is about 1.5GB).