Download 45,000 Adoptable Cat Images in 6.5 Minutes with petpy and multiprocessing

Combining the multiprocessing package for concurrent use of multiple CPUs and the petpy package for interacting with the Petfinder API allows one to find and download a vast amount of animal images for use in other tasks, such as image classification.

This post will introduce how to use the multiprocessing and petpy packages to quickly and easily download a large set of cat images of all the different breeds available in the Petfinder database. We will end up with a collection of just under 45,000 of cat images sorted by user-defined breed classifications.

Start by importing the various packages and modules that will be needed.

In [1]:
import petpy
import os
import pandas as pd
import urllib.request
import urllib.error
from multiprocessing import Pool
from multiprocessing.dummy import Pool as ThreadPool

Get the available cat breeds

Create a connection to the Petfinder API by calling the Petfinder class from the petpy package with your given API key. You can receive an API key by creating an account on the Petfinder developer page.

In [2]:
key = os.getenv('PETFINDER_KEY')
pf = petpy.Petfinder(key)

Use the breed_list() method to get the available cat breeds in the Petfinder database.

In [3]:
cat_breeds = pf.breed_list('cat', return_df=True)['cat breeds'].tolist()

Extracting cat breed records with multiprocessing

To speed up the process of extracting the pet records in the Petfinder database for each breed, we will utilize the multiprocessing library to spread out the task across multiple cores.

The machine that I am working with has a quad-core CPU, thus I set the maximum amount of processes that can be running at one time to be twice that amount. It is generally not recommended to exceed the device's maximum number of cores by too much (I've heard more than double the amount of cores for longer tasks is the recommended max), as the process can slow down if the program has to switch between processes more than is required.

In [4]:
pool = ThreadPool(processes=8)

To leverage the concurrency provided by multiprocessing, we first define a worker function that wraps the pet_find() method. We will pull up to 1,500 pet records for each individual breed.

In [5]:
def get_cat_breeds(cat):
    breeds = pf.pet_find('US', animal='cat', breed=cat, count=500, pages=3, return_df=True)
    return(breeds)

With the worker function and the pool initialized, we can begin extracting the pet records using the pet_find() method in petpy concurrently. We also time the duration of the process using the %%time magic function available in Jupyter Notebook.

In [6]:
%%time # use Jupyter Notebook time magic function for recording how long it takes to accumulate the results.
cats = pool.map(get_cat_breeds, cat_breeds) 
pool.close()
pool.join()
Wall time: 5min 25s

The entire process took just under 5 and a half minutes, likely due to the time taken to convert the JSON results from the API into pandas DataFrames. The completed pool process returns the collected results as a list, which we can convert to a DataFrame with concat().

In [7]:
cats = pd.concat(cats)

The process compiled 32,000 individual pet records that matched the breeds we were looking for in the cat_breeds list. Admittedly, the prcoess of coercing the JSON results from the Petfinder API into pandas DataFrames could likely be much more efficient; however, we were able to find just under 32,000 adoptable cat records from animal shelters across the United States and return the results in a clean and tidy DataFrame, so this seems like an acceptable trade-off.

In [8]:
len(cats)
Out[8]:
31858

Cleaning the data to get the image URLs

As we are only interested in the associated photo images for each cat, we can clean the data set we extracted to reshape and remove the data that is not needed for the task at hand.

The following helper function is used to clean and reshape the data we need.

In [13]:
def get_images(df):
    try:
        del df['media.photos.photo'] # This column may need to be deleted manually.
    except:
        pass
    
    # Keep only the columns that contain 'id', 'breed', and 'photo'
    photos = df[df.columns[df.columns.str.contains('id|breed|photo')]]
    # Melt the data to reshape it from wide to long and remove any NAs introduced from empty photo records.
    photos_melted = pd.melt(photos, id_vars=['id', 'breed0', 'breed1'])
    photos_melted.dropna(subset=['value'], inplace=True)
    del photos_melted['variable']
    
    # The Petfinder API gives two fields for breed, thus we want to split these into individual data sets 
    breeds1 = photos_melted.loc[:,['id', 'breed0', 'value']]
    breeds2 = photos_melted.loc[:,['id', 'breed1', 'value']]
    breeds2.dropna(inplace=True)
    
    # The columns of each breed dataset are renamed and appended with the index column deleted.
    breeds1.rename(columns={'breed0':'breed', 'value':'img'}, inplace=True)
    breeds2.rename(columns={'breed1':'breed', 'value':'img'}, inplace=True)
    
    breed_photos = breeds1.append(breeds2).drop_duplicates().reset_index()
    del breed_photos['index']
    
    return breed_photos
In [14]:
cat_breed_images = get_images(cats)

The resulting DataFrame that is output from our helper function contains only the data that is of interest.

In [15]:
cat_breed_images.head()
Out[15]:
id breed img
0 40181161 Abyssinian http://photos.petfinder.com/photos/pets/401811...
1 40181148 Abyssinian http://photos.petfinder.com/photos/pets/401811...
2 38018075 Abyssinian http://photos.petfinder.com/photos/pets/380180...
3 38017865 Domestic Short Hair http://photos.petfinder.com/photos/pets/380178...
4 38017999 Abyssinian http://photos.petfinder.com/photos/pets/380179...

The cleaning of the data left us with just over 20,000 unique cat records from the Petfinder database.

In [17]:
len(cat_breed_images['id'].unique())
Out[17]:
20394

The Petfinder API provides several different sizes of each uploaded image associated to a record for thumbnails, search results and individual pet profiles. We want to extract the size information from each image URL to filter the larger images that will be used for future tasks.

In [18]:
cat_breed_images['image_width'] = cat_breed_images['img'].str.split('width=', 1).str[1].str.split('&', 0).str[0].astype(int)

We can get a quick count of each unique image width using the value_counts() attribute.

In [19]:
cat_breed_images['image_width'].value_counts()
Out[19]:
500    71959
95     71959
60     71959
50     71959
300    71959
Name: image_width, dtype: int64

As there are the same number of images for each size, we can go ahead and filter the data set to keep only the images that are listed with a 500 pixel width.

In [20]:
cat_images_largest = cat_breed_images.groupby('id').apply(lambda x: x[x['image_width'] == 500])

Remove some of the added columns resulting from the groupby and apply operations as well as reset the index.

In [21]:
del cat_images_largest['id']
cat_images_largest.reset_index(inplace=True)
del cat_images_largest['level_1']

Also, replace space and '/' characters with an underscore or space, respectively, to help clean up the breed names.

In [22]:
cat_images_largest['breed'] = cat_images_largest['breed'].str.replace(' ', '_')
cat_images_largest['breed'] = cat_images_largest['breed'].str.replace('/', '')

Apply the drop_duplicates() method to the DataFrame to remove duplicate pet records and images.

In [23]:
breed_images = cat_images_largest.drop_duplicates(subset=['img', 'id'])

Our cat image DataFrame is now reshaped into the format we need with only unique pet records, thus we should be almost ready to begin downloading the images. First, we can get a quick count of the number of breed images we have for each respective breed by using the value_counts() method as we did earlier.

In [24]:
breed_images['breed'].value_counts()
Out[24]:
Domestic_Short_Hair                     9658
Domestic_Medium_Hair                    3631
Tabby                                   3630
Domestic_Long_Hair                      3441
American_Shorthair                      2935
Calico                                  2911
Siamese                                 2867
Tortoiseshell                           2804
Tuxedo                                  2538
Maine_Coon                              1930
Russian_Blue                            1438
Tiger                                   1354
Torbie                                  1259
Dilute_Calico                           1211
Dilute_Tortoiseshell                    1142
Bombay                                  1135
Manx                                     687
Bengal                                   636
Extra-Toes_Cat__Hemingway_Polydactyl     494
Turkish_Van                              470
Persian                                  407
Snowshoe                                 383
Bobtail                                  338
Abyssinian                               324
Ragdoll                                  243
Oriental_Short_Hair                      224
Himalayan                                186
Turkish_Angora                           181
British_Shorthair                        146
Egyptian_Mau                             133
                                        ... 
American_Curl                             46
Balinese                                  40
Nebelung                                  40
Oriental_Tabby                            39
Birman                                    36
Selkirk_Rex                               26
Ocicat                                    23
Scottish_Fold                             23
Tonkinese                                 22
Siberian                                  20
Chausie                                   19
Munchkin                                  17
Chartreux                                 14
Japanese_Bobtail                          14
Pixie-Bob                                 14
Applehead_Siamese                         12
Ragamuffin                                11
Cornish_Rex                                9
Devon_Rex                                  9
Cymric                                     8
American_Wirehair                          7
Korat                                      7
LaPerm                                     6
Somali                                     6
Javanese                                   6
Burmilla                                   5
Chinchilla                                 5
Oriental_Long_Hair                         4
Singapura                                  3
Sphynx__Hairless_Cat                       1
Name: breed, Length: 65, dtype: int64

As we would expect, more common breeds such as the Domestic Short Hair, Medium Hair and Long Hair have the most images available. Although the Petfinder API lists the Tuxedo, Calico, and Tabby as breeds, they are actually just colorings and not genetically distinctive to be considered individual 'breeds'. As the API is user-input from shelters and organizations listing cats for adoption, this is to be expected. I decided to keep those images rather than filter them out.

Downloading the cat images

Before downloading the images, we first need to coerce our results that are stored in a DataFrame into a list of lists for us to take advantage of the multiprocessing module. First, remove all but the first 5,000 images for each breed, which for our current dataset will only cut a few thousand images for the Domestic Short Hair breed. 5,000 is an admittedly arbitrary number.

In [25]:
breed_images_5000 = breed_images.groupby('breed').head(5000).reset_index()
del breed_images_5000['index']

We then take the columns of the DataFrame we need and convert them each to a list.

In [26]:
urls, breed, index = breed_images_5000['img'].tolist(), breed_images_5000['breed'].tolist(), breed_images_5000.index.tolist()
In [27]:
breed_list = [index, breed, urls]

As of now, our list is just a list of three lists containing the information we need. We must rearrange the list of lists to be in a format that allows us to easily input the values into the Pool process as it iterates through the values. Therefore, we create a new list and iterate through the breed_list collection and append the values of each list into the newly created list.

In [28]:
breed_list_new = []
for i in range(0, len(breed_list[0])):
    breed_list_new.append([breed_list[0][i], breed_list[1][i], breed_list[2][i]])
In [34]:
len(breed_list_new)
Out[34]:
44987

We see we have just under 45,000 images with URLs compiled in the new list.

To keep the images organized after downloading, we first create individual directories in the main directory where we will store the downloaded images. To do this, use the unique() attribute of a pandas Series and convert it to a list, as so:

In [29]:
breed_dirs = list(breed_images_5000['breed'].unique())

We then create individual directories for the breed images by iterating through the list and using the makedirs() function in the os package.

In [30]:
for i in breed_dirs:
    os.makedirs('cat_breeds/' + str(i))

With the directories created, we can proceed to writing the worker function that wil be used to download the images in the Pool process as we did previously when compiling the pet record results. The downloaded image name will have a naming convention of BREEDNAME_INDEX. We also make sure to write an error exception with urllib and the HTTPError for grabbing the images from the URLs.

In [31]:
def download_breed_images(breed_img):
    try:
        urllib.request.urlretrieve(breed_img[2], 
                                   os.path.join('cat_breeds/', 
                                                str(breed_img[1]), str(breed_img[1]) + str(breed_img[0]) + '.jpg'))
    except urllib.error.HTTPError as err:
        print(err.code)

As the process is I/O bound, in that all we are doing is calling a URL and downloading the stored image, we can increase the number of processes as each iteration should be quick. I chose a value of 5x the number of cores available on my machine, again an arbitrary choice that may or may not be the most efficient =).

In [32]:
pool = ThreadPool(processes=20)

We are now ready to download the images to our machine! As before, we start the pool by using the map() method and track the amount of time the process takes to download the images. Any HTTPErrors that arise will also print with the error code.

In [33]:
%%time
pool.map(download_breed_images, breed_list_new)
pool.close()
pool.join()
415
415
Wall time: 6min 37s

The entire process to download just under 45,000 images took about 6 and a half minutes and only had two HTTP errors! The images will be stored in a separate directory cat_breeds with subdirectories containing the respective breed images.

Summary

I hope this post served as a fun and useful introduction to what is possible with the multiprocessing module and the petpy library. Please note that as the Petfinder API is a public API with users around the country inputting data, as well as records continually being added and removed as pets move through the shelter system to adoption that the results obtained above will likely be different when performed at different times.

The images that were downloaded during this exercise can also be downloaded as a tar.gz file using the following Dropbox link (warning: the file is about 1.5GB).

Related Posts