Building a Poetry Database in PostgreSQL with Python, poetpy, pandas and Sqlalchemy

The PoetryDB API stores its data in MongoDB, a popular NoSQL database. Indeed, a NoSQL database is a solid choice for the type of data that is stored in PoetryDB (unstructured text, for example). However, what if we wanted to create a more traditional SQL database with the PoetryDB API data for use in other projects where a relational database would be preferred? By extracting the data from the PoetryDB API using a combination of a few Python libraries, we can recreate the NoSQL PoetryDB database as a SQL database which will allow us more freedom to create additional data features and avoid the need to hit the PoetryDB database more than necessary.

Getting Started

In this example, we walk through a sample use case of extracting data from a database using an API and then structuring that data in a cohesive manner that allows us to create a relational database that we can then query with SQL statements. The database we will create with the extracted data will use Postgresql.

The Python libraries that will be used in this example are poetpy, a Python wrapper for the PoetryDB API written by yours truly, pandas for transforming and cleansing the data as needed, and sqlalchemy for handling the SQL side of things. We start by importing the needed libraries as per usual.

In [1]:
from poetpy import get_poetry
import pandas as pd
from pandas.io.json import json_normalize
import sqlalchemy
import os

import warnings
warnings.simplefilter('ignore')

Getting the Poetry Data

We can't have a useful database without any data! Before beginning to hit any API, it is often useful to devise a strategy for getting the wanted data in an efficient manner that avoids requesting the API more than needed. According to the PoetryDB API documentation, we can get a list of authors which we can then use to iterate over to get each author's poetry and other available information from the database.

We can use the poetpy function get_poetry to return a dictionary object of the available authors in the PoetryDB database.

In [2]:
authors = get_poetry('author')

The returned dictionary contains a list of the available authors, which we can quickly inspect to make sure our API call was successful.

In [3]:
authors['authors'][0:5]
Out[3]:
['Adam Lindsay Gordon',
 'Alan Seeger',
 'Alexander Pope',
 'Algernon Charles Swinburne',
 'Ambrose Bierce']

To store the extracted authors for later exporting into a flat-file such as a CSV and loading into a database, we convert the returned dictionary into a pandas DataFrame using the from_dict method.

In [4]:
authors_df = pd.DataFrame.from_dict(authors)

Now that we have a list of authors to iterate over, we can extract the remaining data from the PoetryDB database! For each of the authors in the database, we extract the titles, content, and linecounts of their poetry, normalize the returned JSON into a DataFrame with pandas's handy json_normalize function and append the resulting data to a list. After each author in the list has been iterated over, the list with the appended results are then concatenated into one DataFrame with the pd.concat function. This operation will give us a complete dataset of all the available information in the PoetryDB API as a pandas DataFrame.

In [5]:
poems = []
for author in authors['authors']:
    author_poems = get_poetry('author', author, 'author,title,lines,linecount')
    author_poems_df = json_normalize(author_poems)
    poems.append(author_poems_df)
    
poems_df = pd.concat(poems)

The PoetryDB API data is now collected into one DataFrame! We can inspect the first few rows of the DataFrame to see the resulting data that was returned with the head method.

In [6]:
poems_df.head()
Out[6]:
author linecount lines title
0 Adam Lindsay Gordon 16 [‘WHERE shall we go for our garlands glad, At ... A Song of Autumn
1 Adam Lindsay Gordon 56 [The ocean heaves around us still, With long a... An Exile's Farewell
0 Alan Seeger 24 [I have a rendezvous with Death, At some dispu... I Have A Rendezvous With Death
1 Alan Seeger 104 [I, , Ay, it is fitting on this holiday,, Comm... Ode in Memory of the American Volunteers Falle...
2 Alan Seeger 83 [In that fair capital where Pleasure, crowned,... Fragments

We see each value in the lines column of the DataFrame is still a list of strings that comprise the particular poem. To edit the lines column to extract the poetry lines, we can use the apply method to apply a lambda function over each row in the DataFrame to join each string in the list as one string.

In [7]:
poems_df['lines'] = poems_df['lines'].apply(lambda x: ' \n '.join(x))

Let's inspect the first couple rows of the lines column to ensure the operation returned what we expect.

In [8]:
poems_df['lines'].head(2)
Out[8]:
0    ‘WHERE shall we go for our garlands glad \n At...
1    The ocean heaves around us still \n With long ...
Name: lines, dtype: object

With the data extracted from the PoetryDB database and transformed into a tabular data structure, we then save the datasets into a csv file using the to_csv method. The exported csv files will be used to insert the data into our Postgresql database.

In [9]:
poems_df.to_csv('../data/poetrydb_copy.csv', index=False, encoding='utf-8')
authors_df.to_csv('../data/poetrydb_authors.csv', index=False, encoding='utf-8')

Building the Postgresql database

The installation of Postgresql is beyond the scope of this example; however, there is a handy tutorial available which details the steps for installing Postgresql for particular operating systems.

During the installation of Postgresql, a postgres database is created that we can use for testing our relational database. Postgresql works slightly different than other SQL engines in that it employs the concept of schemas for managing data, which in other types of SQL would be a database. The installation will prompt the user to create a master username and password, which we will use to connect to the localhost postgres database.

This is the section of the example where SQLAlchemy comes into play. The first step in connecting to a database with SQLAlchemy is to employ the create_engine function. According to the function's documentation, the create_engine function takes a string parameter that details the connection info based on the following structure:

terminal
dialect+driver://username:password@host:port/database

Where dialect is a SQL engine such as postgresql, mysql, mssql, and so on. Thus, we use the newly created postgres database along with the username and password specified during the installation of Postgresql.

In [10]:
engine = sqlalchemy.create_engine('postgresql://postgres:root@localhost:5432/postgres')

The engine variable is then used to create the connection to the postgres database.

In [11]:
conn = engine.connect()

We can now begin working with the Postgresql database and insert our extracted data! The first step is to create a schema which we will use as our local database for testing purposes. Using our database connection, we can send a query to the postgres database to create a schema using the CREATE SCHEMA statement.

In [12]:
conn.execute('CREATE SCHEMA poetry')
Out[12]:
<sqlalchemy.engine.result.ResultProxy at 0x10850b128>

The output indicates the operation was successful! The next step is to create the necessary schema tables and load the data we extracted earlier from the PoetryDB API into those tables. There are several approaches to creating the tables and loading the data into those tables. One approach and typically the most general way to load data into Postgresql is to create the tables and then load a flat file such as a csv into the tables using the psql command line. Another approach is using pandas and SQLAlchemy to load the data directly from a DataFrame into a Postgresql database or schema.

As the last step before loading the data, let's use our SQLAlchemy connection to the database to create the tables that we will use to store our data. For more extended SQL statements, it can be a good idea to write out the statement or query as a multi-line string for clarity.

In [13]:
create_author_table = '''
CREATE TABLE poetry.authors
(
    id serial PRIMARY KEY, 
    author VARCHAR(255)
);'''

create_poems_table = '''
CREATE TABLE poetry.poems
(
    id serial PRIMARY KEY,
    author VARCHAR(255),
    linecount INT,
    lines TEXT,
    title VARCHAR(510)
);
'''
    
conn.execute(create_author_table)
conn.execute(create_poems_table)
Out[13]:
<sqlalchemy.engine.result.ResultProxy at 0x10850b518>

Using psql and \copy to load the data

psql is a Postgresql interactive terminal and is very useful for working with Postgresql databases and schemas. For those with a MySQL background, psql is quite similar to the mysql interactive terminal. The following is used to launch the psql terminal.

terminal
psql -h localhost -U postgres postgres

Where the -h flag specifies the host to connect, which in this case is localhost. The -U postgres argument specifies the username postgres to use to connect to the postgres database. For those having issues launching the psql terminal, it is usually due to the command not being set in the PATH. Here is a good StackOverflow page that explains in detail how to add the psql command to the PATH on Mac.

After launching psql, using the \dt command will display the current tables and relations in the current schema.

The \copy command is used to insert data from a standard flat-file such as a csv into a schema table. The path to the data file to load into the table generally needs to be the absolute path. We specify the columns to insert the data within the parentheses to avoid accidentally inserting the rows into the schema table's id column, which acts as its PRIMARY KEY.

terminal
\copy poetry.authors(author) FROM '/Users/aaronschlegel/Dropbox/Projects/poetpy/data/poetrydb_authors.csv' DELIMITER ',' CSV HEADER;
terminal
\copy poetry.poems(author, linecount, lines, title) FROM '/Users/aaronschlegel/Dropbox/Projects/poetpy/data/poetrydb_copy.csv' DELIMITER ',' CSV HEADER;

If the \copy is successful, the terminal will output the number of rows that were inserted into the table. We can now perform queries on the tables!

Using SQLAlchemy and pandas to load the data

The DataFrame class has a handy method, to_sql for inserting data into a SQL table directly. As we already created two tables earlier, we will give these a different name to identify the new tables created from the DataFrame. The name of the SQL table to insert into is designated by the first argument of the to_sql method, while the second required argument is a database connection string, just like the one we created previously with SQLAlchemy! To get a sequential id column inserted into the SQL table simultaneously, we will also specify that the DataFrame index column is named id with the optional index_label argument. We also want to be sure to set the schema optional argument to the poetry schema (since we are working with Postgresql) that we created earlier in the example. Otherwise, the tables will be created in the default public schema.

In [14]:
poems_df.to_sql('poems_df', conn, schema='poetry', index_label='id')
authors_df.to_sql('authors_df', conn, schema='poetry', index_label='id')

There were no errors or warnings issued. Therefore the data insertion should have been successful! In the next section, we perform some sample queries on the newly created SQL tables to ensure the data is what we expect.

Example Queries

Perhaps unsurprisingly, there are multiple ways to query our Postgresql schema tables. The first is to use the .execute() method from our database connection variable conn, which we created earlier in the example. Let's say we are interested in finding the first 3 authors from the author table in our poetry schema. The SQL query can be written as:

SELECT author FROM poetry.authors LIMIT 3

The above query can be passed as an argument to the .execute() method as a string to query the database. The .fetchall() is chained to the end of the .execute() method to extract all the results.

In [15]:
conn.execute('SELECT author FROM poetry.authors LIMIT 3').fetchall()
Out[15]:
[('Adam Lindsay Gordon',), ('Alan Seeger',), ('Alexander Pope',)]

We see the statement returned the first three authors as a list of tuples as expected! More information on using SQL queries with SQLAlchemy can be found in SQLAlchemy's tutorial.

Another method for querying a database that can be very useful is to use the pandas function read_sql_query. The function read_sql can also be used to return the same results. The required arguments for the function are the query and a connection string. The benefit of using the read_sql_query function is the results are pulled directly into a pandas DataFrame.

In [16]:
pd.read_sql_query('SELECT authors FROM poetry.authors_df LIMIT 3', conn)
Out[16]:
authors
0 Adam Lindsay Gordon
1 Alan Seeger
2 Alexander Pope

Conclusion

In this example, we displayed a sample use case for extracting data from a database through an API and then using that data to create the database that we can use for further analysis and more without worrying about hitting the API database more than needed. In further examples, we will enhance the data we extracted from the API with more information collected from different sources and feature engineering on the already available data.

About PoetryDB

PoetryDB was created and is currently maintained by @thundercomb. They blog about poetry and related technology and other topics at thecombedthunderclap.blogspot.com.

Related Posts