The PoetryDB API stores its data in MongoDB, a popular NoSQL database. Indeed, a NoSQL database is a solid choice for the type of data that is stored in PoetryDB (unstructured text, for example). However, what if we wanted to create a more traditional SQL database with the PoetryDB API data for use in other projects where a relational database would be preferred? By extracting the data from the PoetryDB API using a combination of a few Python libraries, we can recreate the NoSQL PoetryDB database as a SQL database which will allow us more freedom to create additional data features and avoid the need to hit the PoetryDB database more than necessary.
Getting Started¶
In this example, we walk through a sample use case of extracting data from a database using an API and then structuring that data in a cohesive manner that allows us to create a relational database that we can then query with SQL statements. The database we will create with the extracted data will use Postgresql.
The Python libraries that will be used in this example are poetpy, a Python wrapper for the PoetryDB API written by yours truly, pandas for transforming and cleansing the data as needed, and sqlalchemy for handling the SQL side of things. We start by importing the needed libraries as per usual.
from poetpy import get_poetry
import pandas as pd
from pandas.io.json import json_normalize
import sqlalchemy
import os
import warnings
warnings.simplefilter('ignore')
Getting the Poetry Data¶
We can't have a useful database without any data! Before beginning to hit any API, it is often useful to devise a strategy for getting the wanted data in an efficient manner that avoids requesting the API more than needed. According to the PoetryDB API documentation, we can get a list of authors which we can then use to iterate over to get each author's poetry and other available information from the database.
We can use the poetpy
function get_poetry
to return a dictionary object of the available authors in the PoetryDB database.
authors = get_poetry('author')
The returned dictionary contains a list of the available authors, which we can quickly inspect to make sure our API call was successful.
authors['authors'][0:5]
To store the extracted authors for later exporting into a flat-file such as a CSV and loading into a database, we convert the returned dictionary into a pandas DataFrame using the from_dict
method.
authors_df = pd.DataFrame.from_dict(authors)
Now that we have a list of authors to iterate over, we can extract the remaining data from the PoetryDB database! For each of the authors in the database, we extract the titles, content, and linecounts of their poetry, normalize the returned JSON into a DataFrame
with pandas
's handy json_normalize
function and append the resulting data to a list. After each author in the list has been iterated over, the list with the appended results are then concatenated into one DataFrame
with the pd.concat
function. This operation will give us a complete dataset of all the available information in the PoetryDB API as a pandas DataFrame
.
poems = []
for author in authors['authors']:
author_poems = get_poetry('author', author, 'author,title,lines,linecount')
author_poems_df = json_normalize(author_poems)
poems.append(author_poems_df)
poems_df = pd.concat(poems)
The PoetryDB API data is now collected into one DataFrame
! We can inspect the first few rows of the DataFrame
to see the resulting data that was returned with the head
method.
poems_df.head()
We see each value in the lines
column of the DataFrame
is still a list of strings that comprise the particular poem. To edit the lines
column to extract the poetry lines, we can use the apply
method to apply a lambda
function over each row in the DataFrame
to join each string in the list as one string.
poems_df['lines'] = poems_df['lines'].apply(lambda x: ' \n '.join(x))
Let's inspect the first couple rows of the lines
column to ensure the operation returned what we expect.
poems_df['lines'].head(2)
With the data extracted from the PoetryDB database and transformed into a tabular data structure, we then save the datasets into a csv file using the to_csv
method. The exported csv files will be used to insert the data into our Postgresql database.
poems_df.to_csv('../data/poetrydb_copy.csv', index=False, encoding='utf-8')
authors_df.to_csv('../data/poetrydb_authors.csv', index=False, encoding='utf-8')
Building the Postgresql database¶
The installation of Postgresql is beyond the scope of this example; however, there is a handy tutorial available which details the steps for installing Postgresql for particular operating systems.
During the installation of Postgresql, a postgres
database is created that we can use for testing our relational database. Postgresql works slightly different than other SQL engines in that it employs the concept of schemas for managing data, which in other types of SQL would be a database. The installation will prompt the user to create a master username and password, which we will use to connect to the localhost postgres
database.
This is the section of the example where SQLAlchemy comes into play. The first step in connecting to a database with SQLAlchemy is to employ the create_engine
function. According to the function's documentation, the create_engine
function takes a string parameter that details the connection info based on the following structure:
terminal
dialect+driver://username:password@host:port/database
Where dialect
is a SQL engine such as postgresql
, mysql
, mssql
, and so on. Thus, we use the newly created postgres
database along with the username and password specified during the installation of Postgresql
.
engine = sqlalchemy.create_engine('postgresql://postgres:root@localhost:5432/postgres')
The engine
variable is then used to create the connection to the postgres
database.
conn = engine.connect()
We can now begin working with the Postgresql
database and insert our extracted data! The first step is to create a schema which we will use as our local database for testing purposes. Using our database connection, we can send a query to the postgres
database to create a schema using the CREATE SCHEMA
statement.
conn.execute('CREATE SCHEMA poetry')
The output indicates the operation was successful! The next step is to create the necessary schema tables and load the data we extracted earlier from the PoetryDB API into those tables. There are several approaches to creating the tables and loading the data into those tables. One approach and typically the most general way to load data into Postgresql is to create the tables and then load a flat file such as a csv into the tables using the psql
command line. Another approach is using pandas
and SQLAlchemy
to load the data directly from a DataFrame
into a Postgresql database or schema.
As the last step before loading the data, let's use our SQLAlchemy
connection to the database to create the tables that we will use to store our data. For more extended SQL statements, it can be a good idea to write out the statement or query as a multi-line string for clarity.
create_author_table = '''
CREATE TABLE poetry.authors
(
id serial PRIMARY KEY,
author VARCHAR(255)
);'''
create_poems_table = '''
CREATE TABLE poetry.poems
(
id serial PRIMARY KEY,
author VARCHAR(255),
linecount INT,
lines TEXT,
title VARCHAR(510)
);
'''
conn.execute(create_author_table)
conn.execute(create_poems_table)
Using psql
and \copy
to load the data¶
psql
is a Postgresql interactive terminal and is very useful for working with Postgresql databases and schemas. For those with a MySQL background, psql
is quite similar to the mysql
interactive terminal. The following is used to launch the psql
terminal.
terminal
psql -h localhost -U postgres postgres
Where the -h
flag specifies the host to connect, which in this case is localhost
. The -U postgres
argument specifies the username postgres
to use to connect to the postgres
database. For those having issues launching the psql
terminal, it is usually due to the command not being set in the PATH. Here is a good StackOverflow page that explains in detail how to add the psql
command to the PATH on Mac.
After launching psql
, using the \dt
command will display the current tables and relations in the current schema.
The \copy
command is used to insert data from a standard flat-file such as a csv into a schema table. The path to the data file to load into the table generally needs to be the absolute path. We specify the columns to insert the data within the parentheses to avoid accidentally inserting the rows into the schema table's id
column, which acts as its PRIMARY KEY
.
terminal
\copy poetry.authors(author) FROM '/Users/aaronschlegel/Dropbox/Projects/poetpy/data/poetrydb_authors.csv' DELIMITER ',' CSV HEADER;
terminal
\copy poetry.poems(author, linecount, lines, title) FROM '/Users/aaronschlegel/Dropbox/Projects/poetpy/data/poetrydb_copy.csv' DELIMITER ',' CSV HEADER;
If the \copy
is successful, the terminal will output the number of rows that were inserted into the table. We can now perform queries on the tables!
Using SQLAlchemy and pandas to load the data¶
The DataFrame
class has a handy method, to_sql
for inserting data into a SQL table directly. As we already created two tables earlier, we will give these a different name to identify the new tables created from the DataFrame
. The name of the SQL table to insert into is designated by the first argument of the to_sql
method, while the second required argument is a database connection string, just like the one we created previously with SQLAlchemy! To get a sequential id
column inserted into the SQL table simultaneously, we will also specify that the DataFrame
index column is named id
with the optional index_label
argument. We also want to be sure to set the schema
optional argument to the poetry
schema (since we are working with Postgresql) that we created earlier in the example. Otherwise, the tables will be created in the default public
schema.
poems_df.to_sql('poems_df', conn, schema='poetry', index_label='id')
authors_df.to_sql('authors_df', conn, schema='poetry', index_label='id')
There were no errors or warnings issued. Therefore the data insertion should have been successful! In the next section, we perform some sample queries on the newly created SQL tables to ensure the data is what we expect.
Example Queries¶
Perhaps unsurprisingly, there are multiple ways to query our Postgresql schema tables. The first is to use the .execute()
method from our database connection variable conn
, which we created earlier in the example. Let's say we are interested in finding the first 3 authors from the author
table in our poetry
schema. The SQL query can be written as:
SELECT author FROM poetry.authors LIMIT 3
The above query can be passed as an argument to the .execute()
method as a string to query the database. The .fetchall()
is chained to the end of the .execute()
method to extract all the results.
conn.execute('SELECT author FROM poetry.authors LIMIT 3').fetchall()
We see the statement returned the first three authors as a list of tuples as expected! More information on using SQL queries with SQLAlchemy can be found in SQLAlchemy's tutorial.
Another method for querying a database that can be very useful is to use the pandas
function read_sql_query
. The function read_sql
can also be used to return the same results. The required arguments for the function are the query and a connection string. The benefit of using the read_sql_query
function is the results are pulled directly into a pandas DataFrame
.
pd.read_sql_query('SELECT authors FROM poetry.authors_df LIMIT 3', conn)
Conclusion¶
In this example, we displayed a sample use case for extracting data from a database through an API and then using that data to create the database that we can use for further analysis and more without worrying about hitting the API database more than needed. In further examples, we will enhance the data we extracted from the API with more information collected from different sources and feature engineering on the already available data.
About PoetryDB¶
PoetryDB was created and is currently maintained by @thundercomb. They blog about poetry and related technology and other topics at thecombedthunderclap.blogspot.com.