The PoetryDB API stores its data in MongoDB, a popular NoSQL database. Indeed, a NoSQL database is a solid choice for the type of data that is stored in PoetryDB (unstructured text, for example). However, what if we wanted to create a more traditional SQL database with the PoetryDB API data for use in other projects where a relational database would be preferred? By extracting the data from the PoetryDB API using a combination of a few Python libraries, we can recreate the NoSQL PoetryDB database as a SQL database which will allow us more freedom to create additional data features and avoid the need to hit the PoetryDB database more than necessary.
In this example, we walk through a sample use case of extracting data from a database using an API and then structuring that data in a cohesive manner that allows us to create a relational database that we can then query with SQL statements. The database we will create with the extracted data will use Postgresql.
The Python libraries that will be used in this example are poetpy, a Python wrapper for the PoetryDB API written by yours truly, pandas for transforming and cleansing the data as needed, and sqlalchemy for handling the SQL side of things. We start by importing the needed libraries as per usual.
from poetpy import get_poetry import pandas as pd from pandas.io.json import json_normalize import sqlalchemy import os import warnings warnings.simplefilter('ignore')
We can't have a useful database without any data! Before beginning to hit any API, it is often useful to devise a strategy for getting the wanted data in an efficient manner that avoids requesting the API more than needed. According to the PoetryDB API documentation, we can get a list of authors which we can then use to iterate over to get each author's poetry and other available information from the database.
We can use the
get_poetry to return a dictionary object of the available authors in the PoetryDB database.
authors = get_poetry('author')
The returned dictionary contains a list of the available authors, which we can quickly inspect to make sure our API call was successful.
['Adam Lindsay Gordon', 'Alan Seeger', 'Alexander Pope', 'Algernon Charles Swinburne', 'Ambrose Bierce']
authors_df = pd.DataFrame.from_dict(authors)
Now that we have a list of authors to iterate over, we can extract the remaining data from the PoetryDB database! For each of the authors in the database, we extract the titles, content, and linecounts of their poetry, normalize the returned JSON into a
json_normalize function and append the resulting data to a list. After each author in the list has been iterated over, the list with the appended results are then concatenated into one
DataFrame with the
pd.concat function. This operation will give us a complete dataset of all the available information in the PoetryDB API as a pandas
poems =  for author in authors['authors']: author_poems = get_poetry('author', author, 'author,title,lines,linecount') author_poems_df = json_normalize(author_poems) poems.append(author_poems_df) poems_df = pd.concat(poems)
The PoetryDB API data is now collected into one
DataFrame! We can inspect the first few rows of the
DataFrame to see the resulting data that was returned with the
|0||Adam Lindsay Gordon||16||[‘WHERE shall we go for our garlands glad, At ...||A Song of Autumn|
|1||Adam Lindsay Gordon||56||[The ocean heaves around us still, With long a...||An Exile's Farewell|
|0||Alan Seeger||24||[I have a rendezvous with Death, At some dispu...||I Have A Rendezvous With Death|
|1||Alan Seeger||104||[I, , Ay, it is fitting on this holiday,, Comm...||Ode in Memory of the American Volunteers Falle...|
|2||Alan Seeger||83||[In that fair capital where Pleasure, crowned,...||Fragments|
We see each value in the
lines column of the
DataFrame is still a list of strings that comprise the particular poem. To edit the
lines column to extract the poetry lines, we can use the
apply method to apply a
lambda function over each row in the
DataFrame to join each string in the list as one string.
poems_df['lines'] = poems_df['lines'].apply(lambda x: ' \n '.join(x))
Let's inspect the first couple rows of the
lines column to ensure the operation returned what we expect.
0 ‘WHERE shall we go for our garlands glad \n At... 1 The ocean heaves around us still \n With long ... Name: lines, dtype: object
With the data extracted from the PoetryDB database and transformed into a tabular data structure, we then save the datasets into a csv file using the
to_csv method. The exported csv files will be used to insert the data into our Postgresql database.
poems_df.to_csv('../data/poetrydb_copy.csv', index=False, encoding='utf-8') authors_df.to_csv('../data/poetrydb_authors.csv', index=False, encoding='utf-8')
The installation of Postgresql is beyond the scope of this example; however, there is a handy tutorial available which details the steps for installing Postgresql for particular operating systems.
During the installation of Postgresql, a
postgres database is created that we can use for testing our relational database. Postgresql works slightly different than other SQL engines in that it employs the concept of schemas for managing data, which in other types of SQL would be a database. The installation will prompt the user to create a master username and password, which we will use to connect to the localhost
This is the section of the example where SQLAlchemy comes into play. The first step in connecting to a database with SQLAlchemy is to employ the
create_engine function. According to the function's documentation, the
create_engine function takes a string parameter that details the connection info based on the following structure:
dialect is a SQL engine such as
mssql, and so on. Thus, we use the newly created
postgres database along with the username and password specified during the installation of
engine = sqlalchemy.create_engine('postgresql://postgres:root@localhost:5432/postgres')
engine variable is then used to create the connection to the
conn = engine.connect()
We can now begin working with the
Postgresql database and insert our extracted data! The first step is to create a schema which we will use as our local database for testing purposes. Using our database connection, we can send a query to the
postgres database to create a schema using the
CREATE SCHEMA statement.
conn.execute('CREATE SCHEMA poetry')
<sqlalchemy.engine.result.ResultProxy at 0x10850b128>
The output indicates the operation was successful! The next step is to create the necessary schema tables and load the data we extracted earlier from the PoetryDB API into those tables. There are several approaches to creating the tables and loading the data into those tables. One approach and typically the most general way to load data into Postgresql is to create the tables and then load a flat file such as a csv into the tables using the
psql command line. Another approach is using
SQLAlchemy to load the data directly from a
DataFrame into a Postgresql database or schema.
As the last step before loading the data, let's use our
SQLAlchemy connection to the database to create the tables that we will use to store our data. For more extended SQL statements, it can be a good idea to write out the statement or query as a multi-line string for clarity.
create_author_table = ''' CREATE TABLE poetry.authors ( id serial PRIMARY KEY, author VARCHAR(255) );''' create_poems_table = ''' CREATE TABLE poetry.poems ( id serial PRIMARY KEY, author VARCHAR(255), linecount INT, lines TEXT, title VARCHAR(510) ); ''' conn.execute(create_author_table) conn.execute(create_poems_table)
<sqlalchemy.engine.result.ResultProxy at 0x10850b518>
psql is a Postgresql interactive terminal and is very useful for working with Postgresql databases and schemas. For those with a MySQL background,
psql is quite similar to the
mysql interactive terminal. The following is used to launch the
terminal psql -h localhost -U postgres postgres
-h flag specifies the host to connect, which in this case is
-U postgres argument specifies the username
postgres to use to connect to the
postgres database. For those having issues launching the
psql terminal, it is usually due to the command not being set in the PATH. Here is a good StackOverflow page that explains in detail how to add the
psql command to the PATH on Mac.
psql, using the
\dt command will display the current tables and relations in the current schema.
\copy command is used to insert data from a standard flat-file such as a csv into a schema table. The path to the data file to load into the table generally needs to be the absolute path. We specify the columns to insert the data within the parentheses to avoid accidentally inserting the rows into the schema table's
id column, which acts as its
terminal \copy poetry.authors(author) FROM '/Users/aaronschlegel/Dropbox/Projects/poetpy/data/poetrydb_authors.csv' DELIMITER ',' CSV HEADER;
terminal \copy poetry.poems(author, linecount, lines, title) FROM '/Users/aaronschlegel/Dropbox/Projects/poetpy/data/poetrydb_copy.csv' DELIMITER ',' CSV HEADER;
\copy is successful, the terminal will output the number of rows that were inserted into the table. We can now perform queries on the tables!
DataFrame class has a handy method,
to_sql for inserting data into a SQL table directly. As we already created two tables earlier, we will give these a different name to identify the new tables created from the
DataFrame. The name of the SQL table to insert into is designated by the first argument of the
to_sql method, while the second required argument is a database connection string, just like the one we created previously with SQLAlchemy! To get a sequential
id column inserted into the SQL table simultaneously, we will also specify that the
DataFrame index column is named
id with the optional
index_label argument. We also want to be sure to set the
schema optional argument to the
poetry schema (since we are working with Postgresql) that we created earlier in the example. Otherwise, the tables will be created in the default
poems_df.to_sql('poems_df', conn, schema='poetry', index_label='id') authors_df.to_sql('authors_df', conn, schema='poetry', index_label='id')
There were no errors or warnings issued. Therefore the data insertion should have been successful! In the next section, we perform some sample queries on the newly created SQL tables to ensure the data is what we expect.
Perhaps unsurprisingly, there are multiple ways to query our Postgresql schema tables. The first is to use the
.execute() method from our database connection variable
conn, which we created earlier in the example. Let's say we are interested in finding the first 3 authors from the
author table in our
poetry schema. The SQL query can be written as:
SELECT author FROM poetry.authors LIMIT 3
The above query can be passed as an argument to the
.execute() method as a string to query the database. The
.fetchall() is chained to the end of the
.execute() method to extract all the results.
conn.execute('SELECT author FROM poetry.authors LIMIT 3').fetchall()
[('Adam Lindsay Gordon',), ('Alan Seeger',), ('Alexander Pope',)]
We see the statement returned the first three authors as a list of tuples as expected! More information on using SQL queries with SQLAlchemy can be found in SQLAlchemy's tutorial.
Another method for querying a database that can be very useful is to use the
read_sql_query. The function
read_sql can also be used to return the same results. The required arguments for the function are the query and a connection string. The benefit of using the
read_sql_query function is the results are pulled directly into a pandas
pd.read_sql_query('SELECT authors FROM poetry.authors_df LIMIT 3', conn)
|0||Adam Lindsay Gordon|
In this example, we displayed a sample use case for extracting data from a database through an API and then using that data to create the database that we can use for further analysis and more without worrying about hitting the API database more than needed. In further examples, we will enhance the data we extracted from the API with more information collected from different sources and feature engineering on the already available data.