The PoetryDB API stores its data in MongoDB, a popular NoSQL database. Indeed, a NoSQL database is a solid choice for the type of data that is stored in PoetryDB (unstructured text, for example). However, what if we wanted to create a more traditional SQL database with the PoetryDB API data for use in other projects where a relational database would be preferred? By extracting the data from the PoetryDB API using a combination of a few Python libraries, we can recreate the NoSQL PoetryDB database as a SQL database which will allow us more freedom to create additional data features and avoid the need to hit the PoetryDB database more than necessary.
R and SQL make excellent complements for analyzing data due to their respective strengths. The sqldf package provides an interface for working with SQL in R by querying data from a database into an R data.frame. This post will demonstrate how to query and analyze data using the sqldf package in conjunction with the graphing libraries plotly and ggplot2 as well as some other packages that provide useful statistical tests and other functions.
The consumer complaints database provided by the Bureau of Consumer Financial Protection, can be downloaded as a 190mb csv file.
Although the csv file is not large relative to other available datasets that can exceed many gigabytes in size, it still provides good motivation for aggregating the data using SQL and outputting into a Pandas DataFrame. This can all be done conveniently with Pandas's iotools
Page 1 / 1