Working with SQL in Python – Digital product development agency

When you are starting off in Machine Learning you will play around with a lot of static datasets. This is normally in the form of an CSV document and sometimes more complex setups for computer vision problems.

While this is great, at some point you will need to communicate with relational databases. Having a solid foundation in databases can be indispensable in a data scientist or machine learning engineer role. You will often be working with data points spread across many systems and databases. Creating a database that brings together all your datapoint into one system that cleans the data as you expect can make your problems much easier to solve.

We will dive deeper into merging datasets in another post. Realistically you still start with a simple existing open database. We can find one we will look at here on Kaggle.


One of the most common ways in Python to talk with relational databases is using a library called SQLAlchemy. This tool is amazing in its coverage. The main databases you will run into are SQLite, Postgresql, MySQL, and MS-SQL, to name some of the many supported databases. With our local SQLite3 server running, let’s take a look at how you would go about making a query and loading data into a pandas DataFrame.

# Import packages
from sqlalchemy import create_engine
import pandas as pd

As with all our python scrips, we first need to import the packages we need. Luckily for our simple example, we just need a sub package from SQLAlchemy and pandas. Connecting to an SQL database is pretty simple—we just need to pass in the address for the SQLite server. We still need to connect to the database before we can talk to it. Make sure to make a note with SQLAlchemy that you need to explicitly connect and disconnect from your server.

engine = create_engine('sqlite:///soccer.sqlite')
con = engine.connect()

With our connection made we can run SQL queries simply by running the .execute() method on the connection object created in the last step. Here comes an important part of working with SQL in Python. SQL commands are pure strings. If you come from a Ruby background like myself you will be sad to hear we don’t get methods to easily pull data with SQL like you find in Rails. This is for the best as you will benefit from an understanding strong understanding of programming SQL.

rs = con.execute('SELECT * FROM Country')
df = pd.DataFrame(rs.fetchall())

Once we have our results from the executed query we can load these into a Pandas data frame simply by passing the results into a data frame initialization method. To get all the data from the results we will need to use .fetchall() method.

# Be kind to your Database Admins and Close your connections when finished.

# Print first 5 rows of DataFrame

Finally, we need to make sure to close the connection once we are done. You can keep this open if you have some more work to do. However, we can run these commands in a more idiomatic way.

with engine.connect() as con:
    rs = con.execute("SELECT player_api_id, birthday FROM Player”)
    df = pd.DataFrame(rs.fetchmany(size=3))
    df.columns = rs.keys()

You will be used to this way of working with a database if you have experience working with files or TensorFlow v1 sessions. The beautiful thing about working with the database this way is you have made a very readable way of working inside a database connection with no fears of leaving connections opened by mistake.

You’ll notice with this last command that we didn’t fetch all results. We capped the results to three by using the .fetchmany() method.

Kaggle’s SQL Lessons.

Related Articles

Leave a Reply

Your email address will not be published. Required fields are marked *

Back to top button

Adblock Detected

Please consider supporting us by disabling your ad blocker