Creating a SQLite3 Schema Using Spotify Dataset

Oct 7, 2020 9 min read Database, SQL, SQLite3

Introduction

This blog is to create a SQLite3 schema to store Spotify Songs Dataset. This schema should follow the normalization rule to reduce data redundancy and prevent inconsistencies on insertion, update and deletion. Normalization design technique divides large tables into smaller pieces and uses some relationships to connect them. There are many database normal forms, they can be listed from least normalized to most normalized as below:

UNF: Unnormalized form
1NF: First normal form
2NF: Second normal form
3NF: Third normal form
EKNF: Elementary key normal form
BCNF: Boyce-Codd normal form
4NF: Fourth normal form
ETNF: Essential tuple normal form
5NF: Fifth normal form
DKNF: Domain-key normal form
6NF: Sixth normal form

The main normal forms applied in this blog are 1NF, 2NF and 3NF. They are defined as

First normal form
- Table has a primary key
- No repeating groups of columns
- Each cell contains a single value
Second normal form
- All columns in each row depend fully on candidate keys
Third normal form
- No transitive dependencies between non-candidate columns

According to these normalization rules as well as some personal choices, the Spotify dataset can be separated into 6 small tables and stored more logically. These tables can form a SQLite3 database named spotify.db, using which the names of all playlists that contain instrumentals can be obtained.

Spotify Dataset

The Spotify dataset used in this blog comes form the spotifyr package authored by Charlie Thompson, Josiah Parry, Donal Phipps, and Tom Wolff. This package collects information about songs through Spotify’s API. According to the example code, the Spotify dataset can be gained through the following code in Python:

import pandas as pd
import numpy as np

spotify_songs = pd.read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-01-21/spotify_songs.csv')
spotify_songs = spotify_songs.drop_duplicates()

The codebook of this dataset can be read here. It has 32833 rows and 23 columns, which is too tedious and hard to analyze. Thus detailed dissection is necessary.

Data Cutting

Analysis

1NF

To start with, let’s go through the first normal form rule. As can be seen from the data, the requirements of no repeating column groups as well as cells only containing single value are automatically satisfied. However, the primary key condition is not met. Although track_id seems to be the primary key, it contains repeated values and is not eligible for being a primary key. We can verified this by building a generalized checking function.

def check(table, primary_key):
    """
    This function is to check the validity of primary key.
    """
    if np.any(table[primary_key].isnull()):
        print('Mistake: the primary key of {} contains NULL value!'.format(table.name))
    if table[primary_key].drop_duplicates().shape[0] != table.shape[0]:
        print('Mistake: the primary key of {} contains REPETITIVE value!'.format(table.name))

Then the following code can show that repeated values exist in track_id, and this may result from the playlist variables. Since a song can appear in multiple playlists, and a playlist can include multiple songs as well, these two parts should be separated into different data frames.

>>> spotify_songs.name = 'spotify_songs'
>>> check(spotify_songs, 'track_id')
Mistake: the primary key of spotify_songs contains REPETITIVE value!

We choose to apply track_id instead of track_name to be the primary key. This is due to the fact that though not appeared in our case, two songs may have exactly the same name, because names are not protected by copyright. Thus track_id, the unique ID of each song, should be their primary identity proof. The same logic can be extended to other tables.

Now the Spotify dataset can be shown as below.

Table	Variables	Primary Key
song	`track_id`, `track_name`, `track_artist`, `track_popularity`, `track_album_id`, `track_album_name`, `track_album_release_date`, `danceability`, `energy`, `key`, `loudness`, `mode`, `speechiness`, `acousticness`, `instrumentalness`, `liveness`, `valence`, `tempo`, `duration_ms`	`track_id`
playlist	`playlist_id`, `playlist_name`, `playlist_genre`, `playlist_subgenre`	`playlist_id`

If we continue to employ check function to verify the validity of playlist_id as a primary key in dataset playlist, we will see that it contains repetitive values as well. This is because a single playlist can contain multiple different genres, and a genre can exist in several playlists. Therefore, playlist dataset should be divided once more. The final result based on the first normal form is

Table	Variables	Primary Key
song	`track_id`, `track_name`, `track_artist`, `track_popularity`, `track_album_id`, `track_album_name`, `track_album_release_date`, `danceability`, `energy`, `key`, `loudness`, `mode`, `speechiness`, `acousticness`, `instrumentalness`, `liveness`, `valence`, `tempo`, `duration_ms`	`track_id`
playlist	`playlist_id`, `playlist_name`	`playlist_id`
genre	`playlist_subgenre`, `playlist_genre`	`playlist_subgenre`

3NF

Then we turn to the third normal form, which implies no transitive dependencies between non-candidate columns. Transitive dependency means that given a table containing 3 columns A, B and C, if A depends on B and B depends on C, then A should depend on C. This phenomenon needs to be ruled out during our dataset construction. In table song, track_album_name and track_album_release_date seem to be dependent merely on track_album_id, and with track_album_id depending on the primary key track_id, track_album_name and track_album_release_date transitively depend on track_id as well. This clearly violates the third normal form requirement, thus division of table song should be carried out. Besides, a new table should be inserted to summarize the relationship between all of the tables. The final result is shown as below.

Table	Variables	Primary Key
song	`track_id`, `track_name`, `track_artist`, `track_popularity`, `danceability`, `energy`, `key`, `loudness`, `mode`, `speechiness`, `acousticness`, `instrumentalness`, `liveness`, `valence`, `tempo`, `duration_ms`	`track_id`
album	`track_album_id`, `track_album_name`, `track_album_release_date`	`track_album_id`
playlist	`playlist_id`,`playlist_name`	`playlist_id`
genre	`playlist_subgenre`, `playlist_genre`	`playlist_subgenre`
connection	`track_id`, `playlist_id`, `playlist_subgenre`, `track_album_id`	`track_id`, `playlist_id`, `playlist_subgenre`

Here the primary keys for table song, album, playlist and genre are put into the same dataset with a composite primary key track_id, playlist_id and playlist_subgenre. The selection of composite primary key rather than single key is due to the independence of these variables. A playlist can contain many songs as well as many subgenres. Besides, a subgenre can be the property of a lot of playlists and songs. Similarly, a song can occur in several playlist and with different subgenre. These complex relationship makes these variables work together to identify each row.

2NF

The second normal form asks all columns in each row to depend fully on candidate keys. This indicates that if the primary key is composite, there should be no partial dependency. Thus any column that are not one of the primary key columns should depend on all of the primary key columns rather than one or part of them. The composite primary key in our example only appears in table connection. In this table, track_album_id seems to depend only on track_id, which is one of the primary key columns. Hence it is necessary to keep on partition. If we insert track_album_id into table song where track_id is the single primary key, all of the three normal forms are satisfied. So our resulting schema is

Table	Variables	Primary Key
song	`track_id`, `track_name`, `track_artist`, `track_popularity`, `danceability`, `energy`, `key`, `loudness`, `mode`, `speechiness`, `acousticness`, `instrumentalness`, `liveness`, `valence`, `tempo`, `duration_ms`, `track_album_id`	`track_id`
album	`track_album_id`, `track_album_name`, `track_album_release_date`	`track_album_id`
playlist	`playlist_id`,`playlist_name`	`playlist_id`
genre	`playlist_subgenre`, `playlist_genre`	`playlist_subgenre`
connection	`track_id`, `playlist_id`, `playlist_subgenre`	`track_id`, `playlist_id`, `playlist_subgenre`

Personal Choice

Although the current schema doesn’t violate the rules of the normal forms, table song seems to include too many variables. Some of these variables are about the features of the songs. For clearer table presentation, dividing part of them into feature dataset is a good choice. The result is

Table	Variables	Primary Key
song	`track_id`, `track_name`, `track_artist`, `track_popularity`, `track_album_id`	`track_id`
album	`track_album_id`, `track_album_name`, `track_album_release_date`	`track_album_id`
playlist	`playlist_id`,`playlist_name`	`playlist_id`
genre	`playlist_subgenre`, `playlist_genre`	`playlist_subgenre`
feature	`track_id`, `danceability`, `energy`, `key`, `loudness`, `mode`, `speechiness`, `acousticness`, `instrumentalness`, `liveness`, `valence`, `tempo`, `duration_ms`	`track_id`
connection	`track_id`, `playlist_id`, `playlist_subgenre`	`track_id`, `playlist_id`, `playlist_subgenre`

Implementation

The Python code to implement data cutting can be shown as

song = spotify_songs[['track_id', 'track_name', 'track_artist', 'track_popularity', 'track_album_id']].drop_duplicates()
album = spotify_songs[['track_album_id', 'track_album_name', 'track_album_release_date']].drop_duplicates()
playlist = spotify_songs[['playlist_id', 'playlist_name']].drop_duplicates()
genre = spotify_songs[['playlist_subgenre', 'playlist_genre']].drop_duplicates()
feature = spotify_songs[['track_id', 'danceability', 'energy', 'key', 'loudness', 'mode', 
                         'speechiness', 'acousticness', 'instrumentalness', 'liveness', 
                         'valence', 'tempo','duration_ms']].drop_duplicates()
feature = round(feature,4)
connection = spotify_songs[['track_id', 'playlist_id', 'playlist_subgenre']].drop_duplicates()

And the correlations between tables are

error

Create a SQLite3 Schema

To transform pandas DataFrames into SQLite tables, we can use the package sqlite3 to implement. The codes are shown below.

import sqlite3

con = sqlite3.connect("Data/spotify.db")
song.to_sql('song', con, if_exists = 'replace', index = False)
album.to_sql('album', con, if_exists = 'replace', index = False)
playlist.to_sql('playlist', con, if_exists = 'replace', index = False)
genre.to_sql('genre', con, if_exists = 'replace', index = False)
feature.to_sql('feature', con, if_exists = 'replace', index = False)
connection.to_sql('connection', con, if_exists = 'replace', index = False)

Thus the target database spotify.db is stored in folder Data and contains 6 tables. We can check that by running

%load_ext sql
%sql sqlite:///Data/spotify.db

%%sql

SELECT type, name
FROM sqlite_master
WHERE type = 'table'

type	name
table	song
table	album
table	playlist
table	genre
table	feature
table	connection

Problem Solution

In this section, we are going to explore the names of all playlists that contain instrumentals. The threshold of judging whether instrumentals are contained is given in the definition of variable instrumentalness. The instrumentalness with values above 0.5 are intended to represent instrumental tracks, but confidence is higher as the value approaches 1.0. To find the names of the playlists, which are stored in table playlist, the first step is to combine table feature and connection via INNER JOIN method on their common variable track_id. Then the next step is to combine the joint table with playlist on variable playlist_id using INNER JOIN as well. From the combination of three tables we can filter the rows with instrumentalness larger than 0.5 and compute the number of songs in each playlist. Since there are too many playlists that satisfy the condition, only the top 10 playlists are listed ordered by the number of songs within them.

%%sql

SELECT playlist_name AS 'playlist name', COUNT(DISTINCT feature.track_id) AS 'Number of Songs'
FROM feature INNER JOIN connection
ON feature.track_id = connection.track_id
INNER JOIN playlist
ON connection.playlist_id = playlist.playlist_id
WHERE instrumentalness > 0.5
GROUP BY playlist_name
ORDER BY COUNT(DISTINCT feature.track_id) DESC
LIMIT 10