Creating a SQLite3 Schema Using Spotify Dataset
Introduction
This blog is to create a SQLite3 schema to store Spotify Songs Dataset. This schema should follow the normalization rule to reduce data redundancy and prevent inconsistencies on insertion, update and deletion. Normalization design technique divides large tables into smaller pieces and uses some relationships to connect them. There are many database normal forms, they can be listed from least normalized to most normalized as below:
- UNF: Unnormalized form
- 1NF: First normal form
- 2NF: Second normal form
- 3NF: Third normal form
- EKNF: Elementary key normal form
- BCNF: Boyce-Codd normal form
- 4NF: Fourth normal form
- ETNF: Essential tuple normal form
- 5NF: Fifth normal form
- DKNF: Domain-key normal form
- 6NF: Sixth normal form
The main normal forms applied in this blog are 1NF, 2NF and 3NF. They are defined as
- First normal form
- Table has a primary key
- No repeating groups of columns
- Each cell contains a single value
- Second normal form
- All columns in each row depend fully on candidate keys
- Third normal form
- No transitive dependencies between non-candidate columns
According to these normalization rules as well as some personal choices, the Spotify dataset can be separated into 6 small tables and stored more logically. These tables can form a SQLite3 database named spotify.db, using which the names of all playlists that contain instrumentals can be obtained.
Spotify Dataset
The Spotify dataset used in this blog comes form the spotifyr package authored by Charlie Thompson, Josiah Parry, Donal Phipps, and Tom Wolff. This package collects information about songs through Spotify’s API. According to the example code, the Spotify dataset can be gained through the following code in Python:
import pandas as pd
import numpy as np
spotify_songs = pd.read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-01-21/spotify_songs.csv')
spotify_songs = spotify_songs.drop_duplicates()
The codebook of this dataset can be read here. It has 32833 rows and 23 columns, which is too tedious and hard to analyze. Thus detailed dissection is necessary.
Data Cutting
Analysis
1NF
To start with, let’s go through the first normal form rule. As can be seen from the data, the requirements of no repeating column groups as well as cells only containing single value are automatically satisfied. However, the primary key condition is not met. Although track_id
seems to be the primary key, it contains repeated values and is not eligible for being a primary key. We can verified this by building a generalized checking function.
def check(table, primary_key):
"""
This function is to check the validity of primary key.
"""
if np.any(table[primary_key].isnull()):
print('Mistake: the primary key of {} contains NULL value!'.format(table.name))
if table[primary_key].drop_duplicates().shape[0] != table.shape[0]:
print('Mistake: the primary key of {} contains REPETITIVE value!'.format(table.name))
Then the following code can show that repeated values exist in track_id
, and this may result from the playlist variables. Since a song can appear in multiple playlists, and a playlist can include multiple songs as well, these two parts should be separated into different data frames.
>>> spotify_songs.name = 'spotify_songs'
>>> check(spotify_songs, 'track_id')
Mistake: the primary key of spotify_songs contains REPETITIVE value!
We choose to apply track_id
instead of track_name
to be the primary key. This is due to the fact that though not appeared in our case, two songs may have exactly the same name, because names are not protected by copyright. Thus track_id
, the unique ID of each song, should be their primary identity proof. The same logic can be extended to other tables.
Now the Spotify dataset can be shown as below.
Table | Variables | Primary Key |
---|---|---|
song | track_id , track_name , track_artist ,track_popularity , track_album_id ,track_album_name ,track_album_release_date , danceability ,energy , key , loudness , mode , speechiness ,acousticness , instrumentalness , liveness ,valence , tempo , duration_ms | track_id |
playlist | playlist_id , playlist_name , playlist_genre ,playlist_subgenre | playlist_id |
If we continue to employ check function to verify the validity of playlist_id
as a primary key in dataset playlist, we will see that it contains repetitive values as well. This is because a single playlist can contain multiple different genres, and a genre can exist in several playlists. Therefore, playlist dataset should be divided once more. The final result based on the first normal form is
Table | Variables | Primary Key |
---|---|---|
song | track_id , track_name , track_artist ,track_popularity , track_album_id ,track_album_name ,track_album_release_date ,danceability , energy , key , loudness ,mode , speechiness , acousticness ,instrumentalness , liveness , valence ,tempo , duration_ms | track_id |
playlist | playlist_id , playlist_name | playlist_id |
genre | playlist_subgenre , playlist_genre | playlist_subgenre |
3NF
Then we turn to the third normal form, which implies no transitive dependencies between non-candidate columns. Transitive dependency means that given a table containing 3 columns A, B and C, if A depends on B and B depends on C, then A should depend on C. This phenomenon needs to be ruled out during our dataset construction. In table song, track_album_name
and track_album_release_date
seem to be dependent merely on track_album_id
, and with track_album_id
depending on the primary key track_id
, track_album_name
and track_album_release_date
transitively depend on track_id
as well. This clearly violates the third normal form requirement, thus division of table song should be carried out. Besides, a new table should be inserted to summarize the relationship between all of the tables. The final result is shown as below.
Table | Variables | Primary Key |
---|---|---|
song | track_id , track_name ,track_artist , track_popularity ,danceability , energy , key ,loudness , mode , speechiness ,acousticness , instrumentalness ,liveness , valence , tempo ,duration_ms | track_id |
album | track_album_id , track_album_name ,track_album_release_date | track_album_id |
playlist | playlist_id ,playlist_name | playlist_id |
genre | playlist_subgenre , playlist_genre | playlist_subgenre |
connection | track_id , playlist_id ,playlist_subgenre , track_album_id | track_id ,playlist_id ,playlist_subgenre |
Here the primary keys for table song, album, playlist and genre are put into the same dataset with a composite primary key track_id
, playlist_id
and playlist_subgenre
. The selection of composite primary key rather than single key is due to the independence of these variables. A playlist can contain many songs as well as many subgenres. Besides, a subgenre can be the property of a lot of playlists and songs. Similarly, a song can occur in several playlist and with different subgenre. These complex relationship makes these variables work together to identify each row.
2NF
The second normal form asks all columns in each row to depend fully on candidate keys. This indicates that if the primary key is composite, there should be no partial dependency. Thus any column that are not one of the primary key columns should depend on all of the primary key columns rather than one or part of them. The composite primary key in our example only appears in table connection. In this table, track_album_id
seems to depend only on track_id
, which is one of the primary key columns. Hence it is necessary to keep on partition. If we insert track_album_id
into table song where track_id
is the single primary key, all of the three normal forms are satisfied. So our resulting schema is
Table | Variables | Primary Key |
---|---|---|
song | track_id , track_name ,track_artist , track_popularity ,danceability , energy , key ,loudness , mode , speechiness ,acousticness , instrumentalness ,liveness , valence , tempo ,duration_ms , track_album_id | track_id |
album | track_album_id , track_album_name ,track_album_release_date | track_album_id |
playlist | playlist_id ,playlist_name | playlist_id |
genre | playlist_subgenre , playlist_genre | playlist_subgenre |
connection | track_id , playlist_id ,playlist_subgenre | track_id ,playlist_id ,playlist_subgenre |
Personal Choice
Although the current schema doesn’t violate the rules of the normal forms, table song seems to include too many variables. Some of these variables are about the features of the songs. For clearer table presentation, dividing part of them into feature dataset is a good choice. The result is
Table | Variables | Primary Key |
---|---|---|
song | track_id , track_name ,track_artist , track_popularity ,track_album_id | track_id |
album | track_album_id , track_album_name ,track_album_release_date | track_album_id |
playlist | playlist_id ,playlist_name | playlist_id |
genre | playlist_subgenre , playlist_genre | playlist_subgenre |
feature | track_id , danceability , energy ,key , loudness , mode , speechiness ,acousticness , instrumentalness ,liveness , valence , tempo ,duration_ms | track_id |
connection | track_id , playlist_id ,playlist_subgenre | track_id ,playlist_id ,playlist_subgenre |
Implementation
The Python code to implement data cutting can be shown as
song = spotify_songs[['track_id', 'track_name', 'track_artist', 'track_popularity', 'track_album_id']].drop_duplicates()
album = spotify_songs[['track_album_id', 'track_album_name', 'track_album_release_date']].drop_duplicates()
playlist = spotify_songs[['playlist_id', 'playlist_name']].drop_duplicates()
genre = spotify_songs[['playlist_subgenre', 'playlist_genre']].drop_duplicates()
feature = spotify_songs[['track_id', 'danceability', 'energy', 'key', 'loudness', 'mode',
'speechiness', 'acousticness', 'instrumentalness', 'liveness',
'valence', 'tempo','duration_ms']].drop_duplicates()
feature = round(feature,4)
connection = spotify_songs[['track_id', 'playlist_id', 'playlist_subgenre']].drop_duplicates()
And the correlations between tables are
Create a SQLite3 Schema
To transform pandas DataFrames into SQLite tables, we can use the package sqlite3 to implement. The codes are shown below.
import sqlite3
con = sqlite3.connect("Data/spotify.db")
song.to_sql('song', con, if_exists = 'replace', index = False)
album.to_sql('album', con, if_exists = 'replace', index = False)
playlist.to_sql('playlist', con, if_exists = 'replace', index = False)
genre.to_sql('genre', con, if_exists = 'replace', index = False)
feature.to_sql('feature', con, if_exists = 'replace', index = False)
connection.to_sql('connection', con, if_exists = 'replace', index = False)
Thus the target database spotify.db is stored in folder Data and contains 6 tables. We can check that by running
%load_ext sql
%sql sqlite:///Data/spotify.db
%%sql
SELECT type, name
FROM sqlite_master
WHERE type = 'table'
type | name |
---|---|
table | song |
table | album |
table | playlist |
table | genre |
table | feature |
table | connection |
Problem Solution
In this section, we are going to explore the names of all playlists that contain instrumentals. The threshold of judging whether instrumentals are contained is given in the definition of variable instrumentalness
. The instrumentalness
with values above 0.5 are intended to represent instrumental tracks, but confidence is higher as the value approaches 1.0. To find the names of the playlists, which are stored in table playlist, the first step is to combine table feature and connection via INNER JOIN method on their common variable track_id
. Then the next step is to combine the joint table with playlist on variable playlist_id
using INNER JOIN as well. From the combination of three tables we can filter the rows with instrumentalness larger than 0.5 and compute the number of songs in each playlist. Since there are too many playlists that satisfy the condition, only the top 10 playlists are listed ordered by the number of songs within them.
%%sql
SELECT playlist_name AS 'playlist name', COUNT(DISTINCT feature.track_id) AS 'Number of Songs'
FROM feature INNER JOIN connection
ON feature.track_id = connection.track_id
INNER JOIN playlist
ON connection.playlist_id = playlist.playlist_id
WHERE instrumentalness > 0.5
GROUP BY playlist_name
ORDER BY COUNT(DISTINCT feature.track_id) DESC
LIMIT 10
Therefore, among all of the playlists that include instrumentals, “Lush Lofi” has the largest number of songs. For audience fond of instruments, it might be an excellent choice. Besides, according to the following operation, the total number of playlists that are within our choice pool is 262 out of 471, indicating that playlists representing instrumental tracks in the Spotify dataset are a little bit more than playlists with more vocal content.
%%sql
SELECT COUNT(DISTINCT playlist.playlist_id) AS 'total number of playlist'
FROM feature INNER JOIN connection
ON feature.track_id = connection.track_id
INNER JOIN playlist
ON connection.playlist_id = playlist.playlist_id
WHERE instrumentalness > 0.5
total number of playlist |
---|
262 |