Creating a SQLite3 Schema Using Spotify Dataset

Introduction

This blog is to create a SQLite3 schema to store Spotify Songs Dataset. This schema should follow the normalization rule to reduce data redundancy and prevent inconsistencies on insertion, update and deletion. Normalization design technique divides large tables into smaller pieces and uses some relationships to connect them. There are many database normal forms, they can be listed from least normalized to most normalized as below:

  • UNF: Unnormalized form
  • 1NF: First normal form
  • 2NF: Second normal form
  • 3NF: Third normal form
  • EKNF: Elementary key normal form
  • BCNF: Boyce-Codd normal form
  • 4NF: Fourth normal form
  • ETNF: Essential tuple normal form
  • 5NF: Fifth normal form
  • DKNF: Domain-key normal form
  • 6NF: Sixth normal form

The main normal forms applied in this blog are 1NF, 2NF and 3NF. They are defined as

  • First normal form
    • Table has a primary key
    • No repeating groups of columns
    • Each cell contains a single value
  • Second normal form
    • All columns in each row depend fully on candidate keys
  • Third normal form
    • No transitive dependencies between non-candidate columns

According to these normalization rules as well as some personal choices, the Spotify dataset can be separated into 6 small tables and stored more logically. These tables can form a SQLite3 database named spotify.db, using which the names of all playlists that contain instrumentals can be obtained.

Spotify Dataset

The Spotify dataset used in this blog comes form the spotifyr package authored by Charlie Thompson, Josiah Parry, Donal Phipps, and Tom Wolff. This package collects information about songs through Spotify’s API. According to the example code, the Spotify dataset can be gained through the following code in Python:

import pandas as pd
import numpy as np

spotify_songs = pd.read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-01-21/spotify_songs.csv')
spotify_songs = spotify_songs.drop_duplicates()

The codebook of this dataset can be read here. It has 32833 rows and 23 columns, which is too tedious and hard to analyze. Thus detailed dissection is necessary.

Data Cutting

Analysis

1NF

To start with, let’s go through the first normal form rule. As can be seen from the data, the requirements of no repeating column groups as well as cells only containing single value are automatically satisfied. However, the primary key condition is not met. Although track_id seems to be the primary key, it contains repeated values and is not eligible for being a primary key. We can verified this by building a generalized checking function.

def check(table, primary_key):
    """
    This function is to check the validity of primary key.
    """
    if np.any(table[primary_key].isnull()):
        print('Mistake: the primary key of {} contains NULL value!'.format(table.name))
    if table[primary_key].drop_duplicates().shape[0] != table.shape[0]:
        print('Mistake: the primary key of {} contains REPETITIVE value!'.format(table.name))

Then the following code can show that repeated values exist in track_id, and this may result from the playlist variables. Since a song can appear in multiple playlists, and a playlist can include multiple songs as well, these two parts should be separated into different data frames.

>>> spotify_songs.name = 'spotify_songs'
>>> check(spotify_songs, 'track_id')
Mistake: the primary key of spotify_songs contains REPETITIVE value!

We choose to apply track_id instead of track_name to be the primary key. This is due to the fact that though not appeared in our case, two songs may have exactly the same name, because names are not protected by copyright. Thus track_id, the unique ID of each song, should be their primary identity proof. The same logic can be extended to other tables.

Now the Spotify dataset can be shown as below.

TableVariablesPrimary Key
songtrack_id, track_name, track_artist,
track_popularity, track_album_id,
track_album_name,
track_album_release_date, danceability,
energy, key, loudness, mode, speechiness,
acousticness, instrumentalness, liveness,
valence, tempo, duration_ms
track_id
playlistplaylist_id, playlist_name, playlist_genre,
playlist_subgenre
playlist_id

If we continue to employ check function to verify the validity of playlist_id as a primary key in dataset playlist, we will see that it contains repetitive values as well. This is because a single playlist can contain multiple different genres, and a genre can exist in several playlists. Therefore, playlist dataset should be divided once more. The final result based on the first normal form is

TableVariablesPrimary Key
songtrack_id, track_name, track_artist,
track_popularity, track_album_id,
track_album_name,
track_album_release_date,
danceability, energy, key, loudness,
mode, speechiness, acousticness,
instrumentalness, liveness, valence,
tempo, duration_ms
track_id
playlistplaylist_id, playlist_nameplaylist_id
genreplaylist_subgenre, playlist_genreplaylist_subgenre

3NF

Then we turn to the third normal form, which implies no transitive dependencies between non-candidate columns. Transitive dependency means that given a table containing 3 columns A, B and C, if A depends on B and B depends on C, then A should depend on C. This phenomenon needs to be ruled out during our dataset construction. In table song, track_album_name and track_album_release_date seem to be dependent merely on track_album_id, and with track_album_id depending on the primary key track_id, track_album_name and track_album_release_date transitively depend on track_id as well. This clearly violates the third normal form requirement, thus division of table song should be carried out. Besides, a new table should be inserted to summarize the relationship between all of the tables. The final result is shown as below.

TableVariablesPrimary Key
songtrack_id, track_name,
track_artist, track_popularity,
danceability, energy, key,
loudness, mode, speechiness,
acousticness, instrumentalness,
liveness, valence, tempo,
duration_ms
track_id
albumtrack_album_id, track_album_name,
track_album_release_date
track_album_id
playlistplaylist_id,playlist_nameplaylist_id
genreplaylist_subgenre, playlist_genreplaylist_subgenre
connectiontrack_id, playlist_id,
playlist_subgenre, track_album_id
track_id,
playlist_id,
playlist_subgenre

Here the primary keys for table song, album, playlist and genre are put into the same dataset with a composite primary key track_id, playlist_id and playlist_subgenre. The selection of composite primary key rather than single key is due to the independence of these variables. A playlist can contain many songs as well as many subgenres. Besides, a subgenre can be the property of a lot of playlists and songs. Similarly, a song can occur in several playlist and with different subgenre. These complex relationship makes these variables work together to identify each row.

2NF

The second normal form asks all columns in each row to depend fully on candidate keys. This indicates that if the primary key is composite, there should be no partial dependency. Thus any column that are not one of the primary key columns should depend on all of the primary key columns rather than one or part of them. The composite primary key in our example only appears in table connection. In this table, track_album_id seems to depend only on track_id, which is one of the primary key columns. Hence it is necessary to keep on partition. If we insert track_album_id into table song where track_id is the single primary key, all of the three normal forms are satisfied. So our resulting schema is

TableVariablesPrimary Key
songtrack_id, track_name,
track_artist, track_popularity,
danceability, energy, key,
loudness, mode, speechiness,
acousticness, instrumentalness,
liveness, valence, tempo,
duration_ms, track_album_id
track_id
albumtrack_album_id, track_album_name,
track_album_release_date
track_album_id
playlistplaylist_id,playlist_nameplaylist_id
genreplaylist_subgenre, playlist_genreplaylist_subgenre
connectiontrack_id, playlist_id,
playlist_subgenre
track_id,
playlist_id,
playlist_subgenre

Personal Choice

Although the current schema doesn’t violate the rules of the normal forms, table song seems to include too many variables. Some of these variables are about the features of the songs. For clearer table presentation, dividing part of them into feature dataset is a good choice. The result is

TableVariablesPrimary Key
songtrack_id, track_name,
track_artist, track_popularity,
track_album_id
track_id
albumtrack_album_id, track_album_name,
track_album_release_date
track_album_id
playlistplaylist_id,playlist_nameplaylist_id
genreplaylist_subgenre, playlist_genreplaylist_subgenre
featuretrack_id, danceability, energy,
key, loudness, mode, speechiness,
acousticness, instrumentalness,
liveness, valence, tempo,
duration_ms
track_id
connectiontrack_id, playlist_id,
playlist_subgenre
track_id,
playlist_id,
playlist_subgenre

Implementation

The Python code to implement data cutting can be shown as

song = spotify_songs[['track_id', 'track_name', 'track_artist', 'track_popularity', 'track_album_id']].drop_duplicates()
album = spotify_songs[['track_album_id', 'track_album_name', 'track_album_release_date']].drop_duplicates()
playlist = spotify_songs[['playlist_id', 'playlist_name']].drop_duplicates()
genre = spotify_songs[['playlist_subgenre', 'playlist_genre']].drop_duplicates()
feature = spotify_songs[['track_id', 'danceability', 'energy', 'key', 'loudness', 'mode', 
                         'speechiness', 'acousticness', 'instrumentalness', 'liveness', 
                         'valence', 'tempo','duration_ms']].drop_duplicates()
feature = round(feature,4)
connection = spotify_songs[['track_id', 'playlist_id', 'playlist_subgenre']].drop_duplicates()

And the correlations between tables are

error

Create a SQLite3 Schema

To transform pandas DataFrames into SQLite tables, we can use the package sqlite3 to implement. The codes are shown below.

import sqlite3

con = sqlite3.connect("Data/spotify.db")
song.to_sql('song', con, if_exists = 'replace', index = False)
album.to_sql('album', con, if_exists = 'replace', index = False)
playlist.to_sql('playlist', con, if_exists = 'replace', index = False)
genre.to_sql('genre', con, if_exists = 'replace', index = False)
feature.to_sql('feature', con, if_exists = 'replace', index = False)
connection.to_sql('connection', con, if_exists = 'replace', index = False)

Thus the target database spotify.db is stored in folder Data and contains 6 tables. We can check that by running

%load_ext sql
%sql sqlite:///Data/spotify.db

%%sql

SELECT type, name
FROM sqlite_master
WHERE type = 'table'
typename
tablesong
tablealbum
tableplaylist
tablegenre
tablefeature
tableconnection

Problem Solution

In this section, we are going to explore the names of all playlists that contain instrumentals. The threshold of judging whether instrumentals are contained is given in the definition of variable instrumentalness. The instrumentalness with values above 0.5 are intended to represent instrumental tracks, but confidence is higher as the value approaches 1.0. To find the names of the playlists, which are stored in table playlist, the first step is to combine table feature and connection via INNER JOIN method on their common variable track_id. Then the next step is to combine the joint table with playlist on variable playlist_id using INNER JOIN as well. From the combination of three tables we can filter the rows with instrumentalness larger than 0.5 and compute the number of songs in each playlist. Since there are too many playlists that satisfy the condition, only the top 10 playlists are listed ordered by the number of songs within them.

%%sql

SELECT playlist_name AS 'playlist name', COUNT(DISTINCT feature.track_id) AS 'Number of Songs'
FROM feature INNER JOIN connection
ON feature.track_id = connection.track_id
INNER JOIN playlist
ON connection.playlist_id = playlist.playlist_id
WHERE instrumentalness > 0.5
GROUP BY playlist_name
ORDER BY COUNT(DISTINCT feature.track_id) DESC
LIMIT 10

error

Therefore, among all of the playlists that include instrumentals, “Lush Lofi” has the largest number of songs. For audience fond of instruments, it might be an excellent choice. Besides, according to the following operation, the total number of playlists that are within our choice pool is 262 out of 471, indicating that playlists representing instrumental tracks in the Spotify dataset are a little bit more than playlists with more vocal content.

%%sql

SELECT COUNT(DISTINCT playlist.playlist_id) AS 'total number of playlist'
FROM feature INNER JOIN connection
ON feature.track_id = connection.track_id
INNER JOIN playlist
ON connection.playlist_id = playlist.playlist_id
WHERE instrumentalness > 0.5
total number of playlist
262
Next
Previous

Related