Data Extraction Using Star Wars API

Oct 17, 2020 6 min read Python, API, Database, JSON

All codes in this post can be found at my Github repository.

Introduction

This blog is to extract data through SWAPI, the Star Wars API, in the Python environment. The details of SWAPI can be obtained at its documentation. API is the abbreviation for Application Programming Interface, which enables data transmission between multiple software intermediaries. Using this tool, information can be accessed through different data formats. SWAPI provides six attributes about Star Wars. They are

films
people
planets
species
starships
vehicles

These data can be rendered by two types of encodings in SWAPI, which are

JSON (default)
Wookiee

In this post, JSON is the format used to extract the data. And requests library in Python is applied. After getting the data, two problems need to be solved:

The name of the oldest person (or robot or alien)
The titles of all the films the oldest creature appeared in

Data Extraction

To find the name of the oldest creature in the Star War universe, the information with people attribute should be collected. From the codebook, we can learn that such information is stored individually for each person. For example, the data for person $i$ is stored in the URL https://swapi.dev/api/people/i. With the exact URL, the JSON data can be read using get command in the request package. However, the first problem is, what is the total number of characters in the Star Wars? In other words, we don’t know the range for the person’s id $i$ , thus the exact URL cannot be directly obtained.

Encountered with this problem, we should firstly analyze the summary dataset for people to figure out the range for $i$ . This process can be done using the following codes.

import requests

url = 'https://swapi.dev/api/people/'
people = requests.get(url).json()
people

The returned variable people is a dictionary that contains four keys, count, next, previous and results. count tells us the total number of people in SWAPI, which is the information we need. A trap here is that the results part seems to store all the information about each person as a list, and it looks like the iteration based on person id is unnecessary. However, if detailed spection is implemented, we will see that the results section only contains a page of people, which is totally 10 people. This is far away from thorough data extraction about people. It also explains why the next and previous sections exist. These sections will direct us to the next page and previous page of the information about people.

With the total number of people stored at the count part, we can get the exact URL for each person. The URLs are in the format https://swapi.dev/api/people/i, where i starts from 1 and ends at count. One thing that worths some attention is that certain websites might not be found, which is known as the 404 Error. In this case, storing their information will be a waste of time and space. Luckily, the requests package offers some methods to check this issue. The example to check the validity of website https://examples is shown as

requests.get('https://examples').status_code

If the value returned is 404, it indicates that the 404 Error exists for this website. Conversely, if 200 is returned, it means that this website is able to be fetched. Taking these factors into consideration, the final result of people can be obtained through the iteration over person id.

import os

people_set = []
for i in range(1, people['count'] + 1):
    content = requests.get(os.path.join(url, str(i)))
    if content.status_code != 404:
        people_set.append(content.json())

Finally, there will be 81 people in the people_set. And the first element is exhibited below.

{'name': 'Luke Skywalker',
 'height': '172',
 'mass': '77',
 'hair_color': 'blond',
 'skin_color': 'fair',
 'eye_color': 'blue',
 'birth_year': '19BBY',
 'gender': 'male',
 'homeworld': 'http://swapi.dev/api/planets/1/',
 'films': ['http://swapi.dev/api/films/1/',
  'http://swapi.dev/api/films/2/',
  'http://swapi.dev/api/films/3/',
  'http://swapi.dev/api/films/6/'],
 'species': [],
 'vehicles': ['http://swapi.dev/api/vehicles/14/',
  'http://swapi.dev/api/vehicles/30/'],
 'starships': ['http://swapi.dev/api/starships/12/',
  'http://swapi.dev/api/starships/22/'],
 'created': '2014-12-09T13:50:51.644000Z',
 'edited': '2014-12-20T21:17:56.891000Z',
 'url': 'http://swapi.dev/api/people/1/'}

Problem Solution

The name of the oldest person (or robot or alien)

From the codebook, we can see that the age information is stored in the variable birth_year. This variable is defined as the birth year of the corresponding person, represented by the in-universe standard of BBY or ABY, which indicate before the Battle of Yavin or after the Battle of Yavin. The Battle of Yavin is a battle that occurs at the end of Star Wars episode IV: A New Hope. Apparently, the oldest case will be born before the Battle of Yavin rather than after the Battle of Yavin, and the time gap should be as large as possible. Therefore, we can extract the digital part of birth_year which ends with BBY and select the maximum number, the index of which will be that of the oldest creature. The implementation through Python can be represented as:

import pandas as pd
import numpy as np

# extract birth_year information
birth_year_series = pd.Series([i['birth_year'] for i in people_set])

# calculate the oldest age
age_oldest = np.max(birth_year_series[birth_year_series.str.endswith('BBY')].str.replace('[a-zA-Z]+', '').astype(float))

# find location of oldest age in the list
index = int(np.where(birth_year_series == str(int(age_oldest)) + 'BBY')[0])

# corresponding name
print('Name of the oldest creature:', people_set[index]['name'])

And the result is

Name of the oldest creature: Yoda

We should notice that the digit extraction part in this case should not be completed through the command series.str.extract('(\d+)'). Because for items like ‘41.9BBY’, ‘419’ instead of ‘41.9’ will be extracted, which may lead to wrong results.

By the way, Yoda is a small, green humanoid alien who is powerful with the Force. And he often appears as an old, wise mentor.

error

The titles of all the films the oldest creature appeared in

With the index of the oldest creature Yoda, the corresponding film resource URLs that this creature has been in can be acquired through the films variable. By requesting from these URLs, which belong to the films attribute, we can get the titles and other information of these films.

film_url = people_set[index]['films']
titles = []
release_dates = []
for i in range(len(film_url)):
    film = requests.get(film_url[i]).json()
    titles.append(film['title'])
    release_dates.append(film['release_date'])
    
df_title = pd.DataFrame(dict(title = titles, release_date = release_dates))
df_title

title	release_date
The Empire Strikes Back	1980-05-17
Return of the Jedi	1983-05-25
The Phantom Menace	1999-05-19
Attack of the Clones	2002-05-16
Revenge of the Sith	2005-05-19

Therefore, Yoda has appeared in 5 films, and their titles are shown above.

In summary, the oldest feature in the Star War universe is Yoda, who is a humanoid alien. He has appeared in 5 films, with titles The Empire Strikes Back, Return of the Jedi, The Phantom Menace, Attack of the Clones and Revenge of the Sith.

Python API Database JSON

Data Extraction Using Star Wars API

Introduction

Data Extraction

Problem Solution

Related