Malaria is a fatal disease caused by infected mosquitoes. It has remained with human in a long history and is one of the biggest causes of death. Almost half of the world’s population, according to the World Health Organization (WHO), is at the risk of catching Malaria. To gain more comprehension about this powerful enemy, the Wellcome Trust and Sage Bionetworks have hosted a Data Challenge based on the relevant Malaria datasets.
Three datasets are provided in this project. These datasets contain information about the incidences and deaths of Malaria by country for all ages across the world. To get a comprehensive understanding about how broadly and deeply Malaria influences the world, a worldwide map plot representing the deaths caused by Malaria for each country can be meaningful.
The dataset we are using to draw the map plot is called malaria_deaths.csv
, which offers the age-standardized number of deaths caused by Malaria per 100,000 people for each country from 1990 to 2016. However, these information is not enough for obtaining a map plot. Additional dataset containing the geographic information of each country is needed for further use. The geographic data can be obtained at Natural Earth website, where the green button with “Download countries” is the access to what we need. Then a merged dataset can be gained through the following code.
import pandas as pd
import numpy as np
import geopandas as gpd
malaria_deaths = pd.read_csv('Data/malaria_deaths.csv')
# shorter column name
malaria_deaths.rename(columns = {'Deaths - Malaria - Sex: Both - Age: Age-standardized (Rate) (per 100,000 people)':
'Deaths per 100,000'}, inplace = True)
malaria_deaths['bin'] = pd.cut(malaria_deaths['Deaths per 100,000'], [0, 25, 50, 100, 150, 200, 250, 300],
labels = ['0-25', '25-50', '50-100', '100-150', '150-200', '200-250', '>250'])
# geographic data
gdf = gpd.read_file('Data/ne_10m_admin_0_countries/ne_10m_admin_0_countries.shp')[['ADM0_A3', 'geometry']].to_crs('+proj=robin')
gdf = gdf[gdf['ADM0_A3'] != 'ATA'] ## Antarctica is not necessary for our analysis
# generalized function to get merged data in different year
def merged_data(year):
df = malaria_deaths[malaria_deaths['Year'] == year][['Entity', 'Code', 'bin']]
return gdf.merge(df, left_on = 'ADM0_A3', right_on = 'Code', how = 'left')
Using merged_data
function, the resulting merged dataset with death information as well as geographic information for each country can be applied to construct a map plot for each year. The figure is shown below.
The gray part of Figure 1 means that the Malaria death data for the corresponding countries is not available. Among those without missing data, we can see that the severity of Malaria is greater in the Africa area. Almost all of cases with death number greater than 100 per 100,000 population appear in Africa. This should arise the attention of the corresponding health organizations. Besides, as can be noted, the global death condition in 2015 is much better than that in 1990. This demonstrates the effort made by WHO in containing this epidemic.
After the overall view of the geographic distribution of Malaria, we are also interested in the most vulnerable age group for this disease. According to the relevant documents, children is the demographic group most at risk of severe infection. This conclusion can be proved by data.
Here the data used is called malaria_deaths_age.csv
. It provides information about country-level Malaria deaths for different age groups from 1990 to 2016. Since the focus of our exploration here is the overall distribution of deaths for demographic groups, the country information is not helpful and can be combined using groupby
command as following.
malaria_deaths_age = pd.read_csv('Data/malaria_deaths_age.csv', usecols = ['entity', 'code', 'year', 'age_group', 'deaths'])
deaths_sum = malaria_deaths_age.groupby(['year', 'age_group'])['deaths'].agg(np.sum)
years = np.unique(malaria_deaths_age['year'])
ages = np.unique(malaria_deaths_age['age_group'])
Based on deaths_sum
, years
and ages
, a line chart can be plotted to represent the demographic death distribution.
From Figure 2, we can get the conclusion that children are indeed the most vulnerable object when faced with Malaria, and the level of vulnerability reduces as they grow up according to the lower position of lines with older age. Besides, the total number of deaths for children less than 5 years old is greater than the sum of all the other groups, indicating that infancy is the period that requires more attention during their development stages.
Besides Malaria’s geographic and demographic distribution, we may also interest in the economic influence on Malaria infection. In this part, the country-level incidence of Malaria per 1,000 population is employed as the indicator of Malaria severity. In addition to the Malaria dataset malaria_inc.csv
, we need more information about the economic situation of each country. The most widely used economic index from an overall perspective is the Gross Domestic Product (GDP) per capita. This indicator is slightly corrected based on Purchasing Power Parity (PPP) in this project. Besides, additional knowledge about the continent of countries can be added as well. These datasets can be obtained at the project’s Github pages and the process of data cleaning as well as data merging is given as:
malaria_inc = pd.read_csv('Data/malaria_inc.csv')
malaria_inc.rename(columns = {'Incidence of malaria (per 1,000 population at risk) (per 1,000 population at risk)':
'Incidence per 1,000'}, inplace = True)
# data of population and GPD per capita
data_pop_gdp = pd.read_csv('Data/data_pop_gdp.csv', encoding='Windows-1252')
data_pop_gdp = data_pop_gdp.replace(['Gross domestic product based on purchasing-power-parity (PPP) per capita GDP'],
'GDP')
# add region of country
region = pd.read_csv('Data/region.csv', encoding='Windows-1252')
def data_year(year):
# merge datasets
df_pop_gdp = pd.DataFrame({'Country': np.array(data_pop_gdp['Country'][::2]),
'GDP': np.array(data_pop_gdp[str(year)][::2]),
'Population': np.array(data_pop_gdp[str(year)][1:][::2])})
df_target = df_pop_gdp.merge(malaria_inc[malaria_inc['Year'] == year][['Entity','Incidence per 1,000']],
left_on = 'Country', right_on = 'Entity')
df_target['GDP'] = df_target['GDP'].str.replace(',', '').astype(float)
df_target['Population'] = df_target['Population'].str.replace(',', '').astype(float)
df_target = df_target.merge(region, on = 'Country')
group_codes = {k: idx for idx, k in enumerate(df_target['Region'].unique())}
df_target['Region'] = df_target['Region'].apply(lambda x: group_codes[x])
top = pd.concat([df_target.sort_values('Population', ascending = False)[:4],
df_target.sort_values('Incidence per 1,000', ascending = False)[:1],
df_target.sort_values('GDP', ascending = False)[:3]]).drop_duplicates()
return df_target, group_codes, top
The resulting function data_year
will return three elements, the first of which is the main dataset that we will plot on, while the other two elements are for sake of legend and annotation setting. The final figure is shown as below.
In Figure 3, GDP per capita and incidence of Malaria per 1,000 people are converted to logarithmic scale for clearer visualization. The point sizes represent their relative population. Several points are labelled with their country name due to higher population or greater fortune or severer Malaria situation. As is shown, the incidence of Malaria seems to be negatively correlated with the economic condition, indicating that wealthier countries will be less risky in terms of Malaria infection. Besides, the whole group of points slightly move downwards to the right from 2000 to 2015. This might reflect the global improvement in infectious disease control and financial situation.
From the above visualization, we can gain a general comprehension about the dynamic spread and distribution of Malaria. WHO has made a significant progress in controlling this epidemic from the global perspective. However, for the eradication of Malaria, there is still a long way to go. In the future, more detailed analysis and visualization might be needed.
The detailed codes for the visualizations and animations can be accessed at the project’s Github pages.