Further data processing
To ensure code in this tutorial runs as shown, download the tutorial project folder and follow these instructions.If you have questions or suggestions about this tutorial, please open an issue here.
This tutorial uses the World Resources Institute Global Power Plants Dataset to explore data pre-processing in Julia. The dataset is created from multiple sources and is under continuous update, which means that there are lots of missing data, non-standard characters, etc Hence plenty of material to work with!
More tutorials on the manipulation of DataFrames can be found here And some more information can be found on this wikipage.
import MLJ: schema, std, mean, median, coerce, coerce!, scitype
using DataFrames
using UrlDownload
Import data
raw_data = urldownload("https://github.com/tlienart/DataScienceTutorialsData.jl/blob/master/data/wri_global_power_plant_db_be_022020.csv?raw=true")
data = DataFrame(raw_data);
This dataset contains information on power generation plants for a number of countries around the world. The level of disaggregation is the power plant. For each plant, there is information about its name, localisation, capacity, and many other features. The schema function enables us to get a quick overview of the variables it contains, including their machine and scentific types.
schema(data)
┌──────────────────────────┬────────────────────────────┬──────────────────────────┐
│ names │ scitypes │ types │
├──────────────────────────┼────────────────────────────┼──────────────────────────┤
│ country │ Textual │ String3 │
│ country_long │ Textual │ String │
│ name │ Textual │ String │
│ gppd_idnr │ Textual │ String15 │
│ capacity_mw │ Continuous │ Float64 │
│ latitude │ Continuous │ Float64 │
│ longitude │ Continuous │ Float64 │
│ primary_fuel │ Textual │ String15 │
│ other_fuel1 │ Union{Missing, Textual} │ Union{Missing, String15} │
│ other_fuel2 │ Union{Missing, Textual} │ Union{Missing, String7} │
│ other_fuel3 │ Union{Missing, Textual} │ Union{Missing, String7} │
│ commissioning_year │ Union{Missing, Continuous} │ Union{Missing, Float64} │
│ owner │ Union{Missing, Textual} │ Union{Missing, String} │
│ source │ Union{Missing, Textual} │ Union{Missing, String} │
│ url │ Union{Missing, Textual} │ Union{Missing, String} │
│ geolocation_source │ Union{Missing, Textual} │ Union{Missing, String} │
│ wepp_id │ Union{Missing, Textual} │ Union{Missing, String31} │
│ year_of_capacity_data │ Union{Missing, Count} │ Union{Missing, Int64} │
│ generation_gwh_2013 │ Union{Missing, Continuous} │ Union{Missing, Float64} │
│ generation_gwh_2014 │ Union{Missing, Continuous} │ Union{Missing, Float64} │
│ generation_gwh_2015 │ Union{Missing, Continuous} │ Union{Missing, Float64} │
│ generation_gwh_2016 │ Union{Missing, Continuous} │ Union{Missing, Float64} │
│ generation_gwh_2017 │ Union{Missing, Continuous} │ Union{Missing, Float64} │
│ generation_data_source │ Union{Missing, Textual} │ Union{Missing, String} │
│ estimated_generation_gwh │ Union{Missing, Continuous} │ Union{Missing, Float64} │
└──────────────────────────┴────────────────────────────┴──────────────────────────┘
We see that a small number of features have values for all plants (i.e. for each row) present in the dataset. However, (i) several features have missing values (Union{Missing, _.type}) and (ii) we are not interested in working with all of these features. In particular, we're not intersted in the source of the information present in the dataset nor are we interested in the generation data. Hence we drop all columns which contain information's source. We define a function is_active()
that will return a TRUE
boolean value if the column name does NOT (!
) contain either of the strings "source" or "generation". Note the conversion of column names from :Symbol
to :string
since the occursing
function only accepts strings as arguments.
is_active(col) = !occursin(r"source|generation", string(col))
active_cols = [col for col in names(data) if is_active(col)]
select!(data, active_cols);
We also drop a number of other unwanted columns and take a look at our "new" dataframe.
select!(data, Not([:wepp_id, :url, :owner]))
schema(data)
┌───────────────────────┬────────────────────────────┬──────────────────────────┐
│ names │ scitypes │ types │
├───────────────────────┼────────────────────────────┼──────────────────────────┤
│ country │ Textual │ String3 │
│ country_long │ Textual │ String │
│ name │ Textual │ String │
│ gppd_idnr │ Textual │ String15 │
│ capacity_mw │ Continuous │ Float64 │
│ latitude │ Continuous │ Float64 │
│ longitude │ Continuous │ Float64 │
│ primary_fuel │ Textual │ String15 │
│ other_fuel1 │ Union{Missing, Textual} │ Union{Missing, String15} │
│ other_fuel2 │ Union{Missing, Textual} │ Union{Missing, String7} │
│ other_fuel3 │ Union{Missing, Textual} │ Union{Missing, String7} │
│ commissioning_year │ Union{Missing, Continuous} │ Union{Missing, Float64} │
│ year_of_capacity_data │ Union{Missing, Count} │ Union{Missing, Int64} │
└───────────────────────┴────────────────────────────┴──────────────────────────┘
The remaining variables have two different scientific types: Continuous, Textual Of which we can get an overview.
describe(data)
# The describe() function shows that there are several features with missing values.
13×7 DataFrame
Row │ variable mean min median max nmissing eltype
│ Symbol Union… Any Union… Any Int64 Type
─────┼─────────────────────────────────────────────────────────────────────────────────────────────────────────────
1 │ country AFG ZWE 0 String3
2 │ country_long Afghanistan Zimbabwe 0 String
3 │ name 'Muela \u200bVärtaverket 0 String
4 │ gppd_idnr ARG0000001 WRI1075863 0 String15
5 │ capacity_mw 168.993 1.0 18.3 22500.0 0 Float64
6 │ latitude 32.5014 -77.847 39.5835 71.292 0 Float64
7 │ longitude -4.1955 -179.978 -1.2744 179.389 0 Float64
8 │ primary_fuel Biomass Wind 0 String15
9 │ other_fuel1 Biomass Wind 31680 Union{Missing, String15}
10 │ other_fuel2 Biomass Wind 33340 Union{Missing, String7}
11 │ other_fuel3 Biomass Wind 33539 Union{Missing, String7}
12 │ commissioning_year 1995.49 1896.0 2005.0 2018.0 17340 Union{Missing, Float64}
13 │ year_of_capacity_data 2016.86 2000 2017.0 2018 19900 Union{Missing, Int64}
Note: the describe()
function is from the DataFrames
package (and won't work with other, non DataFrames, tables) whereas the schema()
is from the MLJ package.
Let's play around with capacity data, for which there are no missing values. We create a sub-dataframe and aggregate over certain dimensions (country and primary_fuel)
capacity = select(data, [:country, :primary_fuel, :capacity_mw]);
first(capacity, 5)
5×3 DataFrame
Row │ country primary_fuel capacity_mw
│ String3 String15 Float64
─────┼────────────────────────────────────
1 │ AFG Hydro 33.0
2 │ AFG Solar 10.0
3 │ AFG Solar 10.0
4 │ AFG Hydro 66.0
5 │ AFG Hydro 100.0
This dataframe contains several subgroups (country and technology type) and it would be interesting to get data aggregates by subgroup. To obtain a view
of the DataFrame by subgroup, we can use the groupby
function. (See the DataFrame tutorial for an introduction to the use of groupby
)
cap_gr = groupby(capacity, [:country, :primary_fuel]);
If we want to aggregate at the country-fuel-type level and calculate summary statistics at this level, we can use the combine
function on the GroupedDataFrame that we just created. This function takes the GroupedDataFrame, the symbol of the column on which to apply the measure of choice as arguments.
cap_mean = combine(cap_gr, :capacity_mw => mean)
cap_sum = combine(cap_gr, :capacity_mw => sum)
first(cap_sum, 3)
3×3 DataFrame
Row │ country primary_fuel capacity_mw_sum
│ String3 String15 Float64
─────┼────────────────────────────────────────
1 │ AFG Hydro 238.55
2 │ AFG Solar 20.0
3 │ AFG Gas 42.0
Now let's plot some of this aggregate data for a selection of countries, by country and technology type
ctry_selec = r"BEL|FRA|DEU"
tech_selec = r"Solar"
cap_sum_plot = cap_sum[occursin.(ctry_selec, cap_sum.country) .& occursin.(tech_selec, cap_sum.primary_fuel), :]
3×3 DataFrame
Row │ country primary_fuel capacity_mw_sum
│ String3 String15 Float64
─────┼────────────────────────────────────────
1 │ BEL Solar 116.2
2 │ FRA Solar 4910.42
3 │ DEU Solar 6770.56
Note the .
for element-wise comparison Before plotting, we can also sort values by decreasing order using sort!()
.
sort!(cap_sum_plot, :capacity_mw_sum, rev=true)
using Plots
Plots.bar(cap_sum_plot.country, cap_sum_plot.capacity_mw_sum, legend=false)
Now that we have the total capacity by country and technology type, let's use it to calculate the share of each technology in total capacity. To that end we first create a dataframe containing the country-level total capacity, using the same steps as above.
cap_sum_ctry_gd = groupby(capacity, [:country]);
cap_sum_ctry = combine(cap_sum_ctry_gd, :capacity_mw => sum);
The we join this dataframe with the disaggregated one; which requires that we convert the two GroupedDataFrame into DataFrames.
cap_sum = DataFrame(cap_sum);
cap_sum_ctry = DataFrame(cap_sum_ctry);
cap_share = leftjoin(cap_sum, cap_sum_ctry, on = :country, makeunique = true)
cap_share.capacity_mw_share = cap_share.capacity_mw_sum ./ cap_share.capacity_mw_sum_1;
Let's visualise our dataframe again, which now includes the capacity_mw_share
column.
Now let's analyse features which exhibit some missing values. Suppose we want to calculate the age of each plant (rounded to full years). We face two issues. First, the commissioning_year is not reported for all plants. We need to gauge the representativity of the plants for which it is available with regard to the full dataset. One way to count the missing values is
nMissings = length(findall(x -> ismissing(x), data.commissioning_year))
17340
This represents about half of our observations
nMissings_share = nMissings/size(data)[1]
0.515411824153613
Second, the commissioning year is not reported as an integer. Fractions of years are reported too. As a result, the machine type of data.commissioning_year
is Float64.
typeof(data.commissioning_year)
Vector{Union{Missing, Float64}} (alias for Array{Union{Missing, Float64}, 1})
Before calculating the average age, let's drop the missing values.
data_nmiss = dropmissing(data, :commissioning_year);
And round the year to the closest integer. We can do this using the round
function and a mapping function on the relevant DataFrame column.
map!(x -> round(x, digits=0), data_nmiss.commissioning_year, data_nmiss.commissioning_year);
# We can now calculate plant age for each plant (worth remembering that the dataset only contains active plants)
current_year = fill!(Array{Float64}(undef, size(data_nmiss)[1]), 2020);
data_nmiss[:, :plant_age] = current_year - data_nmiss[:, :commissioning_year];
Since the commissioning year is missing for about half the plants in the dataset (17340, see description of data above) and that missing values propagate, the plant age will only be available for 33643-17340 plants. Let's see what the mean and median plant ages are across the plants for which we have the data
mean_age = mean(skipmissing(data_nmiss.plant_age))
median_age = median(skipmissing(data_nmiss.plant_age))
15.0
And bring this into a frequency plot of the plant age observations
histogram(data_nmiss.plant_age, color="blue", bins=100, label="Plant Age Frequency",
normalize=:pdf, alpha=0.5, xlim=(0,130))
vline!([mean_age], linewidth=2, color="red", label="Mean Age")
vline!([median_age], linewidth=2, color="orange", label="Median Age")
We can also calculate and plot average plant age by country and technology Make sure all columns passed, other than the aggregation dimensions, are of type Float
or Int
, otherwise the function execution will fail.
age = select(data_nmiss, [:country, :primary_fuel, :plant_age])
age_mean = combine(groupby(age, [:country, :primary_fuel]), :plant_age => mean)
coal_means = age_mean[occursin.(ctry_selec, age_mean.country) .& occursin.(r"Coal", age_mean.primary_fuel), :]
gas_means = age_mean[occursin.(ctry_selec, age_mean.country) .& occursin.(r"Gas", age_mean.primary_fuel), :]
2×3 DataFrame
Row │ country primary_fuel plant_age_mean
│ String3 String15 Float64
─────┼───────────────────────────────────────
1 │ BEL Gas 16.5
2 │ DEU Gas 25.6867
fig.suptitle("Mean plant age by country and technology")
p1 = Plots.bar(coal_means.country, coal_means.plant_age_mean, ylabel="Age", title="Coal")
p2 = Plots.bar(gas_means.country, gas_means.plant_age_mean, title="Gas")
plot(p1, p2, layout=(1, 2), size=(900,600), plot_title="Mean plant age by country and technology")