Further data processing

To ensure code in this tutorial runs as shown, download the tutorial project folder and follow these instructions.

If you have questions or suggestions about this tutorial, please open an issue here.

More data processing

This tutorial uses the World Resources Institute Global Power Plants Dataset to explore data pre-processing in Julia. The dataset is created from multiple sources and is under continuous update, which means that there are lots of missing data, non-standard characters, etc Hence plenty of material to work with!

More tutorials on the manipulation of DataFrames can be found here And some more information can be found on this wikipage.

import MLJ: schema, std, mean, median, coerce, coerce!, scitype
using DataFrames
using UrlDownload

Import data

raw_data = urldownload("https://github.com/tlienart/DataScienceTutorialsData.jl/blob/master/data/wri_global_power_plant_db_be_022020.csv?raw=true")
data = DataFrame(raw_data);

This dataset contains information on power generation plants for a number of countries around the world. The level of disaggregation is the power plant. For each plant, there is information about its name, localisation, capacity, and many other features. The schema function enables us to get a quick overview of the variables it contains, including their machine and scentific types.

schema(data)

┌──────────────────────────┬────────────────────────────┬──────────────────────────┐
│ names                    │ scitypes                   │ types                    │
├──────────────────────────┼────────────────────────────┼──────────────────────────┤
│ country                  │ Textual                    │ String3                  │
│ country_long             │ Textual                    │ String                   │
│ name                     │ Textual                    │ String                   │
│ gppd_idnr                │ Textual                    │ String15                 │
│ capacity_mw              │ Continuous                 │ Float64                  │
│ latitude                 │ Continuous                 │ Float64                  │
│ longitude                │ Continuous                 │ Float64                  │
│ primary_fuel             │ Textual                    │ String15                 │
│ other_fuel1              │ Union{Missing, Textual}    │ Union{Missing, String15} │
│ other_fuel2              │ Union{Missing, Textual}    │ Union{Missing, String7}  │
│ other_fuel3              │ Union{Missing, Textual}    │ Union{Missing, String7}  │
│ commissioning_year       │ Union{Missing, Continuous} │ Union{Missing, Float64}  │
│ owner                    │ Union{Missing, Textual}    │ Union{Missing, String}   │
│ source                   │ Union{Missing, Textual}    │ Union{Missing, String}   │
│ url                      │ Union{Missing, Textual}    │ Union{Missing, String}   │
│ geolocation_source       │ Union{Missing, Textual}    │ Union{Missing, String}   │
│ wepp_id                  │ Union{Missing, Textual}    │ Union{Missing, String31} │
│ year_of_capacity_data    │ Union{Missing, Count}      │ Union{Missing, Int64}    │
│ generation_gwh_2013      │ Union{Missing, Continuous} │ Union{Missing, Float64}  │
│ generation_gwh_2014      │ Union{Missing, Continuous} │ Union{Missing, Float64}  │
│ generation_gwh_2015      │ Union{Missing, Continuous} │ Union{Missing, Float64}  │
│ generation_gwh_2016      │ Union{Missing, Continuous} │ Union{Missing, Float64}  │
│ generation_gwh_2017      │ Union{Missing, Continuous} │ Union{Missing, Float64}  │
│ generation_data_source   │ Union{Missing, Textual}    │ Union{Missing, String}   │
│ estimated_generation_gwh │ Union{Missing, Continuous} │ Union{Missing, Float64}  │
└──────────────────────────┴────────────────────────────┴──────────────────────────┘

We see that a small number of features have values for all plants (i.e. for each row) present in the dataset. However, (i) several features have missing values (Union{Missing, _.type}) and (ii) we are not interested in working with all of these features. In particular, we're not intersted in the source of the information present in the dataset nor are we interested in the generation data. Hence we drop all columns which contain information's source. We define a function is_active() that will return a TRUE boolean value if the column name does NOT (!) contain either of the strings "source" or "generation". Note the conversion of column names from :Symbol to :string since the occursing function only accepts strings as arguments.

is_active(col) = !occursin(r"source|generation", string(col))
active_cols = [col for col in names(data) if is_active(col)]
select!(data, active_cols);

We also drop a number of other unwanted columns and take a look at our "new" dataframe.

select!(data, Not([:wepp_id, :url, :owner]))
schema(data)

┌───────────────────────┬────────────────────────────┬──────────────────────────┐
│ names                 │ scitypes                   │ types                    │
├───────────────────────┼────────────────────────────┼──────────────────────────┤
│ country               │ Textual                    │ String3                  │
│ country_long          │ Textual                    │ String                   │
│ name                  │ Textual                    │ String                   │
│ gppd_idnr             │ Textual                    │ String15                 │
│ capacity_mw           │ Continuous                 │ Float64                  │
│ latitude              │ Continuous                 │ Float64                  │
│ longitude             │ Continuous                 │ Float64                  │
│ primary_fuel          │ Textual                    │ String15                 │
│ other_fuel1           │ Union{Missing, Textual}    │ Union{Missing, String15} │
│ other_fuel2           │ Union{Missing, Textual}    │ Union{Missing, String7}  │
│ other_fuel3           │ Union{Missing, Textual}    │ Union{Missing, String7}  │
│ commissioning_year    │ Union{Missing, Continuous} │ Union{Missing, Float64}  │
│ year_of_capacity_data │ Union{Missing, Count}      │ Union{Missing, Int64}    │
└───────────────────────┴────────────────────────────┴──────────────────────────┘

The remaining variables have two different scientific types: Continuous, Textual Of which we can get an overview.

describe(data)

# The describe() function shows that there are several features with missing values.

13×7 DataFrame
 Row │ variable               mean     min          median   max                nmissing  eltype
     │ Symbol                 Union…   Any          Union…   Any                Int64     Type
─────┼─────────────────────────────────────────────────────────────────────────────────────────────────────────────
   1 │ country                         AFG                   ZWE                       0  String3
   2 │ country_long                    Afghanistan           Zimbabwe                  0  String
   3 │ name                            'Muela                \u200bVärtaverket         0  String
   4 │ gppd_idnr                       ARG0000001            WRI1075863                0  String15
   5 │ capacity_mw            168.993  1.0          18.3     22500.0                   0  Float64
   6 │ latitude               32.5014  -77.847      39.5835  71.292                    0  Float64
   7 │ longitude              -4.1955  -179.978     -1.2744  179.389                   0  Float64
   8 │ primary_fuel                    Biomass               Wind                      0  String15
   9 │ other_fuel1                     Biomass               Wind                  31680  Union{Missing, String15}
  10 │ other_fuel2                     Biomass               Wind                  33340  Union{Missing, String7}
  11 │ other_fuel3                     Biomass               Wind                  33539  Union{Missing, String7}
  12 │ commissioning_year     1995.49  1896.0       2005.0   2018.0                17340  Union{Missing, Float64}
  13 │ year_of_capacity_data  2016.86  2000         2017.0   2018                  19900  Union{Missing, Int64}

Note: the describe() function is from the DataFrames package (and won't work with other, non DataFrames, tables) whereas the schema() is from the MLJ package.

Let's play around with capacity data, for which there are no missing values. We create a sub-dataframe and aggregate over certain dimensions (country and primary_fuel)

capacity = select(data, [:country, :primary_fuel, :capacity_mw]);
first(capacity, 5)

5×3 DataFrame
 Row │ country  primary_fuel  capacity_mw
     │ String3  String15      Float64
─────┼────────────────────────────────────
   1 │ AFG      Hydro                33.0
   2 │ AFG      Solar                10.0
   3 │ AFG      Solar                10.0
   4 │ AFG      Hydro                66.0
   5 │ AFG      Hydro               100.0

This dataframe contains several subgroups (country and technology type) and it would be interesting to get data aggregates by subgroup. To obtain a view of the DataFrame by subgroup, we can use the groupby function. (See the DataFrame tutorial for an introduction to the use of groupby)

cap_gr = groupby(capacity, [:country, :primary_fuel]);

If we want to aggregate at the country-fuel-type level and calculate summary statistics at this level, we can use the combine function on the GroupedDataFrame that we just created. This function takes the GroupedDataFrame, the symbol of the column on which to apply the measure of choice as arguments.

cap_mean = combine(cap_gr, :capacity_mw => mean)
cap_sum = combine(cap_gr, :capacity_mw => sum)
first(cap_sum, 3)

3×3 DataFrame
 Row │ country  primary_fuel  capacity_mw_sum
     │ String3  String15      Float64
─────┼────────────────────────────────────────
   1 │ AFG      Hydro                  238.55
   2 │ AFG      Solar                   20.0
   3 │ AFG      Gas                     42.0

Now let's plot some of this aggregate data for a selection of countries, by country and technology type

ctry_selec = r"BEL|FRA|DEU"
tech_selec = r"Solar"

cap_sum_plot = cap_sum[occursin.(ctry_selec, cap_sum.country) .& occursin.(tech_selec, cap_sum.primary_fuel), :]

3×3 DataFrame
 Row │ country  primary_fuel  capacity_mw_sum
     │ String3  String15      Float64
─────┼────────────────────────────────────────
   1 │ BEL      Solar                  116.2
   2 │ FRA      Solar                 4910.42
   3 │ DEU      Solar                 6770.56

Note the . for element-wise comparison Before plotting, we can also sort values by decreasing order using sort!().

sort!(cap_sum_plot, :capacity_mw_sum, rev=true)

using Plots

Plots.bar(cap_sum_plot.country, cap_sum_plot.capacity_mw_sum, legend=false)

Now that we have the total capacity by country and technology type, let's use it to calculate the share of each technology in total capacity. To that end we first create a dataframe containing the country-level total capacity, using the same steps as above.

cap_sum_ctry_gd = groupby(capacity, [:country]);
cap_sum_ctry = combine(cap_sum_ctry_gd, :capacity_mw => sum);

The we join this dataframe with the disaggregated one; which requires that we convert the two GroupedDataFrame into DataFrames.

cap_sum = DataFrame(cap_sum);
cap_sum_ctry = DataFrame(cap_sum_ctry);
cap_share = leftjoin(cap_sum, cap_sum_ctry, on = :country, makeunique = true)
cap_share.capacity_mw_share = cap_share.capacity_mw_sum ./ cap_share.capacity_mw_sum_1;

Let's visualise our dataframe again, which now includes the capacity_mw_share column.

Now let's analyse features which exhibit some missing values. Suppose we want to calculate the age of each plant (rounded to full years). We face two issues. First, the commissioning_year is not reported for all plants. We need to gauge the representativity of the plants for which it is available with regard to the full dataset. One way to count the missing values is

nMissings = length(findall(x -> ismissing(x), data.commissioning_year))

This represents about half of our observations

nMissings_share = nMissings/size(data)[1]

0.515411824153613

Second, the commissioning year is not reported as an integer. Fractions of years are reported too. As a result, the machine type of data.commissioning_yearis Float64.

typeof(data.commissioning_year)

Vector{Union{Missing, Float64}} (alias for Array{Union{Missing, Float64}, 1})

Before calculating the average age, let's drop the missing values.

data_nmiss = dropmissing(data, :commissioning_year);

And round the year to the closest integer. We can do this using the round function and a mapping function on the relevant DataFrame column.

map!(x -> round(x, digits=0), data_nmiss.commissioning_year, data_nmiss.commissioning_year);

# We can now calculate plant age for each plant (worth remembering that the dataset only contains active plants)

current_year = fill!(Array{Float64}(undef, size(data_nmiss)[1]), 2020);
data_nmiss[:, :plant_age] = current_year - data_nmiss[:, :commissioning_year];

Since the commissioning year is missing for about half the plants in the dataset (17340, see description of data above) and that missing values propagate, the plant age will only be available for 33643-17340 plants. Let's see what the mean and median plant ages are across the plants for which we have the data

mean_age = mean(skipmissing(data_nmiss.plant_age))
median_age = median(skipmissing(data_nmiss.plant_age))

15.0

And bring this into a frequency plot of the plant age observations

histogram(data_nmiss.plant_age, color="blue",  bins=100, label="Plant Age Frequency",
          normalize=:pdf, alpha=0.5, xlim=(0,130))
vline!([mean_age], linewidth=2, color="red", label="Mean Age")
vline!([median_age], linewidth=2, color="orange", label="Median Age")

We can also calculate and plot average plant age by country and technology Make sure all columns passed, other than the aggregation dimensions, are of type Float or Int, otherwise the function execution will fail.

age = select(data_nmiss, [:country, :primary_fuel, :plant_age])
age_mean = combine(groupby(age, [:country, :primary_fuel]), :plant_age => mean)

coal_means = age_mean[occursin.(ctry_selec, age_mean.country) .& occursin.(r"Coal", age_mean.primary_fuel), :]
gas_means = age_mean[occursin.(ctry_selec, age_mean.country) .& occursin.(r"Gas", age_mean.primary_fuel), :]

2×3 DataFrame
 Row │ country  primary_fuel  plant_age_mean
     │ String3  String15      Float64
─────┼───────────────────────────────────────
   1 │ BEL      Gas                  16.5
   2 │ DEU      Gas                  25.6867

fig.suptitle("Mean plant age by country and technology")

p1 = Plots.bar(coal_means.country, coal_means.plant_age_mean, ylabel="Age", title="Coal")
p2 = Plots.bar(gas_means.country, gas_means.plant_age_mean, title="Gas")

plot(p1, p2, layout=(1, 2), size=(900,600), plot_title="Mean plant age by country and technology")

‎

Further data processing

More data processing