Manipulating a DataFrame

To ensure code in this tutorial runs as shown, download the tutorial project folder and follow these instructions.

If you have questions or suggestions about this tutorial, please open an issue here.

This tutorial is loosely adapted from this pandas tutorial as well as the DataFrames.jl documentation. It is by no means meant to be a complete introduction, rather, it focuses on some key functionalities that are particularly useful in a classical machine learning context.

Basics

To start with, we will use the Boston dataset which is very simple.

using RDatasets
using DataFrames

boston = dataset("MASS", "Boston");

The dataset function returns a DataFrame object:

typeof(boston)
DataFrames.DataFrame

Accessing data

Intuitively a DataFrame is just a wrapper around a number of columns, each of which is a Vector of some type with a name"

names(boston)
14-element Vector{String}:
 "Crim"
 "Zn"
 "Indus"
 "Chas"
 "NOx"
 "Rm"
 "Age"
 "Dis"
 "Rad"
 "Tax"
 "PTRatio"
 "Black"
 "LStat"
 "MedV"

You can view the first few rows using first and specifying a number of rows:

first(boston, 4)
4×14 DataFrame
 Row │ Crim     Zn       Indus    Chas   NOx      Rm       Age      Dis      Rad    Tax    PTRatio  Black    LStat    MedV
     │ Float64  Float64  Float64  Int64  Float64  Float64  Float64  Float64  Int64  Int64  Float64  Float64  Float64  Float64
─────┼────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
   1 │ 0.00632     18.0     2.31      0    0.538    6.575     65.2   4.09        1    296     15.3   396.9      4.98     24.0
   2 │ 0.02731      0.0     7.07      0    0.469    6.421     78.9   4.9671      2    242     17.8   396.9      9.14     21.6
   3 │ 0.02729      0.0     7.07      0    0.469    7.185     61.1   4.9671      2    242     17.8   392.83     4.03     34.7
   4 │ 0.03237      0.0     2.18      0    0.458    6.998     45.8   6.0622      3    222     18.7   394.63     2.94     33.4

You can access one of those columns easily using .colname, this returns a vector that you can access like any Julia vector:

boston.Crim[1:5]
5-element Vector{Float64}:
 0.00632
 0.02731
 0.02729
 0.03237
 0.06905

You can also just access the dataframe as you would a big matrix:

boston[3, 5]
0.469

or specifying a range of rows/columns:

boston[1:5, [:Crim, :Zn]]
5×2 DataFrame
 Row │ Crim     Zn
     │ Float64  Float64
─────┼──────────────────
   1 │ 0.00632     18.0
   2 │ 0.02731      0.0
   3 │ 0.02729      0.0
   4 │ 0.03237      0.0
   5 │ 0.06905      0.0

or, similarly,

boston[1:5, 1:2]
5×2 DataFrame
 Row │ Crim     Zn
     │ Float64  Float64
─────┼──────────────────
   1 │ 0.00632     18.0
   2 │ 0.02731      0.0
   3 │ 0.02729      0.0
   4 │ 0.03237      0.0
   5 │ 0.06905      0.0

The select function is very convenient to get sub dataframes of interest:

b1 = select(boston, [:Crim, :Zn, :Indus])
first(b1, 2)
2×3 DataFrame
 Row │ Crim     Zn       Indus
     │ Float64  Float64  Float64
─────┼───────────────────────────
   1 │ 0.00632     18.0     2.31
   2 │ 0.02731      0.0     7.07

The Not syntax is also very useful:

b2 = select(boston, Not(:NOx))
first(b2, 2)
2×13 DataFrame
 Row │ Crim     Zn       Indus    Chas   Rm       Age      Dis      Rad    Tax    PTRatio  Black    LStat    MedV
     │ Float64  Float64  Float64  Int64  Float64  Float64  Float64  Int64  Int64  Float64  Float64  Float64  Float64
─────┼───────────────────────────────────────────────────────────────────────────────────────────────────────────────
   1 │ 0.00632     18.0     2.31      0    6.575     65.2   4.09        1    296     15.3    396.9     4.98     24.0
   2 │ 0.02731      0.0     7.07      0    6.421     78.9   4.9671      2    242     17.8    396.9     9.14     21.6

Finally, if you would like to drop columns, you can use select! which will mutate the dataframe in place:

select!(b1, Not(:Crim))
first(b1, 2)
2×2 DataFrame
 Row │ Zn       Indus
     │ Float64  Float64
─────┼──────────────────
   1 │    18.0     2.31
   2 │     0.0     7.07

Describing the data

StatsBase offers a convenient describe function which you can use on a DataFrame to get an overview of the data:

using StatsBase
describe(boston, :min, :max, :mean, :median, :std)
14×6 DataFrame
 Row │ variable  min        max       mean        median     std
     │ Symbol    Real       Real      Float64     Float64    Float64
─────┼──────────────────────────────────────────────────────────────────
   1 │ Crim        0.00632   88.9762    3.61352     0.25651    8.60155
   2 │ Zn          0.0      100.0      11.3636      0.0       23.3225
   3 │ Indus       0.46      27.74     11.1368      9.69       6.86035
   4 │ Chas        0          1         0.06917     0.0        0.253994
   5 │ NOx         0.385      0.871     0.554695    0.538      0.115878
   6 │ Rm          3.561      8.78      6.28463     6.2085     0.702617
   7 │ Age         2.9      100.0      68.5749     77.5       28.1489
   8 │ Dis         1.1296    12.1265    3.79504     3.20745    2.10571
   9 │ Rad         1         24         9.54941     5.0        8.70726
  10 │ Tax       187        711       408.237     330.0      168.537
  11 │ PTRatio    12.6       22.0      18.4555     19.05       2.16495
  12 │ Black       0.32     396.9     356.674     391.44      91.2949
  13 │ LStat       1.73      37.97     12.6531     11.36       7.14106
  14 │ MedV        5.0       50.0      22.5328     21.2        9.1971

You can pass a number of symbols to the describe function to indicate which statistics to compute for each feature:

  • mean, std, min, max, median, first, last are all fairly self explanatory

  • q25, q75 are respectively for the 25th and 75th percentile,

  • eltype, nunique, nmissing can also be used

You can also pass your custom function with a pair name => function for instance:

foo(x) = sum(abs.(x)) / length(x)
d = describe(boston, :mean, :median, foo => :foo)
first(d, 3)
3×4 DataFrame
 Row │ variable  mean      median   foo
     │ Symbol    Float64   Float64  Float64
─────┼───────────────────────────────────────
   1 │ Crim       3.61352  0.25651   3.61352
   2 │ Zn        11.3636   0.0      11.3636
   3 │ Indus     11.1368   9.69     11.1368

The describe function returns a derived object with one row per feature and one column per required statistic.

Further to StatsBase, Statistics offers a range of useful functions for data analysis.

using Statistics

Converting the data

If you want to get the content of the dataframe as one big matrix, use convert:

mat = Matrix(boston)
mat[1:3, 1:3]
3×3 Matrix{Float64}:
 0.00632  18.0  2.31
 0.02731   0.0  7.07
 0.02729   0.0  7.07

Adding columns

Adding a column to a dataframe is very easy:

boston.Crim_x_Zn = boston.Crim .* boston.Zn;

that's it! Remember also that you can drop columns or make subselections with select and select!.

Missing values

Let's load a dataset with missing values

mao = dataset("gap", "mao")
describe(mao, :nmissing)
19×2 DataFrame
 Row │ variable  nmissing
     │ Symbol    Int64
─────┼────────────────────
   1 │ ID               0
   2 │ Type             0
   3 │ Gender           0
   4 │ Age            188
   5 │ AAO            296
   6 │ AAD            295
   7 │ UPDRS          301
   8 │ MAOAI2           0
   9 │ AI2Code          0
  10 │ MAOBI2           0
  11 │ BI2Code          0
  12 │ GTBEX3           0
  13 │ BEX3Code         0
  14 │ MAOAVNTR        46
  15 │ VNTRCode        46
  16 │ VNTRCod2        46
  17 │ MAOA31          36
  18 │ MAO31COD        36
  19 │ MAO31CO2        36

Lots of missing values... If you wanted to compute simple functions on columns, they may just return missing:

std(mao.Age)
missing

The skipmissing function can help counter this easily:

std(skipmissing(mao.Age))
11.551715894987558

Split-Apply-Combine

This is a shorter version of the DataFrames.jl tutorial.

iris = dataset("datasets", "iris")
first(iris, 3)
3×5 DataFrame
 Row │ SepalLength  SepalWidth  PetalLength  PetalWidth  Species
     │ Float64      Float64     Float64      Float64     Cat…
─────┼───────────────────────────────────────────────────────────
   1 │         5.1         3.5          1.4         0.2  setosa
   2 │         4.9         3.0          1.4         0.2  setosa
   3 │         4.7         3.2          1.3         0.2  setosa

groupby

The groupby function allows to form "sub-dataframes" corresponding to groups of rows. This can be very convenient to run specific analyses for specific groups without copying the data.

The basic usage is groupby(df, cols) where cols specifies one or several columns to use for the grouping.

Consider a simple example: in iris there is a Species column with 3 species:

unique(iris.Species)
3-element Vector{String}:
 "setosa"
 "versicolor"
 "virginica"

We can form views for each of these:

gdf = groupby(iris, :Species);

The gdf object now corresponds to views of the original dataframe for each of the 3 species; the first species is "setosa" with:

subdf_setosa = gdf[1]
describe(subdf_setosa, :min, :mean, :max)
5×4 DataFrame
 Row │ variable     min     mean    max
     │ Symbol       Any     Union…  Any
─────┼─────────────────────────────────────
   1 │ SepalLength  4.3     5.006   5.8
   2 │ SepalWidth   2.3     3.428   4.4
   3 │ PetalLength  1.0     1.462   1.9
   4 │ PetalWidth   0.1     0.246   0.6
   5 │ Species      setosa          setosa

Note that subdf_setosa is a SubDataFrame meaning that it is just a view of the parent dataframe iris; if you modify that parent dataframe then the sub dataframe is also modified.

See ?groupby for more information.

combine

The combine function allows to derive a new dataframe out of transformations of an existing one. Here's an example taken from the official doc (see ?combine):

df = DataFrame(a=1:3, b=4:6)
combine(df, :a => sum, nrow)
1×2 DataFrame
 Row │ a_sum  nrow
     │ Int64  Int64
─────┼──────────────
   1 │     6      3

what happened here is that the derived DataFrame has two columns obtained respectively by (1) computing the sum of the first column and (2) applying the nrow function on the df.

The transformation can produce one or several values, combine will try to concatenate these columns as it can, for instance:

foo(v) = v[1:2]
combine(df, :a => maximum, :b => foo)
2×2 DataFrame
 Row │ a_maximum  b_foo
     │ Int64      Int64
─────┼──────────────────
   1 │         3      4
   2 │         3      5

here the maximum value of a is copied twice so that the two columns have the same number of rows.

bar(v) = v[end-1:end]
combine(df, :a => foo, :b => bar)
2×2 DataFrame
 Row │ a_foo  b_bar
     │ Int64  Int64
─────┼──────────────
   1 │     1      5
   2 │     2      6

combine with groupby

Combining groupby with combine is very useful. For instance you might want to compute statistics across groups for different variables:

combine(groupby(iris, :Species), :PetalLength => mean)
3×2 DataFrame
 Row │ Species     PetalLength_mean
     │ Cat…        Float64
─────┼──────────────────────────────
   1 │ setosa                 1.462
   2 │ versicolor             4.26
   3 │ virginica              5.552

let's decompose that:

  1. the groupby(iris, :Species) creates groups using the :Species column (which has values setosa, versicolor, virginica)

  2. the combine creates a derived dataframe by applying the mean function to the :PetalLength column

  3. since there are three groups, we get one column (mean of PetalLength) and three rows (one per group).

You can do this for several columns/statistics at the time and give new column names to the results:

gdf = groupby(iris, :Species)
combine(gdf, :PetalLength => mean => :MPL, :PetalLength => std => :SPL)
3×3 DataFrame
 Row │ Species     MPL      SPL
     │ Cat…        Float64  Float64
─────┼───────────────────────────────
   1 │ setosa        1.462  0.173664
   2 │ versicolor    4.26   0.469911
   3 │ virginica     5.552  0.551895

so here we assign the names :MPL and :SPL to the derived columns. If you want to apply something on all columns apart from the grouping one, using names and Not comes in handy:

combine(gdf, names(iris, Not(:Species)) .=> std)
3×5 DataFrame
 Row │ Species     SepalLength_std  SepalWidth_std  PetalLength_std  PetalWidth_std
     │ Cat…        Float64          Float64         Float64          Float64
─────┼──────────────────────────────────────────────────────────────────────────────
   1 │ setosa             0.35249         0.379064         0.173664        0.105386
   2 │ versicolor         0.516171        0.313798         0.469911        0.197753
   3 │ virginica          0.63588         0.322497         0.551895        0.27465

where

names(iris, Not(:Species))
4-element Vector{String}:
 "SepalLength"
 "SepalWidth"
 "PetalLength"
 "PetalWidth"

and note the use of . in .=> to indicate that we broadcast the function over each column.