Manipulating a DataFrame
To ensure code in this tutorial runs as shown, download the tutorial project folder and follow these instructions.If you have questions or suggestions about this tutorial, please open an issue here.
This tutorial is loosely adapted from this pandas tutorial as well as the DataFrames.jl documentation. It is by no means meant to be a complete introduction, rather, it focuses on some key functionalities that are particularly useful in a classical machine learning context.
To start with, we will use the Boston
dataset which is very simple.
using RDatasets
using DataFrames
boston = dataset("MASS", "Boston");
The dataset
function returns a DataFrame
object:
typeof(boston)
DataFrames.DataFrame
Intuitively a DataFrame is just a wrapper around a number of columns, each of which is a Vector
of some type with a name"
names(boston)
14-element Vector{String}:
"Crim"
"Zn"
"Indus"
"Chas"
"NOx"
"Rm"
"Age"
"Dis"
"Rad"
"Tax"
"PTRatio"
"Black"
"LStat"
"MedV"
You can view the first few rows using first
and specifying a number of rows:
first(boston, 4)
4×14 DataFrame
Row │ Crim Zn Indus Chas NOx Rm Age Dis Rad Tax PTRatio Black LStat MedV
│ Float64 Float64 Float64 Int64 Float64 Float64 Float64 Float64 Int64 Int64 Float64 Float64 Float64 Float64
─────┼────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
1 │ 0.00632 18.0 2.31 0 0.538 6.575 65.2 4.09 1 296 15.3 396.9 4.98 24.0
2 │ 0.02731 0.0 7.07 0 0.469 6.421 78.9 4.9671 2 242 17.8 396.9 9.14 21.6
3 │ 0.02729 0.0 7.07 0 0.469 7.185 61.1 4.9671 2 242 17.8 392.83 4.03 34.7
4 │ 0.03237 0.0 2.18 0 0.458 6.998 45.8 6.0622 3 222 18.7 394.63 2.94 33.4
You can access one of those columns easily using .colname
, this returns a vector that you can access like any Julia vector:
boston.Crim[1:5]
5-element Vector{Float64}:
0.00632
0.02731
0.02729
0.03237
0.06905
You can also just access the dataframe as you would a big matrix:
boston[3, 5]
0.469
or specifying a range of rows/columns:
boston[1:5, [:Crim, :Zn]]
5×2 DataFrame
Row │ Crim Zn
│ Float64 Float64
─────┼──────────────────
1 │ 0.00632 18.0
2 │ 0.02731 0.0
3 │ 0.02729 0.0
4 │ 0.03237 0.0
5 │ 0.06905 0.0
or, similarly,
boston[1:5, 1:2]
5×2 DataFrame
Row │ Crim Zn
│ Float64 Float64
─────┼──────────────────
1 │ 0.00632 18.0
2 │ 0.02731 0.0
3 │ 0.02729 0.0
4 │ 0.03237 0.0
5 │ 0.06905 0.0
The select
function is very convenient to get sub dataframes of interest:
b1 = select(boston, [:Crim, :Zn, :Indus])
first(b1, 2)
2×3 DataFrame
Row │ Crim Zn Indus
│ Float64 Float64 Float64
─────┼───────────────────────────
1 │ 0.00632 18.0 2.31
2 │ 0.02731 0.0 7.07
The Not
syntax is also very useful:
b2 = select(boston, Not(:NOx))
first(b2, 2)
2×13 DataFrame
Row │ Crim Zn Indus Chas Rm Age Dis Rad Tax PTRatio Black LStat MedV
│ Float64 Float64 Float64 Int64 Float64 Float64 Float64 Int64 Int64 Float64 Float64 Float64 Float64
─────┼───────────────────────────────────────────────────────────────────────────────────────────────────────────────
1 │ 0.00632 18.0 2.31 0 6.575 65.2 4.09 1 296 15.3 396.9 4.98 24.0
2 │ 0.02731 0.0 7.07 0 6.421 78.9 4.9671 2 242 17.8 396.9 9.14 21.6
Finally, if you would like to drop columns, you can use select!
which will mutate the dataframe in place:
select!(b1, Not(:Crim))
first(b1, 2)
2×2 DataFrame
Row │ Zn Indus
│ Float64 Float64
─────┼──────────────────
1 │ 18.0 2.31
2 │ 0.0 7.07
StatsBase
offers a convenient describe
function which you can use on a DataFrame to get an overview of the data:
using StatsBase
describe(boston, :min, :max, :mean, :median, :std)
14×6 DataFrame
Row │ variable min max mean median std
│ Symbol Real Real Float64 Float64 Float64
─────┼──────────────────────────────────────────────────────────────────
1 │ Crim 0.00632 88.9762 3.61352 0.25651 8.60155
2 │ Zn 0.0 100.0 11.3636 0.0 23.3225
3 │ Indus 0.46 27.74 11.1368 9.69 6.86035
4 │ Chas 0 1 0.06917 0.0 0.253994
5 │ NOx 0.385 0.871 0.554695 0.538 0.115878
6 │ Rm 3.561 8.78 6.28463 6.2085 0.702617
7 │ Age 2.9 100.0 68.5749 77.5 28.1489
8 │ Dis 1.1296 12.1265 3.79504 3.20745 2.10571
9 │ Rad 1 24 9.54941 5.0 8.70726
10 │ Tax 187 711 408.237 330.0 168.537
11 │ PTRatio 12.6 22.0 18.4555 19.05 2.16495
12 │ Black 0.32 396.9 356.674 391.44 91.2949
13 │ LStat 1.73 37.97 12.6531 11.36 7.14106
14 │ MedV 5.0 50.0 22.5328 21.2 9.1971
You can pass a number of symbols to the describe
function to indicate which statistics to compute for each feature:
mean
,std
,min
,max
,median
,first
,last
are all fairly self explanatoryq25
,q75
are respectively for the 25th and 75th percentile,eltype
,nunique
,nmissing
can also be used
You can also pass your custom function with a pair name => function
for instance:
foo(x) = sum(abs.(x)) / length(x)
d = describe(boston, :mean, :median, foo => :foo)
first(d, 3)
3×4 DataFrame
Row │ variable mean median foo
│ Symbol Float64 Float64 Float64
─────┼───────────────────────────────────────
1 │ Crim 3.61352 0.25651 3.61352
2 │ Zn 11.3636 0.0 11.3636
3 │ Indus 11.1368 9.69 11.1368
The describe
function returns a derived object with one row per feature and one column per required statistic.
Further to StatsBase
, Statistics
offers a range of useful functions for data analysis.
using Statistics
If you want to get the content of the dataframe as one big matrix, use convert
:
mat = Matrix(boston)
mat[1:3, 1:3]
3×3 Matrix{Float64}:
0.00632 18.0 2.31
0.02731 0.0 7.07
0.02729 0.0 7.07
Adding a column to a dataframe is very easy:
boston.Crim_x_Zn = boston.Crim .* boston.Zn;
that's it! Remember also that you can drop columns or make subselections with select
and select!
.
Let's load a dataset with missing values
mao = dataset("gap", "mao")
describe(mao, :nmissing)
19×2 DataFrame
Row │ variable nmissing
│ Symbol Int64
─────┼────────────────────
1 │ ID 0
2 │ Type 0
3 │ Gender 0
4 │ Age 188
5 │ AAO 296
6 │ AAD 295
7 │ UPDRS 301
8 │ MAOAI2 0
9 │ AI2Code 0
10 │ MAOBI2 0
11 │ BI2Code 0
12 │ GTBEX3 0
13 │ BEX3Code 0
14 │ MAOAVNTR 46
15 │ VNTRCode 46
16 │ VNTRCod2 46
17 │ MAOA31 36
18 │ MAO31COD 36
19 │ MAO31CO2 36
Lots of missing values... If you wanted to compute simple functions on columns, they may just return missing
:
std(mao.Age)
missing
The skipmissing
function can help counter this easily:
std(skipmissing(mao.Age))
11.551715894987558
This is a shorter version of the DataFrames.jl tutorial.
iris = dataset("datasets", "iris")
first(iris, 3)
3×5 DataFrame
Row │ SepalLength SepalWidth PetalLength PetalWidth Species
│ Float64 Float64 Float64 Float64 Cat…
─────┼───────────────────────────────────────────────────────────
1 │ 5.1 3.5 1.4 0.2 setosa
2 │ 4.9 3.0 1.4 0.2 setosa
3 │ 4.7 3.2 1.3 0.2 setosa
The groupby
function allows to form "sub-dataframes" corresponding to groups of rows. This can be very convenient to run specific analyses for specific groups without copying the data.
The basic usage is groupby(df, cols)
where cols
specifies one or several columns to use for the grouping.
Consider a simple example: in iris
there is a Species
column with 3 species:
unique(iris.Species)
3-element Vector{String}:
"setosa"
"versicolor"
"virginica"
We can form views for each of these:
gdf = groupby(iris, :Species);
The gdf
object now corresponds to views of the original dataframe for each of the 3 species; the first species is "setosa"
with:
subdf_setosa = gdf[1]
describe(subdf_setosa, :min, :mean, :max)
5×4 DataFrame
Row │ variable min mean max
│ Symbol Any Union… Any
─────┼─────────────────────────────────────
1 │ SepalLength 4.3 5.006 5.8
2 │ SepalWidth 2.3 3.428 4.4
3 │ PetalLength 1.0 1.462 1.9
4 │ PetalWidth 0.1 0.246 0.6
5 │ Species setosa setosa
Note that subdf_setosa
is a SubDataFrame
meaning that it is just a view of the parent dataframe iris
; if you modify that parent dataframe then the sub dataframe is also modified.
See ?groupby
for more information.
The combine
function allows to derive a new dataframe out of transformations of an existing one. Here's an example taken from the official doc (see ?combine
):
df = DataFrame(a=1:3, b=4:6)
combine(df, :a => sum, nrow)
1×2 DataFrame
Row │ a_sum nrow
│ Int64 Int64
─────┼──────────────
1 │ 6 3
what happened here is that the derived DataFrame has two columns obtained respectively by (1) computing the sum of the first column and (2) applying the nrow
function on the df
.
The transformation can produce one or several values, combine
will try to concatenate these columns as it can, for instance:
foo(v) = v[1:2]
combine(df, :a => maximum, :b => foo)
2×2 DataFrame
Row │ a_maximum b_foo
│ Int64 Int64
─────┼──────────────────
1 │ 3 4
2 │ 3 5
here the maximum value of a
is copied twice so that the two columns have the same number of rows.
bar(v) = v[end-1:end]
combine(df, :a => foo, :b => bar)
2×2 DataFrame
Row │ a_foo b_bar
│ Int64 Int64
─────┼──────────────
1 │ 1 5
2 │ 2 6
Combining groupby
with combine
is very useful. For instance you might want to compute statistics across groups for different variables:
combine(groupby(iris, :Species), :PetalLength => mean)
3×2 DataFrame
Row │ Species PetalLength_mean
│ Cat… Float64
─────┼──────────────────────────────
1 │ setosa 1.462
2 │ versicolor 4.26
3 │ virginica 5.552
let's decompose that:
the
groupby(iris, :Species)
creates groups using the:Species
column (which has valuessetosa
,versicolor
,virginica
)the
combine
creates a derived dataframe by applying themean
function to the:PetalLength
columnsince there are three groups, we get one column (mean of
PetalLength
) and three rows (one per group).
You can do this for several columns/statistics at the time and give new column names to the results:
gdf = groupby(iris, :Species)
combine(gdf, :PetalLength => mean => :MPL, :PetalLength => std => :SPL)
3×3 DataFrame
Row │ Species MPL SPL
│ Cat… Float64 Float64
─────┼───────────────────────────────
1 │ setosa 1.462 0.173664
2 │ versicolor 4.26 0.469911
3 │ virginica 5.552 0.551895
so here we assign the names :MPL
and :SPL
to the derived columns. If you want to apply something on all columns apart from the grouping one, using names
and Not
comes in handy:
combine(gdf, names(iris, Not(:Species)) .=> std)
3×5 DataFrame
Row │ Species SepalLength_std SepalWidth_std PetalLength_std PetalWidth_std
│ Cat… Float64 Float64 Float64 Float64
─────┼──────────────────────────────────────────────────────────────────────────────
1 │ setosa 0.35249 0.379064 0.173664 0.105386
2 │ versicolor 0.516171 0.313798 0.469911 0.197753
3 │ virginica 0.63588 0.322497 0.551895 0.27465
where
names(iris, Not(:Species))
4-element Vector{String}:
"SepalLength"
"SepalWidth"
"PetalLength"
"PetalWidth"
and note the use of .
in .=>
to indicate that we broadcast the function over each column.