Manipulating a DataFrame
To ensure code in this tutorial runs as shown, download the tutorial project folder and follow these instructions.If you have questions or suggestions about this tutorial, please open an issue here.
This tutorial is loosely adapted from this pandas tutorial as well as the DataFrames.jl documentation. It is by no means meant to be a complete introduction, rather, it focuses on some key functionalities that are particularly useful in a classical machine learning context.
To start with, we will use the Boston dataset which is very simple.
using RDatasets
using DataFrames
boston = dataset("MASS", "Boston");
The dataset function returns a DataFrame object:
typeof(boston)DataFrames.DataFrame
Intuitively a DataFrame is just a wrapper around a number of columns, each of which is a Vector of some type with a name"
names(boston)14-element Vector{String}:
"Crim"
"Zn"
"Indus"
"Chas"
"NOx"
"Rm"
"Age"
"Dis"
"Rad"
"Tax"
"PTRatio"
"Black"
"LStat"
"MedV"
You can view the first few rows using first and specifying a number of rows:
first(boston, 4)4×14 DataFrame
Row │ Crim Zn Indus Chas NOx Rm Age Dis Rad Tax PTRatio Black LStat MedV
│ Float64 Float64 Float64 Int64 Float64 Float64 Float64 Float64 Int64 Int64 Float64 Float64 Float64 Float64
─────┼────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
1 │ 0.00632 18.0 2.31 0 0.538 6.575 65.2 4.09 1 296 15.3 396.9 4.98 24.0
2 │ 0.02731 0.0 7.07 0 0.469 6.421 78.9 4.9671 2 242 17.8 396.9 9.14 21.6
3 │ 0.02729 0.0 7.07 0 0.469 7.185 61.1 4.9671 2 242 17.8 392.83 4.03 34.7
4 │ 0.03237 0.0 2.18 0 0.458 6.998 45.8 6.0622 3 222 18.7 394.63 2.94 33.4
You can access one of those columns easily using .colname, this returns a vector that you can access like any Julia vector:
boston.Crim[1:5]5-element Vector{Float64}:
0.00632
0.02731
0.02729
0.03237
0.06905
You can also just access the dataframe as you would a big matrix:
boston[3, 5]0.469
or specifying a range of rows/columns:
boston[1:5, [:Crim, :Zn]]5×2 DataFrame
Row │ Crim Zn
│ Float64 Float64
─────┼──────────────────
1 │ 0.00632 18.0
2 │ 0.02731 0.0
3 │ 0.02729 0.0
4 │ 0.03237 0.0
5 │ 0.06905 0.0
or, similarly,
boston[1:5, 1:2]5×2 DataFrame
Row │ Crim Zn
│ Float64 Float64
─────┼──────────────────
1 │ 0.00632 18.0
2 │ 0.02731 0.0
3 │ 0.02729 0.0
4 │ 0.03237 0.0
5 │ 0.06905 0.0
The select function is very convenient to get sub dataframes of interest:
b1 = select(boston, [:Crim, :Zn, :Indus])
first(b1, 2)2×3 DataFrame
Row │ Crim Zn Indus
│ Float64 Float64 Float64
─────┼───────────────────────────
1 │ 0.00632 18.0 2.31
2 │ 0.02731 0.0 7.07
The Not syntax is also very useful:
b2 = select(boston, Not(:NOx))
first(b2, 2)2×13 DataFrame
Row │ Crim Zn Indus Chas Rm Age Dis Rad Tax PTRatio Black LStat MedV
│ Float64 Float64 Float64 Int64 Float64 Float64 Float64 Int64 Int64 Float64 Float64 Float64 Float64
─────┼───────────────────────────────────────────────────────────────────────────────────────────────────────────────
1 │ 0.00632 18.0 2.31 0 6.575 65.2 4.09 1 296 15.3 396.9 4.98 24.0
2 │ 0.02731 0.0 7.07 0 6.421 78.9 4.9671 2 242 17.8 396.9 9.14 21.6
Finally, if you would like to drop columns, you can use select! which will mutate the dataframe in place:
select!(b1, Not(:Crim))
first(b1, 2)2×2 DataFrame
Row │ Zn Indus
│ Float64 Float64
─────┼──────────────────
1 │ 18.0 2.31
2 │ 0.0 7.07
StatsBase offers a convenient describe function which you can use on a DataFrame to get an overview of the data:
using StatsBase
describe(boston, :min, :max, :mean, :median, :std)14×6 DataFrame
Row │ variable min max mean median std
│ Symbol Real Real Float64 Float64 Float64
─────┼──────────────────────────────────────────────────────────────────
1 │ Crim 0.00632 88.9762 3.61352 0.25651 8.60155
2 │ Zn 0.0 100.0 11.3636 0.0 23.3225
3 │ Indus 0.46 27.74 11.1368 9.69 6.86035
4 │ Chas 0 1 0.06917 0.0 0.253994
5 │ NOx 0.385 0.871 0.554695 0.538 0.115878
6 │ Rm 3.561 8.78 6.28463 6.2085 0.702617
7 │ Age 2.9 100.0 68.5749 77.5 28.1489
8 │ Dis 1.1296 12.1265 3.79504 3.20745 2.10571
9 │ Rad 1 24 9.54941 5.0 8.70726
10 │ Tax 187 711 408.237 330.0 168.537
11 │ PTRatio 12.6 22.0 18.4555 19.05 2.16495
12 │ Black 0.32 396.9 356.674 391.44 91.2949
13 │ LStat 1.73 37.97 12.6531 11.36 7.14106
14 │ MedV 5.0 50.0 22.5328 21.2 9.1971
You can pass a number of symbols to the describe function to indicate which statistics to compute for each feature:
mean,std,min,max,median,first,lastare all fairly self explanatoryq25,q75are respectively for the 25th and 75th percentile,eltype,nunique,nmissingcan also be used
You can also pass your custom function with a pair name => function for instance:
foo(x) = sum(abs.(x)) / length(x)
d = describe(boston, :mean, :median, foo => :foo)
first(d, 3)3×4 DataFrame
Row │ variable mean median foo
│ Symbol Float64 Float64 Float64
─────┼───────────────────────────────────────
1 │ Crim 3.61352 0.25651 3.61352
2 │ Zn 11.3636 0.0 11.3636
3 │ Indus 11.1368 9.69 11.1368
The describe function returns a derived object with one row per feature and one column per required statistic.
Further to StatsBase, Statistics offers a range of useful functions for data analysis.
using Statistics
If you want to get the content of the dataframe as one big matrix, use convert:
mat = Matrix(boston)
mat[1:3, 1:3]3×3 Matrix{Float64}:
0.00632 18.0 2.31
0.02731 0.0 7.07
0.02729 0.0 7.07
Adding a column to a dataframe is very easy:
boston.Crim_x_Zn = boston.Crim .* boston.Zn;
that's it! Remember also that you can drop columns or make subselections with select and select!.
Let's load a dataset with missing values
mao = dataset("gap", "mao")
describe(mao, :nmissing)19×2 DataFrame
Row │ variable nmissing
│ Symbol Int64
─────┼────────────────────
1 │ ID 0
2 │ Type 0
3 │ Gender 0
4 │ Age 188
5 │ AAO 296
6 │ AAD 295
7 │ UPDRS 301
8 │ MAOAI2 0
9 │ AI2Code 0
10 │ MAOBI2 0
11 │ BI2Code 0
12 │ GTBEX3 0
13 │ BEX3Code 0
14 │ MAOAVNTR 46
15 │ VNTRCode 46
16 │ VNTRCod2 46
17 │ MAOA31 36
18 │ MAO31COD 36
19 │ MAO31CO2 36
Lots of missing values... If you wanted to compute simple functions on columns, they may just return missing:
std(mao.Age)missing
The skipmissing function can help counter this easily:
std(skipmissing(mao.Age))11.551715894987558
This is a shorter version of the DataFrames.jl tutorial.
iris = dataset("datasets", "iris")
first(iris, 3)3×5 DataFrame
Row │ SepalLength SepalWidth PetalLength PetalWidth Species
│ Float64 Float64 Float64 Float64 Cat…
─────┼───────────────────────────────────────────────────────────
1 │ 5.1 3.5 1.4 0.2 setosa
2 │ 4.9 3.0 1.4 0.2 setosa
3 │ 4.7 3.2 1.3 0.2 setosa
The groupby function allows to form "sub-dataframes" corresponding to groups of rows. This can be very convenient to run specific analyses for specific groups without copying the data.
The basic usage is groupby(df, cols) where cols specifies one or several columns to use for the grouping.
Consider a simple example: in iris there is a Species column with 3 species:
unique(iris.Species)3-element Vector{String}:
"setosa"
"versicolor"
"virginica"
We can form views for each of these:
gdf = groupby(iris, :Species);
The gdf object now corresponds to views of the original dataframe for each of the 3 species; the first species is "setosa" with:
subdf_setosa = gdf[1]
describe(subdf_setosa, :min, :mean, :max)5×4 DataFrame
Row │ variable min mean max
│ Symbol Any Union… Any
─────┼─────────────────────────────────────
1 │ SepalLength 4.3 5.006 5.8
2 │ SepalWidth 2.3 3.428 4.4
3 │ PetalLength 1.0 1.462 1.9
4 │ PetalWidth 0.1 0.246 0.6
5 │ Species setosa setosa
Note that subdf_setosa is a SubDataFrame meaning that it is just a view of the parent dataframe iris; if you modify that parent dataframe then the sub dataframe is also modified.
See ?groupby for more information.
The combine function allows to derive a new dataframe out of transformations of an existing one. Here's an example taken from the official doc (see ?combine):
df = DataFrame(a=1:3, b=4:6)
combine(df, :a => sum, nrow)1×2 DataFrame
Row │ a_sum nrow
│ Int64 Int64
─────┼──────────────
1 │ 6 3
what happened here is that the derived DataFrame has two columns obtained respectively by (1) computing the sum of the first column and (2) applying the nrow function on the df.
The transformation can produce one or several values, combine will try to concatenate these columns as it can, for instance:
foo(v) = v[1:2]
combine(df, :a => maximum, :b => foo)2×2 DataFrame
Row │ a_maximum b_foo
│ Int64 Int64
─────┼──────────────────
1 │ 3 4
2 │ 3 5
here the maximum value of a is copied twice so that the two columns have the same number of rows.
bar(v) = v[end-1:end]
combine(df, :a => foo, :b => bar)2×2 DataFrame
Row │ a_foo b_bar
│ Int64 Int64
─────┼──────────────
1 │ 1 5
2 │ 2 6
Combining groupby with combine is very useful. For instance you might want to compute statistics across groups for different variables:
combine(groupby(iris, :Species), :PetalLength => mean)3×2 DataFrame
Row │ Species PetalLength_mean
│ Cat… Float64
─────┼──────────────────────────────
1 │ setosa 1.462
2 │ versicolor 4.26
3 │ virginica 5.552
let's decompose that:
the
groupby(iris, :Species)creates groups using the:Speciescolumn (which has valuessetosa,versicolor,virginica)the
combinecreates a derived dataframe by applying themeanfunction to the:PetalLengthcolumnsince there are three groups, we get one column (mean of
PetalLength) and three rows (one per group).
You can do this for several columns/statistics at the time and give new column names to the results:
gdf = groupby(iris, :Species)
combine(gdf, :PetalLength => mean => :MPL, :PetalLength => std => :SPL)3×3 DataFrame
Row │ Species MPL SPL
│ Cat… Float64 Float64
─────┼───────────────────────────────
1 │ setosa 1.462 0.173664
2 │ versicolor 4.26 0.469911
3 │ virginica 5.552 0.551895
so here we assign the names :MPL and :SPL to the derived columns. If you want to apply something on all columns apart from the grouping one, using names and Not comes in handy:
combine(gdf, names(iris, Not(:Species)) .=> std)3×5 DataFrame
Row │ Species SepalLength_std SepalWidth_std PetalLength_std PetalWidth_std
│ Cat… Float64 Float64 Float64 Float64
─────┼──────────────────────────────────────────────────────────────────────────────
1 │ setosa 0.35249 0.379064 0.173664 0.105386
2 │ versicolor 0.516171 0.313798 0.469911 0.197753
3 │ virginica 0.63588 0.322497 0.551895 0.27465
where
names(iris, Not(:Species))4-element Vector{String}:
"SepalLength"
"SepalWidth"
"PetalLength"
"PetalWidth"
and note the use of . in .=> to indicate that we broadcast the function over each column.