Directory Structure
The folder structure is as follows:
.
├── Imbalance.jl # entry point to package
├── generic_resample.jl # functions used in all resampling methods
├── generic_encoder.jl # used in all resampling methods that deal with categorical data
├── table_wrappers.jl # generalizes a function that operates on matrices to tables
├── class_counts.jl # used to compute number of data points to add or remove
├── common # has julia files for common docs, error strings and utils
├── distance_metrics # has distance metrics used by some resampling methods
├── oversampling_methods # all oversampling methods live here
├── undersampling_methods # all undersampling methods live here
└── extras.jl # extra functions like generating data or checking balance
The purpose of each file is further documented therein at the beginning of the file. The files are ordered here in the recommended order of checking.
Any method resampling method implemented in the oversampling_methods
or undersampling_methods
folder takes the following structure:
├── resample_method # contains implementation and interfaces for a resampling method
│ ├── interface_mlj.jl # implements MLJ interface for the method
│ ├── interface_tables.jl # implements Tables.jl interface for the method
│ └── resample_method.jl # implements the method itself (pure functional interface)
Contribution
Reporting Problems or Seeking Support
- Do not hesitate to post a Github issue with your question or problem.
Adding New Resampling Methods
- Make a new folder
resample_method
for the method in theoversampling_methods
orundersampling_methods
- Implement in
resample_method/resample_method.jl
the method over matrices for one minority class - Use
generic_oversample.jl
to generalize it to work on the whole data - Use
table_wrapper.jl
to generalize the method to work on tables and possibly usegeneric_encoder.jl
- Implement the
MLJ
interface for the method inresample_method/interface_mlj
- Implement the
TableTransforms
interface for the method inresample_method/interface_tables.jl
- Use the rest of the files according to their description
- Testing and documentation should be done in parallel
Surely, you can ignore ignore the third step if the algorithm you are implementing does not operate in "per-class" sense.
🔥 Hot algorithms to add
K-Means SMOTE
: Takes care of where exactly to generate more points usingSMOTE
by factoring in "within class imbalance". This may be also easily generalized to algorithms beyondSMOTE
.CondensedNearestNeighbors
: Undersamples the dataset such as to perserve the decision boundary byKNN
BorderlineSMOTE2
: A small modification of theBorderlineSMOTE1
conditionRepeatedENNUndersampler
: Simply repeatsENNUndersampler
multiple times
Adding New Tutorials
- Make a new notebook with the tutorial in the
examples
folder found indocs/src/examples
- Run the notebook so that the output is shown below each cell
- If the notebook produces visuals then save and load them in the notebook
- Convert it to markdown by using Python to run
from convert import convert_to_md; convert_to_md('<filename>')
- Set a title, description, image and links for it in the dictionary found in
docs/examples.jl
- For the colab link, you do not need to upload anything just follow the link pattern in the file