Cross-Validation

The CrossValidation module extends scikit-learn’s splitter API with group-aware, stratified cross-validation strategies. These are needed when data contain groups (e.g. recordings from the same individual or deployment) that must not be split across folds.

StratifiedGroupKFold

class ecosound.classification.CrossValidation.StratifiedGroupKFold(*args: Any, **kwargs: Any)[source]

Bases: _BaseKFold

Stratified K-Folds iterator variant with non-overlapping groups.

This cross-validation object is a variation of StratifiedKFold that returns stratified folds with non-overlapping groups. The folds are made by preserving the percentage of samples for each class.

The same group will not appear in two different folds (the number of distinct groups has to be at least equal to the number of folds).

The difference between GroupKFold and StratifiedGroupKFold is that the former attempts to create balanced folds such that the number of distinct groups is approximately the same in each fold, whereas StratifiedGroupKFold attempts to create folds which preserve the percentage of samples for each class.

Parameters:
  • n_splits (int, default=5) – Number of folds. Must be at least 2.

  • shuffle (bool, default=False) – Whether to shuffle each class’s samples before splitting into batches. Note that the samples within each split will not be shuffled.

  • random_state (int or RandomState instance, default=None) – When shuffle is True, random_state affects the ordering of the indices, which controls the randomness of each fold for each class. Otherwise, leave random_state as None. Pass an int for reproducible output across multiple function calls.

Examples

>>> import numpy as np
>>> from sklearn.model_selection import StratifiedGroupKFold
>>> X = np.ones((17, 2))
>>> y = np.array([0, 0, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0])
>>> groups = np.array([1, 1, 2, 2, 3, 3, 3, 4, 5, 5, 5, 5, 6, 6, 7, 8, 8])
>>> cv = StratifiedGroupKFold(n_splits=3)
>>> for train_idxs, test_idxs in cv.split(X, y, groups):
...     print("TRAIN:", groups[train_idxs])
...     print("      ", y[train_idxs])
...     print(" TEST:", groups[test_idxs])
...     print("      ", y[test_idxs])
TRAIN: [2 2 4 5 5 5 5 6 6 7]
       [1 1 1 0 0 0 0 0 0 0]
 TEST: [1 1 3 3 3 8 8]
       [0 0 1 1 1 0 0]
TRAIN: [1 1 3 3 3 4 5 5 5 5 8 8]
       [0 0 1 1 1 1 0 0 0 0 0 0]
 TEST: [2 2 6 6 7]
       [1 1 0 0 0]
TRAIN: [1 1 2 2 3 3 3 6 6 7 8 8]
       [0 0 1 1 1 1 1 0 0 0 0 0]
 TEST: [4 5 5 5 5]
       [1 0 0 0 0]

See also

StratifiedKFold

Takes class information into account to build folds which retain class distributions (for binary or multiclass classification tasks).

GroupKFold

K-fold iterator variant with non-overlapping groups.

RepeatedStratifiedGroupKFold

class ecosound.classification.CrossValidation.RepeatedStratifiedGroupKFold(*args: Any, **kwargs: Any)[source]

Bases: _RepeatedSplits

Repeated Stratified K-Fold cross validator.

Repeats Stratified K-Fold with non-overlapping groups n times with different randomization in each repetition.

Parameters:
  • n_splits (int, default=5) – Number of folds. Must be at least 2.

  • n_repeats (int, default=10) – Number of times cross-validator needs to be repeated.

  • random_state (int or RandomState instance, default=None) – Controls the generation of the random states for each repetition. Pass an int for reproducible output across multiple function calls.

Examples

>>> import numpy as np
>>> from sklearn.model_selection import RepeatedStratifiedGroupKFold
>>> X = np.ones((17, 2))
>>> y = np.array([0, 0, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0])
>>> groups = np.array([1, 1, 2, 2, 3, 3, 3, 4, 5, 5, 5, 5, 6, 6, 7, 8, 8])
>>> cv = RepeatedStratifiedGroupKFold(n_splits=2, n_repeats=2,
...                                   random_state=36851234)
>>> for train_index, test_index in cv.split(X, y, groups):
...     print("TRAIN:", groups[train_idxs])
...     print("      ", y[train_idxs])
...     print(" TEST:", groups[test_idxs])
...     print("      ", y[test_idxs])
TRAIN: [2 2 4 5 5 5 5 8 8]
       [1 1 1 0 0 0 0 0 0]
 TEST: [1 1 3 3 3 6 6 7]
       [0 0 1 1 1 0 0 0]
TRAIN: [1 1 3 3 3 6 6 7]
       [0 0 1 1 1 0 0 0]
 TEST: [2 2 4 5 5 5 5 8 8]
       [1 1 1 0 0 0 0 0 0]
TRAIN: [3 3 3 4 7 8 8]
       [1 1 1 1 0 0 0]
 TEST: [1 1 2 2 5 5 5 5 6 6]
       [0 0 1 1 0 0 0 0 0 0]
TRAIN: [1 1 2 2 5 5 5 5 6 6]
       [0 0 1 1 0 0 0 0 0 0]
 TEST: [3 3 3 4 7 8 8]
       [1 1 1 1 0 0 0]

Notes

Randomized CV splitters may return different results for each call of split. You can make the results identical by setting random_state to an integer.

See also

RepeatedStratifiedKFold

Repeats Stratified K-Fold n times.