06-data-train-test-spliting.Rmd
Data Train/Test spliting must be done at audio file level (by audio_id
), not slice. There will be no slices from the same audio file in both testing and training dataset at the same time. This way will vanish any risk of data leakage. The only source of data leakage risk resides on repeated audio files that could happen to exist due to Xeno-canto and Wikiaves (and any other data source) duplicity.
set.seed(1)
# load .rds file built at data-labeling step
audio_ids <- readr::read_rds(glue::glue("../data_/slices_1000ms_labels_by_humans.rds")) %>%
distinct(audio_id)
# 85% for training/ 15% for testing (could be any %)
audio_ids_train <- audio_ids %>% sample_frac(0.85)
audio_ids_test <- audio_ids %>% anti_join(audio_ids_train)
# save for later
readr::write_rds(audio_ids_train, "data_/audio_ids_train.rds")
readr::write_rds(audio_ids_test, "data_/audio_ids_test.rds")
audio_ids
#> # A tibble: 3 x 1
#> audio_id
#> <chr>
#> 1 Glaucidium-minutissimum-24426.wav
#> 2 Megascops-atricapilla-1261496.wav
#> 3 Megascops-atricapilla-1393458.wav