Here are some tips for collecting data for training a RAVE timbre transfer model. This guidebook is the product of ‘trial and error’ while training our own RAVE models, coupled with some fundamental machine learning best practices.
Disclaimer
-
Trial and error is often necessary for getting good models.
- No one really knows how to make deep learning models work 100% of the time!
-
Things in this guidebook may not apply to other timbre transfer models.
- For example, DDSP only works with monophonic pitched instruments.
-
It’s hard to get a timbre transfer effect that works with any input.
- Models tend to work well with input that is somewhat similar to the training data.
-
Please abide by the law and AI ethics when collecting and using data.
- Although the legalities surrounding generative AI are constantly evolving, please refrain from using recordings and the likenesses of other artists without their consent.
- Please refrain from using royalty-protected recordings without the owner’s consent.
- Please read and abide by our terms and conditions.
Neutone Inc.
Data collection
- Data amount
- More than 2 hours of sound data is ideal
- The dataset can be a collection of many short snippets or a single long recording.
- Recording quality
- The audio should ideally be recorded in a clean environment.
- All the sounds should be recorded in a similar environment.
- Diversity of the data
- Too little diversity results in models that output the same thing over and over.
- Ex.) Training on bike exhaust recordings resulted in a model that produces the same drone sound and is unresponsive to input.
- Too much diversity and the RAVE model tends to fail.
- Ex.) Collecting a large collection (>1000) of drum breaks with different timbre resulted in a muddy output.
- Compiling recordings of a single instrument should generally be just the right amount of diversity.
- Range
- The pitch range of the input affects the model’s reactivity.
- Models tend to break and become unstable when it is fed input that has frequency content that was not in the training data.
- Ex.) Training a model on Marimba sounds resulted in a model that can’t deal with high frequency content.
- Pre-filters can be applied to the model after training to prevent this behaviour.
Preprocessing
Preprocessing your data before training maybe effective in improving model quality.
- Gain normalization is necessary if the original data is relatively quiet.
- A model trained on quiet sounds can behave erratically when it is fed loud sounds as input.
- If your data doesn’t cover a wide enough range of pitch, pitch augmentation (creating variations by pitch shifting the data) may be beneficial.
- This is also a good way to increase the amount of data.
Examples of trained RAVE models
- Speech
- Ex.) evoice, jvoice
- Creates cool beatboxing type effect when fed drums as input.
- Percussion
- Ex.) drumkit, taiko
- Follows the rhythmic content of the input.
- Melodic
- Ex.) violin, choir
- Follows pitch to an extent.
- Needs large dataset that covers a lot of different pitches.
- Textural
- Ex.) kora, NASA
- Doesn’t follow pitch so well but adds interesting texture/events to the input.