Here are some tips for collecting data for training a RAVE timbre transfer model. This guidebook is the product of ‘trial and error’ while training our own RAVE models, coupled with some fundamental machine learning best practices.

Disclaimer

Trial and error is often necessary for getting good models.
- No one really knows how to make deep learning models work 100% of the time!
Things in this guidebook may not apply to other timbre transfer models.
- For example, DDSP only works with monophonic pitched instruments.
It’s hard to get a timbre transfer effect that works with any input.
- Models tend to work well with input that is somewhat similar to the training data.
Please abide by the law and AI ethics when collecting and using data.
- Although the legalities surrounding generative AI are constantly evolving, please refrain from using recordings and the likenesses of other artists without their consent.
- Please refrain from using royalty-protected recordings without the owner’s consent.
- Please read and abide by our terms and conditions.
Neutone Inc.

Data collection

Data amount
- More than 2 hours of sound data is ideal
- The dataset can be a collection of many short snippets or a single long recording.
Recording quality
- The audio should ideally be recorded in a clean environment.
- All the sounds should be recorded in a similar environment.
Diversity of the data
- Too little diversity results in models that output the same thing over and over.
  - Ex.) Training on bike exhaust recordings resulted in a model that produces the same drone sound and is unresponsive to input.
- Too much diversity and the RAVE model tends to fail.
  - Ex.) Collecting a large collection (>1000) of drum breaks with different timbre resulted in a muddy output.
- Compiling recordings of a single instrument should generally be just the right amount of diversity.
Range
- The pitch range of the input affects the model’s reactivity.
- Models tend to break and become unstable when it is fed input that has frequency content that was not in the training data.
  - Ex.) Training a model on Marimba sounds resulted in a model that can’t deal with high frequency content.
  - Pre-filters can be applied to the model after training to prevent this behaviour.

Preprocessing

Preprocessing your data before training maybe effective in improving model quality.

Gain normalization is necessary if the original data is relatively quiet.
- A model trained on quiet sounds can behave erratically when it is fed loud sounds as input.
If your data doesn’t cover a wide enough range of pitch, pitch augmentation (creating variations by pitch shifting the data) may be beneficial.
- This is also a good way to increase the amount of data.

Examples of trained RAVE models

Speech
- Ex.) evoice, jvoice
- Creates cool beatboxing type effect when fed drums as input.
Percussion
- Ex.) drumkit, taiko
- Follows the rhythmic content of the input.
Melodic
- Ex.) violin, choir
- Follows pitch to an extent.
- Needs large dataset that covers a lot of different pitches.
Textural
- Ex.) kora, NASA
- Doesn’t follow pitch so well but adds interesting texture/events to the input.