Music Gen is music Generating model based on LM.
Music can be created by text or audio prompt.
Music Gen is based on Transformer.
text-> encoder-->hidden_rep-> decoder->music
model = MusicgenForConditionalGeneration.from_pretrained('facebook/musicgen-small')
Two mode: greedy and sampling. (It is said that the latter one is better)
With no prompts
%%time
unconditional_inputs = model.get_unconditional_inputs(num_samples=1)
audio_values = model.generate(**unconditional_inputs, do_sample = True, max_new_tokens = 256)
sampling_rate = model.config.audio_encoder.sampling_rate
scipy.io.wavfile.write('musicgen_out.wav', rate = sampling_rate, data = audio_values[0,0].asnumpy())
Audio(audio_values[0].asnumpy(), rate = sampling_rate)# the method to listen in jupyter
audio_length_in_s = 256/model.config.audio_encoder.frame_rate
audio_length_in_s
With text prompt
processor = AutoProcessor.from_pretrained('facebook/musicgen-small')
inputs = processor(
text = ['90s pop track with bassy drums and synth', '80s rock song with loud guitars and heavy drums']
padding = True,
return_tensors = 'ms'
)
audio_values = model.generate(**inputs, do_sample=True, guidance_scale=3, max_new_tokens = 256)
scipy.io.wavfile.write('musicgen_out_text.wav', rate = sampling_rate, data = audio_values[0,0].asnumpy())
Audio(audio_values[0].asnumpy(), rate = sampling_rate)
With audio prompt
processor = AuroProcessor.from_pretrained('facebook/musicgen-small')
dataset = load_dataset('sanchit-gandhi/gtzan', split = 'train', streaming = True)
sample = next(iter(dataset))['audio']
sample['array'] = sample['array'][:len(sample['array']) // 2]
inputs = processor(
audio = sample['array'],
sampling_rate = sample['sampling_rate'],
text = ['80s blue track with groovy saxophone'],
padding = True,
return_tensors = 'ms'
)
audio_values = model.generate(**inputs, do_sample = True, guidance_scale= 3, max_new_tokens = 256)
Batch generating.
sample = next(iter(dataset))['audio']
sample_1 = sample['array'][:len(sample['array'])//4]
sample_2 = sample['array'][:len(sample['array']) //2]
inputs = processor(
audio = [sample_1, sample_2],
sampling_rate = sample['sampling_rate'],
text = ['80s blues track with groovy saxophone', '90s rock songs with loud guiatars and heavy drums'],
padding = True,
return_tensor = 'ms'
)
audio_values = model.generate(**inputs, do_sample = True, guidance_scale = 3, max_new_tokens = 256)
audio_values = processor.batch_decode(audio_values, padding_mask = inputs.padding_mask)