Unlocking Speech–Text Compositional Powers: Instruction-Following Speech Language Models without Instruction Tuning - Audio Demo

This webpage shows some audio examples for SpeechCombine, an instruction-following speech language model trained without any instruction tuning, using only a single round of speech pre-training on 30k hours of data.

Overall framework

The following demos correspond to Section 4 in the original paper, showcasing how SpeechCombine can generate appropriate output in text-oriented, speech understanding, and generation tasks. However, we choose to present the tasks in a reverse order, since the generation task is evaluated primarily based on the expressiveness of the output, where subjective perception plays a central role in assessing speech generation quality. In contrast, the other two tasks primarily evaluate accuracy-oriented capabilities.