Fish Audio S2 Pro Text to Speech
Audio
Fish Audio S2 Pro Text to Speech
POST
Fish Audio S2 Pro Text to Speech
The Fish Audio S2 Pro text-to-speech model converts text into natural speech, supporting reference voices, sampling control, segmentation, audio formats, and prosody control.
Request Headers
Enum value:
application/jsonBearer authentication format: Bearer {{API key}}.
Request Body
The text to convert into speech. S2-Pro multi-speaker text can use tags such as <|speaker:0|>Hello<|speaker:1|>Hi there.
Nucleus sampling diversity control.Value range: [0, 1]
Output audio format.Available values:
wav, pcm, mp3, opusLatency tier.Available values:
low, normal, balancedProsody control.
Normalize Chinese and English text.
Reference audio samples for zero-shot voice cloning.
MP3 bitrate, in kbps.Available values:
64, 128, 192Output sample rate in Hz. If empty, the default value for the format is used: 48000 Hz for opus, and usually 44100 Hz for others.
Expressiveness control.Value range: [0, 1]
Text chunk size.Value range: [100, 300]
Opus bitrate, in bps. -1000 indicates automatic.Available values:
-1000, 24000, 32000, 48000, 64000Voice model ID; in multi-speaker scenarios, an array matching the speaker indices can be provided.
Maximum number of audio tokens per chunk.
Minimum number of characters before chunking.Value range: [0, 100]
Penalty coefficient for reducing audio pattern repetition.
Early stopping threshold.Value range: [0, 1]
Use previous audio chunks as context.
Response Information
Generated audio. Format:binary