Skip to main content
POST
/
v3
/
fish-audio-s2-pro-text-to-speech
Fish Audio S2 Pro Text to Speech
curl --request POST \
  --url https://api.highwayapi.ai/v3/fish-audio-s2-pro-text-to-speech \
  --header 'Authorization: <authorization>' \
  --header 'Content-Type: <content-type>' \
  --data '
{
  "text": "<string>",
  "top_p": 123,
  "format": "<string>",
  "latency": "<string>",
  "prosody": {
    "speed": 123,
    "volume": 123,
    "normalize_loudness": true
  },
  "normalize": true,
  "references": [
    {
      "text": "<string>",
      "audio": "<string>"
    }
  ],
  "mp3_bitrate": 123,
  "sample_rate": 123,
  "temperature": 123,
  "chunk_length": 123,
  "opus_bitrate": 123,
  "reference_id": "<string>",
  "max_new_tokens": 123,
  "min_chunk_length": 123,
  "repetition_penalty": 123,
  "early_stop_threshold": 123,
  "condition_on_previous_chunks": true
}
'
The Fish Audio S2 Pro text-to-speech model converts text into natural speech, supporting reference voices, sampling control, segmentation, audio formats, and prosody control.

Request Headers

Content-Type
string
required
Enum value: application/json
Authorization
string
required
Bearer authentication format: Bearer {{API key}}.

Request Body

text
string
required
The text to convert into speech. S2-Pro multi-speaker text can use tags such as <|speaker:0|>Hello<|speaker:1|>Hi there.
top_p
number
Nucleus sampling diversity control.Value range: [0, 1]
format
string
default:"mp3"
Output audio format.Available values: wav, pcm, mp3, opus
latency
string
default:"normal"
Latency tier.Available values: low, normal, balanced
prosody
object
Prosody control.
normalize
boolean
default:true
Normalize Chinese and English text.
references
array
Reference audio samples for zero-shot voice cloning.
mp3_bitrate
integer
default:128
MP3 bitrate, in kbps.Available values: 64, 128, 192
sample_rate
integer
Output sample rate in Hz. If empty, the default value for the format is used: 48000 Hz for opus, and usually 44100 Hz for others.
temperature
number
Expressiveness control.Value range: [0, 1]
chunk_length
integer
default:300
Text chunk size.Value range: [100, 300]
opus_bitrate
integer
Opus bitrate, in bps. -1000 indicates automatic.Available values: -1000, 24000, 32000, 48000, 64000
reference_id
string
Voice model ID; in multi-speaker scenarios, an array matching the speaker indices can be provided.
max_new_tokens
integer
default:1024
Maximum number of audio tokens per chunk.
min_chunk_length
integer
default:50
Minimum number of characters before chunking.Value range: [0, 100]
repetition_penalty
number
Penalty coefficient for reducing audio pattern repetition.
early_stop_threshold
number
default:1
Early stopping threshold.Value range: [0, 1]
condition_on_previous_chunks
boolean
default:true
Use previous audio chunks as context.

Response Information

Generated audio. Format: binary