Skip to main content
POST
/
v4beta
/
txt2speech
Fish Audio Text-to-Speech
curl --request POST \
  --url https://api.highwayapi.ai/v4beta/txt2speech \
  --header 'Authorization: <authorization>' \
  --header 'Content-Type: <content-type>' \
  --data '
{
  "text": "<string>",
  "temperature": 123,
  "top_p": 123,
  "references": {
    "text": "<string>"
  },
  "reference_id": {},
  "prosody": {
    "speed": 123,
    "volume": 123
  },
  "chunk_length": 123,
  "normalize": true,
  "format": {},
  "sample_rate": {},
  "mp3_bitrate": {},
  "opus_bitrate": {},
  "latency": {}
}
'
For best results, we recommend using audio cloning to upload reference audio before using this API. This will improve speech quality and reduce latency.
Fish Audio converts text into speech. Supported audio formats:
  • WAV / PCM
    • Sample rates: 8kHz, 16kHz, 24kHz, 32kHz, 44.1kHz
    • Default sample rate: 44.1kHz
    • 16-bit, mono
  • MP3
    • Sample rates: 32kHz, 44.1kHz
    • Default sample rate: 44.1kHz
    • Mono
    • Bitrates: 64kbps, 128kbps (default), 192kbps
  • Opus
    • Sample rate: 48kHz
    • Default sample rate: 48kHz
    • Mono
    • Bitrates: -1000 (automatic), 24kbps, 32kbps (default), 48kbps, 64kbps

Request Headers

Content-Type
string
required
Enum value: application/json
Authorization
string
required
Bearer authentication format: Bearer {{API Key}}.

Request Body

text
string
required
The text to convert to speech.
temperature
number
Controls the randomness of speech generation. Higher values (for example, 1.0) make the output more random, while lower values (for example, 0.1) make it more deterministic. We recommend using 0.9 for the s1 model.Required range: 0 <= x <= 1
top_p
number
Controls diversity through nucleus sampling. Lower values (for example, 0.1) make the output more focused, while higher values (for example, 1.0) allow more diversity. We recommend using 0.9 for the s1 model.Required range: 0 <= x <= 1
references
ReferenceAudio · object[] | null
Reference audio for the voice. This requires MessagePack serialization and will override reference_voices and reference_texts.
reference_id
string | null
The reference model ID for the voice.
prosody
ProsodyControl · object
Prosody control for the voice.
chunk_length
integer
default:200
The chunk length for the voice.Required range: 100 <= x <= 300
normalize
boolean
default:true
Whether to normalize the speech. This will reduce latency, but may reduce performance when handling numbers and dates.
format
enum<string>
default:"mp3"
The format for the speech.Optional values: wav, pcm, mp3, opus
sample_rate
integer | null
The sample rate for the speech.
mp3_bitrate
enum<integer>
default:128
The MP3 bitrate for the speech.Optional values: 64, 128, 192
opus_bitrate
enum<integer>
default:32
The Opus bitrate for the speech.Optional values: -1000, 24, 32, 48, 64
latency
enum<string>
default:"normal"
The latency setting for the speech. balanced reduces latency but may cause performance degradation.Optional values: normal, balanced

Response Information

The API will directly return an audio stream in the format specified by the format parameter (default: mp3).