MiniMax Speech 2.8 HD Synchronous Speech Synthesis

curl --request POST \
  --url https://api.highwayapi.ai/v3/minimax-speech-2.8-hd \
  --header 'Authorization: <authorization>' \
  --header 'Content-Type: <content-type>' \
  --data '
{
  "text": "<string>",
  "stream": true,
  "voice_modify": {
    "pitch": 123,
    "timbre": 123,
    "intensity": 123,
    "sound_effects": "<string>"
  },
  "audio_setting": {
    "format": "<string>",
    "bitrate": 123,
    "channel": 123,
    "force_cbr": true,
    "sample_rate": 123
  },
  "output_format": "<string>",
  "voice_setting": {
    "vol": 123,
    "pitch": 123,
    "speed": 123,
    "emotion": "<string>",
    "voice_id": "<string>",
    "latex_read": true,
    "text_normalization": true
  },
  "aigc_watermark": true,
  "language_boost": "<string>",
  "stream_options": {
    "exclude_aggregated_audio": true
  },
  "timber_weights": [
    {
      "weight": 123,
      "voice_id": "<string>"
    }
  ],
  "subtitle_enable": true,
  "continuous_sound": true,
  "pronunciation_dict": {
    "tone": [
      {}
    ]
  }
}
'

{
  "data": {},
  "trace_id": "<string>",
  "base_resp": {},
  "extra_info": {}
}

POST

minimax-speech-2.8-hd

curl --request POST \
  --url https://api.highwayapi.ai/v3/minimax-speech-2.8-hd \
  --header 'Authorization: <authorization>' \
  --header 'Content-Type: <content-type>' \
  --data '
{
  "text": "<string>",
  "stream": true,
  "voice_modify": {
    "pitch": 123,
    "timbre": 123,
    "intensity": 123,
    "sound_effects": "<string>"
  },
  "audio_setting": {
    "format": "<string>",
    "bitrate": 123,
    "channel": 123,
    "force_cbr": true,
    "sample_rate": 123
  },
  "output_format": "<string>",
  "voice_setting": {
    "vol": 123,
    "pitch": 123,
    "speed": 123,
    "emotion": "<string>",
    "voice_id": "<string>",
    "latex_read": true,
    "text_normalization": true
  },
  "aigc_watermark": true,
  "language_boost": "<string>",
  "stream_options": {
    "exclude_aggregated_audio": true
  },
  "timber_weights": [
    {
      "weight": 123,
      "voice_id": "<string>"
    }
  ],
  "subtitle_enable": true,
  "continuous_sound": true,
  "pronunciation_dict": {
    "tone": [
      {}
    ]
  }
}
'

{
  "data": {},
  "trace_id": "<string>",
  "base_resp": {},
  "extra_info": {}
}

Convert text to speech, with support for multiple voices, emotion control, speech rate adjustment, and more. The text length must be less than 10000 characters. If the text length is greater than 3000 characters, streaming output is recommended.

Request Headers

Content-Type

string

required

Enum value: application/json

Authorization

string

required

Bearer authentication format: Bearer {{API Key}}.

Request Body

text

string

required

The text to synthesize into speech. The length must be less than 10000 characters. If the text length is greater than 3000 characters, streaming output is recommended. Supports paragraph switching (line breaks), pause control (<#x#> markers), and filler/sound effect tags (such as (laughs), (coughs), etc.; supported only by speech-2.8-hd/turbo)

stream

boolean

default:false

Controls whether to use streaming output. The default is false, meaning streaming is not enabled

voice_modify

object

Hide properties

pitch

integer

Pitch adjustment (deep/bright), range [-100, 100]. Values closer to -100 make the voice deeper; values closer to 100 make the voice brighterValue range: [-100, 100]

timbre

integer

Timbre adjustment (magnetic/crisp), range [-100, 100]. Values closer to -100 make the voice richer; values closer to 100 make the voice crisperValue range: [-100, 100]

intensity

integer

Intensity adjustment (powerful/soft), range [-100, 100]. Values closer to -100 make the voice more forceful; values closer to 100 make the voice softerValue range: [-100, 100]

sound_effects

string

Sound effect setting. Only one can be selected per request. Available values: spacious_echo (spacious echo), auditorium_echo (auditorium broadcast), lofi_telephone (telephone distortion), robotic (electronic voice)Available values: spacious_echo, auditorium_echo, lofi_telephone, robotic

audio_setting

object

Hide properties

format

string

default:"mp3"

The format of the generated audio. wav is supported only for non-streaming outputAvailable values: mp3, pcm, flac, wav

bitrate

integer

default:128000

The bitrate of the generated audio. Available range: [32000, 64000, 128000, 256000]. The default value is 128000. This parameter only takes effect for audio in mp3 formatAvailable values: 32000, 64000, 128000, 256000

channel

integer

default:1

The number of channels for the generated audio. Available range: [1, 2], where 1 is mono and 2 is stereo. The default value is 1Available values: 1, 2

force_cbr

boolean

default:false

Controls constant bitrate (cbr) for audio. Available values are false and true. When this parameter is set to true, audio is encoded with a constant bitrate. Note: This parameter only takes effect when the audio is set to streaming output and the audio format is mp3

sample_rate

integer

default:32000

The sample rate of the generated audio. Available range: [8000, 16000, 22050, 24000, 32000, 44100]. The default is 32000Available values: 8000, 16000, 22050, 24000, 32000, 44100

output_format

string

default:"hex"

Parameter that controls the form of the output result. Available values are url and hex. The default value is hex. This parameter only takes effect in non-streaming scenarios; streaming scenarios only support returning in hex form. The returned url is valid for 24 hoursAvailable values: url, hex

voice_setting

object

Hide properties

vol

number

default:1

The volume of the synthesized audio. The larger the value, the higher the volume. Value range: (0, 10]. The default value is 1.0Value range: [0, 10]

pitch

integer

default:0

The intonation of the synthesized audio. Value range: [-12, 12]. The default value is 0, where 0 outputs the original voiceValue range: [-12, 12]

speed

number

default:1

The speech rate of the synthesized audio. The larger the value, the faster the speech rate. Value range: [0.5, 2]. The default value is 1.0Value range: [0.5, 2]

emotion

string

Controls the emotion of the synthesized speech. The parameter range corresponds to 8 emotions: happy (happy), sad (sad), angry (angry), fearful (fearful), disgusted (disgusted), surprised (surprised), calm (neutral), fluent (vivid), and whisper (whisper). The model will automatically match an appropriate emotion based on the input text, so manual specification is usually not requiredAvailable values: happy, sad, angry, fearful, disgusted, surprised, calm, fluent, whisper

voice_id

string

required

The voice ID for the synthesized audio. To set a mixed voice, set the timber_weights parameter and set this parameter to an empty value. Supports three types: system voices, cloned voices, and text-generated voices

latex_read

boolean

default:false

Controls whether to read latex formulas aloud. The default is false. Only Chinese is supported. After this parameter is enabled, the language_boost parameter will be set to Chinese

text_normalization

boolean

default:false

Whether to enable Chinese and English text normalization. Enabling it can improve performance in numeric reading scenarios, but will slightly increase latency. The default value is false

aigc_watermark

boolean

default:false

Controls whether to add an audio rhythm identifier at the end of the synthesized audio. The default value is false. This parameter only takes effect for non-streaming synthesis

language_boost

string

Whether to enhance recognition capability for specified low-resource languages and dialects. The default value is null. It can be set to auto to let the model determine automaticallyAvailable values: Chinese, Chinese,Yue, English, Arabic, Russian, Spanish, French, Portuguese, German, Turkish, Dutch, Ukrainian, Vietnamese, Indonesian, Japanese, Italian, Korean, Thai, Polish, Romanian, Greek, Czech, Finnish, Hindi, Bulgarian, Danish, Hebrew, Malay, Persian, Slovak, Swedish, Croatian, Filipino, Hungarian, Norwegian, Slovenian, Catalan, Nynorsk, Tamil, Afrikaans, auto

stream_options

object

Hide properties

exclude_aggregated_audio

boolean

default:false

Sets whether the last chunk contains the concatenated speech hex data. The default value is false, meaning the last chunk contains the complete concatenated speech hex data

timber_weights

array

Mixed voice settings. Supports mixing up to 4 voices

Hide properties

weight

integer

required

The weight of each voice in the synthesized audio. Must be filled in together with voice_id. Available value range: [1, 100]. Supports mixing up to 4 voices. The higher the proportion of a single voice, the more similar the synthesized voice will be to that voiceValue range: [1, 100]

voice_id

string

required

The voice ID for the synthesized audio. Must be filled in together with the weight parameter. Supports three types: system voices, cloned voices, and text-generated voices

subtitle_enable

boolean

default:false

Controls whether to enable the subtitle service. The default value is false. This parameter is only valid in non-streaming output scenarios, and only for the speech-2.6-hd, speech-2.6-turbo, speech-02-turbo, speech-02-hd, speech-01-turbo, and speech-01-hd models

continuous_sound

boolean

default:false

Enable this parameter to make transitions between clauses more natural. Only the speech-2.8-hd and speech-2.8-turbo models are supported

pronunciation_dict

object

Hide properties

tone

array

Defines phonetic annotations or pronunciation replacement rules corresponding to text or symbols that require special annotation. In Chinese text, tones are represented by numbers: first tone is 1, second tone is 2, third tone is 3, fourth tone is 4, and neutral tone is 5. Example: [“燕少飞/(yan4)(shao3)(fei1)”, “omg/oh my god”]

Response Information

data

object

The returned synthesis data object. It may be null, so a non-null check is required

trace_id

string

The id of this session, used to help locate issues during consultation/feedback

base_resp

object

The status code and details of this request

extra_info

object

Additional information about the audio

MiniMax Speech 2.8 HD Asynchronous Speech Synthesis

MiniMax Audio Quick Cloning

​Request Headers

​Request Body

​Response Information

Request Headers

Request Body

Response Information