TTA Speech 2.6 HD API | MiniMax High-Quality Speech Synthesis

MiniMax Speech-2.6-hd Synchronous Speech Synthesis

curl --request POST \
  --url https://api.highwayapi.ai/v3/minimax-speech-2.6-hd \
  --header 'Authorization: <authorization>' \
  --header 'Content-Type: <content-type>' \
  --data '
{
  "text": "<string>",
  "voice_setting": {
    "speed": 123,
    "vol": 123,
    "pitch": 123,
    "voice_id": "<string>",
    "emotion": "<string>",
    "latex_read": true,
    "text_normalization": true
  },
  "audio_setting": {
    "sample_rate": 123,
    "bitrate": 123,
    "format": "<string>",
    "channel": 123
  },
  "pronunciation_dict": {
    "tone": [
      {}
    ]
  },
  "timbre_weights": [
    {
      "voice_id": "<string>",
      "weight": 123
    }
  ],
  "stream": true,
  "stream_options": {
    "exclude_aggregated_audio": true
  },
  "language_boost": "<string>",
  "output_format": "<string>",
  "voice_modify": {
    "pitch": 123,
    "intensity": 123,
    "timbre": 123,
    "sound_effects": "<string>"
  }
}
'

{
  "audio": "<string>",
  "status": 123
}

POST

minimax-speech-2.6-hd

MiniMax Speech-2.6-hd Synchronous Speech Synthesis

curl --request POST \
  --url https://api.highwayapi.ai/v3/minimax-speech-2.6-hd \
  --header 'Authorization: <authorization>' \
  --header 'Content-Type: <content-type>' \
  --data '
{
  "text": "<string>",
  "voice_setting": {
    "speed": 123,
    "vol": 123,
    "pitch": 123,
    "voice_id": "<string>",
    "emotion": "<string>",
    "latex_read": true,
    "text_normalization": true
  },
  "audio_setting": {
    "sample_rate": 123,
    "bitrate": 123,
    "format": "<string>",
    "channel": 123
  },
  "pronunciation_dict": {
    "tone": [
      {}
    ]
  },
  "timbre_weights": [
    {
      "voice_id": "<string>",
      "weight": 123
    }
  ],
  "stream": true,
  "stream_options": {
    "exclude_aggregated_audio": true
  },
  "language_boost": "<string>",
  "output_format": "<string>",
  "voice_modify": {
    "pitch": 123,
    "intensity": 123,
    "timbre": 123,
    "sound_effects": "<string>"
  }
}
'

{
  "audio": "<string>",
  "status": 123
}

This API supports synchronous text-to-speech generation, with a maximum of 10,000 characters per text submission. It supports 100+ system voices and independently selectable cloned voices; supports adjustments to volume, pitch, speech rate, and output format; supports proportional voice mixing and fixed interval control; supports multiple audio specifications and formats, including: mp3, pcm, flac, wav, and supports streaming output. After submitting a long-text speech synthesis request, note that the returned url is valid for 24 hours from the time the url is returned. Please download the information in time.

Suitable for scenarios such as short sentence generation, voice chat, and online social networking. It has low latency but a text length limit of under 10,000 characters. For long text, we recommend using asynchronous speech synthesis.

Request Headers

Content-Type

string

required

Enum value: application/json

Authorization

string

required

Bearer authentication format: Bearer {{API Key}}.

Request Body

text

string

required

The text to synthesize, with a length limit of under 10,000 characters. Paragraph breaks should be replaced with newline characters. (If you need to control intervals in the speech, add <#x#> between characters, where x is in seconds, supports 0.01-99.99, with up to two decimal places). Custom time intervals between text segments are supported to achieve custom speech pause durations. Note that the text interval must be set between two text segments that can be pronounced, and multiple consecutive time intervals cannot be set.

voice_setting

object

required

Show properties

speed

float

default:"1.0"

Range [0.5,2], default value is 1.0The speech rate of the generated voice. Optional. The larger the value, the faster the speech rate.

vol

float

default:"1.0"

Range (0,10], default value is 1.0The volume of the generated voice. Optional. The larger the value, the higher the volume.

pitch

int

default:"0"

Range [-12,12], default value is 0The pitch of the generated voice. Optional. (0 means output with the original voice; the value must be an integer).

voice_id

string

The voice ID to request. One of voice_id or timbre_weights is “required”.Supports two types: system voices (id) and cloned voices (id). The system voice IDs are as follows:

Young male voice: male-qn-qingse
Elite young male voice: male-qn-jingying
Dominant young male voice: male-qn-badao
Male college student voice: male-qn-daxuesheng
Girl voice: female-shaonv
Mature sister voice: female-yujie
Mature female voice: female-chengshu
Sweet female voice: female-tianmei
Male presenter: presenter_male
Female presenter: presenter_female
Male audiobook 1: audiobook_male_1
Male audiobook 2: audiobook_male_2
Female audiobook 1: audiobook_female_1
Female audiobook 2: audiobook_female_2
Young male voice-beta: male-qn-qingse-jingpin
Elite young male voice-beta: male-qn-jingying-jingpin
Dominant young male voice-beta: male-qn-badao-jingpin
Male college student voice-beta: male-qn-daxuesheng-jingpin
Girl voice-beta: female-shaonv-jingpin
Mature sister voice-beta: female-yujie-jingpin
Mature female voice-beta: female-chengshu-jingpin
Sweet female voice-beta: female-tianmei-jingpin
Clever boy: clever_boy
Cute boy: cute_boy
Lovely girl: lovely_girl
Cartoon Pig Xiaoqi: cartoon_pig
Yandere younger brother: bingjiao_didi
Handsome boyfriend: junlang_nanyou
Innocent junior schoolmate: chunzhen_xuedi
Aloof senior schoolmate: lengdan_xiongzhang
Dominant young master: badao_shaoye
Sweetheart Xiaoling: tianxin_xiaoling
Playful cute girl: qiaopi_mengmei
Charming mature sister: wumei_yujie
Sweet-voiced junior schoolmate: diadia_xuemei
Elegant senior schoolmate: danya_xuejie
Santa Claus: Santa_Claus
Grinch: Grinch
Rudolph: Rudolph
Arnold: Arnold
Charming Santa: Charming_Santa
Charming Lady: Charming_Lady
Sweet Girl: Sweet_Girl
Cute Elf: Cute_Elf
Attractive Girl: Attractive_Girl
Serene Woman: Serene_Woman

emotion

string

Controls the emotion of the synthesized speech;Currently supports 7 emotions: happy, sad, angry, fearful, disgusted, surprised, neutral;Parameter range: ["happy", "sad", "angry", "fearful", "disgusted", "surprised", "neutral"]

latex_read

bool

default:"false"

Controls whether reading latex formulas is supported. The default is false.Notes:

Formulas in the request need to be wrapped with $$ at the beginning and end;
If a formula in the request contains "", it needs to be escaped as ”\”.

Example: The basic formula for derivatives is $$\\frac{d}{dx}(x^n) = nx^{n-1}$$

text_normalization

bool

default:"false"

This parameter supports English text normalization, which can improve performance in number-reading scenarios but will slightly increase latency. If not provided, the default value is false.

audio_setting

object

Show properties

sample_rate

int

default:"32000"

Range [8000, 16000, 22050, 24000, 32000, 44100]The sample rate of the generated voice. Optional, defaults to 32000.

bitrate

int

default:"128000"

Range [32000, 64000, 128000, 256000]The bitrate of the generated voice. Optional, default value is 128000. This parameter only takes effect for audio in mp3 format.

format

string

default:"mp3"

The generated audio format. Defaults to mp3, range [mp3,pcm,flac,wav]. wav is only supported for non-streaming output.

channel

int

default:"1"

The number of channels in the generated audio. Default 1: mono. Options:1: mono2: stereo

pronunciation_dict

object

Show properties

tone

list

Replace text and symbols that require special annotation, along with their corresponding pronunciations.Pronunciation replacement (adjust tones/replace pronunciations of other characters), in the following format:["燕少飞/(yan4)(shao3)(fei1)","达菲/(da2)(fei1)"，"omg/oh my god"]Tones are represented by numbers: first tone (yinping) is 1, second tone (yangping) is 2, third tone (shangsheng) is 3, fourth tone (qusheng) is 4, and neutral tone is 5.

timbre_weights

object[]

One of timbre_weights or voice_id is required.

Show properties

voice_id

string

The requested voice id. Must be filled in together with the weight parameter.

weight

int

Range [1,100]Weight. Must be filled in together with voice_id. Supports mixing up to 4 voices. The value must be an integer. The higher the proportion of a single voice, the more the synthesized voice will resemble it.

stream

boolean

default:"false"

Whether to stream. The default is false, meaning streaming is not enabled.

stream_options

object

Show properties

exclude_aggregated_audio

boolean

default:"false"

When this parameter is set to True, the final chunk in streaming will not include the complete stitched audio hex data. The default is False, meaning the final chunk includes the complete stitched audio hex data.

language_boost

string

default:"null"

Enhances recognition capability for specified low-resource languages and dialects. After setting it, speech performance can be improved in the specified low-resource language/dialect scenario. If the low-resource language type is unclear, you can choose “auto”, and the model will determine the language type independently. Supports the following values:

'Chinese', 'Chinese,Yue', 'English', 'Arabic', 'Russian', 'Spanish', 'French', 'Portuguese', 'German', 'Turkish', 'Dutch', 'Ukrainian', 'Vietnamese', 'Indonesian', 'Japanese', 'Italian', 'Korean', 'Thai', 'Polish', 'Romanian', 'Greek', 'Czech', 'Finnish', 'Hindi', 'Bulgarian', 'Danish', 'Hebrew', 'Malay', 'Persian', 'Slovak', 'Swedish', 'Croatian', 'Filipino', 'Hungarian', 'Norwegian', 'Slovenian', 'Catalan', 'Nynorsk', 'Tamil', 'Afrikaans', 'auto'

output_format

string

default:"hex"

Parameter that controls the output result format. Optional values are url hex. The default value is hex. This parameter only takes effect in non-streaming scenarios; streaming scenarios only support returning in hex format. The returned url is valid for 24 hours.

voice_modify

object

Voice effects settings. This parameter supports the following audio formats:

Non-streaming: mp3, wav, flac
Streaming: mp3

Show properties

pitch

integer

Pitch adjustment (deep/bright), range [-100,100]. Values closer to -100 make the voice deeper; values closer to 100 make the voice brighter.

intensity

integer

Intensity adjustment (powerful/soft), range [-100,100]. Values closer to -100 make the voice stronger; values closer to 100 make the voice softer.

timbre

integer

Timbre adjustment (magnetic/crisp), range [-100,100]. Values closer to -100 make the voice richer; values closer to 100 make the voice crisper.

sound_effects

string

Sound effects settings. Only one can be selected per request. Optional values:

spacious_echo (spacious echo)
auditorium_echo (auditorium broadcast)
lofi_telephone (telephone distortion)
robotic (electronic voice)

Response Information

audio

string

The synthesized audio segment, encoded in hex, generated according to the format defined in the input (audio_setting.format) (mp3/pcm/flac). The return format depends on the definition of output_format; when stream is true, only the hex return format is supported.

status

number

Current audio stream status, returned only when stream is true. 1 indicates synthesis in progress, and 2 indicates synthesis completed.

MiniMax Speech-2.5-turbo-preview Asynchronous Speech Synthesis

MiniMax Speech-2.6-hd Asynchronous Speech Synthesis

​Request Headers

​Request Body

​Response Information

Request Headers

Request Body

Response Information