MiniMax Speech-2.5-hd-preview Asynchronous Speech Synthesis

curl --request POST \
  --url https://api.highwayapi.ai/v3/async/minimax-speech-2.5-hd-preview \
  --header 'Authorization: <authorization>' \
  --header 'Content-Type: <content-type>' \
  --data '
{
  "text": "<string>",
  "voice_setting": {
    "speed": 123,
    "vol": 123,
    "pitch": 123,
    "voice_id": "<string>",
    "emotion": "<string>",
    "text_normalization": true
  },
  "audio_setting": {
    "sample_rate": 123,
    "bitrate": 123,
    "format": "<string>",
    "channel": 123
  },
  "pronunciation_dict": {
    "tone": [
      {}
    ]
  },
  "language_boost": "<string>",
  "voice_modify": {
    "pitch": 123,
    "intensity": 123,
    "timbre": 123,
    "sound_effects": "<string>"
  }
}
'

{
  "task_id": "<string>"
}

POST

async

minimax-speech-2.5-hd-preview

MiniMax Speech-2.5-hd-preview Asynchronous Speech Synthesis

curl --request POST \
  --url https://api.highwayapi.ai/v3/async/minimax-speech-2.5-hd-preview \
  --header 'Authorization: <authorization>' \
  --header 'Content-Type: <content-type>' \
  --data '
{
  "text": "<string>",
  "voice_setting": {
    "speed": 123,
    "vol": 123,
    "pitch": 123,
    "voice_id": "<string>",
    "emotion": "<string>",
    "text_normalization": true
  },
  "audio_setting": {
    "sample_rate": 123,
    "bitrate": 123,
    "format": "<string>",
    "channel": 123
  },
  "pronunciation_dict": {
    "tone": [
      {}
    ]
  },
  "language_boost": "<string>",
  "voice_modify": {
    "pitch": 123,
    "intensity": 123,
    "timbre": 123,
    "sound_effects": "<string>"
  }
}
'

{
  "task_id": "<string>"
}

This API supports asynchronous text-to-speech generation. A single text generation request supports transmitting up to 1 million characters, and the complete generated audio result can be retrieved asynchronously. It supports 100+ system voices and cloned voice selection, as well as custom adjustment of pitch, speed, volume, bitrate, sample rate, and output format. After submitting a long-text speech synthesis request, please note that the returned url is valid for 24 hours from the time the url is returned. Please download the information within this time window.

Suitable for generating speech from long texts such as entire books. Task queuing may take a relatively long time. For scenarios such as short sentence generation, voice chat, and online social interaction, we recommend using synchronous speech synthesis.

Request Headers

Content-Type

string

required

Enum value: application/json

Authorization

string

required

Bearer authentication format: Bearer {{API key}}.

Request Body

text

string

required

The text to synthesize, with a maximum length of 50,000 characters.

voice_setting

object

required

Show properties

speed

number

Range [0.5,2], default value is 1.0The speaking speed of the generated voice. Optional. The larger the value, the faster the speed.

vol

number

Range (0,10], default value is 1.0The volume of the generated voice. Optional. The larger the value, the higher the volume.

pitch

number

default:0

Range [-12,12], default value is 0The pitch of the generated voice. Optional. (0 outputs the original voice; the value must be an integer.)

voice_id

string

The requested voice ID.Supports two types: system voices (id) and cloned voices (id). The system voice IDs are as follows:

Young Male Voice: male-qn-qingse
Elite Young Male Voice: male-qn-jingying
Domineering Young Male Voice: male-qn-badao
Male College Student Voice: male-qn-daxuesheng
Young Girl Voice: female-shaonv
Mature Sister Voice: female-yujie
Mature Female Voice: female-chengshu
Sweet Female Voice: female-tianmei
Male Presenter: presenter_male
Female Presenter: presenter_female
Male Audiobook 1: audiobook_male_1
Male Audiobook 2: audiobook_male_2
Female Audiobook 1: audiobook_female_1
Female Audiobook 2: audiobook_female_2
Young Male Voice-beta: male-qn-qingse-jingpin
Elite Young Male Voice-beta: male-qn-jingying-jingpin
Domineering Young Male Voice-beta: male-qn-badao-jingpin
Male College Student Voice-beta: male-qn-daxuesheng-jingpin
Young Girl Voice-beta: female-shaonv-jingpin
Mature Sister Voice-beta: female-yujie-jingpin
Mature Female Voice-beta: female-chengshu-jingpin
Sweet Female Voice-beta: female-tianmei-jingpin
Smart Boy: clever_boy
Cute Boy: cute_boy
Lovely Girl: lovely_girl
Cartoon Pig Xiaoqi: cartoon_pig
Yandere Younger Brother: bingjiao_didi
Handsome Boyfriend: junlang_nanyou
Innocent Junior Schoolmate: chunzhen_xuedi
Aloof Senior Schoolmate: lengdan_xiongzhang
Domineering Young Master: badao_shaoye
Sweetheart Xiaoling: tianxin_xiaoling
Playful Cute Girl: qiaopi_mengmei
Charming Mature Sister: wumei_yujie
Cutesy Junior Schoolmate: diadia_xuemei
Elegant Senior Schoolmate: danya_xuejie
Santa Claus: Santa_Claus
Grinch: Grinch
Rudolph: Rudolph
Arnold: Arnold
Charming Santa: Charming_Santa
Charming Lady: Charming_Lady
Sweet Girl: Sweet_Girl
Cute Elf: Cute_Elf
Attractive Girl: Attractive_Girl
Serene Woman: Serene_Woman

emotion

string

Controls the emotion of the synthesized speech;Currently supports 7 emotions: happy, sad, angry, fearful, disgusted, surprised, and neutral;Parameter range: ["happy", "sad", "angry", "fearful", "disgusted", "surprised", "neutral"]

text_normalization

bool

default:"false"

This parameter supports English text normalization, which can improve performance in numeric reading scenarios but may slightly increase latency. If not provided, the default value is false.

audio_setting

object

Show properties

sample_rate

number

default:32000

Range [8000, 16000, 22050, 24000, 32000, 44100]The sample rate of the generated voice. Optional, defaults to 32000.

bitrate

number

default:128000

Range [32000, 64000, 128000, 256000]The bitrate of the generated voice. Optional, default value is 128000. This parameter only takes effect for audio in mp3 format.

format

string

default:"mp3"

The generated audio format. Default is mp3. Options: mp3, pcm, flac, wav. wav is only supported for non-streaming output.

channel

number

default:1

The number of channels for the generated audio. Default is 1: mono. Options:1: mono2: stereo

pronunciation_dict

object

Show properties

tone

list

Replace text, symbols, and corresponding pronunciations that require special annotation.Pronunciation replacement (adjust tones/replace pronunciations of other characters), in the following format:["燕少飞/(yan4)(shao3)(fei1)","达菲/(da2)(fei1)"，"omg/oh my god"]Tones are represented by numbers: first tone (Yinping) is 1, second tone (Yangping) is 2, third tone (Shangsheng) is 3, fourth tone (Qusheng) is 4, and neutral tone is 5.

language_boost

string

default:"null"

Enhances recognition capabilities for specified low-resource languages and dialects. After setting this parameter, speech performance can be improved in the specified low-resource language/dialect scenario. If the low-resource language type is unclear, you can choose “auto”, and the model will determine the low-resource language type autonomously. The following values are supported:

'Chinese', 'Chinese,Yue', 'English', 'Arabic', 'Russian', 'Spanish', 'French', 'Portuguese', 'German', 'Turkish', 'Dutch', 'Ukrainian', 'Vietnamese', 'Indonesian', 'Japanese', 'Italian', 'Korean', 'Thai', 'Polish', 'Romanian', 'Greek', 'Czech', 'Finnish', 'Hindi', 'Bulgarian', 'Danish', 'Hebrew', 'Malay', 'Persian', 'Slovak', 'Swedish', 'Croatian', 'Filipino', 'Hungarian', 'Norwegian', 'Slovenian', 'Catalan', 'Nynorsk', 'Tamil', 'Afrikaans', 'auto'

voice_modify

object

Voice effects settings. Supported audio formats for this parameter: mp3, wav, flac

Show properties

pitch

integer

Pitch adjustment (deep/bright), range [-100,100]. Values closer to -100 make the voice deeper; values closer to 100 make the voice brighter.

intensity

integer

Intensity adjustment (powerful/soft), range [-100,100]. Values closer to -100 make the voice more forceful; values closer to 100 make the voice softer.

timbre

integer

Timbre adjustment (magnetic/crisp), range [-100,100]. Values closer to -100 make the voice richer; values closer to 100 make the voice crisper.

sound_effects

string

Sound effect settings. Only one option can be selected per request. Options:

spacious_echo (spacious echo)
auditorium_echo (auditorium broadcast)
lofi_telephone (telephone distortion)
robotic (electronic voice)

Response Parameters

task_id

string

required

The task_id of the asynchronous task. You should use this task_id to request the Query Task Result API to obtain the generated result.

MiniMax Speech-2.5-hd-preview Synchronous Speech Synthesis

MiniMax Speech-2.5-turbo-preview Synchronous Speech Synthesis

​Request Headers

​Request Body

​Response Parameters

Request Headers

Request Body

Response Parameters