Feature Overview
Vision-Language Models (VLMs) are multimodal large models that support both image and text inputs, with the ability to understand image content and process cross-modal information. Based on combined image and text information, the model can generate high-quality responses and is widely used in scenarios such as image recognition, content understanding, and intelligent Q&A.Typical Use Cases
- Image Content Recognition and Description: Automatically identify objects, colors, scenes, and spatial relationships in images, and generate natural language descriptions.
- Comprehensive Image-Text Understanding: Combine image and text inputs to enable context-aware multi-turn conversations and responses to complex tasks.
- Visual-Assisted Q&A: Can serve as a supplement to OCR tools by recognizing text embedded in images and answering questions.
- Future Extended Applications: Suitable for interactive scenarios such as intelligent visual assistants, robotic perception, and augmented reality.
API Usage Instructions
To call a vision-language model, use the/chat/completions endpoint, which supports mixed image and text inputs.
Image Processing Parameter
Set the image processing precision through thedetail field. The following options are supported:
high: High resolution, preserves more details, suitable for fine-grained tasks.low: Low resolution, faster processing, suitable for real-time responses.auto: The system automatically selects the appropriate mode.
Message Format Examples
URL Image Format
Base64 Image Format
Base64 Image Encoding Example Code (Python)
Multi-Image Mode
Multiple images and text can be sent together as input. For better performance and comprehension, it is recommended to use no more than two images.Supported Models
The following are the vision-language models (VLMs) currently supported by the platform:Billing Method
Image inputs for vision-language models are converted into Tokens and billed together with text:- The image Token estimation rules vary slightly by model;
- Detailed billing standards can be found on the corresponding model introduction page.
API Call Example Code
Single-Image Description
Multi-Image Comparative Analysis
FAQ and Notes
- Image resolution and clarity affect model recognition accuracy. It is recommended to use clear image sources.
- Base64 encoding results in larger payloads. It is recommended that images do not exceed 1MB.
- If you encounter any issues, please refer to the platform developer documentation or submit a support ticket.