Vision-Language Models

Feature Overview

Vision-Language Models (VLMs) are multimodal large models that support both image and text inputs, with the ability to understand image content and process cross-modal information. Based on combined image and text information, the model can generate high-quality responses and is widely used in scenarios such as image recognition, content understanding, and intelligent Q&A.

Typical Use Cases

Image Content Recognition and Description: Automatically identify objects, colors, scenes, and spatial relationships in images, and generate natural language descriptions.
Comprehensive Image-Text Understanding: Combine image and text inputs to enable context-aware multi-turn conversations and responses to complex tasks.
Visual-Assisted Q&A: Can serve as a supplement to OCR tools by recognizing text embedded in images and answering questions.
Future Extended Applications: Suitable for interactive scenarios such as intelligent visual assistants, robotic perception, and augmented reality.

API Usage Instructions

To call a vision-language model, use the /chat/completions endpoint, which supports mixed image and text inputs.

Image Processing Parameter

Set the image processing precision through the detail field. The following options are supported:

high: High resolution, preserves more details, suitable for fine-grained tasks.
low: Low resolution, faster processing, suitable for real-time responses.
auto: The system automatically selects the appropriate mode.

Message Format Examples

URL Image Format

{
  "role": "user",
  "content": [
    {
      "type": "image_url",
      "image_url": {
        "url": "https://example.com/image.png",
        "detail": "high"
      }
    },
    {
      "type": "text",
      "text": "Please describe the scene in the image."
    }
  ]
}

Base64 Image Format

{
  "role": "user",
  "content": [
    {
      "type": "image_url",
      "image_url": {
        "url": "data:image/jpeg;base64,{base64_image}",
        "detail": "low"
      }
    },
    {
      "type": "text",
      "text": "What text content is in the image?"
    }
  ]
}

Base64 Image Encoding Example Code (Python)

import base64
from PIL import Image
import io

def image_to_base64(image_path):
    with Image.open(image_path) as img:
        buffered = io.BytesIO()
        img.save(buffered, format="JPEG")
        return base64.b64encode(buffered.getvalue()).decode('utf-8')

base64_image = image_to_base64("path/to/your/image.jpg")

Multi-Image Mode

Multiple images and text can be sent together as input. For better performance and comprehension, it is recommended to use no more than two images.

{
  "role": "user",
  "content": [
    {
      "type": "image_url",
      "image_url": {
        "url": "https://example.com/image1.png"
      }
    },
    {
      "type": "image_url",
      "image_url": {
        "url": "data:image/jpeg;base64,{base64_image}"
      }
    },
    {
      "type": "text",
      "text": "Compare the common features of these two images."
    }
  ]
}

Supported Models

The following are the vision-language models (VLMs) currently supported by the platform:

Billing Method

Image inputs for vision-language models are converted into Tokens and billed together with text:

The image Token estimation rules vary slightly by model;
Detailed billing standards can be found on the corresponding model introduction page.

API Call Example Code

Single-Image Description

from openai import OpenAI

client = OpenAI(api_key="YOUR_KEY", base_url="https://api.highwayapi.ai/openai")

response = client.chat.completions.create(
    model="qwen/qwen2.5-vl-72b-instruct",
    messages=[
        {
            "role": "user",
            "content": [
                {"type": "image_url", "image_url": {"url": "https://example.com/cityscape.jpg"}},
                {"type": "text", "text": "Describe the main buildings in the image."}
            ]
        }
    ],
    stream=True
)

for chunk in response:
    print(chunk.choices[0].delta.content or "", end="", flush=True)

Multi-Image Comparative Analysis

response = client.chat.completions.create(
    model="qwen/qwen2.5-vl-72b-instruct",
    messages=[
        {
            "role": "user",
            "content": [
                {"type": "image_url", "image_url": {"url": "https://example.com/product1.jpg"}},
                {"type": "image_url", "image_url": {"url": "https://example.com/product2.jpg"}},
                {"type": "text", "text": "Please compare the main differences between these two products."}
            ]
        }
    ],
    stream=True
)

for chunk in response:
    print(chunk.choices[0].delta.content or "", end="", flush=True)

FAQ and Notes

Image resolution and clarity affect model recognition accuracy. It is recommended to use clear image sources.
Base64 encoding results in larger payloads. It is recommended that images do not exceed 1MB.
If you encounter any issues, please refer to the platform developer documentation or submit a support ticket.

Getting Started

LLM API

Model Providers

Model Features

Third-party Tool Setup

Feature Overview

Typical Use Cases

API Usage Instructions

Image Processing Parameter

Message Format Examples

URL Image Format

Base64 Image Format

Base64 Image Encoding Example Code (Python)

Multi-Image Mode

Supported Models

Billing Method

API Call Example Code

Single-Image Description

Multi-Image Comparative Analysis

FAQ and Notes

​Feature Overview

​Typical Use Cases

​API Usage Instructions

​Image Processing Parameter

​Message Format Examples

​URL Image Format

​Base64 Image Format

​Base64 Image Encoding Example Code (Python)

​Multi-Image Mode

​Supported Models

​Billing Method

​API Call Example Code

​Single-Image Description

​Multi-Image Comparative Analysis

​FAQ and Notes

Feature Overview

Typical Use Cases

API Usage Instructions

Image Processing Parameter

Message Format Examples

URL Image Format

Base64 Image Format

Base64 Image Encoding Example Code (Python)

Multi-Image Mode

Supported Models

Billing Method

API Call Example Code

Single-Image Description

Multi-Image Comparative Analysis

FAQ and Notes