Skip to main content

Feature Overview

Vision-Language Models (VLMs) are multimodal large models that support both image and text inputs, with the ability to understand image content and process cross-modal information. Based on combined image and text information, the model can generate high-quality responses and is widely used in scenarios such as image recognition, content understanding, and intelligent Q&A.

Typical Use Cases

  • Image Content Recognition and Description: Automatically identify objects, colors, scenes, and spatial relationships in images, and generate natural language descriptions.
  • Comprehensive Image-Text Understanding: Combine image and text inputs to enable context-aware multi-turn conversations and responses to complex tasks.
  • Visual-Assisted Q&A: Can serve as a supplement to OCR tools by recognizing text embedded in images and answering questions.
  • Future Extended Applications: Suitable for interactive scenarios such as intelligent visual assistants, robotic perception, and augmented reality.

API Usage Instructions

To call a vision-language model, use the /chat/completions endpoint, which supports mixed image and text inputs.

Image Processing Parameter

Set the image processing precision through the detail field. The following options are supported:
  • high: High resolution, preserves more details, suitable for fine-grained tasks.
  • low: Low resolution, faster processing, suitable for real-time responses.
  • auto: The system automatically selects the appropriate mode.

Message Format Examples

URL Image Format

{
  "role": "user",
  "content": [
    {
      "type": "image_url",
      "image_url": {
        "url": "https://example.com/image.png",
        "detail": "high"
      }
    },
    {
      "type": "text",
      "text": "Please describe the scene in the image."
    }
  ]
}

Base64 Image Format

{
  "role": "user",
  "content": [
    {
      "type": "image_url",
      "image_url": {
        "url": "data:image/jpeg;base64,{base64_image}",
        "detail": "low"
      }
    },
    {
      "type": "text",
      "text": "What text content is in the image?"
    }
  ]
}

Base64 Image Encoding Example Code (Python)

import base64
from PIL import Image
import io

def image_to_base64(image_path):
    with Image.open(image_path) as img:
        buffered = io.BytesIO()
        img.save(buffered, format="JPEG")
        return base64.b64encode(buffered.getvalue()).decode('utf-8')

base64_image = image_to_base64("path/to/your/image.jpg")

Multi-Image Mode

Multiple images and text can be sent together as input. For better performance and comprehension, it is recommended to use no more than two images.
{
  "role": "user",
  "content": [
    {
      "type": "image_url",
      "image_url": {
        "url": "https://example.com/image1.png"
      }
    },
    {
      "type": "image_url",
      "image_url": {
        "url": "data:image/jpeg;base64,{base64_image}"
      }
    },
    {
      "type": "text",
      "text": "Compare the common features of these two images."
    }
  ]
}

Supported Models

The following are the vision-language models (VLMs) currently supported by the platform:

Billing Method

Image inputs for vision-language models are converted into Tokens and billed together with text:
  • The image Token estimation rules vary slightly by model;
  • Detailed billing standards can be found on the corresponding model introduction page.

API Call Example Code

Single-Image Description

from openai import OpenAI

client = OpenAI(api_key="YOUR_KEY", base_url="https://api.highwayapi.ai/openai")

response = client.chat.completions.create(
    model="qwen/qwen2.5-vl-72b-instruct",
    messages=[
        {
            "role": "user",
            "content": [
                {"type": "image_url", "image_url": {"url": "https://example.com/cityscape.jpg"}},
                {"type": "text", "text": "Describe the main buildings in the image."}
            ]
        }
    ],
    stream=True
)

for chunk in response:
    print(chunk.choices[0].delta.content or "", end="", flush=True)

Multi-Image Comparative Analysis

response = client.chat.completions.create(
    model="qwen/qwen2.5-vl-72b-instruct",
    messages=[
        {
            "role": "user",
            "content": [
                {"type": "image_url", "image_url": {"url": "https://example.com/product1.jpg"}},
                {"type": "image_url", "image_url": {"url": "https://example.com/product2.jpg"}},
                {"type": "text", "text": "Please compare the main differences between these two products."}
            ]
        }
    ],
    stream=True
)

for chunk in response:
    print(chunk.choices[0].delta.content or "", end="", flush=True)

FAQ and Notes

  • Image resolution and clarity affect model recognition accuracy. It is recommended to use clear image sources.
  • Base64 encoding results in larger payloads. It is recommended that images do not exceed 1MB.
  • If you encounter any issues, please refer to the platform developer documentation or submit a support ticket.