GLM 4.7 Flash

GLM 4.7 Flash is the speed-optimized variant in Z.ai's GLM-4.7 generation, released January 19, 2026. It delivers faster inference for high-throughput workloads while retaining the coding, tool usage, and conversational improvements introduced in GLM-4.7.

Implicit CachingReasoningTool Use

index.ts

import { streamText } from 'ai'

const result = streamText({
  model: 'zai/glm-4.7-flash',
  prompt: 'Why is the sky blue?'
})

Overview About Providers Throughput Latency Uptime Status Similar FAQ

Playground

Try out GLM 4.7 Flash by Z.ai. Usage is billed to your team at API rates. Free users (those who haven't made a payment) get $5 of credits every 30 days.

GLM 4.7 Flash

Ask GLM 4.7 Flash anything to try it out.

Providers

Route requests across multiple providers. Copy a provider slug to set your preference. Visit the docs for more info. Using a provider means you agree to their terms, listed under Legal.

Provider

Context	Latency	Throughput	Input	Output	Cache	Web Search	Per Query	Capabilities	ZDR	No Training	Release Date

Z.ai

200K

$0.07/M

$0.40/M

Read:$0.01/M

Write:—

—

01/19/2026

Amazon Bedrock

200K

9.0s

55tps

$0.07/M

$0.40/M

—

01/19/2026

More models by Z.ai

Model

Context	Latency	Throughput	Input	Output	Cache	Web Search	Per Query	Capabilities	Providers	ZDR	No Training	Release Date

zai/glm-5.2

0.7s

201tps

$1.40/M

$4.40/M

Read:$0.26/M

Write:—

—

06/16/2026

zai/glm-5.1

205K

0.2s

109tps

$1.30/M

$4.30/M

Read:$0.26/M

Write:—

—

04/07/2026

zai/glm-5-turbo

203K

6.4s

27tps

$1.20/M

$4.00/M

Read:$0.24/M

Write:—

—

03/15/2026

zai/glm-5

203K

0.6s

91tps

$0.80/M

$2.56/M

Read:$0.16/M

Write:—

—

02/12/2026

zai/glm-4.7-flashx

200K

$0.06/M

$0.40/M

Read:$0.01/M

Write:—

—

01/19/2026

zai/glm-4.7

205K

0.1s

987tps

$2.25/M

$2.75/M

Read:$2.25/M

Write:—

—

12/22/2025

About GLM 4.7 Flash

GLM 4.7 Flash was released January 19, 2026 as the middle tier in Z.ai's GLM-4.7 generation, sitting between the full GLM-4.7 and the ultra-fast GLM-4.7-FlashX. It inherits the 4.7 generation's gains in coding assistance, tool usage, multi-step reasoning, and natural conversational tone while trading peak capability for faster inference.

The GLM-4.7 generation focused on closing coding and tool-use gaps with competing models. GLM 4.7 Flash carries those gains forward at a cost-and-latency profile that fits high-volume coding assistance, real-time chat, and production pipelines with strict response time budgets. If the full GLM-4.7 is too slow and GLM-4.7-FlashX strips too much capability, GLM 4.7 Flash is the compromise.

Through AI Gateway, switching between GLM-4.7 tiers requires only changing the model identifier. The API surface and request format stay the same.

What To Consider When Choosing a Provider

Configuration: GLM 4.7 Flash sits in the middle of the 4.7 generation. Test it against both GLM-4.7 (higher capability) and GLM-4.7-FlashX (higher speed) on your specific tasks to find the right tradeoff.
Configuration: All GLM-4.7 variants share the same API. You can A/B test across tiers without changing your integration.
Configuration: The reduced per-token cost makes GLM 4.7 Flash practical for high-volume deployments where GLM-4.7's per-request cost would be prohibitive.
Zero Data Retention: AI Gateway supports Zero Data Retention for this model via direct gateway requests (BYOK is not included). To configure this, check the documentation.
Authentication: AI Gateway authenticates requests using an API key or OIDC token. You do not need to manage provider credentials directly.

When to Use GLM 4.7 Flash

Best For

High-volume coding assistance: Fast response times improve developer productivity across many concurrent sessions
Real-time conversational applications: The 4.7 generation's natural tone under strict latency thresholds
Production API backends: High request volumes where cost per token directly impacts margins
Agentic pipelines: Most steps need good capability at speed, with the option to route complex steps to the full GLM-4.7
Interactive prototyping and development: Fast iteration cycles depend on quick model responses

Consider Alternatives When

Maximum complex-task capability: The full GLM-4.7 provides the deepest reasoning and coding quality
Absolute fastest inference: GLM-4.7-FlashX offers the lowest latency in the 4.7 generation
Vision capabilities needed: Evaluate GLM-4.6V or GLM-4.5V for multimodal input
Advanced reasoning modes: GLM-5 provides multiple thinking modes and an expanded reasoning architecture

Conclusion

GLM 4.7 Flash sits between the full GLM-4.7 and GLM-4.7-FlashX: fast enough for many production latency budgets, and more capable than FlashX on heavier coding and tool-use tasks. Switch between 4.7 tiers through AI Gateway by changing the model identifier.

Agent Stack

Core Platform

Tools

Learn

Build

Explore

GLM 4.7 Flash

Playground

Providers

More models by Z.ai

About GLM 4.7 Flash

What To Consider When Choosing a Provider

When to Use GLM 4.7 Flash

Best For

Consider Alternatives When

Conclusion