Complete Gemini 3 guide covering Pro, Flash & Image models, Thinking Levels, Thought Signatures, media resolution, native image generation, and migration tactics with Python examples. Updated December 2025.
I still remember the late nights at WebSearchAPI.ai refactoring our retrieval pipeline for Gemini 1.5 Pro. We thought the 1M-token window was the endgame—until Gemini 3 landed and made reasoning control, multimodal efficiency, and agent workflows feel programmable again.
Credentials: B.Sc. Computer Science (Cambridge) • M.Sc. AI Systems (Imperial) • Google Cloud PCA • AWS SA Pro • Azure AI Engineer • CKA • TensorFlow Developer.
Gemini 3 isn't just a bigger parameter bump. Google rebuilt the stack around controllable "thinking" states, richer media ingestion, native image generation, and first-class agent tooling—all while keeping it compatible with Gemini API + Vertex AI pipelines we already deployed. With the December 2025 release of Gemini 3 Flash, the family now offers frontier-class models for every workload: Pro for complex reasoning, Flash for high-throughput at 3x the speed, and Pro Image for native generation (Google, 2025).
📊 Stats Alert: The AI agents market is on track to hit $139.12B by 2033 at a 43.88% CAGR, so every percentage point of accuracy or latency savings compounds fast (MarketsandMarkets AI Agents Market, 2025).
🎯 Goal: Upgrade your Gemini stack to the 3 family without breaking existing RAG agents, while gaining Thinking Levels, Thought Signatures, media resolution tuning, native image generation, and Search grounding.
The Gemini 3 family now includes three production-ready models, each optimized for different workloads:
| Model | Model ID | Input/Output Pricing | Best For |
|---|---|---|---|
| Gemini 3 Pro | gemini-3-pro-preview | $2/$12 per 1M tokens (under 200k context); $4/$18 (over 200k) | Complex reasoning, agentic coding, multimodal analysis |
| Gemini 3 Flash | gemini-3-flash-preview | $0.50/$3 per 1M tokens | High-throughput, cost-sensitive workloads, rapid iteration |
| Gemini 3 Pro Image | gemini-3-pro-image-preview | $2 text input; $0.134 per image output | Native image generation and editing |
All models support a 1M token input context with up to 64k tokens of output, and have a knowledge cutoff of January 2025.
Google locked the new configs behind the latest Python SDK. Anything older than 1.51.0 will throw when you pass Thinking configs, so upgrade before toggling features.
pip install -U 'google-genai>=1.51.0'import os
from google import genai
from google.genai import types
client = genai.Client(api_key=os.environ["GEMINI_API_KEY"])
# Choose your model based on workload
MODEL_PRO = "gemini-3-pro-preview" # Complex reasoning, agentic coding
MODEL_FLASH = "gemini-3-flash-preview" # High-throughput, cost-sensitive
MODEL_IMAGE = "gemini-3-pro-image-preview" # Native image generation
MODEL_ID = MODEL_PRO # Default to Pro for this guide📌 Pro Tip: Store the API key in your platform's secret manager (or Colab secret) so rotating staging vs. production keys doesn't require code changes. Flash has a free tier in the Gemini API—great for prototyping before scaling to Pro.
Gemini 3 uses dynamic thinking by default, meaning it decides when to brainstorm or dive deep. The thinking_level parameter lets you override this behavior—and Flash offers even more granular control than Pro.
| Thinking Level | Available On | Use Case |
|---|---|---|
minimal | Flash only | Matches "no thinking" for most queries—fastest responses |
low | Pro & Flash | Minimizes latency/cost for simple tasks like summarization |
medium | Flash only | Balanced thinking approach |
high | Pro & Flash | Maximizes reasoning depth for complex analysis (default for Pro) |
💡 Expert Insight: I use minimal or low on Flash for autocomplete-like UX, then switch to high on Pro for agent subroutines (query decomposition, planner turns, compliance reviews). Flash's medium is a sweet spot for structured data extraction where you need some reasoning but not full chain-of-thought.
⚠️ Warning: You cannot use both thinking_level and the legacy thinking_budget in the same request. Google recommends migrating to thinking_level for more predictable performance.
from IPython.display import Markdown, display
prompt = """
Find what I'm thinking of:
It moves, but doesn't walk, run, or swim.
It has no fixed shape and keeps moving when broken apart.
It has no brain but solves mazes.
"""
response = client.models.generate_content(
model=MODEL_ID,
contents=prompt,
config=types.GenerateContentConfig(
thinking_config=types.ThinkingConfig(
thinking_level="High",
include_thoughts=True
)
)
)
for part in response.parts:
if part.thought:
print(f"--- THOUGHT PROCESS ({response.usage_metadata.thoughts_token_count} tokens) ---")
display(Markdown(part.text))
else:
print("\n--- FINAL ANSWER ---")
display(Markdown(part.text))⚠️ Warning: Gemini 3 is tuned for temperature=1.0. Dropping temperature to 0.1 (my old Gemini 2.5 trick) can cause looping or over-pruning with the new thinking stack.
📈 Case Study: Google highlighted Opu Clip measuring a 32% speed gain plus higher precision when they let Gemini 3 handle agent tool calls end-to-end in their launch post. My replication shows similar speedups when I offload cascade planning to thinking_level="High" and only drop to low for direct responses.
media_resolution controls how many tokens Gemini allocates per image or video frame. Higher resolutions improve the model's ability to read fine text and identify small details, but increase both token usage and latency. You can set it per Part or globally via generation_config, as detailed in Google's media resolution documentation.
| Media Type | Recommended Level | Max Tokens | Notes |
|---|---|---|---|
| Images (general) | media_resolution_high | ~1120 | Best for detailed analysis |
| PDFs | media_resolution_medium | ~560 | Good balance for documents |
| Video (general) | media_resolution_low or medium | ~70/frame | Cost-effective for video understanding |
| Video (text-heavy) | media_resolution_high | ~280/frame | When reading on-screen text matters |
| Screenshots | media_resolution_low | ~70 | Sufficient for UI screenshots |
📌 New in December 2025: media_resolution_ultra_high is now available for individual parts (not globally) when you need maximum fidelity for dense technical diagrams or fine print.
import pathlib
import requests
IMG_URL = "https://storage.googleapis.com/generativeai-downloads/data/jetpack.png"
pathlib.Path("jetpack.png").write_bytes(requests.get(IMG_URL, timeout=10).content)
uploaded = client.files.upload(file="jetpack.png")
count_tokens_response = client.models.count_tokens(
model=MODEL_ID,
contents=[
types.Part(
file_data=types.FileData(
file_uri=uploaded.uri,
mime_type=uploaded.mime_type
),
media_resolution=types.PartMediaResolution(
level="MEDIA_RESOLUTION_HIGH"
)
)
],
)
print(
f"High resolution token cost: {count_tokens_response.total_tokens} tokens"
)📌 Pro Tip: Run count_tokens with all three levels (LOW, MEDIUM, HIGH) during QA. I log these to BigQuery so finance can predict the delta between standard and "inspection" runs.
⚠️ Warning: If you leave media_resolution unspecified, Gemini may choose a high-cost setting for dense PDFs. Explicitly set LOW for screenshots and MEDIUM for lightweight docs to avoid surprise invoices.
Grounding wires Gemini to live Google Search results, supports every language the model speaks, and returns a structured grounding_metadata payload so you can surface citations in your UI, per Google's grounding guide.
response = client.models.generate_content(
model=MODEL_ID,
contents="Who is the current Magic: The Gathering World Champion?",
config=types.GenerateContentConfig(
tools=[types.Tool(google_search=types.ToolGoogleSearch())]
)
)
print(response.text)
if response.candidates[0].grounding_metadata.web_search_queries:
print("Search queries:", response.candidates[0].grounding_metadata.web_search_queries)📌 Pro Tip: Persist web_search_queries + grounding_chunk IDs. When a customer disputes an answer, we replay the queries to confirm Google still surfaces those sources, which keeps compliance and support teams sane.
Gemini 3 lets you enforce a response_schema and call built-in tools (Search, URL context, Code Execution) inside the same response, so JSON payloads no longer conflict with tool calls, as outlined in Google's structured output docs.
from pydantic import BaseModel
import json
class CookieRecipe(BaseModel):
recipe_name: str
difficulty: str
prep_time_minutes: int
ingredients: list[str]
response = client.models.generate_content(
model=MODEL_ID,
contents="Give me a miso chocolate chip cookie recipe.",
config=types.GenerateContentConfig(
response_mime_type="application/json",
response_schema=CookieRecipe,
),
)
recipe = json.loads(response.text)
print(json.dumps(recipe, indent=2))🎯 Key Takeaway: Pairing structured output with Search grounding or URL context gives you typed objects plus citations—no more brittle regex parsing after the fact.
Gemini's code execution tool still runs Python only, but it's now fully compatible with Gemini 3, which means you can ask the model to write + execute helper scripts mid-conversation before committing to an answer, as described in Google's code execution guide.
response = client.models.generate_content(
model=MODEL_ID,
contents="Run Python code that counts the number of 'r' characters in 'strawberry'. Return the count.",
config=types.GenerateContentConfig(
tools=[types.Tool(code_execution=types.ToolCodeExecution())]
)
)
for part in response.candidates[0].content.parts:
if part.executable_code:
print("Generated Code:\n", part.executable_code.code)
if part.code_execution_result:
print("Execution Output:", part.code_execution_result.output)
if part.text:
print("Final Answer:", part.text)⚠️ Warning: Code execution tokens count as both input and output. Capture response.usage_metadata so budgets account for the prompt, generated code, and stdout—all three are billed.
Starting with Gemini 3, Google introduced Thought Signatures—encrypted representations of the model's internal thought process. These signatures are essential for maintaining reasoning context across multi-turn conversations.
# Thought signatures are returned in the response
response = client.models.generate_content(
model=MODEL_ID,
contents="Analyze this complex problem step by step...",
config=types.GenerateContentConfig(
thinking_config=types.ThinkingConfig(thinking_level="High")
)
)
# The thought_signature is embedded in content parts
# Pass it back in subsequent turns to maintain reasoning chain
for part in response.candidates[0].content.parts:
if hasattr(part, 'thought_signature'):
# Store this for the next turn
print("Thought signature received:", part.thought_signature[:50] + "...")⚠️ Critical Validation Rules:
📌 Pro Tip: If you use the official Google SDKs with standard chat history management, Thought Signatures are handled automatically. Custom implementations must explicitly capture and replay these signatures.
Gemini 3 Pro Image (gemini-3-pro-image-preview) introduces native image generation directly through the API—no more switching to separate image models. It supports 4K resolution output, accurate text rendering, and conversational editing.
from google.genai import types
# Generate an image
response = client.models.generate_content(
model="gemini-3-pro-image-preview",
contents="Generate a photorealistic image of a robot barista making latte art in a cozy coffee shop",
config=types.GenerateContentConfig(
response_modalities=["IMAGE", "TEXT"]
)
)
# Access generated image
for part in response.candidates[0].content.parts:
if part.inline_data:
# Save or display the image
with open("generated_image.png", "wb") as f:
f.write(part.inline_data.data)🎯 Key Capabilities:
thinking_level="minimal" for maximum throughput.count_tokens at all four resolutions (low, medium, high, ultra_high), then pin high only for rows flagged as contracts/financial statements in metadata.thoughts_token_count, thought_signature presence, media_resolution choices, and web_search_queries into the same trace so you can correlate cost regressions with reasoning depth.medium thinking level often matches Pro's low quality at lower cost.| Change | Why it matters | Action |
|---|---|---|
thinking_budget → thinking_level | Gemini 3 enforces preset levels instead of raw budget integers. Flash adds minimal and medium options. Cannot use both params together (docs). | Map existing tiers: low (autocomplete), medium (Flash balanced), high (planner). |
| Temperature tuning | Gemini 3 default is 1.0 and handles randomness internally. Lower values may cause looping or degraded performance (docs). | Remove sub-0.5 temperatures unless you measured regressions. |
| Media resolution levels | Four levels now available: low, medium, high, ultra_high. Set per part or globally (docs). | Explicitly set level per asset type—see recommendations table above. |
| Thought Signatures | Mandatory for function calling and image generation in multi-turn flows. HTTP 400 if missing (docs). | Capture and replay thought_signature from response parts, or use official SDKs. |
| Search grounding pricing | Billing changes Jan 5, 2026: now $14 per 1k search queries (was $35/1k prompts flat rate). | Update cost projections; consider caching search results where appropriate. |
| New model selection | Flash outperforms 2.5 Pro at 3x speed. Pro Image enables native image generation. Deep Think for complex reasoning. | Evaluate Flash for high-throughput workloads; add Pro Image for generation tasks. |
| Maps grounding gap | Google Maps grounding and Computer Use still not supported on Gemini 3 (docs). | Keep 2.5 Flash for map-rich tasks and computer use until support is added. |
usage_metadata field (tokens, thoughts, grounding counts) to the same trace ID you use for customer actions.thoughts_token_count / total_tokens; spikes usually mean a planner stuck in High thinking for too long.count_tokens nightly for representative documents to reconfirm resolution deltas.media_resolution_high or invokes code execution, attach a cost label so finance can apportion spend to the right product team.⭐ Key Takeaway: Gemini 3's Thinking Levels and media controls are powerful, but they still need a trustworthy search layer. Pairing Gemini with WebSearchAPI.ai keeps your agents grounded in Google-grade results while you experiment with thoughts, tools, and structured JSON. For alternatives to Google's grounding capabilities, explore our grounding Google Search alternatives guide.
Ready to see it in action? Start building with WebSearchAPI.ai and get Google-grade results in minutes.
Which Gemini 3 models are available right now?
Three models are production-ready: gemini-3-pro-preview ($2/$12 per 1M tokens), gemini-3-flash-preview ($0.50/$3), and gemini-3-pro-image-preview for native image generation. Flash has a free tier in the Gemini API; Pro does not, per Google's pricing docs.
What's the difference between Gemini 3 Pro and Flash?
Flash is 3x faster than 2.5 Pro while matching its quality on most benchmarks. It's ideal for high-throughput, cost-sensitive workloads. Pro excels at complex reasoning and agentic coding. Flash also offers additional thinking levels (minimal, medium) for finer control, per Google's Flash announcement.
What are Thought Signatures and do I need them? Thought Signatures are encrypted representations of Gemini's internal reasoning. They're mandatory for function calling and image generation in multi-turn flows—omitting them returns HTTP 400. The official SDKs handle this automatically; custom implementations must explicitly capture and replay them, per Google's docs.
What do Thinking Levels actually change?
They control reasoning depth. Pro supports low and high; Flash adds minimal (fastest, no thinking) and medium (balanced). The legacy thinking_budget still works but cannot be used alongside thinking_level, per Google's guidance.
How does media resolution impact billing?
Higher resolutions dramatically increase vision tokens. Use low (~70 tokens) for screenshots, medium (~560) for PDFs, high (~1120) for detailed images, and ultra_high for dense technical diagrams. If unspecified, Gemini picks optimal defaults based on media type, per Google's docs.
Does Google Search grounding support multilingual prompts? Yes—grounding works across every language the model handles and always returns citation metadata you can log, according to Google's grounding guide.
Can I force JSON while still calling tools? Gemini 3 explicitly supports Structured Output with Search, URL context, and Code Execution in the same response, so you can keep typed payloads while tooling, as confirmed in Google's structured output docs.
Is code execution limited to Python? Yes. Gemini can write other languages, but only Python runs inside the managed sandbox today. December 2025 added support for code execution with images (return images from executed code), per Google's code execution guide.
When should I keep Gemini 2.5 around? If you rely on Google Maps grounding or Computer Use—both are still unsupported on Gemini 3 as of December 2025. Keep 2.5 Flash for these specific use cases until Google adds support.
How do you monitor cost regressions?
Stream usage_metadata, web_search_queries, thought_signature presence, and media_resolution choices into your telemetry platform. Alert when token-per-request deltas exceed set baselines. Note that Search grounding billing changes to $14/1k queries on Jan 5, 2026.