LangChain and Claude/Gemini APIs - Deep Dive
Tool use protocol, prompt caching math, RAG architecture, agent loops, structured outputs, cost engineering, and the failure modes nobody mentions.
This is the operator's view of building with LLMs. Less framework worship, more "what actually broke at 3 am."
The Messages API mental model
Anthropic, Gemini, and OpenAI all converged on roughly the same shape: a list of messages with roles (user, assistant, sometimes system, sometimes tool). The model takes the list, returns the next message.
import anthropic
client = anthropic.Anthropic()
response = client.messages.create(
model="claude-sonnet-4-7",
max_tokens=1024,
system="You are a compliance assistant.",
messages=[
{"role": "user", "content": "Summarize this PDF: ..."},
],
)
print(response.content[0].text)Multi-turn: append the assistant response to messages, append a new user message, call again. The provider does not store conversation state. You do.
Tool use in detail
Tool use (function calling) lets the model decide to invoke a function instead of responding. You define tools with JSON schema, the model returns a tool_use content block with the tool name and arguments, you execute, you send back a tool_result block, the model continues.
tools = [{
"name": "get_invoice",
"description": "Fetch invoice by ID",
"input_schema": {
"type": "object",
"properties": {"id": {"type": "string"}},
"required": ["id"],
},
}]
response = client.messages.create(
model="claude-sonnet-4-7",
max_tokens=1024,
tools=tools,
messages=[{"role": "user", "content": "Show me invoice INV-42"}],
)
# response.content might be: [TextBlock("I'll look that up."), ToolUseBlock(name="get_invoice", input={"id": "INV-42"}, id="toolu_abc")]The loop:
messages = [{"role": "user", "content": query}]
for _ in range(10): # hard cap
resp = client.messages.create(model=..., tools=tools, messages=messages)
messages.append({"role": "assistant", "content": resp.content})
if resp.stop_reason != "tool_use":
break
tool_results = []
for block in resp.content:
if block.type == "tool_use":
result = execute_tool(block.name, block.input)
tool_results.append({
"type": "tool_result",
"tool_use_id": block.id,
"content": result,
})
messages.append({"role": "user", "content": tool_results})Three rules:
- Always cap iterations. Models can loop.
- Always validate tool inputs against your schema before executing. Models can hallucinate types.
- Tool results should be informative but compact. Long results bloat context and cost.
Prompt caching, the cost lever
Anthropic caches stable prefixes of your prompt for 5 minutes. Cached tokens cost 10 percent of normal input tokens. Cache write costs 25 percent more than normal once.
response = client.messages.create(
model="claude-sonnet-4-7",
max_tokens=1024,
system=[
{"type": "text", "text": "You are a compliance assistant.", "cache_control": {"type": "ephemeral"}},
{"type": "text", "text": LONG_DOCS, "cache_control": {"type": "ephemeral"}},
],
messages=[{"role": "user", "content": query}],
)If you call this with the same system prompt 100 times in 5 minutes, calls 2-100 hit the cache. For a 50K-token system prompt at $3 per million input tokens:
- No cache: 100 * 50K * $3/M = $15
- With cache: 1 * 50K * $3.75/M (write) + 99 * 50K * $0.30/M = $0.19 + $1.49 = $1.68
90 percent savings. The compliance analyzer at Binocs was unaffordable without this.
Structured outputs
LLMs return text. Production needs JSON. Three patterns:
- Tool use with no actual tool: define a "respond_with" tool whose schema is your output schema, force the model to call it.
- Anthropic's
response_format(newer) lets you specify a JSON schema directly. - Prompt-engineered: "respond only with valid JSON matching this schema...". Always validate, always retry on parse failure.
For Pydantic models:
from pydantic import BaseModel
class InvoiceAnalysis(BaseModel):
risk_score: int
flags: list[str]
summary: str
# Use the tool-use pattern
tool = {
"name": "analyze_invoice",
"description": "Analyze invoice and return structured results",
"input_schema": InvoiceAnalysis.model_json_schema(),
}
# Force the model to use only this tool
resp = client.messages.create(
model=...,
tools=[tool],
tool_choice={"type": "tool", "name": "analyze_invoice"},
messages=[...],
)
result = InvoiceAnalysis.model_validate(resp.content[0].input)Pydantic catches schema violations. On failure, retry with the validation error in the prompt: "Your previous response failed validation: X. Please try again."
RAG architecture
Retrieval-Augmented Generation: retrieve relevant docs, stuff into context, ask the model.
Components:
- Chunking: split docs into 500-1500 token chunks with overlap. Semantic chunking (split on headings) beats fixed-size.
- Embedding: turn chunks into vectors. Use voyage-3 or text-embedding-3-large for English; nomic-embed for cheap.
- Vector store: Pinecone, Qdrant, Weaviate, or pgvector if you already have Postgres.
- Retrieval: similarity search returns top K (typically K=20).
- Rerank: a smaller cross-encoder model (Cohere Rerank, BAAI/bge-reranker) re-scores the top K with the query. Keep top 5.
- Generation: stuff the 5 chunks into the system prompt or user message, ask the LLM.
Common failures:
- Bad chunking destroys semantic coherence. Test with real queries.
- No rerank. Vector search alone is noisy. Rerank lifts quality dramatically.
- No citation. Always ask the model to cite the chunk ID it used; lets you trace hallucinations.
When LangChain earns its weight
LangChain adds value when:
- You want to swap models (Anthropic to Gemini) without rewriting prompts.
- Your pipeline has 5+ steps and you benefit from declarative chains.
- You use LangSmith for tracing; the integration is excellent.
- You want pre-built retrievers (Parent Document Retriever, Multi-Vector, Self-Query) instead of implementing yourself.
It costs you:
- 4-5 layers of abstraction for simple cases.
- Frequent API churn (LCEL, the new Runnable interface, every 6 months).
- Hard-to-debug stack traces when a chain step fails.
Rule of thumb: if your spec fits in one sentence ("send prompt, get response"), skip it. If it is "agent with 8 tools, RAG over 4 sources, with conditional routing," it earns its keep.
Cost engineering
Order of operations to reduce cost:
- Use a smaller model where you can. Haiku for triage, Sonnet for reasoning, Opus for complex.
- Cache aggressively. System prompts, RAG context, tool definitions.
- Stream responses; do not pay for tokens you do not display.
- Truncate context to what is actually needed. Long history is expensive.
- Use batch API for offline jobs (Anthropic batch is 50 percent cheaper).
For Gemini: Flash is cheap and fast (good for classifiers and summarizers), Pro for reasoning. Free tier has generous limits for prototyping.
Latency engineering
User-facing apps need first-token latency under 500 ms. Tactics:
- Stream. Always. SSE for HTTP, WebSockets if you need bidirectional.
- Pre-warm: keep a long-lived client, reuse the HTTP connection.
- Reduce prompt length. Long prompts mean long time-to-first-token.
- Use smaller models for first draft, big model for refinement (cascading).
Observability
Track:
- Tokens in/out per request.
- Latency to first token and total.
- Tool use counts and durations.
- User feedback (thumbs up/down).
- Cost per request, rolled up by feature.
Tools: LangSmith, Helicone, OpenLLMetry, or roll your own with structured logging. The numbers will surprise you.
Failure modes
- Rate limits: backoff with jitter, queue, route to a fallback provider.
- Context window exhaustion: implement context trimming (summarize old turns).
- Hallucination: ground in retrieved facts, ask the model to cite sources, validate critical facts against a DB.
- Prompt injection: never trust user-controlled text inside system prompts. Sanitize or escape.
Learn more
- DocsAnthropic API DocumentationAnthropic
- DocsAnthropic: Prompt CachingAnthropic
- DocsAnthropic: Tool UseAnthropic
- DocsGemini API DocumentationGoogle
- DocsLangChain DocumentationLangChain
- ArticleSimon Willison: Things I've learned about LLMsSimon Willison