Binocs - 99% LLM cost cut on deck pipeline
Full rewrite of the Binocs CIM-to-deck pipeline. How I split the work between deterministic parsers and the LLM, the layout classifier, the eval set, and why this is the pattern for every LLM-heavy pipeline.
The problem
The Binocs core product turns a Confidential Information Memorandum (CIM) into an editable structured slide deck. A CIM is what an investment banker sends to PE buyers when they put a company up for sale. It is 80 to 200 pages of company history, financials, market analysis, customer logos, growth charts, and forward projections. The buyer's analyst spends days reading it. The Binocs product reads it in minutes and outputs slides the associate can edit and put in front of a partner.
The pipeline I inherited worked. It was just expensive and slow.
The state I inherited
The original pipeline was one function. For each page, send the page image and the extracted text to GPT-4 with a prompt that said "this is page N of a CIM, extract the layout, identify the headings and bullets and tables and charts, summarize the page into a slide". The output was a JSON blob with slide structure and content.
Cost per CIM - around 12 USD. At a few hundred CIMs a month, that was thousands of dollars and growing. Latency per CIM - 8 to 15 minutes. Quality - good enough that customers were paying for it, inconsistent enough that the ops team was hand-editing every output before delivery.
The pipeline had three real problems.
- The LLM was doing layout work, which it is bad at and which is expensive in tokens.
- The LLM was doing extraction work, which a parser does for free.
- The output was non-deterministic, so the same CIM ran twice gave different slide structures.
The principle
The principle I kept coming back to - if a problem can be solved by deterministic code, solve it by deterministic code. The LLM is the most expensive tool in the toolbox, save it for the parts that need judgment.
In a CIM-to-deck pipeline, the parts that need judgment are - what is the headline of this slide, what are the 2-3 most important bullets, what does this table actually mean for the buyer. The parts that do not need judgment are - where is the heading on the page, what is the body text, what is in the table, what is the caption of this chart. The first set is LLM work. The second set is parser work.
The old pipeline gave both sets to the LLM. The new pipeline gives only the first set.
Pass one - deterministic extraction
The first pass uses pdfminer.six to walk the PDF and pull every text block with its bounding box, font, font size, and weight. Output is a flat list of blocks per page.
From the flat list, a layout classifier turns blocks into semantic elements - heading, subheading, body, bullet, caption, footer, page-number, table-cell. The classifier is pure rules, not ML. Rules I use -
- Font size > median + 4pt AND bold AND vertical position in top third = heading.
- Font size > median + 2pt AND bold = subheading.
- Body font size AND starts with bullet character or numbered list pattern = bullet.
- Position in bottom 5 percent of page AND font size <= median = footer.
- Small font + adjacent to image bounding box = caption.
For tables I use Camelot's lattice mode when the PDF has table borders, and a custom column-detection algorithm based on x-coordinate clustering when it does not. Both are deterministic. Both return the table as a 2D array of cells.
For images and charts I extract the image, then look for the surrounding text - title above, caption below, footnote in italics. Charts that are rendered as text (lines and labels embedded in the PDF, not raster images) get re-rendered as proper structured data using the underlying coordinates.
The output of pass one is a JSON document per CIM, structured as pages, each page has elements with type and content, plus relationships (this caption belongs to this image, this bullet belongs to this heading). At this point we have done zero LLM calls and have a clean machine-readable representation of the document.
Pass two - the LLM writes
Now the LLM gets called. For each slide we want to produce, we send the structured content of the source page (or pages) plus a prompt that says "given this content, write a slide title and 2-3 bullets in the Binocs voice".
The prompt is small. The input tokens are small (we are sending cleaned text, not page images, not raw PDF blocks). The output tokens are small (a title and 3 bullets). The LLM is doing what LLMs are good at - writing.
We use Claude for this because Claude is good at structured writing and stays in voice. We use structured outputs (JSON mode) so we never have to parse free-text responses. The prompt is cached - the system prompt and the voice guidelines do not change, so we pay full price once and cached price after.
The eval set
The biggest unlock for the migration was the eval set. Before I touched anything, I built an eval set of 30 CIMs across industries with hand-graded outputs - what a perfect slide deck would look like for each. Every change to the pipeline ran against the eval and scored on structural correctness (right number of slides, right headings) and qualitative correctness (right bullets, right voice).
The eval let me move fast without breaking quality. When I moved table extraction from LLM to Camelot, the eval told me table extraction quality went up 14 percent. When I changed the heading classification rule, the eval told me one specific CIM regressed and I fixed it. Without the eval, I would have been guessing.
The numbers
- Cost per CIM - 12.40 USD -> 0.09 USD (99.3 percent reduction).
- Latency per CIM - 11 minutes median -> 90 seconds median.
- Quality - structural correctness on eval set went from 78 percent to 96 percent. Voice quality stayed flat.
- Ops hand-editing time - 45 minutes per CIM -> 6 minutes per CIM.
The cost number is the headline but the latency and quality numbers are the reason ops loved the migration. Fast and consistent matters as much as cheap.
The patterns that generalize
Three patterns from this work that I now use on every LLM pipeline.
Pattern 1 - LLM as a writer, not a parser
If you are asking the LLM to extract structured data from unstructured input, you are paying LLM prices for parser work. Build the parser. Even an imperfect parser plus an LLM cleanup pass is cheaper than an LLM doing both.
Pattern 2 - Eval before optimize
Before you change anything, build an eval. The eval is your safety net. Without it, you cannot tell if a change is an improvement or a regression. With it, you can ship changes weekly.
Pattern 3 - Cache aggressively
Prompt caching on the static parts of the prompt (system instructions, voice guides, schema) is a 90 percent cost reduction on those tokens. It is free, it just requires you to put the static parts at the front of the prompt and the dynamic parts at the end.
What I would do differently
I would have built the eval set first, not third. I spent the first week porting the pipeline incrementally and using "vibes" to judge quality. Once I built the eval set in week two, every decision got faster and better. The eval was the unlock.
I would also have profiled token usage per call earlier. I assumed the LLM was the cost driver, which was correct, but I did not know which prompts were the worst offenders until I built a per-prompt cost dashboard. The deck-summary prompt was 60 percent of cost, the chart-interpretation prompt was 5 percent. Knowing that, I optimized the right one first.
What this taught me
LLMs are amazing at the things they are amazing at. They are bad at the things they are bad at. The skill is not "use the LLM for everything" or "avoid the LLM". The skill is knowing which parts of a problem are LLM-shaped and which are not, then routing accordingly.
Most teams treat the LLM as a hammer because the demo was a hammer. The senior version is to treat the LLM as a scalpel - use it precisely, in the spot it cuts cleanly, and let regular code do the rest.
Learn more
- DocsAnthropic prompt engineering docsAnthropic
- Docspdfminer.six docspdfminer.six
- DocsCamelot - PDF table extractioncamelot-py
- DocsOpenAI cookbookOpenAI
- ArticleDesigning Data-Intensive Applications - Martin Kleppmanndataintensive.net