Skip to content
Lucky Snail Logo Lucky Snail
中文

Shipping at the Speed of Reasoning

/ 16 min read /
#ai #agent
Table of Contents 目录

Original: https://steipete.me/posts/2025/shipping-at-inference-speed Translation: ChatGPT 5.2

What has changed since May

The progress in “vibe coding” this year has been incredible. Around May I was still surprised that some prompts could generate code that worked out of the box, and now that’s my baseline expectation. I’m shipping code at speeds that feel unreal. Since then I’ve burned through a lot of tokens. Time for an update.

It’s interesting how these agents work. A few weeks ago someone argued that you have to write code yourself to feel bad architecture, and that using agents creates a disconnect—I disagree completely. When you’ve spent enough time with an agent, you develop a very clear sense of how long something should take; and when Codex comes back and doesn’t get it right on the first try, I’m already suspicious.

The amount of software I can now build is mostly limited by inference time and deep thinking. But honestly—most software doesn’t need deep thinking. Most apps just move data from one form to another, maybe store it somewhere, and then present it to the user somehow. The simplest form is text, so by default, whatever I want to build starts as a CLI. The agent can call it directly and verify the output—closing the loop.

Model shift

What really unlocked building like a factory was GPT-5. It took me a few weeks after its release to realize that—also waiting a bit for codex to catch up functionally to Claude Code, spending time learning and understanding the differences; but then I started trusting the model more and more. I hardly read code these days. I watch the streaming output, sometimes peek at critical parts, but honestly—I don’t read most of it. I do know where components live, how things are structured, and how the whole system is designed, and that’s usually enough.

The important decisions these days are language/ecosystem and dependencies. My go-to languages: TypeScript for web, Go for CLIs, Swift if I need macOS stuff or a UI. I wasn’t even considering Go a few months ago, but then I poked around and found that agents are particularly good at writing it, and its simple type system makes linting fast.

For those working on Mac or iOS stuff: you barely need Xcode anymore. I don’t even use xcodeproj files. Swift’s build infrastructure is good enough for most things now. codex knows how to run iOS apps and how to interact with the simulator. No special stuff or MCP needed.

codex-vs-opus

As I’m writing this, codex is in the middle of a massive, multi-hour refactor that’s cleaning up the worse historical baggage from the earlier days of Opus 4.0. People on Twitter often ask me what the biggest difference between Opus and codex is, and why it matters when benchmark scores are so close. For me, benchmarks are becoming harder to trust—you have to try both to really understand. Whatever OpenAI did in post-training, codex is trained to read a lot of code before starting.

Sometimes it silently reads files for 10, 15 minutes before starting to write any code. Annoying on one hand, amazing on the other, because it drastically increases the chances of fixing the right thing. Opus, in contrast, is much more proactive—great for small changes—but less suitable for larger features or refactors; it often doesn’t read the entire file, or misses parts, and then returns something inefficient or misses things. I’ve noticed that even when codex is sometimes 4x slower than Opus on similar tasks, I end up being faster because I don’t have to go back and fix the “fix itself,” which felt pretty normal when I was still using Claude Code.

Codex also let me drop a lot of the “theater” I had to do with Claude Code. I no longer use “plan mode”; instead, I just start a conversation with the model: ask a question, have it Google, explore code, make a plan together; when I’m happy with what I see, I write “build” or “write plan to docs/*.md and build this”. Plan mode felt like a workaround—something necessary for earlier models that weren’t great at following prompts, so we had to take away their editing tools. There’s also a terribly misunderstood tweet of mine that’s still circulating, which made me realize most people don’t understand plan mode is not magic.

Oracle

The jump from GPT-5/5.1 to 5.2 is huge. About a month ago I built oracle 🧿—a CLI that lets the agent run GPT-5 Pro, upload files + prompts, manage sessions so answers can be retrieved later. I did this because many times when the agent got stuck, I’d have it write everything into a markdown file, then query it myself, which felt like repetitive time waste—and an opportunity to close the loop. Usage instructions are in my global AGENTS.MD file, and the model sometimes triggers oracle itself when stuck. I use it many times a day. It’s a huge unlock. Pro is good at speed-scanning about 50 websites, then thinking very seriously, and getting the answer right almost every time. Sometimes it’s fast, just 10 minutes, but I’ve had runs that took over an hour.

Now that GPT-5.2 is out, I need it much less. I still use Pro myself for research sometimes, but the number of times I have the model “go ask the oracle” went from multiple times a day to a few times a week. I’m not mad about it—building oracle was super fun, I learned a lot about browser automation and Windows, and finally got around to working on skills, which I’d been putting off for quite a while. It does show how much 5.2 has improved on many real-world programming tasks. It gets things right on the first try for almost anything I throw at it.

Another huge advantage is the knowledge cutoff. GPT-5.2 goes until the end of August, while Opus is stuck in mid-March—about a five-month gap. That matters a lot when you want to use the newest tools available.

A concrete example: VibeTunnel

Let me give you another example of how far the model has come. One of my early deep-dive projects was VibeTunnel. It’s a terminal multiplexer that lets you code from anywhere. Earlier this year I spent almost all my time on it, got it good enough after two months that I was coding on my phone while out with friends… then I decided I should stop, more for mental health reasons. At that point I’d tried to rewrite a core part of the multiplexer out of TypeScript, but older models couldn’t handle it. I tried Rust, Go… dear lord, even Zig. Sure, I could have done the refactor by hand, but it was a lot of manual work, so it never got done before I shelved it. Last week I dusted it off, gave codex a two-sentence prompt to convert the entire forwarding system to Zig, it ran for over 5 hours with multiple compaction cycles, and delivered a working conversion in one shot.

Why did I dust it off? My current focus is Clawdis, an AI assistant with full access to everything on all my computers, plus access to messages, email, home automation, cameras, lights, music, hell, it can even control my bed temperature. Of course it also has its own voice, a CLI for tweeting, and its own Twitter account.

Clawd can see and control my screen, and sometimes says mean things, but I also want him to be able to see my agents, and fetching the role stream is much more efficient than looking at images… we’ll see if that works out!

My workflow

I know… you’re here to learn how to build faster, and I’m just writing a marketing pitch for OpenAI. I hope Anthropic is building Opus 5 and the tide turns again. Competition is good! For now, I love Opus as a general model. My AI agents would be half as fun running on GPT-5. Opus has something special that makes it pleasant to use. I use it for most computer automation tasks, and Clawd🦞 is driven by it too.

My workflow hasn’t changed much since I last wrote about it in October.

  1. I usually work on multiple projects at the same time. Depending on complexity, between 3 and 8. Switching contexts frequently is tiring, and I really can only do it when working from home, quiet and focused. It requires juggling many mental models. Luckily, most software is boring. Building a CLI to check your food delivery status doesn’t require much thinking. Usually I focus on one big project, with a few satellite projects running on autopilot. When you’ve done enough agent engineering, you develop an intuition for what’s easy and where the model will get stuck, so often I just write a prompt, Codex runs for 30 minutes, and I get what I need. Sometimes a little tweaking or creativity is needed, but often it’s straightforward.

  2. I use codex’s queue feature heavily—as soon as I have an idea, I add it to the pipeline. I see many people trying various multi-agent orchestration systems, email or automated task management—so far I don’t see the need—usually the bottleneck is me. The way I build software is very iterative. I make something, play with it, see how it “feels”, then get new ideas to refine it. I rarely have a complete blueprint of the goal in my head. Sure, I have a rough direction, but it changes a lot as I explore the problem space. So systems that take a complete idea as input and output a result don’t work for me. I need to play with it, touch it, feel it, see it—that’s how I evolve it.

  3. I essentially never rollback or use checkpoints. If something isn’t how I like it, I ask the model to change it. Codex sometimes resets a file, but more often it just undoes or modifies those edits; I rarely need to go all the way back, instead we pivot in a different direction. Building software is like climbing a mountain. You don’t go straight up, you circle around, take turns; sometimes you wander off the path and need to backtrack a bit. Not perfect, but eventually you get where you need to be.

  4. I just commit to main. Sometimes codex thinks it’s too messy and auto-creates a worktree, then merges the changes back, but that’s rare, and I only ask it to do that in very exceptional cases. I find the extra cognitive load of having to think about different states in a project unnecessary, and I prefer linear progression. For larger tasks I leave them running while I’m distracted—like writing this, I have 4 projects undergoing refactors right now, each taking about 1-2 hours to finish. Sure, I could do it in worktrees, but that just leads to tons of merge conflicts and suboptimal refactors. Note: I usually work alone; if you’re in a bigger team, this clearly won’t work.

  5. I already mentioned how I plan a feature. I cross-reference projects constantly, especially when I know I’ve solved a problem elsewhere, I’ll tell Codex to look at ../project-folder, which is usually enough for it to infer where to look from context. This is super useful for saving prompts. I just write “look at ../vibetunnel, then do the same thing for Sparkle’s changelog” because that’s already done there, and it’ll copy things over and adapt to the new project with 99% confidence. Same way I scaffold new projects.

  6. I’ve seen many systems for people who want to reference past conversations. Another thing I never need and never use. I maintain docs in the docs folder of each project for subsystems and features, and use a script + instructions in my global AGENTS file to force the model to read docs on certain topics. The bigger the project, the more this pays off, so I don’t use it everywhere, but it’s super helpful for keeping docs up-to-date and building better context for my tasks.

  7. Speaking of context. I used to restart a session very diligently before starting a new task. With GPT-5.2, that’s no longer needed. Performance is extremely good even with a fuller context, and often speeds up because running is faster when the model already has many files loaded. Obviously this only works if you serialize tasks or keep changes far enough apart that two sessions barely interfere. codex doesn’t have “this file changed” system events like Claude Code, so you need to be a bit more careful—but conversely, I feel codex is just much better at context management, and I’d say I can get 5x more work done in one codex session than in a Claude one. It’s not just the objectively larger context window, there’s something else at play. I guess codex internally compresses thinking very efficiently to save tokens, while Opus is very verbose. Sometimes the model errs and its internal thinking flow leaks to the user, so I’ve seen this several times. Seriously, codex is good at word economy, which I find oddly entertaining.

  8. Prompts. I used to dictate long, elaborate prompts. With Codex, my prompts are much shorter, I often retype them, and many times I add images, especially when iterating on UI (or doing text with CLI). If you show the model clearly what’s wrong, just a few words are enough to steer it the way you want. Yes, I’m that person who drags in a screenshot of a UI component and writes “fix padding” or “redesign it”; many times that either solves it or gets me to a pretty good place. I used to reference markdown files, but with my docs:list script, that’s no longer needed.

  9. Markdown files. Many times I’ll write “document this to docs/*.md” and just let the model pick a filename. The more intuitive you design the structure for what the model was trained on, the easier your work gets. After all, I’m not designing codebases for myself to navigate, I’m engineering them to be efficient for agents to work in. Fighting the model is often a waste of time and tokens.

Tools and infrastructure

  1. What is still hard? Choosing the right dependencies and frameworks is where I invest significant time. Is it well maintained? What about peer dependencies? Is it popular—meaning enough world knowledge for the agent to hit the ground running? Same with system design. Are we communicating over WebSocket? HTML? What goes on the server vs. client? How and through what channels does data flow between parts? These are often harder things to explain to the model, and it pays off to research and think deeply about them.

  2. Since I manage many projects, I often have an agent running right in my projects folder; when I figure out a new pattern, I tell it to “find all my recent Go projects and also implement this change there + update changelogs”. Each of my projects bumps the patch version in that file; when I look back later, often there are already improvements waiting to be tested.

  3. Of course, I automate everything. Registering domains and modifying DNS is a skill. Writing good frontend is another. My AGENTS file has a note about my Tailscale network, so I just say “go to my Mac Studio and update xxx”.

  4. By the way, about multiple Macs. I usually work on two Macs: my MacBook Pro for the big screen, and another screen connected via Jump Desktop to my Mac Studio. Some projects run there, some here. Sometimes I edit different parts of the same project on each machine and sync via git. Simpler than worktrees because offsets on main are easy to align. Bonus: anything that needs UI or browser automation, I can move to the Studio, and its popups don’t distract me. (Yes, Playwright has headless mode, but there are many cases where it doesn’t work)

  5. Another advantage is that tasks run continuously there, so whenever I travel, the remote becomes my primary workstation, and tasks keep running even when I close the Mac. I’ve tried truly async agents like codex or Cursor Web before, but I missed the controllability, and ultimately the work would end up as pull requests, adding complexity to my setup. I prefer the simplicity of the terminal.

  6. I’ve played with slash commands but never found them that useful. Skills replaced some of them, and for the rest, I always type “commit/push” because it takes as long as typing /commit and always works.

  7. I used to dedicate days specifically to refactoring and cleaning up projects, but now I do it more opportunistically. Whenever a prompt starts taking too long, or I see something ugly flash by in the code stream, I handle it immediately.

  8. I’ve tried Linear or other issue trackers, but never stuck with them. Important ideas I do immediately; the rest I either remember or they’re not important. Of course I maintain public bug trackers for people using my open source code, but when I find a bug, I fix it immediately—much faster than writing it down first, then switching back to that context later.

  9. Whatever you do, start with the model and CLI. I had this idea for a Chrome extension to summarize YouTube videos in my head for a while. Last week I started building summarize, a CLI: it takes anything and converts it to markdown, then hands it to the model for summarization. I get the core right first, once it works great, I build the extension in a day. I love it. Runs locally, supports free or paid models. Transcribes video/audio locally. Communicates with a local daemon, so it’s super fast. Go try it!

  10. My go-to model is gpt-5.2-codex high. Again, KISS. Apart from being much slower, xhigh offers little extra benefit, and I don’t want to spend time fiddling with different modes or “ultrathink”. So basically everything runs on high. GPT-5.2 and codex are close enough that switching models is pointless, so I just use this.

My config

Here’s my ~/.codex/config.toml:

model = "gpt-5.2-codex"
model_reasoning_effort = "high"
tool_output_token_limit = 25000
# Leave room for native compaction near the 272–273k context window.
# Formula: 273000 - (tool_output_token_limit + 15000)
# With tool_output_token_limit=25000 ⇒ 273000 - (25000 + 15000) = 233000
model_auto_compact_token_limit = 233000
[features]
ghost_commit = false
unified_exec = true
apply_patch_freeform = true
web_search_request = true
skills = true
shell_snapshot = true
[projects."/Users/steipete/Projects"]
trust_level = "trusted"

This lets the model read more in one go; the default is a bit small and can limit what it sees. It fails silently, which is annoying, and they’ll eventually fix it. Also, is web search still not on by default? unified_exec replaces tmux and my old runner script, and the rest is neat. And don’t fear compaction—since OpenAI switched to their new /compact endpoint, this scheme works well enough that tasks can survive multiple compactions and still complete. It slows things down, but often acts as a review pass, with the model finding bugs when it re-examines the code.

That’s it for now. I plan to write more, I have a bunch of ideas swirling around, but I’m just having too much fun building stuff. If you want more ramblings and thoughts on how to build in this new world, follow me on Twitter.