This is the living view. The current stance: agent-first workflow with a strong editor-as-escape-valve, a small set of carefully chosen models, and a deliberate bias toward running code over generating text. It will change. The changelog at the bottom records how it has changed.

The shape of the stack

The stack has three layers, each with one or two tools. I resist the temptation to add a third tool to any layer; if I cannot justify why a layer needs two, I do not have two.

The surface layer. Where I do the work. Claude Code in the terminal for any task that is “build me a thing” or “help me think about this with code.” VS Code (with the Copilot extension disabled by default) when I need to think with my hands — read a stack trace, scan a diff, edit a single file with intent. I do not use Cursor anymore; see the journal entry on the switch for the why.

The model layer. Claude Sonnet 4.5 is my default for everything that is not a long-context reasoning task. Claude Opus 4 for long-context reasoning, document synthesis, and “make this look like the work of a careful person.” I have tried GPT-5 and Gemini 2.5 Pro for specific tasks; Sonnet remains the right default for me. I do not yet have a strong opinion on open-weight models for my daily work.

The orchestration layer. The Model Context Protocol (MCP) for any tool integration. I have three MCP servers running locally: filesystem (for ad-hoc file work), a small SQLite helper, and a Playwright server for browser automation. I am not yet sold on agent frameworks — LangGraph feels heavyweight for the work I do; the actual agent loops in my code are 20-50 lines, not 2000.

What I have learned about prompting

Two things, both small:

Prompts are documents, not queries. I write the prompt for a long-running task the way I would write a design doc: scope, non-goals, the shape of the output, the constraints, an example. A good prompt for a 30-minute task looks like a good design doc. A good prompt for a 5-minute task looks like a good bug report. I rarely write a one-liner that is not part of a longer context.

Prompt caching changes the economics. When the model can cache a long-running context, I stop pre-summarizing and start pre-loading. The same prompt that would have cost me $2 in tokens now costs me $0.20 because the system prompt and the codebase are cached. The implication is that the right prompt length is as long as it needs to be, not as short as I can make it.

What I am still figuring out

  • The right amount of test coverage. The agent can write tests, but it can also write tests that pass without proving anything. I do not yet have a strong heuristic for how much to ask for.
  • The right amount of plan-then-execute. Some tasks want a 5-paragraph plan; some want a single line; some want neither, just to start writing. I am still learning which is which.
  • When to keep the agent in the loop. I do not want to be in the approval loop for every file write. I also do not want to hand the keys to a model that does not know my taste. The right cadence is “I review the plan, I do not review the diff unless the plan is wrong.”

What I do not have

A model for “I want to think with you, not have you think for me.” I have found the cursor-style inline autocomplete useful for the thinking with my hands mode, but Copilot is not quite right; Continue is closer in spirit but the UX is rough. If someone built a good “rubber duck that types” product, I would pay for it.