A month of vibe-coding at 0.01x velocity

This is not the first time I've tried a fully vibecoded project. Last year had me return multiple times to different providers to form my own opinion given the disparity of public opinion. I never walked away before with the feeling that coding is solved.

What follows is a rundown of my experience over the course of the month, followed by my takeaways.

The project, agent operation, and disclosure of preconceptions

Selected a project that was unreasonable for a limited timeframe, given my skills. A plugin for the IntelliJ IDE platform, written in Kotlin. A new domain, and a language I don't use. Small in scope, greenfield project, with no entrenched domain knowledge-checks, for the model, or for me.

As is customary across the industry to build coding harnesses the plugin is another LLM integration plugin, in the sea of LLM plugins. In few words, a plugin that works well with locally hosted models, using them as ad-hoc fill-in-the-middle capable models, and never proactively interrupting the flow of programming. See the Alternatively prompt GitHub repository for more information.

I believe that coding agents are best used as tools to generate prototypes. Attempting to solve problems that are outside the area of expertise of the operator. A similar belief concerns the practical applications of general purpose LLMs for business integrations.

The setup, and the agent operation

Ask ten people how to best operate an agent and you'll get eleven answers. Maybe you'll also receive suggestions to run agents that run agents which check agents that control agents that actually code. If AI vibe-coding ultimately is the future, I'd like to prompt the way I code,
lazily.

I've kept things simple though not optimal. And if there is an optimal, companies that are a bit too well funded (pre-IPO) should be the ones enlightening us on the one true way. ™

Best in class model.
Unless limited by the tool (planning mode), always ran GPT-5.5 xhigh (extra high reasoning).
Planning for larger features.
The project had three main project features, and several functional rewrites. Most, if not all, of these went through the general plan then execute flow. In general improvements that only consisted of a single short sentence prompt (change color, add accessibility label, fix prompt for e2e test, etc) skipped the plan phase.
No agent-specific project adjustments beforehand.
No AGENTS.md or equivalent markdown files. Whatever was in the upstream base template repository used as is without review. Later in the process, when the agent forgot how to run tests, was instructed to generate an AGENTS.md file.

The weekly limits, and the effective daily limits

On the Plus plan, as expected, the quotas become obvious quickly. Their terms of service nowadays state those quotas are based on token usage, but when full allocation and utilization are not visible throughout the interface, how do you gauge how long you can work in a session?

Progress wasn't steady every day. Some days I would be fixing an issue or two, and due to a more extensive e2e run I would be running out of requests pretty quickly. I assume in part due to the internal looping of codex to check if processes have finished. If every time it checked that a long running e2e test wasn't done, and if it sent the entire context with each request, I can only imagine how much useless computation it could chew through.

Surprising model behaviour when close to the usage limit.
I quickly learned that I should avoid any meaningful request when closing in on the 5 hour limit. While in the past I've seen model behaviour in which a response would be cut off abruptly (my experience last year when trialing JetBrains' Junie agent), with codex I felt shortchanged. Instead of producing a smaller part due the constraints, the model didn't stray from the prompt but made undesirable shortcuts. Instead of adjusting an end-to-end test as requested, it hardcoded asserts for tests to pass. Behaviour unobserved outside of this scenario.

The weekly limits most of the time would reset sooner than what was advertised in the tool.
I could draft up my own theory on why the dates and times were out of sync, but you won't hear me complain loudly, as this allowed me to work more on this project more often. Good reminder that you can have billions of dollars in funding and still make the most basic mistakes even with PhD-level intelligence at your disposal.

Model degradation during peak hours?

It is a known fact that AI labs are capacity constrained, though how that affects day to day usage remains mostly a guess. In practice what I've noticed, was that around the time US East Coast came online, model behaviour would slightly change.

When those times of the day would roll around, around my 4-5pm afternoons, GPT would be more likely to create checklists while working on a problem. This is not to say that I haven't seen this umprompted behaviour before, which I'm sure it helps on long-running tasks to keep it aligned on the task. But during those overlap times it was far more likely to take that approach, which highlights that something changes due to compute demand. In other normal usage, aside from executing on a plan, that manifestation was invisible.

Am I always getting the best in class model, or similar to Anthropic's past bugs, I was silently downgraded to another model?

Because I didn’t run this empirically, I can only subjectively say that during those periods the model was more likely to discard existing decisions. I needed to be often adamant about accessibility, and usability considerations because changes made during those hours would break established improvements.

My takeaways

At the end of this trial period I had something I would be able to release. This stands in contrast with my experience over the last year. There is much that can be attributed to better harnessing of the models, such as their encoded reliance on writing tests even when unprompted. I would guesstimate that the LLM took me around to 90%, functionally, towards how I was envisioning the project as it was being built.

What is the license and liability of this code? The current critical consensus is that GenAI code is uncopyrightable unless enough human authorship is demonstrable. If prompting the model can be considered enough is something that still needs to be settled. In terms of liability how can one assure that outputs conform to existing requirements without reviewing changes to their fullest? These questions will hopefully be answered by December this year when the Cyber Resilience Act comes in full effect.

There is no denying that vibe-coding today leads to more consistent results than before. However it might be worth considering that if we use them as prototyping tools, we could also start applying an old principle from the book Mythical Man-Month - that is - build one to be thrown away. The process of building is the process of learning the domain. What if we use them more for learning the problem space, instead of producing software that goes straight to production with all the liabilities involved?

A video demo of the plugin in action with trivial examples, backed by the self-hosted gpt-oss-20B model.

vibe-coding:
a term used to describe programming with the assistance of AI without any manual intervention or review of the generated code.
agent:
an orchestration approach in which a model is given the tools and automated oversight to complete tasks autonomously.
codex: OpenAI's coding agent tooling that runs in the terminal.
end‑to‑end tests:
a test that exercises the entire application stack from user input to its last output.
AGENTS.md:
a markdown file that documents the capabilities and usage patterns within a project for the agents.
Mythical Man‑Month:
a classic book on software engineering that famously states “adding manpower to a late software project makes it later.”