This is not the first time I've tried a fully vibecoded project. Last year had me
return multiple times to different providers to form my own opinion given the disparity
of public opinion. I never walked away before with the feeling that coding is solved
.
What follows is a rundown of my experience over the course of the month, followed by my takeaways.
The project, agent operation, and disclosure of preconceptions
Selected a project that was untenable in a reasonable timeframe given my skills. A plugin for the IntelliJ IDE platform, written in Kotlin. A new domain, and a language I don't use. Small in scope, greenfield project, with no entrenched domain knowledge-checks, for the model, or for me.
As is customary across the industry to build coding harnesses the plugin is another LLM
integration plugin, in
the sea of LLM plugins. In few words, a plugin that works well with self-hosted
models, using them as ad-hoc fill-in-the-middle capable models, and never
proactively interrupting the flow of programming.
See the Alternatively prompt
GitHub
repository for more information.
I believe that coding agents are best used as tools to generate prototypes. Attempting to solve problems that are outside the area of expertise of the operator. A similar belief concerns the practical applications of general purpose LLMs for business integrations.
The setup, and the agent operation
Ask ten people how to best operate an agent and you'll get eleven answers. Maybe
you'll also receive suggestions to run agents that run agents which check agents that
control agents that actually code. If AI vibe-coding ultimately is the future, I'd like to
prompt the way I code,
lazily.
I've kept things simple though not optimal.
And if there is an optimal, companies that
are a bit too well funded, pre-IPO, should be the ones enlightening us on the one true way. ™
-
Best in class model.
Unless limited by the tool (planning mode), always ran GPT-5.5 xhigh (extra high reasoning). -
Planning for larger features.
The project had three main project features, and several functional rewrites. Most, if not all, of these went through the general plan then execute flow. In general improvements that only consisted of a single short sentence prompt (change color
,add accessibility label
,fix prompt for e2e test
, etc) skipped the plan phase. -
No agent-specific project adjustments beforehand.
No AGENTS.md or equivalent markdown files. Whatever was in the upstream base template repository used as is without review. Later in the process, when the agent forgot how to run tests, was instructed to generate an AGENTS.md file.
The weekly limits, and the effective daily limits
On the Plus plan
, as expected, the quotas become obvious quickly.
Their terms of service nowadays state those quotas are based on token usage, but when full allocation
and utilization are not visible throughout the interface, how do you gauge how long you can work
in a session?
Progress wasn't steady every day. Some days I would be fixing an issue or two, and due to a more extensive e2e run I would be running out of requests pretty quickly. I assume in part due to the internal looping of codex to check if processes have finished. If every time it checked that a long running e2e tests wasn't done, and if it sent the entire context with each request, I can only imagine how much useless computation it could chew through.
Surprising model behaviour when close to the usage limit.
I quickly learned that I should avoid any meaningful request when closing in on the 5 hour limit.
While in the past I've seen model behaviour in which a response would be cut off abruptly (my
experience last year when trialing JetBrains' Junie agent), with codex I felt shortchanged.
Instead of producing a smaller part due the constraints, the model didn't stray from the
prompt but made undesirable shortcuts. Instead of adjusting an end-to-end test as requested,
it hardcoded asserts for tests to pass. Behaviour unobserved outside of this scenario.
The weekly limits most
of the time would reset sooner than what was advertised in the tool.
I could draft up
my own theory on why the dates and times were out of sync, but you won't hear me complain
loudly, as this allowed me to work
more on this project more often. Good reminder
that you can have billions of dollars in funding and still made the most basic mistakes
even with PhD-level inteligence
at your disposal.
Model degradation during peak hours?
It is a known fact that AI labs are capacity constrained, though how that affects day to day usage remains mostly a guess. In practice what I've noticed, was that around the time US East Coast came online, model behaviour would slightly change.
When those times of the day would roll around, around my 4-5pm afternoons, GPT would be more
likely to create checklists while working on a problem. This is not to say that I haven't
seen this umprompted behaviour before, which I'm sure it helps on long-running tasks to keep
it aligned on the task. But during those overlap times it was far more likely to take that
approach, which highlights that something changes due to compute demand. In other normal usage,
aside from executing on a plan, that manifestation
was invisible.
Am I always getting the best in class model, or similar to Anthropic's past bugs, I was silently downgraded to another model?
Since I didn’t run this empirically, I can only subjectively say that during those periods the model was more likely to discard existing decisions. I needed to be often adamant about accessibility, and usability considerations because changes made during those hours would break established improvements.
My takeaways
At the end of this trial period I had something I would be able to release. This stands in contrast with my experience over the last year. There is much that can be attributed to better harnessing of the models, such as their encoded reliance on writing tests even when unprompted. I would guesstimate that the LLM took me around to 90%, functionally, towards how I was envisioning the project as it was being built.
What is the license and liability of this code? The current critical consensus is that GenAI code is uncopyrightable unless enough human authorship is demonstrable. If prompting the model can be considered enough is something that still needs to be settled. In terms of liability how can one assure that outputs conform to existing requirements without reviewing changes to their fullest? These questions will hopefully be answered by December this year when the Cyber Resilience Act comes in full effect.
There is no denying that vibe-coding today leads to more consistent results than before. However it might be worth considering that if we use them as prototyping tools, we could also start applying an old principle from the book Mythical Man-Month, that is, build one to be thrown away. The process of building is the process of learning the domain.