@zanzi @jnkrtech Your point about abstraction ladders is something I've been turning over since I read it. The Terence Tao/Lean combination feels like a glimpse of exactly what you're describing: Lean's type system carries so much semantic weight that the LLM doesn't need to compensate with volume. The proof is short because the language is expressive enough to make it short. That's very different from what happens when you point an LLM at TypeScript.
I'm skeptical of vibe coding, and have been from the start. Generating an entire project from prompts feels to me like a path to maintainability disaster, and I think most of the people currently excited about it haven't yet had to clean up what they made. The enthusiasm reads a lot like the dynamic typing boom of the 2000s: Ruby, Python, JavaScript, the whole wave. “We can build so fast.” True, and then ten years passed, the codebases grew, the teams changed, and people started hitting walls they hadn't anticipated. Python grew type hints. Flow and TypeScript appeared. Ruby quietly declined. The reckoning came, it just took a while.
I expect vibe coding to follow the same curve. One difference worries me though. With dynamic typing, the code was at least written by humans who understood it at the time. The technical debt was “hard to read.” With LLM-generated code that nobody reviewed deeply, the debt is something else: code that exists for reasons nobody can reconstruct. That's a harder problem.
There's a related problem I don't think more training data will fix. LLMs converge toward the average of what they've seen, and the average code on the internet is not concise. Verbose code is the norm; terse, well-factored code is rare, and usually underdocumented, so it contributes a weak training signal at best. The result is that LLMs have internalized the habits of the median developer: defensive, repetitive, over-specified. Conciseness requires knowing what not to write, and that judgment depends on domain context and something like aesthetic sense—neither of which transfers easily through pretraining. I don't see a scaling path out of that.
My own workflow tries to avoid this. Even when I use an LLM for Fedify, I steer constantly: small outputs, immediate review, corrections before moving on. The LLM is closer to a fast typist than an autonomous collaborator. It still helps, but the judgment about what to write, what to cut, where to stop, stays with me.
Which brings me to your actual question, what should the metric be. I don't have a clean answer, but I think it has something to do with how much of the codebase a human can hold in their head and feel responsible for. LOC never measured that. Neither does “prompt to working demo.” Whatever comes next probably needs to.
And on the higher-level languages point: I think you're right, and I'd add that this might be where the more interesting craft ends up living. Not writing the implementation, but designing the abstractions well enough that the implementation, whoever or whatever produces it, stays within bounds a human can oversee. That's a different skill from what most developers have trained, but it doesn't feel like a lesser one.