@algernon @iocaine Thank you for taking the time to engage with my piece and for sharing your
concrete experience with aggressive crawling. The scale you describe—3+ million
daily requests from ClaudeBot alone—makes the problem tangible in a way
abstract discussion doesn't.
Where we agree: AI companies don't behave ethically. I don't assume they do,
and I certainly don't expect them to voluntarily follow rules out of goodwill.
The environmental costs you mention are real and serious concerns that I share.
And your point about needing training data alongside weights for true
reproducibility is well-taken—I should have been more explicit about that.
On whether they've “scraped everything”
I overstated this point. When I said they've already scraped what they need, I
was making a narrower claim than I stated: that the major corporations have
already accumulated sufficient training corpora that individual developers
withdrawing their code won't meaningfully degrade those models. Your traffic
numbers actually support this—if they're still crawling that aggressively, it
means they have the resources and infrastructure to get what they want
regardless of individual resistance.
But you raise an important nuance I hadn't fully considered: the value of fresh
human-generated content in an internet increasingly filled with synthetic
output. That's a real dynamic worth taking seriously.
On licensing strategy
I hear your skepticism about licensing, and the Anthropic case you cite is
instructive. But I think we may be drawing different conclusions from it. Yes,
the copyright claim was dismissed while the illegal sourcing claim
succeeded—but this tells me that legal framing matters. The problem isn't that
law is irrelevant; it's that current licenses don't adequately address this use
case.
I'm not suggesting a new license because I believe companies will voluntarily
comply. I'm suggesting it because it changes the legal terrain. Right now, they
can argue—as you note—that training doesn't create derivative works and thus
doesn't trigger copyleft obligations. A training-specific copyleft wouldn't
eliminate violations, but it would make them explicit rather than ambiguous. It
would create clearer grounds for legal action and community pressure.
You might say this is naïve optimism about law, but I'd point to GPL's history.
It also faced the critique that corporations would simply ignore it. They
didn't always comply voluntarily, but the license created the framework for
both legal action and social norms that, over time, did shape behavior.
Imperfectly, yes, but meaningfully.
The strategic question I'm still wrestling with
Here's where I'm genuinely uncertain: even if we grant that licensing won't
stop corporate AI companies (and I largely agree it won't, at least not
immediately), what's the theory of victory for the withdrawal strategy?
My concern—and I raise this not as a gotcha but as a genuine question—is that
OpenAI and Anthropic already have their datasets. They have the resources to
continue acquiring what they need. Individual developers blocking crawlers may
slow them marginally, but it won't stop them. What it will do, I fear, is
starve open source AI development of high-quality training data.
The companies you're fighting have billions in funding, massive datasets, and
legal teams. Open source projects like Llama or Mistral, or the broader
ecosystem of researchers trying to build non-corporate alternatives, don't. If
the F/OSS community treats AI training as inherently unethical and withdraws
its code from that use, aren't we effectively conceding the field to exactly
the corporations we oppose?
This isn't about “accepting reality” in the sense of surrender. It's about
asking: what strategy actually weakens corporate AI monopolies versus what
strategy accidentally strengthens them? I worry that withdrawal achieves the
latter.
On environmental costs and publicization
Freeing model weights alone doesn't solve environmental costs, I agree. But I'd
argue that publicization of models does address this, though perhaps I didn't
make the connection clear enough.
Right now we have competitive redundancy: every major company training similar
models independently, duplicating compute costs. If models were required to be
open and collaborative development was the norm, we'd see less wasteful
duplication. This is one reason why treating LLMs as public infrastructure
rather than private property matters—not just for access, but for efficiency.
The environmental argument actually cuts against corporate monopolization, not
for it.
A final thought
I'm not advocating negotiation with AI companies in the sense of compromise or
appeasement. I'm advocating for a different field of battle. Rather than
fighting to keep them from training (which I don't believe we can win), I'm
suggesting we fight over the terms: demanding that what's built from our
commons remains part of the commons.
You invoke the analogy of not negotiating with fascists. I'd push back gently
on that framing—not because these corporations aren't doing real harm, but
because the historical anti-fascist struggle wasn't won through withdrawal. It
was won through building alternative power bases, through organization, through
creating the structures that could challenge and eventually supplant fascist
power.
That's what I'm trying to articulate: not surrender to a “new reality,” but the
construction of a different one—one where the productive forces of AI are
brought under collective rather than private control.
I may be wrong about the best path to get there. But I think we share the
destination.