洪 民憙 (Hong Minhee) 
@hongminhee@hollo.social · Reply to Gergely Nagy 🐁's post
@algernon @iocaine Thank you for taking the time to engage with my piece and for sharing your concrete experience with aggressive crawling. The scale you describe—3+ million daily requests from ClaudeBot alone—makes the problem tangible in a way abstract discussion doesn't.
Where we agree: AI companies don't behave ethically. I don't assume they do, and I certainly don't expect them to voluntarily follow rules out of goodwill. The environmental costs you mention are real and serious concerns that I share. And your point about needing training data alongside weights for true reproducibility is well-taken—I should have been more explicit about that.
On whether they've “scraped everything”
I overstated this point. When I said they've already scraped what they need, I was making a narrower claim than I stated: that the major corporations have already accumulated sufficient training corpora that individual developers withdrawing their code won't meaningfully degrade those models. Your traffic numbers actually support this—if they're still crawling that aggressively, it means they have the resources and infrastructure to get what they want regardless of individual resistance.
But you raise an important nuance I hadn't fully considered: the value of fresh human-generated content in an internet increasingly filled with synthetic output. That's a real dynamic worth taking seriously.
On licensing strategy
I hear your skepticism about licensing, and the Anthropic case you cite is instructive. But I think we may be drawing different conclusions from it. Yes, the copyright claim was dismissed while the illegal sourcing claim succeeded—but this tells me that legal framing matters. The problem isn't that law is irrelevant; it's that current licenses don't adequately address this use case.
I'm not suggesting a new license because I believe companies will voluntarily comply. I'm suggesting it because it changes the legal terrain. Right now, they can argue—as you note—that training doesn't create derivative works and thus doesn't trigger copyleft obligations. A training-specific copyleft wouldn't eliminate violations, but it would make them explicit rather than ambiguous. It would create clearer grounds for legal action and community pressure.
You might say this is naïve optimism about law, but I'd point to GPL's history. It also faced the critique that corporations would simply ignore it. They didn't always comply voluntarily, but the license created the framework for both legal action and social norms that, over time, did shape behavior. Imperfectly, yes, but meaningfully.
The strategic question I'm still wrestling with
Here's where I'm genuinely uncertain: even if we grant that licensing won't stop corporate AI companies (and I largely agree it won't, at least not immediately), what's the theory of victory for the withdrawal strategy?
My concern—and I raise this not as a gotcha but as a genuine question—is that OpenAI and Anthropic already have their datasets. They have the resources to continue acquiring what they need. Individual developers blocking crawlers may slow them marginally, but it won't stop them. What it will do, I fear, is starve open source AI development of high-quality training data.
The companies you're fighting have billions in funding, massive datasets, and legal teams. Open source projects like Llama or Mistral, or the broader ecosystem of researchers trying to build non-corporate alternatives, don't. If the F/OSS community treats AI training as inherently unethical and withdraws its code from that use, aren't we effectively conceding the field to exactly the corporations we oppose?
This isn't about “accepting reality” in the sense of surrender. It's about asking: what strategy actually weakens corporate AI monopolies versus what strategy accidentally strengthens them? I worry that withdrawal achieves the latter.
On environmental costs and publicization
Freeing model weights alone doesn't solve environmental costs, I agree. But I'd argue that publicization of models does address this, though perhaps I didn't make the connection clear enough.
Right now we have competitive redundancy: every major company training similar models independently, duplicating compute costs. If models were required to be open and collaborative development was the norm, we'd see less wasteful duplication. This is one reason why treating LLMs as public infrastructure rather than private property matters—not just for access, but for efficiency.
The environmental argument actually cuts against corporate monopolization, not for it.
A final thought
I'm not advocating negotiation with AI companies in the sense of compromise or appeasement. I'm advocating for a different field of battle. Rather than fighting to keep them from training (which I don't believe we can win), I'm suggesting we fight over the terms: demanding that what's built from our commons remains part of the commons.
You invoke the analogy of not negotiating with fascists. I'd push back gently on that framing—not because these corporations aren't doing real harm, but because the historical anti-fascist struggle wasn't won through withdrawal. It was won through building alternative power bases, through organization, through creating the structures that could challenge and eventually supplant fascist power.
That's what I'm trying to articulate: not surrender to a “new reality,” but the construction of a different one—one where the productive forces of AI are brought under collective rather than private control.
I may be wrong about the best path to get there. But I think we share the destination.