Been thinking a lot about @alg…

洪民憙 (Hong Minhee)

@hongminhee@hollo.social

Been thinking a lot about @algernon's recent post on FLOSS and LLM training. The frustration with AI companies is spot on, but I wonder if there's a different strategic path. Instead of withdrawal, what if this is our GPL moment for AI—a chance to evolve copyleft to cover training? Tried to work through the idea here: Histomat of F/OSS: We should reclaim LLMs, not reject them.

Gergely Nagy 🐁

@algernon@come-from.mad-scientist.club · Reply to 洪民憙 (Hong Minhee) :nonbinary:'s post

@hongminhee I think you're giving the AI companies too much credit and goodwill. The technology itself may have use, but their total lack of respect is not the only problem. The energy required to do the training and the inference, the environmental impact of it will not be addressed by freeing the models.

But... that's probably worth another blog post. Nevertheless, I'd like to address a few things here:

OpenAI and Anthropic have already scraped what they need.

They did not. I'm receiving 3+ million requests a day from Anthropic's ClaudeBot, and about a 1 million a day from OpenAI: see the stats on @iocaine . If they scraped everything they need, they wouldn't aggressively continue the practice, would they?

They need new content to "improve" the models. They need new data to scrape now, more than ever, because when the internet is getting filled with slop, legit human work to train on becomes even more desirable.

Heck, I've seen scraping waves where my sites received over 100 million requests in a single day! I do not know which companies were responsible (though, I have my suspicions), but they most definitely do not have all the data they need.

And I'm hosting small, tiny things, nothing of much importance. I imagine more juicy targets like Codeberg receive a whole lot more of these.

GitHub already has everyone's code.

GitHub has a lot of code, but far from everyone's. And we shouldn't give them more to exploit. Just because they already exploited us in the past 10 years doesn't mean we should "accept reality" and let them continue.

Then, you go and ponder licensing: it doesn't matter. See the beginning of my blog post:

None of the major models keep attribution properly, and their creators and proponents of these “tools” assert that they do not need to keep either. That by nature of the training, the models recycle and remix, and no substantial code is emitted from the original as-is, only small parts that are not copyrightable in themselves. As such, they do not constitute derived work, and no attribution is necessary.

No matter how you word your licensing, as long as they can argue that training emits only uncopyrightable fragments through remixing and recycling, your license is irrelevant.

You can try and incorporate explicitly allowing training if the wights are released - they will not care. Once they deem your code uncopyrightable, they can do whatever they want, like they've been doing now.

You assume these companies behave ethically. They do not.

See the recentish Anthropic vs Authors case: Anthropic was fined not because they violated copyright, but because they sourced the books illegally. The copyright violation was dismissed.

Why do you think applying a different license would help, when there's existing legal precedent that it does fuck all?

Also, releasing the weights is... insufficient. Important, but insufficient. To free a model, you also need the training data, otherwise you can't reproduce it. Training data should be considered part of its source, because you can't reproduce the model without it.

Good luck with that. There is no scenario where surrendering to this "new reality" plays out well.

It really is quite simple: as we do not negotiate with fascists, we do not negotiate with AI companies either.

silverpill

@silverpill@mitra.social · Reply to 洪民憙 (Hong Minhee) :nonbinary:'s post

@hongminhee @algernon Personally, I don't mind AI scrapers. Developers of closed-source projects have always been using copylefted code, today it is just easier as they don't need to hide anymore.

However, I don't think the resistance is futile:

>OpenAI and Anthropic have already scraped what they need. GitHub already has everyone's code. The training data exists.

Because they will need more data, and easily accessible training data pools (e.g. Github) are slowly becoming poisoned with slop. Did they find a solution to this problem?

>GPLv4

I like this idea!

撤收(철수)가 아니라 再專有(재전유)! GPL이 그랬던 것처럼요.

訓練(훈련) 카피레프트에 對(대)한 글을 썼습니다: 〈F/OSS 史唯(사유): 우리는 LLM을 拒否(거부)할 게 아니라 되찾아 와야 한다〉(한글).

洪民憙 (Hong Minhee)

Gergely Nagy 🐁

silverpill

Pawslut420

Phantasm

WEBmadman

洪民憙 (Hong Minhee)