No subject

Wed Jun 17 12:34:00 PDT 2026

"Regina Obe" <lr at pcorp.us> writes:

> Like the only thing I can think of is to claim -- well LLMs are all trained
> on stolen data therefore their work is stolen.

Nobody has a straightforward credible argument that LLM output is
compatible with any free software license.  The standard argument you
see is a masssive handwave that the large-scale commercial copying must
be fair use, because that's the only way to get to the answer people
decided they wanted before they started appearing to analyze.

The Anthropic settlement is a huge clue, that Anthropic thought it
better to pay $2.5B to avoid having a ruling that might say it was all
copyright infringement and they have to delete all their stolen training
data and their models.

> But there is no proof to that and there is no way to assume the same about a
> user doing the same thing.

You are inverting the rules.  We don't only reject code if we can prove
it is copyright infringement.  We havve a positive expectation of proper
licensing (e.g DCO). And yes, there can be be bad actors who willfully
misrepresent things.

> So anyway a lot of these pull requests look like trying to solve problems
> the way we've solved in the past. Can we provide guidance on what is good
> code to follow and what is not?

If you want to train a model only on the existing postgis code base
that's something else.  But nobody is doing that.

The other thing is the social aspects.  codeberg is suffering a DOS
which I believe is mainly AI scrapers.  So we should not accept any LLM
code unless there is good reason to believe that the entire model
creation behaved properly.   People are choosing to look the other way
and to enable bad behavior.