Programming, LLMs
The way every scammer is trying to use it to bypass the law and not pay human workers is its own phenomenon, but technically, the reason why it even copies people so much is because it's just trying its best to model its training data.
If you get a somehow usable training dataset for the one domain you want to deploy the probabilistic machine in, you could use the machine and find it not doing all of that stuff. It just behaves as a fuzzy transformer for the specific thing you trained it in
And I think current LLMs could've gotten to their current grasp of English using public domain text. Even speaking with a Victorian program is good enough.
Tech, LLMs
It's just, I think the rest of the world is having a better opinion of this stuff not only because they're less politically literate, but because of the trait of LLMs where they can switch languages arbitrarily (because in the training data, words from every language around the same concept are placed close together). So people are getting all the jank of American LLMs, *but in their language*, it's "foreign" "more advanced" jank.
So like, I think the tech is like, actually something when it comes to language, cause that's where all the research had been going into in decades prior. This thing is basically an offshoot of Google Translate
Tech, LLMs
Everything else though, is what you'd expect probabilistic machines to output for a domain. Believable (= "looks like the real thing" = "I'm comparing it with real-world data" = "I'm comparing it with what I trained it against) but amorphously corrupting and fundamentally untrue stuff. It's a machine starting from the output, not the input, it's not an intelligence machine.
Tech, LLMs
@snowyfox open translation models trained on publicly available data exist afaik? like the one built into Firefox for offline translation works pretty well. I was trying to figure out what training data they use exactly and couldn't (but I gave up pretty quickly) but there's definitely plenty around
Programming, LLMs
@noiob Mm, but it's not translation I'm talking about, I mean the use case where people feed user messages into an LLM to get some tags/judgements/guesses, or feed JSON into an LLM to have it normalise and massaged slightly
Where English language is involved in the field values and such, some ability to deal with English is needed, and a local LLM trained on Victorian-era books could pass the test