Data contamination expert 👌

@[email protected] · 1 year ago

Data contamination expert 👌

@[email protected] · 1 year ago

I’m pissed at reddit but I still hate searching for something and finding a post on reddit discussing it, only to find some of the posts being deleted or overwritten.

@[email protected] · 1 year ago

Good, then the protest at least worked somewhat.

@[email protected] · 1 year ago

if you’re lucky, some posts have been archived on the internet archive’s wayback machine. highly recommend pinning the extension to your toolbar, it’ll show a number badge of how many times the current site has been archived :) https://addons.mozilla.org/en-US/firefox/addon/wayback-machine_new

magnetosphere · 1 year ago

This is the ideal meme format. Pedro’s smile is perfect.

Beefalo · 1 year ago

This announcement is just “oh by the way, the horse is now out of the barn. He left like 10 years ago but this is the announcement.”

Shout out to whoever dismissed the first AI writings with “It’s like a perfect Redditor. Totally confident and completely full of shit, doesn’t even know that it’s lying.”

That doesn’t happen by accident. That happens when everyone was already scraping the shit out of the site, at the very least.

Flying Squid · 1 year ago

@[email protected] · 1 year ago

Dear God, I’ve posted a lot of nonsense and untrue things over the years. You guys want to do a candle light vigil tonight for ai?

alphacyberranger · 1 year ago

If it takes reddit data to train a model, instead of Artificial Intelligence we will end up with Artificial Idiocy and a horny one that too.

@[email protected] · 1 year ago

deleted by creator

@[email protected] · 1 year ago

Sigh, unzips

@[email protected] · 1 year ago

You had to unzip?

@[email protected] · 1 year ago

Hey, I’d say that Facebook, Twitter and YouTube are at least just as bad, and probably worse.

Eager Eagle · 1 year ago

Good move, but anyone using public data already applies a simple spam filter to reject “dumb” data poisoning. Also, hatred and other negative comments as responses will be penalized in a language model training, so an effective data poisoning takes effort. I’ll just throw some ideas here how poisoning could hypothetically have a tangible negative impact in their results.

The best one can do in terms of data poisoning is make comments that are not easily discernible from usual comments - both for humans and machines - but are either unhelpful or misleading. This is an “in-distribution” data poisoning attack. To be really effective in having any impact whatsoever for training, they need to be mass applied using different user accounts that also upvote each others’ comments in a way that mimics real user interaction: if applied in a simplistic way, a simple graph analysis on these interactions can highlight these fake accounts as a christmas tree.

@[email protected] · 1 year ago

I was contemplating the merits of botting with the current model with slight vectorization offsets so the data becomes prone to overfitting.

I would think it would alao work to post using valid, but non-standard syntax so it muddies the n-gram searches.

greenskye · edit-2 1 year ago

but are either unhelpful or misleading

Honestly that just sounds like a lot of Reddit users in general

Darth_Mew · 1 year ago

yea we know that’s why he said that because that’s “real” reddit content

@[email protected] · 1 year ago

So you’ve contaminated the training data for an LLM by spamming a public forum? Seems like everyone loses

@[email protected] · 1 year ago

I dont lose, I get a good laugh out of watching idiots feed unreliable data to their LLMs because it was cheap

@[email protected] · 1 year ago

I mean the people using the forum who have to navigate around your spam

@[email protected] · 1 year ago

Theyre on reddit, the spam site. I think theyre okay with a little more spam on their spam.

@[email protected] · 1 year ago

I agree that reddit sucks

@[email protected] · 1 year ago

They’ll just find the signal in what you’re doing. Sorry but checkmate, mate.

nickwitha_k (he/him) · 1 year ago

I really ought to have done that.

@[email protected] · 1 year ago

Set up a bot that just constantly posts blatantly wrong information, like “the earth is flat according to encyclopedia Britannica”, “the sky is green because it’s full or chlorophyll according to the UK foundation of science”

@[email protected] · 1 year ago

Or in line with current events, “we are sorry about your experience and will refund you triple.”

@[email protected] · 1 year ago

we need to make a repository just for that and spam reddit with it, everyone is welcome to contribute, open-source fake news

@[email protected] · 1 year ago

That should be super easy. Just make a massive database of random stuff and put them in a sentence structured “XX is YY because ZZ” with no other explanation.

@[email protected] · 1 year ago

You won’t poison the data if the bot is on there just doing the same things as the redditors.

@[email protected] · 1 year ago

OpenAI team after including the data: why is the model suddenly even more horny, abusive, and discriminatory?

Norgur · 1 year ago

we need a bot that deletes comments and replaces them with some faulty grammar yoda-speak.

@[email protected] · 1 year ago

“So much for your fucking canoe!”

@[email protected] · edit-2 1 year ago

after they announced it would’ve been the time to start poisoning the comments. Then it would’ve been completely justified and moral.

Honestly, keep up the good fight. Start poisoning all open sources being scraped by any type of AI.

And I use the term “ai” very, very loosely. Because what’s called ai now isn’t real ai. It’s just an automated data collection tool.

It doesn’t create anything, it plagiarizes real artists.

@[email protected] · 1 year ago

exactly, ”ai” right now is just a computer parrot. why settle for blurry generic versions of the art that it is digesting and shitting back out?

@[email protected] · edit-2 1 year ago

Nailed it. The whole essence of AI is that it can make images with a variety of colors and styles, but it’s not creative or artistic by definition. At the end of the day, it’s just a bunch of numbers and equations being translated into pixels on a screen.

(This comment pasted from NovelAI with this prompt:

Please write a reply to this interrnet comment: exactly, ”ai” right now is just a computer parrot. why settle for blurry generic versions of the art that it is digesting and shitting back out?)

@[email protected] · edit-2 1 year ago

That is not the “whole essence” of it all… You are summarizing he whole piece of tech off a single use-case (image generation).

AI is MUCH more than just a picture or generator. As a software engineer I use AI for things like debugging or quickly automating some tasks

@[email protected] · 1 year ago

Considering the tendency of all LLMs to confabulate, that might not be a very smart choice.