Reboosh | ♥ Nonprofit | Bring Your Karma

Bring your karma
Join the waitlist today

HUMBLECAT.ORG

/r/Technology

r/technology

Last sync: 1y ago

↑

10871

↓

A lawsuit claims Google has been 'secretly stealing everything ever created and shared on the internet by hundreds of millions of Americans' to train its AI (businessinsider.com)

submitted 1d ago by Lakerlion

zeptillian 655 points 1d ago

Just Google? When did all the other AI companies stop doing it?

CAttack787 81 points 1d ago

They're probably trying to get a settlement and Google is one of the richest companies.

_kehd 13 points 1d ago

It would also set a stronger precedent for lawsuits against other companies doing the same if the suit somehow falls not-Google’s way

(additional comments not archived)

HellBlazer_NQ 42 points 1d ago

Also, just Americans..?

Lolmemsa 15 points 1d ago

It’s a lawsuit filed in America, American courts don’t care about what’s happening in other countries

andrea_ci 16 points 1d ago

r/USdefaultism

officer897177 6 points 22h ago

Never occurred to me it was a secret, just kind of assumed they were doing it.

(additional comments not archived)

wind_dude 2507 points 1d ago

>For years, Google harvested this data in secret, without notice or consent from anyone.

Does whoever wrote that realise that google core product is a search engine? And how search engines work? It wasn't a secret.



>This includes data taken from subscription-based websites and from websites known for pirated collections of books and creative works, the lawsuit alleges.

Yea, that's how a search index works, indexes everything, that has been the goal from day 1 at google. Subscription services purposely let google and bing through paywalls to get indexed.

hamlet9000 516 points 1d ago

Yeah. People really don't grok that these "you can't harvest and analyze publicly accessible data on the internet" lawsuits can only end in one of two ways:

1. The LLM-creators win.

2. Search engines go bye-bye and the internet implodes.

wind_dude 38 points 1d ago

I think we could be in for a big change in how the internet operates if general LLMs take over as the means of getting information. And honestly it’s probably not good due to how Reddit/twitter/substack/etc are reacting. If everything ends up being in walled gardens, it hurts access to information, and benefits those that will pay, enhances the knowledge divide, plus walled gardens and silos create more opportunities for manipulation and censorship, etc etc.

But I think we’ll sort it out, there will be growing pains, but search has arguably been getting worse. And the internet has been becoming more and more walled gardens since Facebook.

scodagama1 34 points 1d ago

The only thing that’s ridiculous to me is platforms like Reddit, Twitter and sub stack ie platform that share user generated content

Like guys, I know _technically_ you own that data because you have lawyers who drafted proper contracts and these users gave it to you for free voluntary

But well, don’t I smell a bit of hypocrisy here? If anything the money should go to the creators (people who post) not aggregators (platforms who publish posts next to advertisements). I agree wholeheartedly. But well, spidermen pointing at each other meme here.

It’s a completely different story with news outlets who share their own content, one they created or funded creation. But these guys already have tools in their disposal - paywalls. I think indeed it should be illegal to bypass a paywall in breach of subscription contract, but well it already is. I just don’t see where’s the problem

IndyDude11 12 points 1d ago

Why do you think you get to use these sites for free? You're the product.

(additional comments not archived)

be_dead_soon_please 39 points 1d ago

Alright, fuck it. Four responses that seem to know exactly what you are saying, and yeah I have a rudimentary grasp of context clues, but fuck it.

"Grok"?

metagameface 49 points 1d ago

Understand deeply and intuitively, pretty much. Comes from the novel Stranger in a Strange Land.

https://en.wikipedia.org/wiki/Grok

be_dead_soon_please 13 points 1d ago

Word. Thanks. I kind of like it.

(additional comments not archived)

nullSquid5 151 points 1d ago

Right? It’s not so different from me training myself on textbooks and the knowledge of others and then making a salary off of it.

Vinterslag 77 points 1d ago

wait til colleges start charging you a license subscription like they are Adobe Suite.

Someone hit up Charlie Brooker he could use that.

nayanshah 39 points 1d ago

Student loans are kind of like a subscription.

Fake_William_Shatner 5 points 1d ago

Oh, you mean paying off college loans for 50 years isn’t like a perpetual license fee?

(additional comments not archived)

Thread_water 28 points 1d ago

True but the differences that are there do matter.

One major reason for copyright is encourage people to work on, and publish, good content. People could read/watch/consume this content if they paid for it or abided by the license, and yes that meant that they would somewhat regurgitate it in their own work, and thus in a way possibly "profit" off someone else's work. But in many cases it was still worth it to publish as there are known limits on what one person can do with your content.

The questions we should be asking about LLMs is whether their limits are sufficient to continue to make it worth anyone's time to publish. Also how possible is it, and what would it entail, to stop LLMs learning from copyrighted works without the proper compensation (which depending on the answer to the first question would likely be a lot more than a single license for a single person).

The last thing is the answers to these questions could change rapidly as the tech changes, and rapid change does not do well with laws, it's quite messy and could mean a huge shift in copyright laws. I really think the lawsuits ongoing right now will have quite huge ramifications going into the future, and I really don't think they are as simple as "well it's not so different than a person learning from the content and using that to profit", but we'll see.

Crypt0Nihilist 16 points 1d ago

I'm more of the opinion that if an action is legitimate for a person then it is legitimate for a person with a tool to do it.

It's important to separate training from generation. For training, the parsing of online material is the same as search engine web-crawlers. If it's ok for them to copy and process information for indexing (or indeed for us to do so when browsing since our web browsers cache copyrighted content) then it's ok for someone compiling a training dataset, in the sense that it is the same action. We need to remember that copyright exists to allow the owner to profit from their work. A work on public display is no less valuable after a person or a machine has learned from it, therefore I'd argue that it falls outside the bounds of copyright.

As for the generation stage of the process, the person who publishes the output of the AI will be as liable as if they wrote it themselves with the assistance of any other tool, or none.

People published before copyright existed and there are many reasons for people to publish which are not related to the profit motive, so people aren't going to simply stop. Also consider that people generally value an author's means of expressing themselves and want to read that, not an AI's rehashing of the underlying ideas, so the AI isn't going to compete. Authors ought to be a lot more worried about the internet and Google Search than AI because that will return a link to their actual work, whereas AI will create something new which is merely influenced by their work to a small degree and that's in line with what authors tend to want, to influence future discourse.

(additional comments not archived)

zsxking 47 points 1d ago

"Learning is now illegal" 😂

gotoline1 12 points 1d ago

Hey...don't put that idea out there.

Jani3D 8 points 1d ago

*Texas perks up ears*

Psychological-Sale64 2 points 1d ago

Might be dangerous for big people when it learns everything and predicts the future based on our values and behavior.

(additional comments not archived)

Puzzled_Vegetable83 10 points 1d ago

So an interesting question is that assuming you paid for those textbooks, there are (in theory) some restrictions on what you can do with respect to dissemination and fair use. Let's use the example of a textbook that you legally acquired - i.e. you paid for a paper copy.

As a human, you're not allowed to photocopy/scan that textbook and give one to all your friends to learn from. So consider training an ensemble of models in parallel. Each dataloader could load a copy of that material simultaneously. Sure, you can argue that the model is like a version of you learning, but does the ML pipeline provide something like a lock/mutex such that only a single source can read from it at one point in time?

[edit] Are you allowed to include data augmentation if that constitutes modifying/deriving from the work? Suppose you use an auxilliary model to translate the content into a new language and feed that in?

I think this would be an interesting concept and it provably scales (see physical libraries, floating licenses); though also LLM foundation datasets are so large that generally you can only train for a few epochs and it's likely that read collisions on a single sample are low(?) Nevertheless, this is a golden opportunity for publishers to milk Big Tech for exorbitant licensing fees to ingest data for learning. And also there is presumably a market for super high quality data for learning - you could employ technical writers to construct ML-ready factual data or provide exports of ebooks in easily tokenised formats. There is some recent research that suggests that if you train solely on academic texts, the models are still $1.

In general I am of the reductionist opinion that these models are similar to a person reading lots of books. Fundamentally the only difference is the speed at which it can read.

theonlybutler 9 points 1d ago

It's insane people have returned to some medieval childish mindset. Keep each other down type thing.

(additional comments not archived)

amroamroamro 6 points 1d ago

schools and universities will start taking a 5% cut from all your future salaries, seeing how they "gave" you your education, it's only fair 🤣

(additional comments not archived)

Uristqwerty 15 points 1d ago

Another option is that analyzing publicly-visible data to extract metadata or construct a search index is allowed, but using that same model to *generate content* is infringing. Since a search index does not compete with the original works, while generated art *does*, horribly imbalancing supply and demand for entire industries.

(additional comments not archived)

Plus-Command-1997 4 points 1d ago

LLMs will already cause the internet to implode. There is a difference between indexing content and providing links to said content and using that content to generate other content which replaced the original source content.

One is symbiotic in nature and the other is an anti-competitive practice.

(additional comments not archived)

greiton 4 points 1d ago

search engines existed for decades before LLMs. this has not been a long term play from the mid 90s praying that somehow LLMs would be invented and they could start being profitable in 30-50 years.

(additional comments not archived)

brunocborges 11 points 1d ago

There is a third option: law that forbids creation of LLM with publicly available data. The law may say "only copyrighted data, with author approval, may be used".

rust_at_work 29 points 1d ago

That is the worst option. If there is at least one country where this law is not applied, the countries development would leapfrog other countries. It can be dangerous for defence and various other reasons.

Trigger1221 5 points 1d ago

Even if all countries in the world adopted a similar law, there are still be people willing to break the law.

(additional comments not archived)

Echoeversky 5 points 1d ago

Forth Option: Nuke copyright.

(additional comments not archived)

Nik_Tesla 49 points 1d ago

Also, "without consent"? Isn't there a whole industry around optimizing how well Google ranks you called SEO? Websites have literally been paying to get Google to index them more than their competitors for the past 30 years.

hinkitybinkity 19 points 1d ago

Also also, how does one "secretly steal" something that has at the same time been "shared on the internet"?

(additional comments not archived)

powercow 24 points 1d ago

and you can block at least google and other good guys by putting the no index tag in your code, as that one republican congressman accidentally discovered but not before claiming google had canceled him.

(additional comments not archived)

jumpup 128 points 1d ago

though pirated books means they technically didn't have the rights to those works, stealing from a thief's stolen stuff is not legal, and while the thief is the primary responsible for the theft, keeping illicitly gained goods is still illegal

powercow 151 points 1d ago

indexing the web, if it contains copyrighted material doesnt make you a criminal. THat would fall under the same rules that protect reddit and others.

there is also no proof they used pirated material to train, its just a guess. Perhaps they did, it would be easier to feed it everything unfiltered, instead of trying to find all the little copies of books, that have been reposted here and there... there are a fuck ton of them out there. but so far their isnt proof. Im not sure training on their index would be seen as much different than indexing itself, as far as the law is concerned but thats for the courts to decide. In both cases a computer is processing the data, its just done differently for AI.

NuuLeaf 2 points 1d ago

There will definitely be an investigation at some point. May not be illegal now, but internet laws are still in their infancy.

(additional comments not archived)

wind_dude 32 points 1d ago

| stealing from a thief's stolen stuff is not legal

So they aren't stealing, even less so than those who share the content online originally. Traditionally google was just providing a way too find it, and being able to find it means having to crawl it, and index it, indexing has always involved storing a copy or at least a partial copy.

So those copies exist, and that's a good thing for search and access to information, and knowledge. It even helps companies issue dmca take-down requests for their copy-written material.

As it get's into AI models it get's a bit greyer... but at the end of the day there is nothing even remotely close to a resemblance of any original source in a model. If you read a stolen book, you're not breaking the law if you use the information you learned.



And debatable if google used pirated books, they already have $1 with 40m+ books already indexed in text. Did openAI and meta, and tons of others, almost certainly. Is this illegal, it's hard to say... I would no. Was it necessary to compete with google, absolutely, is it a net benefit for humanity, yes. For competition and lower barriers to entry I hope google wins the lawsuit.

Whatsapokemon 6 points 1d ago

>Is this illegal, it's hard to say...

It's not hard to say, there was $1 back in 2015 which handled the legality of scanning and digitising 100% copyrighted information, then using that as a basis for some algorithm.

In this case, google scanned and retained the _entire_ copyrighted material of many many books, and presented direct snippets to users who searched through that material. The court ruled that this was a perfectly acceptable transformative use of the copyrighted content, even in the context of a commercial business using it in a for-profit manner.

wind_dude 2 points 1d ago

Thanks, yes, that is a good point, and it makes sense the president would transfer to LLMs

zefy_zef 11 points 1d ago

I hope a result of the lawsuit is that Google won't be able to solely profit from this data and that it needs to be released for use by anyone.

Voidsheep 12 points 1d ago

The data is public to begin with. They are indexing the internet, much like Microsoft, Yahoo, Yandex, Baidu and such.

You can build your own crawler to follow links in the web and grab data, or transform it into a more useful format. Takes a hell of a lot of time, but so would parsing it from the index of any search engine provider. It's really not different from opening websites and making more or less extensive notes by hand.

Google also has private data, like your emails if you use Gmail. They can use it carefully to provide other services like targeted advertising, or potentially train AI models, but they definitely shouldn't publish the underlying data, unless you want your emails to be public.

Search engine providers publishing their data crawled from the internet is also questionable. Do you mean a copy of the internet as a massive dump of their cache? Even that may pose problems, as the authors of that data would rather have you grab it from their website than Google's cache, to get things like advertising revenue.

wind_dude 6 points 1d ago

That would be awesome for innovation and humanity if google had to open access to it’s index. But it won’t. But the electricity to run the servers to handle the index would be massive. The search index is estimated to be 100pb. So that what available in search. There’s no doubt the have multiple copies of each render cached from crawls, every url they’ve crawled and found indexed, every xhr request/response for rendering.

(additional comments not archived)

Sweaty-Emergency-493 7 points 1d ago

And here people think Tik Tok is the biggest concern to privacy intrusions. All web based subscription/account related services that exist online are a gold mine and what follows money like always? Power, and control.

StillBurningInside 29 points 1d ago

No different than reading every book in a public library.

Going to sue me for learning something ?

magkruppe 9 points 1d ago

well .... yes. many copyrighted materials are ok for educational purposes, and not for commercial

Difficult_Bit_1339 5 points 1d ago

(additional comments not archived)

Illustrious-Self8648 3 points 1d ago

google search is not a core product. What rock have you been under for 20+ years? Google core is advertising, and to do that they harvest data. They read gmails. Anyone who says or believes the harvesting was secret is an uninformed loudmouth.

Huwbacca 3 points 1d ago

Doing that for purpose A, doesn't then mean it's ok to conduct purpose B....

TacTurtle 13 points 1d ago

They are ignorant enough they probably also think privacy mode prevents ISPs and google from knowing what kind of porn they are into as well.

wind_dude 28 points 1d ago

I mean unless you visit something like $1, your ISP won't know what type of porn you're into, they'll just know you visit $1 7 times a day for and avg of 2-3mins each time.

die-maus 9 points 1d ago

Great. Now my ISP thinks I love "Busty short Latinas"...

I didn't even get to see any "Busty short Latinas", I'm disappointed, and—quite frankly—offended.

(additional comments not archived)

taisui 2 points 1d ago

>Yea, that's how a search index works, indexes everything, that has been the goal from day 1 at google.

I see the issue with the instant answers that Google and Bing provides in-line in their search engine is getting out of hand, and that is arguably more than just indexing.

CoffeeToCode 2 points 1d ago

> Subscription services purposely let google and bing through paywalls to get indexed.

Wait, how do they do this? Presumably it's not a simple useragent check because then it'd be even easier to access paywalled content for free.

wind_dude 4 points 1d ago

They publish the list of ips used by the crawler so sites whitelist those to not throw the paywall or thing like ad block disabler.

https://www.bing.com/toolbox/bingbot.json

and

https://developers.google.com/search/apis/ipranges/googlebot.json

(additional comments not archived)

redial2 2 points 1d ago

Clearly never read map reduce white paper

(additional comments not archived)

sgjo1 258 points 1d ago

I mean OpenAI did the same

sherbang 110 points 1d ago

When a human reads many different sources to learn so they can create their own content based on what they learn it's celebrated. When a computer does it it's stealing?

ughEverythingTaken 5 points 1d ago

It's not so much the reading it that's the problem, it's the dissemination of that information from there that is. If a person shares information without appropriately crediting the source, yes....it's "stealing". There are legal consequences to doing so.

I definitely foresee a massive legal fight about this in the not too distant future. A developer at a large company is going to use an AI assisted tool to help them do their job better. In order for this AI tool to work, context has to be provided, which in this case is the intellectual property of the software you're working on. Now, if you think these AI tools aren't using that context you've provided to further train things, I have some ocean front property in Nebraska I'd love to sell you. It is, in fact, already worse than that. GitHub is being sued by several companies for scanning private, non-open sourced projects without consent. As everyone is jumping on the AI bandwagon and adding ChatGPT features to their own applications and passing along possibly sensitive/proprietary information without updating their terms of use, who is on the hook when, not if, that information end up in someone else's hands?

There is so many legal minefields around AI that were not dealt with before the genie got out of the bottle. Now we're stuck with a massive mess.

And before you ask....yes I do understand how AI works. I have a masters in machine learning. And with that background, I can tell you that I'm fighting tooth and nail to not add AI to the project I'm currently working on because I like my job and I don't want them to get sued into bankruptcy or getting in trouble with the EU for violating GDPR regulations.

sarge21 2 points 17h ago

>If a person shares information without appropriately crediting the source, yes....it's "stealing".

You shared information and credited zero sources. Did you break the law or did you come up with everything yourself?

ughEverythingTaken 2 points 15h ago

Here's a reference for a GitHub lawsuit https://www.theregister.com/2023/05/12/github_microsoft_openai_copilot/
As for the stealing things statement, most of my comment centered around software development and I was mainly referring to licensing that requires attribution to use. This is standard practice in the software development world. But if you would like a reference you can check out https://opensource.org/license/attribution-php/

As for the rest.... personal experience. I have actually called a chat gpt service and you do, in fact need to pass it context. As mentioned, I received a master's in machine learning. Part of the requirements was actually writing an ML system (in my case doing multi-lingual information retrieval). Given those bona fides, I think it is acceptable to not provide a source for the statements about how AI works.

I know you are just trying to be cute and argumentative, but there are actual legal issues at play here and many people seem to want to just brush them under the rug.

What do you think is going to happen when some Google or Microsoft intellectual property magically appears in a competitors system? Who is going to liable when an AI response is taken as fact and gets someone hurt?

Don't get me wrong, I think there is massive potential behind the technology, but there isn't enough legal framework in place to risk your business on.

eeyore134 61 points 1d ago

For this argument to work the angry people need to understand how AI works, but they are being told to be angry by other people who report on AI and legislate on AI who don't know how AI works. It's just another convenient foil to get people upset and on your side about something. How long before we get Patriot Act Part 3 out of this? I say Part 3 because they're already using "Oh no, China TikTok!" to try to get Part 2 rolling.

sherbang 15 points 1d ago

I agree that people not understanding AI is part of the problem (or often not even trying to understand, but just parroting others' outrage). The other part is that people overestimate their own creativity.

How much of what people output is really original thought? Very little indeed. Our every thought is based on the accumulated experiences we've gathered from everyone around us. Language, culture, art, history, it gets all mashed up in our brains and mostly regurgitated.

Original thinking is the exception (not the norm), most of the time we're selecting our favorite insights that we've learned from others and reusing them. We take bits from here and bits from there and put them together in our own words, but even "our own words" are not our own, they're the learned sentence structure and phrases we've obtained by copying others.

(additional comments not archived)

cinemachick 22 points 1d ago

We assume the human paid for the books, or paid into taxes to keep the library/Internet maintained. AIs are skimming content for free, that's not fair to everyone who paid to make and house the content

(additional comments not archived)

admins_are_useless 11 points 1d ago

A computer is a tool, it cannot contextually transform as humans do (yet).

That's like saying the brush of a forger is the thing doing the forgery.

sherbang 16 points 1d ago

But isn't that exactly what these systems are doing? Accumulating as much input as possible and then recombining that input based on the context given in the prompts?

The human brain is doing the same thing. An artist's work builds on the work of all the other artists that they've learned from. They're copying some techniques from here, and some from there, and some from nature, mixing in some randomness (generic differences and random misfiring of neutrons), and outputting them into a new (hopefully) unique combination.

We don't judge artists by how they've collected their source material, we judge based on their output. Some artists work we deem too derivative, others pieces we call inspired.

Usually forgery is a word we reserve for the art that is by one artist and we tell people it's by another.

(additional comments not archived)

dwittherford69 337 points 1d ago

How is this on the fucking news? This is literally the job of the product that the company advertises.

omniuni 125 points 1d ago

It apparently didn't occur to someone that a search engine needs to actually know about the content it's searching for.

WTFwhatthehell 46 points 1d ago

Years ago there would be stories about hoe much of the net Google caches to allow search.

A single entry in a robots.txt file and Google would ignore your site... but organisations want the traffic from a search engine.

m3thodm4n021 21 points 1d ago

Because business insider is click bait garbage. Wish it were banned site wide.

(additional comments not archived)

peepeedog 14 points 1d ago

People are dumb, and hate big tech. Websites can opt out of being indexed. Always have been able to.

cyan2k 3 points 1d ago

Google should just go the road of malicious compliance, and remove people that sue them because of data scraping completely from their search results.

rateb_ 3 points 1d ago

people with no clew what technology is, oosting in r/technology

(additional comments not archived)

WhatUp007 856 points 1d ago

If you don't pay for a service, you're the product. People should remember this.

pablo_pick_ass_ohhh 238 points 1d ago

All that paid content online? Yeah... that's still being scraped by Google.

So if you pay or don't pay, it doesn't matter.

Krakenspoop 52 points 1d ago

It's reading Playboy... for the articles.

(additional comments not archived)

VyvanseForBreakfast 15 points 1d ago

If it's openly accessible, it's not paid.

BoxOfDemons 8 points 1d ago

Google scrapes stuff that's behind a pay wall. That's what they are referring too. Web companies give Google a free pass so they can still be indexed, even though the content is pay walled.

Bhraal 7 points 1d ago

If that is the case then it is openly accessible to Google, and the other companies are letting it happen. They want the benefits of being indexed but not the newly available downsides.

This is the unfortunate endgame of any business model that is based on a private company supplying "free" service in perpetuity. Either it goes away or the net value extraction is eventually turned around.

ShiraCheshire 2 points 1d ago

This. If a law was passed tomorrow that mandated every website move to paid subscription for access, you'd be paying for every website *and* there would be just as many ads.

(additional comments not archived)

cartsucks 15 points 1d ago

Even when you pay for a product you are often times still the product. The line has been so horribly blurred over the years.

cyanydeez 31 points 1d ago

so the service in this instance, is the _internet_`

(additional comments not archived)

silly_jokes 19 points 1d ago

Your data is still sold if you are paying

(additional comments not archived)

david76 18 points 1d ago

Quid pro quo. People were free to put a robots file on their site.

aebeeceebeedeebee 13 points 1d ago

Meta robots "noindex" "nofollow" tags are part of basic HTML.

Good luck to the plaintiff(s)!

formerfatboys 7 points 1d ago

If you're running a business this is a terrible thing to pretend makes sense.

Especially if your business relies on selling your users to advertisers.

You really have two customer bases to serve. You're advertisers don't care ***at all*** about your service or platform. They care about eyeballs.

Which means, you better please the hell out of those eyeballs or they'll eventually go elsewhere.

Reddit, Twitter, Facebook etc are all wrestling with that at the moment because they pretended this was a real thing.

(additional comments not archived)

rdmusic16 3 points 1d ago

Even companies should remember this. Google was (is) doing it at a mass scale.

tkingsbu 3 points 1d ago

I tell this to my kids all the time. It’s important to remember this.

(additional comments not archived)

Xalbana 4 points 1d ago

I really need to switch to Protonmail.

AcorneliusMaximus 4 points 1d ago

How do I pay for better internet search and access?

WhatUp007 6 points 1d ago

Well, that business model doesn't exist because of advertising business models and data collection/brokerage. But you can have private secure email by paying for it along with cloud storage. Those typically (check the ToS on data privacy) only collected needed data for functional purposes.

social_tech_10 5 points 1d ago

If by "better" you mean a search engine that doesn't build a profile of you based on all of your web searches (and browser activity) then I think you may want to take a look at "DuckDuckGo" as an alternative to "Google", and ditch the Chrome in favor of the DuckDuckGo Browser, or use Firefox with "uBlock Origin" and a change the "Privacy and Security" settings to block trackers and to erase all of your cookies (if not white-listed) when you exit the browser.

(additional comments not archived)

godhandkiller 2 points 1d ago

There is a really good song from the band Incendiary called "The Product is You" and the chorus says "If you don't know what the product is, the product is you"

Weird people still surprised by this

MatsugaeSea 7 points 1d ago

Is this a problem. Seems like a great deal, personally.

(additional comments not archived)

cadublin 301 points 1d ago

Why do you think they give you 15GB for your Google Drive, Photos, and Gmail? Not to mention your search history and Map Timeline? At this point Google knows me better than my family, and that's fine with me as long as they don't steal my money.

Pherllerp 65 points 1d ago

I just wish they would use that data for ME instead of selling it to assholes who sell stuff I don’t want.

UmutIsRemix 121 points 1d ago

That's exactly what Google is doing? You give them data and they improve their services for you. Maps and Google assistant are just examples of useful services, for free. Not even including Google's search engine.

Sure they are prolly assholes and want money but at the very least we get the best online inventions for everyday use cases.

OriginalCompetitive 90 points 1d ago

Sure, but other than free email, free maps, free satellite imagery, free photo storage … what has Google every done for us?

Edit: Forgot free internet search and free AI tool.

aebeeceebeedeebee 25 points 1d ago

Google has moved literally millions of regular people into higher tax brackets with such favorites as SEO, Google Ads and YouTube creator payouts, to name a few.

CjScholeswrites 12 points 1d ago

Roads?

chandlar 2 points 1d ago

Surely this is sarcasm

(additional comments not archived)

BookooBreadCo 2 points 1d ago

I wish they'd build that data into their AI so it could give you more personal recommendations.

UmutIsRemix 3 points 1d ago

That's what I'm hoping they do for their pixel phones. My phone sometimes tells me to leave early for a meeting because I wouldn't catch the train fast enough. Impressive shit ngl

(additional comments not archived)

TheWhyOfFry 2 points 1d ago

All it means is that the asshole selling stuff you don’t want is willing to pay more for ads than someone who sells stuff you kinda want.

(additional comments not archived)

NoSympathy9787 46 points 1d ago

Pretty sure they're not using your private or encrypted data like emails and google drive to train their AI lmao.

cadublin 32 points 1d ago

Not going to argue the technical details on this, since most of us won't know for sure anyways, but I bet Google looks into Gmail data (and/or its meta-data) to a certain extent. This is an article from 2017 that said $1, that implies they did openly up to then and God knows what they are doing now. I'm sure they could hide behind technicalities and blur the definitions of what considered/not considered scanning users data.

(additional comments not archived)

CocodaMonkey 2 points 1d ago

Of course they are. They have access to read all of your emails and they outright say they do it. It's been something they've said they do since they first launched gmail.

(additional comments not archived)

One-Statistician4885 3 points 1d ago

Next you're going to tell us the know our incognito history

(additional comments not archived)

HombreMan24 45 points 1d ago

Is it stealing if it is all publicly available on the internet? Or did Google also hack into paywalls and steal that content too?

elmonstro12345 30 points 1d ago

That's the best part about this, if people actually did not want Google to look at their data on their websites all it takes is a simple entry in robots.txt and Google will politely ignore you.

But no, they want to have it both ways - they want that sweet sweet search engine traffic but don't want Google to, you know, index their site somehow?

That is just not how it works.

Bakoro 13 points 1d ago

Insert "No take, only throw" dog comic.

(additional comments not archived)

spinur1848 54 points 1d ago

Ok, so yeah, Google absolutely did help themselves, but it wasn't exactly a secret. They did build a lot of the infrastructure that facilitates internet search and they were pretty clear that they were monetizing the data to do it.

But I think it's pretty dangerous to expect a strict copyright interpretation on this and other AI stuff, especially the stuff that was already in the public domain or already commercially licensed. Arguably, this is an extension of search, which Google has always been clear is their business.

When you expect individual data owners to be able to individually remove their data from a curated dataset, especially after a model has been trained and deployed on that data, you're seriously going to damage the stability and reproducibility of the model.

We probably need a new way to value and compensate data owners and people more generally when their data or likeness is used for these kinds of things. We also need some mechanism to protect them from unintended harms, such as deep fakes that can damage reputations.

(additional comments not archived)

Bob_Sconce 46 points 1d ago

Replace "stealing" with "using" and you have a less inflammatory, but more accurate headline

ApatheticWithoutTheA 18 points 1d ago

No bro they’re stealing it. You see… they’re taking it and not giving it back. They have stolen the internet. All of it.

Anirbanbiswas43 8 points 1d ago

Yeah I can't access anything. Everything is gone.

(additional comments not archived)

redditrasberry 14 points 1d ago

news flash - if you put words on the internet people may read it. I know, it seems crazy, but I heard it can happen.

Difficult_Bit_1339 4 points 1d ago

I'm stealing this comment right now

(additional comments not archived)

fellipec 30 points 1d ago

Secretly?

They are super open about it!

Douchieus 5 points 23h ago

We've been training AI for years with image capchas lol.

spisHjerner 11 points 1d ago

Okay, but is it really a secret when we've relied on Google to index and rank contents of the internet since Google was released to public? Of course they would be indexing and storing a version of everything on the internet. They have been the most powerful SEO choice since inception. This is how Google Ads was so awesome. Until it wasn't anymore, because all users were shown is ads, and no content discovery.

It's as if they knew the day would come where they would strategically choose to break the internet, because as they assume they've paid for it. Never mind all that user data Google gathered and sold for tremendous profit.

MajorAcer 4 points 1d ago

Crazy how much proprietary information these AI companies are getting too. I work on a lot of corporate award submissions, and the number of people who will freely input their companies revenue, headcount, budget, all kinds of sensitive info into chatgpt.

aManPerson 2 points 1d ago

i think the example they used in the last season of westworld showed it best. they said something like.

"by the time they came up with really extensive privacy laws about people's data, they had already shared enough of it online years ago. we already had enough data about everyone that we could model everyone and everything. it didn't matter anymore".

we might eventually get data privacy right online, but i'm sure the AI stuff won't care because they'll have big enough datasets already.

conquer69 4 points 1d ago

> stealing

Here we go again. Using or doing something without permission isn't stealing.

Blackfire01001 40 points 1d ago

Stealing implies ownership and I can guarantee you Google owns nothing. And anything you put on the internet can be ripped off and used by other people. Also training an AI model is the same as an artist looking at works of art and learning how to draw it. It's not stealing ideas it's interpreting algorithm.

meara 18 points 1d ago

This is really the crux of it. AIs are doing the same thing creative people have done for eons — looking at lots of other art, creating new derivative works and sometimes stumbling onto a distinctive new style.

The problem is their speed.

In the past, the creator of a new style could enjoy a period of notoriety and exclusivity before it was widely imitated. AIs have eliminated that. The moment you post a creative work on the internet, it can be copied and used to create millions of derivative works by dinnertime.

We are going to need a new way to reward the creators who feed new ideas to our AI overlords. :)

(additional comments not archived)

Opposite_Computer_25 3 points 1d ago

Thou shalt not make a machine in the likeness of a human mind.

The spice must flow.

MathCrank 3 points 1d ago

That’s a lot of boob searches and nba scores

ImUrFrand 3 points 1d ago

What really needs to be said at this point in time is Google is no longer a Search Engine,
its a fucking AD Server for SEO cock bags.

PurpEL 3 points 1d ago

Wow so surprising. Next we're going to learn those at home assistants are always listening, just like your phone. Very shocking stuff no totally saw coming

WengFu 3 points 1d ago

Collecting and monetizing user data is their business model.

trailer8k 3 points 1d ago

and google complains about Dmca and copyright every day

Shratath 3 points 1d ago

I doubt thet are stealing just from americans, but thr whole damn world

jacobvso 3 points 1d ago

So... they only stole stuff created by Americans? .... why??

Obi-Drun-Kenobi 3 points 1d ago

The ruling class owns us. George Carlin said it best.

TendieTrades 3 points 1d ago

If it’s free you’re the product.

PandaCheese2016 3 points 23h ago

Let’s have a class action lawsuit where the class is pretty much all internet users. Lawyers will get like a hundred mil in fees and award while class members each get a Google Play gift card for $5.

OnyxsUncle 3 points 21h ago

I thought everyone knew that…except the assholes in congress…who are supposed to protect us…but actually fuck us over by selling themselves to corporations

haapuchi 3 points 20h ago

Google actually have a well documented approach to make its crawlers bypass pages. This is like complaining that I left my trash can out and the waste removal company emptied it.

AlFender74 8 points 1d ago

Do only Americans use the internet? Wonder what the rest of the world is up to? outside having fun?

(additional comments not archived)

Accomplished_Ad_8814 5 points 1d ago

Why is everyone on reddit set on blindly defending these AI developments? Like, yeah, it "just learns", "like people", but the side effect that by doing it it completely displaces the original authors is not to be overlooked. What's the incentive of publishing anything if essentially nobody except AIs and perhaps a tiny circle of personal fans will consume it? Nobody will click on your ads, or follow you, or even know you, if they can get highly personalized solutions to their specific problems (with some anonymized fragments of your contributions). And this means that people will stop publishing. So until AGI can create genuinely new content and replace us altogether, it seems that AI will just be regurgitating its inferences into oblivion. Is that something to look forward to or what am I missing?

philote_ 5 points 1d ago

Thank you for being one of the few sane voices here.

StillCraft8105 2 points 1d ago

lol people fall all over themselves to crow about how google has OFC expropriated all of our public spaces and experiences for the highest biddee and to benefi it's own shareholders

truly we have lost the information battle with these goliath vampire squid as this is analogous to drilling for oil in public national parks

free data extraction for profit and influencing public opinion ALL IN ONE SHOT

reparations!!
reparations!!
reparations!!

how does 5 trillion sound over five years to be redistributed to every man, woman and child in us?

(additional comments not archived)

anotherpredditor 10 points 1d ago

File this in the no shit category. We joked about this back when Gmail was invite only.

Terrible_Yak_4890 10 points 1d ago

How can you steal something that is openly shared?

aquarain 14 points 1d ago

This already went to the Supreme Court for Indexing the Internet and Google won.

flyfreeflylow 6 points 1d ago

Only Americans? If they're going to do that, it should go beyond one country to get a more rounded mix of views.

GonWithTheNen 2 points 1d ago

Unlike countries with GDPR, America's privacy laws regarding companies' data collection are abysmal. It's a much safer bet to use Americans' data, unfortunately.

andrea_ci 2 points 1d ago

unfortunately they used all sh\*t from EU countries too, they even state it in the TOS

(additional comments not archived)

Bellegante 5 points 1d ago

Secretly???

RainRainThrowaway777 7 points 1d ago

If you're only stealing the thoughts of Americans, your AI is going to be stupid as fuck. Big design flaw Google.

Signageactives 8 points 1d ago

***Secretly*** stealing?

I thought everyone already knew?

NoSympathy9787 8 points 1d ago

Scraping is perfectly legal so every AI company out there is doing exactly this. Google is probably just better at it.

(additional comments not archived)

illestrated16 2 points 1d ago

Google out here googling stuff and then memorizing the results...does this work as an alternative title?

theonlybutler 2 points 1d ago

People have lost track of what the purpose of copyright is. The ultimate purpose of copyright is the creation and spread of knowledge. In order to do this, copyright must strike a fair balance between protecting creative works and allowing the public to use them.
Copyright for copyright's sake is not an end in itself.

Also on a more technical note, AI is not breaching copyright anyway, despite what many people think on the surface. Its not storing a database of the content it reads.

lood9phee2Ri 2 points 1d ago

psychicsailboat 2 points 1d ago

I don’t see an issue with this at all.
You can’t take what I freely posted?
What?

sweetlemon69 2 points 1d ago

You mean like how ALL AI companies are doing today? And why it's forcing public API's to start charging for large volume access?

Also, it's anything public I'm assuming, nothing private.

Kassdhal88 2 points 1d ago

Secretly? It’s litterally their business model.

whitefoot 2 points 1d ago

We need a system like Google Ads but for training AIs.

Basically content owners should get paid for having their data used in training. So you sign up with a middle man company that gives you a bit of code to add in your website header. Every time an AI scans your site for training data, you get paid from the middle man. That middle man takes payment from the AI companys that want access to their list of sites to train on.

Neverpostagainyoufa 2 points 1d ago

Jesus Christ whoever wrote this article is a fucking moron.

RedRapunzal 2 points 1d ago

Ever read the terms for Goggle Docs...

(additional comments not archived)

SpicyCrabSoup 2 points 1d ago

“For years, Webster’s dictionary has stolen every word ever invented in the English language…”

FuckAllMods69420 2 points 1d ago

So has my brain. I look at things that are copyrighted and over time adopt those ideas to be my own and change my thinking. It’s not any different and those who are fighting AI are going to lose in the end.

scabbymonkey 2 points 1d ago

Wait till they train it on the game Plague. That will be the final nail in the coffin for us.

LivingEnd44 2 points 1d ago

It's only a problem if they are going to resell it for profit.

I can take a picture of your house from the street. I don't need your permission, because it is publicly viewable from the street. How is this different? If you publish something outside of a password-lock, it is effectively public. Most people understand this intuitively. Which is why they post public pictures of their dog but not of their driver's license or SSN.

mtnviewcansurvive 2 points 1d ago

the internet business model: use the information from users for profits. they do the work you get the money. sounds like pimping.

ZalmoxisRemembers 2 points 1d ago

And yet their search engine sucks more and more.

iqisoverrated 2 points 1d ago

They are as much 'stealing' this as me looking into a shop window is stealing the shop's. merchandise.

FarleyFinster 2 points 1d ago

If Google don't even bother to show up in court and get fined a *b*illion-with-a-*B* each and every day until they stop, they'll go broke… never.

TPproject123 2 points 1d ago

Next they’re gone tell me water is wet.

Find_another_whey 2 points 1d ago

Secretly?

dromedarian 2 points 1d ago

This is the least shocking thing I've read all year.

processwater 2 points 1d ago

Looking at and stealing are two different things.

If your Ferrari is parked on the street, and I spend a couple hours looking at it, you can't sue me for stealing your Ferrari.

(additional comments not archived)

classiclantern 2 points 23h ago

I would like to take this opportunity to add the DUH to this thread so that when Google scrapes the comments, the word DUH becomes part of the AI model. I would also like to attach the words SWELL and SPIFFY to this thread.

dogtagnoname 2 points 23h ago

Google will overtake Facebook just you watch

DerfnamZtarg 2 points 22h ago

Stealing is an interesting term when applied to information. Especially when you are an information provider, and your business is to make it readily available for others. The other trickly part is that when I think of "stealing" it is about taking something that is exclusively yours and removing it from your possession. That is not what is happening here by a long stretch. I am perfectly free to download the same information from GOOG and build my own ChatGPT, as many others have done. Few have the budget to pay for the server's electric bill. I think the real objection is that MSFT, GOOG and a slew of others (FB, AMZN etc) are working to monetize these models. Oddly I see no objection to financial teams using AI to build trading models nor the military to training AI to fire guns.

DerfnamZtarg 2 points 22h ago

Whoever wrote this article does not seem to understand much about what a Search Engine does. Had they spent their time discussing the need for guardrails and use case limits on the application and use of these models they would make a better case.

saiyaniam 2 points 20h ago

Good, take it all. Make a really decent AI, we NEED it. We're fucked, we NEED help.

elister 2 points 18h ago

Wasn't this part of the plot for Ex Machina? Billionaire CEO lives out in the middle of nowhere, pays all these phone and internet companies to tap into their user lines, so it could feed their AI and create a sentient android?

MarsCitizen2 2 points 17h ago

….Secretly?? 🤣🤣🤣🤣🤣

(additional comments not archived)

This nonprofit website is run by volunteers.

Please contribute if you can. Thank you!

Our mission is to provide everyone with access to large-
scale community websites for the good of humanity.
Without ads, without tracking, without greed.