Hacker News
6 hours ago by satvikpendem

This is all public data. People should not be putting personal data on public image hosts and sites like LinkedIn if they did not want them to be scraped. There is nothing private about the internet and I wish people understood that.

3 hours ago by pera

> This is all public data

It's important to know that generally this distinction is not relevant when it comes to data subject rights like GDPR's right to erasure: If your company is processing any kind of personal data, including publicly available data, it must comply with data protection regulations.

2 hours ago by booder1

Legal has in no way been able to keep up with AI. Just look at copyright. Internet data is public and the government is incapable of changing this.

an hour ago by Anonbrit

A hidden camera can make your bedroom public. Don't do it if you don't want it to be on pay-per-view?

31 minutes ago by satvikpendem

That is indeed what Justin.tv did, to much success. But that was because Justin had consented to doing so, just as anything anyone posts online is also consented to being seen by anyone.

5 hours ago by malfist

What's important is that we blame the victims instead of the corporations that are abusing people's trust. The victims should have known better than to trust corporations

5 hours ago by nerdjon

Right, both things can be wrong here.

We need to better educate people on the risks of posting private information online.

But that does not absolve these corporations of criticism of how they are handling data and "protecting" people's privacy.

Especially not when those companies are using dark patterns to convince people to share more and more information with them.

5 hours ago by thinkingtoilet

If this was 2010 I would agree. This is the world we live in. If you post a picture of yourself on a lamp post on a street in a busy city, you can't be surprised if someone takes it. It's the same on the internet and everyone knows it by now.

4 hours ago by pnutbutry

[dead]

4 hours ago by Workaccount2

I have negative sympathy for people who still aren't aware that if they aren't paying for something, they are the something to be sold. This has been the case for almost 30 years now with the majority of services on the internet, including this very website right here.

4 hours ago by gishglish

Tbh, even if they are paying for it, they’re probably still the product. Unless maybe they’re an enterprise customer who can afford magnitudes more to obtain relative privacy.

5 hours ago by blitzar

> blame the victims

If you post something publicly you cant be complaining that it is public.

5 hours ago by lewhoo

But I can complain about what happens to said something. If my blog photo becomes deep fake porn am I allowed to complain or not ? What we have is an entirely novel situation (with ai) worth at least a serious discussion.

5 hours ago by malfist

Sure, and if I put out a local lending library box in my front yard I shouldn't by annoyed by the neighbor that takes every book out of it and throws it in the trash.

Decorum and respect expectations don't disappear the moment it's technically feasible to be an asshole

4 hours ago by jeroenhd

AI and scraping companies are why we can't have nice things.

Of course privacy law doesn't necessarily agree with the idea that you can just scrape private data, but good luck getting that enforced anywhere.

5 hours ago by djoldman

Just to be clear, as with LAION, the data set doesn't contain personal data.

It contains links to personal data.

The title is like saying that sending a magnet link to a copyrighted torrent file is distributing copyright material. Folks can argue if that's true but the discussion should at least be transparent.

5 hours ago by yorwba

I think the data set is generally considered to consist of the images, not the list of links for downloading the images.

That the data set aggregator doesn't directly host the images themselves matters when you want to issue a takedown (targeting the original image host might be more effective) but for the question "Does that mean a model was trained on my images?" it's immaterial.

5 hours ago by undefined
[deleted]
4 hours ago by bearl

Links to pii are by far the worst sort of pii, yes.

“It’s not his actual money, it’s just his bank account and routing number.”

2 hours ago by djoldman

A more accurate analogy is "it's not his actual money, it's a link to a webpage or image that has his bank account and routing number."

8 hours ago by cheschire

I hope future functionality of haveibeenpwned includes a tool to search LLM models and training data for PII based on the collected and hashed results of this sort of research.

7 hours ago by croes

Hard to search in the model itself

6 hours ago by cheschire

Yep, that's why at the end of my sentence I referred to the results of research efforts like this that do the hard work of extracting the information in the first place.

7 hours ago by itsalotoffun

I WISH this mattered. I wish data breaches actually carried consequences. I wish people cared about this. But people don't care. Right up until you're targeted for ID theft, fraud or whatever else. But by then the causality feels so diluted that it's "just one of those things" that happens randomly to good people, and there's "nothing you can do". Horseshit.

7 hours ago by rypskar

We should also stop calling it ID theft. The identity is not stolen, the owner do still have it. Calling it ID theft is moving the responsibility from the one that a fraud is against (often banks or other large entities) to an innocent 3rd party

6 hours ago by herbturbo

Yes tricking a bank into thinking you are one of their customers is not the same as assuming someone else’s identity.

2 hours ago by messagebus

As always, Mitchell and Webb hit the nail precisely on the head.

https://www.youtube.com/watch?v=CS9ptA3Ya9E

4 hours ago by JohnFen

> Calling it ID theft is moving the responsibility from the one that a fraud is against (often banks or other large entities)

The victim of ID theft is the person whose ID was stolen. The damage to banks or other large entities pales in comparison to the damage to those people.

3 hours ago by rypskar

I did probably not formulate myself good enough. By calling it ID theft you are blaming the person the ID belongs to and that person have to prove they are innocent. By calling it by the correct words, bank fraud, the bank have to prove that the person the ID belongs to did it. No ID was stolen, it was only used by someone else to commit fraud. The banks don't have enough security to stop it because they have gotten away with calling it ID theft and putting the blame on the person the ID belongs to

6 hours ago by laughingcurve

It’s not clear to me how this is a data breach at all. Did the researchers hack into some database and steal information? No?

Because afaik everything they collected was public web. So now researchers are being lambasted for having data in their sets that others released

That said, masking obvious numbers like SSN is low hanging fruit. Trying to obviate every piece of public information about a person that can identify them is insane.

7 hours ago by jelvibe25

What's the right consequence in your opinion?

7 hours ago by passwordoops

Criminal liability with a minimum 2 years served for executives and fines amounting to 110% of total global revenue to the company that allowed the breach would see cybersecurity taken a lot more seriously in a hurry

6 hours ago by lifestyleguru

Would be nice to have executives finally responsible for something.

4 hours ago by bearl

Internet commerce requires databases with pii that will be breached.

Who is to blame for internet commerce?

Our legislators. Maybe specifically we can blame Al Gore, the man who invented the internet. If we had put warning labels on the internet like we did with NWA and 2 live crew, Gore’s second best achievement, we wouldn’t be a failed democracy right now.

7 hours ago by undefined
[deleted]
5 hours ago by krageon

A stolen identity destroys the life of the victim, and there's going to be more than one. They (every single involved CEO) should have all of their assets seized, to be put in a fund that is used to provide free legal support to the victims. Then they should go to a low-security prison and have mandatory community service for the rest of their lives.

They probably can't be redeemed and we should recognise that, but that doesn't mean they can't spend the rest of their life being forced to be useful to society in a constructive way. Any sort of future offense (violence, theft, assault, anything really) should mean we give up on them. Then they should be humanely put down.

7 hours ago by atoav

It doesn't now, but we could collectively decide to introduce consequences of the kind that deter anybody willing to try this again.

7 hours ago by pera

Yesterday I asked if there is any LLM provider that is GDPR compliant: at the moment I believe the answer is no.

https://news.ycombinator.com/item?id=44716006

7 hours ago by thrance

Mistral's products are supposed to be at least, since they are based in the EU.

6 hours ago by pera

I am not sure if Mistral is: if you go to their GDPR page (https://help.mistral.ai/en/articles/347639-how-can-i-exercis...) and then to the erasure request section they just link to a "How can I delete my account?" page.

Unfortunately they don't provide information regarding their training sets (https://help.mistral.ai/en/articles/347390-does-mistral-ai-c...) but I think it's safe to assume it includes DataComp CommonPool.

7 hours ago by tonyhart7

so your best bet is open weight LLM then???

but its that a breach of GDPR???

7 hours ago by pera

There is currently no effective method for unlearning information - specially not when you don't have access to the original training datasets (as is the case with open weight models), see:

Rethinking Machine Unlearning for Large Language Models

https://arxiv.org/html/2402.08787v6

7 hours ago by atoav

Only if it contains personal data you collected without explicit consent ("explicit" here means litrrally asking: "I want to use this data for that purpose, do you allow this? Y/N").

Also people who have given their consent before need to be able to revoke it at any point.

7 hours ago by xxs

> need to be able to revoke it at any point.

They have to be able to ask how much (if) data is being used, and how.

7 hours ago by tonyhart7

so EU basically locked itself from AI space????

idk but how can we do that with GDPR compliance etc???

7 hours ago by imglorp

Reader mode works on this site.

Daily Digest

Get a daily email with the the top stories from Hacker News. No spam, unsubscribe at any time.