reddit is telling it’s future investors with recent news and more info on their IPO, that they’re currently selling and looking to sell their user’s data to companies wanting to train their LLMs, including Google.

This is a direct violation of the GDPR for any EU based users.

Legal Basis?

Under Art. 6 GDPR reddit may only really use p1 (f) (p meaning paragraph) processing is necessary for the purposes of the legitimate interests pursued by the controller or by a third party, except where such interests are overridden by the interests or fundamental rights and freedoms of the data subject which require protection of personal data, in particular where the data subject is a child.

All other options are impossible, as they don’t have consent, nor do they have contracts with their user base to allow for this. Art. 5 p1 (f) is a touchy subject and clearly requires extra provisions being made in case the data subject is a child. Reddit has tons of children (meaning anyone under 18) using their site daily. See for example: https://www.reddit.com/r/teenagers/

What’s being processed

Due to the nature of reddit, they are also processing huge amounts of data of special categories Article 9 such as data on sexual orientation, health information, ethnic info, union information, etc. (basically everything in Article 9 can easily be found on reddit):

(note users have not given explicit consent according to the requirements of consent under Art. 7 and 8, which would allow for such processing under Art. 9 p2 (a))

These are obviously just a tiny selection of the hundreds of subreddits that are concerned with these types of data, not to mention the unencrypted “private” messages, chats, etc.

My lord, is this legal?

Article 9 p2 (e) states “processing relates to personal data which are manifestly made public by the data subject;”; so they’re out of the woods, right? After all, users posted this stuff and it was made public! Sadly, this doesn’t work with processing data of children, especially with 9 p4 allowing member states to introduce additional limitations and conditions for further processing.

The real kicker comes with Article 5 p1 (b) though for 9 p2 (e). 5 p1 (b) requires the personal data be: “collected for specified, explicit and legitimate purposes and not further processed in a manner that is incompatible with those purposes; further processing for archiving purposes in the public interest, scientific or historical research purposes or statistical purposes shall, in accordance with Article 89(1), not be considered to be incompatible with the initial purposes (‘purpose limitation’);”

Yeeeah… People have posted their stuff publicly, BUT with a clear understanding that the processing of the data ends there. Reddit may process the data insofar they serve the data to the public. That’s it. Turning around and selling the data now is a crystal clear violation of Article 5 p1(b). There’s no two ways about it. As per Article 5 p2, reddit needs to be able to prove they are in compliance with 5 p1.

They’re also processing data under Art. 10 relating to criminal convictions and offences

Processing of such data shall be carried out only under the control of official authority or where processing is authorized by Union or Member State law. Rugh-Roh. I’ll admit, that this one might be reaching a bit, as it could easily be considered only to apply to “official” type of data here, rather than just criminal talk overall, but fuck it. Throw it on the pile.

Your rights and how they’re being violated (not in a kinky fun way)

Now let’s look at the Rights of the data subject! Those are always fun :)

Art. 12 p1 “The controller shall take appropriate measures to provide any information referred to in Articles 13 and 14 and any communication under Articles 15 to 22 and 34 relating to processing to the data subject in a concise, transparent, intelligible and easily accessible form, using clear and plain language, in particular for any information addressed specifically to a child.”

This is my favorite. Articles 13 and 14 are provisions on informing the data subject where they had data obtained directly from them or not directly from them. For reddit it mostly applies 13, but since people also talk about people they know, 14 also applies.

Let’s check in there real quick. We’ll keep it to Article 13 for brevity. Reddit needs to:

  • give info on contact details of the controller and the controllers rep (in the case of selling data to Google for LLM training, that’s info for Google, not reddit; anyone got that info via DM maybe? No? Oh shit)
  • contact details of their data protection officer (both reddit and google, anyone was informed on that for the LLM stuff? No?)
  • purposes of processing including the legal basis. Love this one. Anyone know that for the sale of their data to Google to train their LLM? No? Shucks.
  • Since they’re likely hinging on Art. 6 p1 (f) they need to tell you what the legitimate interest is - in our case MAD MONEY, not sure if that’ll hold up.
  • recipients or categories of recipients -> so “big evil data churning, election influencing, minority silencing, union busting, mega corp” Sweet.
  • Transfer to third countries -> likely doesn’t apply, as reddit servers are in the US (I believe, no idea if that’s true) if not, weeeeeeell…
  • right to lodge a complaint with supervisory authority (anyone got that notice?)
  • whether the data is provided due to a contract or statutory -> doesn’t apply, users give their data “freely”
  • existence of automated decision-making, INCLUDING PROFILING (as per Art. 22 -> non of the exceptions in that article apply to the current situation) - I’m sure no LLM will ever be used by the largest Ad company on the planet to help profile users, noooooooo, that’s craaazy!

Article 13 p3 clearly states in relation to Article 5 p1 (b) that the data subject must be informed about the data that is being collected for further processing BEFORE such processing occurs, including all the info I just listed above from Article 13 p2.

Article 13 p4 states “Paragraphs 1, 2 and 3 shall not apply where and insofar as the data subject already has the information.” yeeaaah… No reddit user knew about their data being actively sold to LLMs (sure LLMs might have scraped it, but that’s an entirely different can of worms - in which reddit has a hand, too under GDPR as they’re supposed to establish safeguards against such things, but reddit directly selling now without any upfront info… tut, tut…)

Another element of Article 12 p1 is this bit “in a concise, transparent, intelligible and easily accessible form, using clear and plain language, in particular for any information addressed specifically to a child.” I’m sure they’ll find a cool and hip way of explaning to all the teens on reddit what an LLM is and how it’s using their data.

Send reddit a little e-mail

For added fun, I urge anyone who still has a reddit account and is a EU citizen to contact reddit and make use of their rights under the GDPR to be specifically excluded from any use for LLM training, etc. Which is your RIGHT under Article 12 p2 specifically Article 21 right to object. You can contact them via “[email protected]

Let’s be really petty and assume that reddit and Google are shit at what they do, so they’re likely not even engaged in a Data Processing Agreement required under Article 28 p3 and if so, I’d love my supervisory authority to take a look at that one.

Delving into the Arcane

Let’s kick it up a tiny notch and go into the more arcane bits of the GDPR with Article 35, Data Protection Impact Assessment. I’m sure you’ll love p1:

“Where a type of processing in particular using new technologies, and taking into account the nature, scope, context and purposes of the processing, is likely to result in a high risk to the rights and freedoms of natural persons, the controller shall, prior to the processing, carry out an assessment of the impact of the envisaged processing operations on the protection of personal data. 2A single assessment may address a set of similar processing operations that present similar high risks.”

New technologies you say? Likely to result in a high risk, you say? Remember how most chatbots and AIs turn super racist, super quick? Or AIs being easily triggered into revealing their training data 1:1? Oh I’m sure there’s nooooo such risk with LLMs run by evil mega corp known for exploiting the shit out of exactly this kind of info for well over a decade now.

But we needn’t even argue that point. Article 35 p3 (b) clearly states:
“processing on a large scale of special categories of data referred to in Article 9(1), or of personal data relating to criminal convictions and offences referred to in Article 10;”

Oh nooo. Remember my list from the start? LLMs are 100% definitely large scale processing all that with reddit data sets.

Any assessment carried out would clearly indicate risk and thus Article 36 would apply, where reddit has to consult with supervisory authorities in the EU BEFORE starting this. Knowing reddit, if they ever even did such an assessment, they’ve come out with “low risk, nothing to see”.

Then there’s Article 32 of the GDPR: Security of processing. p1 (b): “the ability to ensure the ongoing confidentiality, integrity, availability and resilience of processing systems and services;” yeeeaaah, good luck with that on an LLM there, buddies.

Cool, what now?

Here’s what you do to exercise your rights and defend your data against the highway robbery and continuous violation by US Tech-Bros:

  • Find your supervisory authority (just use google, for added irony) by searching for “Data Protection supervisory authority [the state you live in]”.
  • Find their contact info, usually they have a form to complain ready made
  • give the company info applicable for your state, I’ve gone ahead and fished those out for you (see at the end here)
  • Tell them about the upcoming reddit Google partnership: https://apnews.com/article/google-reddit-ai-partnership-a7f131c7cb4225307134ef21d3c6a708
  • Tell them you’re an EU citizen and reddit user (or former and they still have data of yours)
  • Tell them you believe them to be in violation of Articles: 5, 6, 9, 10, 12, 13, 14, 32, 35, 36, and possibly more.
  • Link to this thread, if you like

US

Reddit, Inc.
548 Market St. #16093
San Francisco, California 94104

EU

Reddit Netherlands B.V.
Euro Business Center
Keizersgracht 62, 1015CS Amsterdam
Netherlands
[email protected]

UK

Reddit UK Limited,
5 New Street Square,
London, United Kingdom,
EC4A 3TW
[email protected]

Good luck!

  • AlteredStateBlob@kbin.socialOP
    link
    fedilink
    arrow-up
    26
    arrow-down
    4
    ·
    10 months ago

    Every post is tied to a username and email address, making it personal information, since each poster can be identified. I’m sure they’re also tracking further metrics such as IP addresses, browser fingerprints, etc. It is immaterial if we from the outside are able to identify users, it only matters if it’s possible given the data available to the processor. In this case, it is. Not to mention, there is a good chance texts and posts themselves contain plenty of personal information, such as linking to other user profiles, mentioning and discussing people, etc.

    • FaceDeer@kbin.social
      link
      fedilink
      arrow-up
      12
      arrow-down
      1
      ·
      10 months ago

      If they were GDPR-compliant before, I don’t see how they’ve changed to not be GDPR-compliant now. They allow people to delete their accounts and their posts if they wish, which removes all identifying information from their system.

      Frankly, this looks like just a “I just hate Reddit! There’s gotta be something I can hit them with!” flailing attempt to me.

      • roadkill@kbin.social
        link
        fedilink
        arrow-up
        10
        arrow-down
        1
        ·
        10 months ago

        They ‘allow’ people to delete their posts and accounts…

        But never actually delete anything from their databases. I’ve had years-old comments I deleted mysteriously reappear despite being gone for months.

        • FaceDeer@kbin.social
          link
          fedilink
          arrow-up
          3
          arrow-down
          1
          ·
          edit-2
          10 months ago

          So contact them about that, then. Sue them if you’re sufficiently offended. This doesn’t change anything. If they were GDPR-compliant before they’re still GDPR-compliant, if they weren’t GDPR-compliant then they still aren’t. My point is that this AI training stuff has nothing to do with that.

          • webghost0101@sopuli.xyz
            link
            fedilink
            arrow-up
            3
            ·
            10 months ago

            Did you read the parts of clear informed use case for any further processing. I asked the people i know still go there none of them where even aware there was anything going on.

    • HeartyBeast@kbin.social
      link
      fedilink
      arrow-up
      7
      ·
      10 months ago

      True, however I assume that Reddit is supplying Google with just the text. So, yes Reddit is collecting lots of PII, but that’s not what is going to Google to deduce it - unless you dox yourself in the text.

      Not trying to be deliberately argumentative, just thinking this though, much as I dislike Reddit, the case feels weak

      • AlteredStateBlob@kbin.socialOP
        link
        fedilink
        arrow-up
        5
        arrow-down
        4
        ·
        10 months ago

        It doesn’t matter, as long as the text is supplied as is, a simple Google search with the text and site:reddit.com will reveal the author, keeping it identifiable. True anonymization under GDPR almost does not exist, as it would destroy the dataset and make it unusable.

        • QuaternionsRock@lemmy.world
          link
          fedilink
          arrow-up
          5
          ·
          10 months ago

          I deleted my first Reddit account a few years ago. When the whole API fiasco happened and I moved here, I realized that Redacted didn’t finish the job. I tried to get them to remove the rest of my stuff through a GDPR request, but they wouldn’t do shit, and they seemed to think that was acceptable under GDPR. When you delete your account, they (claim to) delete your associated email address, so they also “couldn’t” verify that it was mine.

          FWIW, HackerNews has the same policy.

        • HeartyBeast@kbin.social
          link
          fedilink
          arrow-up
          3
          arrow-down
          1
          ·
          edit-2
          10 months ago

          It will reveal the username, not the identity of the author. If I tell you my Reddit username, what do you know about me?

          • AlteredStateBlob@kbin.socialOP
            link
            fedilink
            arrow-up
            3
            arrow-down
            1
            ·
            10 months ago

            It doesn’t matter what it tells me. Personal data is clearly defined under GDPR as data that can be used to identify a person. It is irrelevant if you or I can do it with publicly available data, reddit has the data and that is enough to qualify it as such.

            A DPA might absolutely disagree with my reading of the situation. I would be surprised, if a DPA considered usernames as non personal identifable information and know of no such ruling.

            • HeartyBeast@kbin.social
              link
              fedilink
              arrow-up
              1
              ·
              10 months ago

              My view is that Reddit has personally identifiable data but the data that is being licensed to Google, isn’t personally identifiable because the username by itself is insufficient to identify a person, without the additional data that Reddit isn’t passing over.

              But I agree I may well be surprised by a DPA decision.

    • oce 🐆@jlai.lu
      link
      fedilink
      arrow-up
      1
      ·
      10 months ago

      Isn’t it enough to remove any connection to any personal identifier before sending it? LLM training doesn’t care about your email, it cares about a certain quality of question/answer pairs, and reddit has a lot of those.

      • AlteredStateBlob@kbin.socialOP
        link
        fedilink
        arrow-up
        2
        arrow-down
        2
        ·
        10 months ago

        It is not enough, no. The LLM might reveal training data, showing the original text and that is a simple Google search with site:reddit.com away from identifing the user. It’s trivial and thus not anonymized.