• rockSlayer@lemmy.world
        link
        fedilink
        arrow-up
        38
        ·
        2 months ago

        Well that’s part of the thing. Web scraping doesn’t get covered by policies. Like, they could ban your ip or any accounts you have, but web scraping itself will always be acceptable. It’s why projects like NewPipe and Invidious don’t care about YouTube cease and desist letters.

          • rockSlayer@lemmy.world
            link
            fedilink
            arrow-up
            3
            ·
            2 months ago

            Yea, I’ve seen those pop-ups when trying to find something out. It sucks but isn’t a significant barrier to web scraping

          • AeroLemming@lemm.ee
            link
            fedilink
            English
            arrow-up
            1
            ·
            2 months ago

            Doesn’t that only happen on the mobile version? Either way, it’s stupid and annoying. Google should start de-ranking sites that add barriers to content, but I know they never will.

            • werefreeatlast@lemmy.world
              link
              fedilink
              arrow-up
              1
              ·
              2 months ago

              I tried that on my desktop. So long as you are not actually logged in you cannot see the communities that are too small for a review or too adult after a review.

              • AeroLemming@lemm.ee
                link
                fedilink
                English
                arrow-up
                2
                ·
                2 months ago

                Ugh, what a fucking shitshow. I know it won’t happen quickly or easily, but I’m hoping to see more people on federated platforms in the next decade or two. It’s the only way for us to take the internet back from these greedy bastards.

        • AeroLemming@lemm.ee
          link
          fedilink
          English
          arrow-up
          1
          ·
          2 months ago

          Is it any different for an “API”? I don’t think there’s a very big difference between an HTTP endpoint that returns HTML and an HTTP endpoint that returns JSON.

          • krippix@feddit.de
            link
            fedilink
            English
            arrow-up
            1
            ·
            2 months ago

            In what way?

            HTML definitely provides more overhead than json if you only care about the data.

            • AeroLemming@lemm.ee
              link
              fedilink
              English
              arrow-up
              1
              ·
              2 months ago

              Legally. OC stated that NewPipe doesn’t worry about legal threats because they scrape instead of using an official API.

          • folkrav
            link
            fedilink
            arrow-up
            1
            ·
            2 months ago

            Parsing absolutely comes with a lot more overhead. Especially since many websites integrate a lot of JS interactivity nowadays, you oftentimes don’t get the full contents you’re looking for straight out of the HTML you’re getting out of your HTTP request, depending on the site.

      • freebread@lemm.ee
        link
        fedilink
        English
        arrow-up
        9
        ·
        2 months ago

        Still waiting for the news that they took down old.reddit. Without the third party apps, that was the only way it could still be usable.

    • IphtashuFitz@lemmy.world
      link
      fedilink
      English
      arrow-up
      2
      ·
      2 months ago

      We use Akamai where I work for security, CDN, etc. Their services make it largely trivial to identify traffic from bots. They can classify requests in real time as coming from known bots like Googlebot to programming frameworks like python & java to bots that impersonate Googlebot, to virtually any other automated traffic from unknown bots.

      If Reddit was smart they’d leverage something like that to allow Google, Bing, etc. to crawl their data and block all others, or poison others with bogus data. But we’re talking about Reddit here…