I would like to try building a search index for this instance (maybe others) and as such would like to crawl the site with automated spiders. Now with the shutdown of the reddit API I expect the site to come under quite substantially load and also I would ofc try to not spam the site with too many requests as to not get banned or blocked, due to looking like a DOS attack. Could anyone provide some information on this?

  • ericjmorey@lemmy.world
    link
    fedilink
    English
    arrow-up
    0
    ·
    1 year ago

    Many existing Fediverse services are being operated by people who are opposed indexing the content on their instance(s). You may run into resistance from that angle.

    • Beliriel@lemmy.worldOP
      link
      fedilink
      English
      arrow-up
      1
      ·
      1 year ago

      I mean unless they make their instance private I don’t see why you wouldn’t index them? That’s literally why google provided such a value in their early days.

      • ericjmorey@lemmy.world
        link
        fedilink
        English
        arrow-up
        1
        ·
        1 year ago

        Even Google doesn’t index webpages that include “noindex” in a header. You are going to run into a lot of people who don’t agree with what you are trying to do. If you start reaching out to the people running Fediverse services to let them know that you’re trying to index the data on their services, you can learn what they think of the idea.