When I search for anything on Google or DuckDuckGo, more than half of the results are useless AI generated articles.

Those articles are generated to get in the first results of requests, since the search engine use algorithms to index websites and pages.

If we manually curate “good” websites (newspapers, forums, encyclopedias, anything that can be considered a good source) and only index their contents, would it be possible to create a good ol’fashioned search engine? Does it already exist?

  • mkwt@lemmy.world
    link
    fedilink
    arrow-up
    31
    ·
    2 days ago

    That’s how Yahoo originally worked. They gave up on the “directory” because they utterly failed to keep up with the expansion of the world wide web. Google with its automatic crawlers did a much better job of listing new websites.

    • Treczoks@lemmy.world
      link
      fedilink
      arrow-up
      2
      ·
      1 day ago

      Yep. Web indexing (or rather internet indexing) for us started with a notebook (paper and pen version) in the computer room. Later Yahoo and Altavista joined in (we still used the notebook for the good sites, browser bookmarks did not yet exist). But Google, when it still was good, wiped this all off the face of the earth.

    • snooggums@lemmy.world
      link
      fedilink
      English
      arrow-up
      14
      ·
      edit-2
      2 days ago

      Yahoo could go back to the directory approach and advertise as having less SEO garbage and AI slop.

    • ricecake@sh.itjust.works
      link
      fedilink
      arrow-up
      12
      ·
      2 days ago

      Skipping a few generations there. Manual indexing wasn’t feasible long before Google existed. Engines like the eponymous “webcrawler” would follow the links between sites and allowed searching the text of them.
      Google came around later with enhancements to how the pages are ranked based on the relationship of links between pages.

    • skisnow
      link
      fedilink
      English
      arrow-up
      10
      arrow-down
      2
      ·
      2 days ago

      There’s space in the market for a better search engine. It could be built using an AI-assisted tool for semi-automatic curation by a team of a hundred librarians that could speedily vet the AI’s suggestions for quality.

      I’m not sure the web is even that big any more, due to all the big web 2.0 social apps like Facebook and Instagram sucking everything into their uncrawlable platforms.

  • anachrohack@lemmy.world
    link
    fedilink
    arrow-up
    23
    ·
    2 days ago

    Kagi allows you to save filters (called lenses) for later use, so if you curate your own, you can share with other people. Would be neat if there was a lemmy community for that

    • JustTesting@lemmy.hogru.ch
      link
      fedilink
      arrow-up
      4
      ·
      2 days ago

      It also allows you to boost/deprioritize/ban domains from your searches. Not seeing pinterest show up ever again is great

  • haverholm@kbin.earth
    link
    fedilink
    arrow-up
    14
    ·
    2 days ago

    You could try something like the huge AI blocklist for uBlock Origin. It cleans “AI” results out of Google, DDG and Bing searches.

    I’m not sure how a curated search engine would work in practice, though it’s a nice idea. It would just be an enormous task, and vulnerable to manipulation by bad actors if you crowdsource it. I’d love to be proven wrong, though.

    As @[email protected] said already, a self hosted searxng instance would give you some individual curation capabilities, but wouldn’t be a “for the greater good” project as I think you might be looking for?

  • ricecake@sh.itjust.works
    link
    fedilink
    arrow-up
    7
    ·
    2 days ago

    It’s certainly possible. Wouldn’t even be hard, since early crawlers were using radically smaller computers and the technology involved is now just available free and open source.

    If you limit it to only curated domains you’ll find issues with limited content and difficulties discovering novel information.
    If you need information on Peruvian sand art you can only find it if you’ve already added it to the index.

    What you might consider is starting with a set of “seed” sites that you trust, and fan out from there. Use something like pagerank to rank encountered sites, and augment that ranking with distance from a known good domain. A site with a lot of link activity that’s also referenced by a site you find credible is probably better than one that’s four steps removed.
    Human review of sites as they cross some threshold of ranking is plausible, since it’s easier to look at a list of sites that seem consistently okay and check if they’re slop than to enumerate the sites that aren’t slop from scratch.

    One of the better ways to gain insight about which results your users find most helpful is, ironically, a non-llm neural net. Understanding what types of queries lead to which domains should help you guide curation and put trustworthy sites that users pick higher.

  • scsi@lemm.ee
    link
    fedilink
    arrow-up
    9
    ·
    2 days ago

    The tech nerds on HN really like Kagi - it’s a paid search engine, but apparently it provides filters and stuff to curate your search results to come from the types of sources you’re looking for but I have no personal experience. That said, it’s infected with AI just like Mojeek, Qwant Next, etc. and other alternatives. A SearXNG instance might be your best bet.

    • TurtleTourParty@midwest.social
      link
      fedilink
      English
      arrow-up
      1
      ·
      edit-2
      1 day ago

      You can easily turn off the kagi AI through settings, blocking all of the AI generated sites would be tedious though. They currently have a setting that tries to filter AI images from search results, hopefully they’ll add one for AI sites in the future.

  • ironhydroxide@sh.itjust.works
    link
    fedilink
    English
    arrow-up
    9
    ·
    2 days ago

    I too was frustrated with the ever decreasing quality of search results. I spun up my own searxng instance and use that. In my experience is usually better at finding what I’m looking for.