See linked posting. I’ve commented there with a link to a CLI tool in Python that allows downloading of IA collections. I’ve submitted a patch to enable specifying start and end points so that it’s easier to resume downloading a huge collection, or to allow multiple people to split up the work.

https://archive.org/details/georgeblood

https://archive.org/details/78rpm_bowling_green

F*ck the RIAA and absurdly long copyright.


EDIT: There is more than one collection of 78s on IA, so I updated the title.


The issue with these collections are that they’re absolutely HUGE. And yes, IA offers torrents for them, but as a separate torrent for every. single. album. And the torrents have all data in them – FLAC, fixed-rate MP3, VBR MP3, PDF liner notes, etc. etc… there may be some extremely hardcore data-hoarders out there who want everything, but IMHO as these are scratchy old 78 records, FLAC is overkill to just save the audio in a listenable format. The George Blood collection, just the VBR MP3s, is looking to be about 6TB. With ALL data it might be over 40TB! I can’t afford that many hard drives :)


So, my approach at the moment is to save just the VBR MP3s (they seem to be done at up to 320kbps VBR) and the JPEG album cover. If I have a chance and any storage left afterwards, I can make a separate pass to get the album liner PDFs…


Tool used: https://github.com/jjjake/internetarchive


Patch to allow setting start and end item indices for downloads: https://github.com/jjjake/internetarchive/pull/605


Example usage to grab just the VBR MP3 and record label JPG for each (note the --start-idx and --end-idx arguments):

#ia download --start-idx=4001 --end-idx=8000 -a -i --format="VBR MP3" --format="JPEG" --search collection:georgeblood

I’m going to concentrate on the George Blood collection for now… I’m starting at item 1. It would be great if others started at index 50,000, 100,000, 150,000, … and others started at the end and worked backwards in similarly-sized chunks, so that it’s assured someone gets each of them.

  • ArghblargOP
    link
    fedilink
    English
    arrow-up
    10
    arrow-down
    1
    ·
    1 year ago

    Oh, it’s not my project – I already have moved my own projects off there, yeah.

    • maudefi@lemm.ee
      link
      fedilink
      English
      arrow-up
      7
      ·
      1 year ago

      That’s awesome! Really encouraging seeing projects and devs migrate away from closed-source and proprietary systems and features. 💪

      • ArghblargOP
        link
        fedilink
        English
        arrow-up
        5
        arrow-down
        1
        ·
        edit-2
        1 year ago

        sourcehut, self-hosted Gogs or Forgejo are some good candidates. Gitea is popular, but there’s apparently been some drama about them going commercial without proper buy-in from their contributors. (The code lineage is AFAIK Gogs → Gitea → Forgejo).


        All the above solutions make it super-easy to mirror a github project as well, just in case it goes away :) Doing so has saved my arse more than a few times when github takes a repo down for stupid reasons.


        Mandatory plug for [email protected] :)

        Gitlab seems too heavyweight to me. I use Gogs myself on my home server. No code review tools via PR ala github/gitlab, but I don’t need those in my web frontend.