It’s fairly obvious why stopping a service while backing it up makes sense. Imagine backing up Immich while it’s running. You start the backup, db is backed up, now image assets are being copied. That could take an hour. While the assets are being backed up, a new image is uploaded. The live database knows about it but the one you’ve backed up doesn’t. Then your backup process reaches the new image asset and it copies it. If you restore this backup, Immich will contain an asset that isn’t known by the database. In order to avoid scenarios like this, you’d stop Immich while the backup is running.

Now consider a system that can do instant snapshots like ZFS or LVM. Immich is running, you stop it, take a snapshot, then restart it. Then you backup Immich from the snapshot while Immich is running. This should reduce the downtime needed to the time it takes to do the snapshot. The state of Immich data in the snapshot should be equivalent to backing up a stopped Immich instance.

Now consider a case like above without stopping Immich while taking the snapshot. In theory the data you’re backing up should represent the complete state of Immich at a point in time eliminating the possibility of divergent data between databases and assets. It would however represent the state of a live Immich instance. E.g. lock files, etc. Wouldn’t restoring from such a backup be equivalent to kill -9 or pulling the cable and restarting the service? If a service can recover from a cable pull, is it reasonable to consider it should recover from restoring from a snapshot taken while live? If so, is there much point to stopping services during snapshots?

  • MangoPenguin@lemmy.blahaj.zone
    link
    fedilink
    English
    arrow-up
    11
    ·
    5 months ago

    You start the backup, db is backed up, now image assets are being copied. That could take an hour.

    For the initial backup maybe, but subsequent incrementals should only take a minute or two.

    I don’t bother stopping services, it’s too time intensive to deal with setting that up.

    I’ve yet to meet any service that can’t recover smoothly from a kill -9 equivalent, any that did sure wouldn’t be in my list of stuff I run anymore.

    • Avid AmoebaOP
      link
      fedilink
      English
      arrow-up
      3
      ·
      edit-2
      5 months ago

      It depends on the dataset. If the dataset itself is very large, just walking it to figure out what the incremental part is can take a while on spinning disks. Concrete example - Immich instance with 600GB of data, hundreds of thousands of files, sitting on a 5-disk RAIDz2 of 7200RPM disks. Just walking the directory structure and getting the ctimes takes over an hour. Suboptimal hardware, suboptimal workload. The only way I could think of speeding it up is using ZFS itself to do the backups with send/recv, thus avoiding the file operations altogether. But if I do that, I must use ZFS on the backup machine too.

      I’ve yet to meet any service that can’t recover smoothly from a kill -9 equivalent, any that did sure wouldn’t be in my list of stuff I run anymore.

      My thoughts precisely.

      • MangoPenguin@lemmy.blahaj.zone
        link
        fedilink
        English
        arrow-up
        2
        ·
        5 months ago

        Oooh yeah I can imagine RAIDz2 on top of using spinning disks would be very slow, especially with access times enabled on ZFS.

        What backup software are you using? I’ve found restic to be reasonably fast.

        • Avid AmoebaOP
          link
          fedilink
          English
          arrow-up
          1
          ·
          edit-2
          5 months ago

          Currently duplicity but rsync took similar amount of time. The incremental change is typically tens or hundreds of files, hundreds of megabytes total. They take very little to transfer.

          If I can keep the service up while it’s backing up, I don’t care much how long it takes. Snapshots really solve this well. Even if I stop the service while creating the snapshot, it’s only down for a few seconds. I might even get rid of the stopping altogether but there’s probably little point to that given how short the downtime is. I don’t have to fulfill an SLA. 😂