BackupSolution

From HerzbubeWiki
Jump to: navigation, search

This page has information on how I am backing up my data.


Data sets

I am using several backup strategies for different sets of data. This section provides an overview by describing the different data sets.

Regular data 
This data set contains all personal and household documents (e.g. letters, spreadsheets, invoices, contracts, bank account data, personal notes, etc.). Some of the documents are personally created, but others were obtained from an external source and have become part of the data set in order to archive them. This data set is not too large (less than 10 GB at the time of writing) and has a relatively low growth rate. Files in the data set may change frequently.
Media files 
This data set contains music, images (except for the Photo Library), icons, travel souvenirs (sounds and small movies) and videos (except for the Movie Library). This data set is quite large (less than 200 GB at the time of writing) and has a relatively low growth rate (with spikes when new chunks are added). Files in the data set do not change a lot, the data set usually just grows over time. Metadata such as MP3 tags may be an exception.
Music Library 
This data set consists of the Apple Music library data, excluding the actual music media files - those are part of the "Media files" data set. The Music Library therefore contains metadata that is also stored in the music media files (e.g. title, album, artist), personalized data (ratings, playlists), plus other data obtained from the Internet (e.g. album covers). The Apple Music library data is treated separately from the music files in the "Media files" data set, and forms a data set in its own right, not because of the characteristics of the data, but because of the special nature of the process with which it must be backed up. Apple Music does not allow its library data to be stored on a file server, it can therefore only be added to the automated backup process by manual intervention. For details see the Music Library section further down on this page. The Music Library is small (less than 1 GB at the time of writing) and has a very low growth rate (because it mainly consists of metadata). Some files in the data set change frequently (the metadata library), others remain quite static (album covers).
Photo Library 
This data set contains photos taken by one of the cameras in the household, usually while travelling. These photos are treated separately from the images in the "Media files" data set, and form a data set in their own right, because the Photo Library is stored separately from the images in the "Media files" data set, but also because the Photo Library data set has different characteristics than the "Media files" data set. The Photo Library is very large, almost as large again as the "Media files" data set. Also, the Photo Library is relatively fast growing (each trip adds a few more GB). Files in the data set do not change a lot, the data set usually just grows over time. Metadata such as Exif tags may be an exception (e.g. when I perform geo-tagging).
Movie Library 
This data set contains movies. These movies are treated separately from the videos in the "Media files" data set, and form a data set in their own right, simply because of the huge size of the data set (hundreds of GB). Files in the data set practically do not change, the data set usually just grows over time.
Software images 
This data set is an archive of disk images (e.g. Mac OS X install images, C64 floppy disk images), virtual environment images (e.g. Virtual Box machines, DosBox game folders) and other chunks of data that can be seen as software images (e.g. old BIOS dumps). This data set is very large (more than 100 GB at the time of writing) and has a relatively low growth rate (with spikes when new images are added). Files in the data set practically do not change, the data set usually just grows over time.
Software archive 
This data set is an archive of software packages. This data set is very large (more than 100 GB at the time of writing) and has a relatively low growth rate. Files in the data set do not change, the data set just grows over time.
Linux server data 
This data set contains various databases (e.g. Mediawiki, CardDAV contacts, CalDAV calendar data, LDAP data) in the form of nightly database dumps, Git repositories, and other data (e.g. users' home directories). This data set is not too large (less than 10 GB at the time of writing) and has a low growth rate. Files in the data set change frequently because the nightly database dumps usually differ from the previous dumps by at least a few bytes.
Linux server configuration 
This data set contains the configuration of the Linux server, which is mainly the /etc/ folder. This data set is small (less than 1 GB at the time of writing) and has an extremely low growth rate. Files in the data set do not change a lot.
Games data 
This data set contains images and videos originating from my playing computer games (e.g. screenshots and videos from Elite: Dangerous). This data set is very large (more than 100 GB at the time of writing) and has a relatively low growth rate (with spikes when I am on one of my periodic gaming sprees). Files in the data set do not change a lot because I don't do image/video editing.
ISFDB cover scans 
This data set contains scans of book covers I make while I work as editor for the ISFDB project. This data set is not too large (less than 20 GB at the time of writing) and has a relatively low growth rate. Files in the data set do not change, the data set just grows over time.
Data archive 
This data set is an archive of various chunks of data (old Windows data from the 90ies PC era, archive of mailing lists from lists.herzbube.ch, etc.). This data set is not too large (less than 10 GB at the time of writing) and has a low growth rate. With the exception of a TrueCrypt encrypted container, files in the data set do not change, the data set just grows over time.


Backup strategies

Summary

Note: The backup type columns are explained in separate sections after this summary section.


Data set Size Backup type Remarks
On-site backup Off-site copy Snapshot Manual copy Encrypted backup Mirror
Regular data < 10 GB n/a x x - x -
Media files < 200 GB n/a x x - x -
Music Library < 10 GB n/a (x) (x) x x - The off-site copy and snapshot are made from an intermediate snapshot that needs to be manually updated.
Photo Library < 100 GB n/a (x) (x) x x - The off-site copy and snapshot are made from an intermediate snapshot that needs to be manually updated.
Movie Library Hundreds of GB n/a No backup x This data set is too large to include it in the regular backup.
Software images > 100 GB n/a No backup x This data set is too large to include it in the regular backup.
Software archive > 100 GB n/a No backup x This data set is too large to include it in the regular backup.
Linux server data < 10 GB x x x - x -
Linux server configuration < 1 GB x x x - x -
Games data > 100 GB n/a No backup x This data set is not important enough to include it in the regular backup (especially in view of its size).
ISFDB cover scans > 10 GB n/a No backup x This data set is not important enough to include it in the regular backup.
Data archive < 10 GB n/a x x - x -


On-site backup

On pelargir an automated on-site backup job is running every night at 01:00 am. The details are on the wiki page Backupninja.

The resulting backup files are rotated in several stages: Daily (7x), weekly (4x) and monthly (12x). Configuration details are on this wiki page.


Off-site copy

Every three days an automated job creates an off-site copy of the backup files created on-site on pelargir. The same job also creates a copy of some of the other data sets that are stored on the intranet file server.


The automated off-site copy job currently runs on a Raspberry Pi - see that Wiki page for a documentation of the cron job schedule. The scripts are documented on the BackupScripts wiki page. The copied data is placed on an external hard disk that is attached via USB to the Raspberry Pi. Important aspects of that storage medium:

  • This is not the same hard disk as the one from which the intranet file server serves data. Storing the data on two independent physical devices guards against physical hardware failure. Unfortunately, in case of file server data there is no geographical separation between the two storage devices, but that can't be helped for the moment.
  • The hard disk with the backup data is not accessible from intranet desktop machines via regular file server mechanisms (e.g. Samba). This makes sure that the backup data cannot be corrupted by malware, especially the recently emerging ransomware.
  • Because the hard disk is accessed only every other day, the mechanical and electrical parts of the device should not wear out for a long time and premature disk failure is therefore rather unlikely. If the disk spins for 3 hours on every backup (conservative estimate), and 150 backups are made per year (includes some non-scheduled extra backups), the result are a ridiculous 450 hours of usage per year.


Snapshot

Every week an automated job creates a snapshot of the off-site copied data. This is achieved by feeding the data into a time-machine like tool. These weekly snapshots can be used to restore file versions and entire folder structures as they appeared in the past. This versioning is a safeguard not only against accidental changes by a user, but also against undetected corruption of the source data sets due to hardware/software malfunction or due to malware.

Note that the time-machine tool must support data de-duplication to prevent excessive growth of the snapshot database.


The automated snapshot job currently runs on the same Raspberry Pi as the automated off-site copy job - see that Wiki page for a documentation of the cron job schedule. The scripts are documented on the BackupScripts wiki page. Snapshots are placed on yet another external hard disk, different from the one on which the off-site copy of the backup data lives, but also attached to the Raspberry Pi via USB. All considerations from the previous section regarding a physically separate storage medium also apply to this hard disk.


Encrypted backup

At least annually I create an encrypted backup of all critical data that I don't want to lose even if some disaster (fire, theft) strikes my home. Unfortunately I have to do this manually, so discipline is required.

The backup data is placed on an external hard disk that I then store somewhere away from home. Because some of the source data sets contain sensitive data, the entire hard disk is encrypted so that I don't have to worry about theft or other unauthorized access.

Currently there is still sufficient space on the encrypted hard disk to make an entire clone of the snapshot data (/mnt/backup-snapshot). If for some reason this becomes no longer possible, the encrypted backup is made from the most recent dataset copies (/mnt/backup-copy).

Technical details about hard disk encryption can be found on the Disk Maintenance Wiki page. The commands to perform the encrypted backup are these:

# Open the LUKS partition.
# This requires the passphrase with which the partition is encrypted.
sudo cryptsetup luksOpen /dev/sdd1 encrypted-backup

# Mount the filesystem inside the LUKS partition.
# The block device should already be in /etc/fstab.
sudo mount /dev/mapper/encrypted-backup /mnt/encrypted-backup

# Clone the latest snapshot data
htb-mkbackupcopy.sh /mnt/backup-snapshot /mnt/encrypted-backup/ >>/mnt/encrypted-backup/encrypted-backup.log 2>&1

# Unmount the filesystem and close the LUKS partition
sudo umount /mnt/encrypted-backup
sudo cryptsetup luksClose encrypted-backup

If the backup is expected to take a long time, then the process can be started via cron. See the file /etc/cron.d/manual-backup-scripts for details.


Mirror

Due to their bulk (which in total amounts to several TB of data) it is impractical to include some of the data sets in the automated backup processes that are executed with some regularity (once per week or even more frequently). Specific reasons why it is impractical:

  • The regular backup processes would take much too long.
  • Versioning of bulk data might cause the snapshot database to grow to a potentially unmanageable size (e.g. it could become impossible to store the snapshot database on a single disk).

Along with the encrypted backup (see previous section) I therefore manually create a simple mirror backup of those large and not-so-important data sets that are left out from the regular backup processes. The mirror backup data is placed on yet another external hard disk that I then store somewhere away from home, in the same place as the encrypted backup.

The mirror backup differs from the encrypted backup in the following ways:

  • The mirror backup is not encrypted. This is acceptable because the source data sets do not contain sensitive data.
  • The mirror backup does not contain versioned snapshots, i.e. a new mirror backup completely overwrites the previous mirror backup. This strategy contains the risk that good backup data is overwritten by corrupted data. This risk must perforce be accepted because, as mentioned above, versioning of bulk data may cause an unmanageably large snapshot database.

The commands to perform the encrypted backup are as follows. Note that the files/folders to include or exclude are based on the backup cron jobs documented on the Raspberry Pi wiki page. Also note that the Movie Library data set is mirrored in a separate directory, the reason being that the script htb-mkbackupcopy.sh uses the rsync option --delete-excluded, which would cause the first command to delete a previously mirrored Movie Library data set because it excludes the /Media folder.

# Copy everything in the "alles-andere" folder, except the Movie Library data set
# and the folders/files that are already included in the regular backup
htb-mkbackupcopy.sh -e "/Media" \
                    -e "/Backup/Snapshots" \
                    -e "/Archiv/mailman.tar.gz" \
                    -e "/Archiv/OldWindowsData" \
                    -e "/Archiv/Work" \
                    -e "/Archiv/facebook-herzbube102.zip" \
                    /mnt/fileserver/alles-andere/ \
                    /mnt/backup-mirror/fileserver/alles-andere-no-movies/ \
                    >> /mnt/backup-mirror/fileserver-alles-andere-no-movies.log 2>&1

# Copy the Movie Library data set in a separate step, because it is located in
# the /Media folder, which was excluded in the first command
htb-mkbackupcopy.sh /mnt/fileserver/alles-andere/Media/Filme/ \
                    /mnt/backup-mirror/fileserver/alles-andere-movies/ \
                    >> /mnt/backup-mirror/fileserver-alles-andere-movies.log 2>&1

If the backup is expected to take a long time, then the process can be started via cron. See the file /etc/cron.d/manual-backup-scripts for details.


Hard disk health

Hard disk health can be verified by running the smartctl tool, e.g.

smartctl -a /dev/sda

If the disk is not recognized it may be necessary to specify the device type with the -d parameter. The Devices wiki page should have a list of the hard disks connected to the Raspberry Pi and what the smartctl command line is to read their SMART status.

The important thing is that in the SMART attributes table the number in the VALUE column does not fall below the number in the THRESH column. The meaning of each of the SMART attributes can be looked up on Wikipedia.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x002f   200   200   051    Pre-fail  Always       -       0
  3 Spin_Up_Time            0x0027   213   211   021    Pre-fail  Always       -       4325
  4 Start_Stop_Count        0x0032   098   098   000    Old_age   Always       -       2362
  5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x002e   100   253   000    Old_age   Always       -       0
  9 Power_On_Hours          0x0032   085   085   000    Old_age   Always       -       11624
 10 Spin_Retry_Count        0x0032   100   100   000    Old_age   Always       -       0
 11 Calibration_Retry_Count 0x0032   100   100   000    Old_age   Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       186
192 Power-Off_Retract_Count 0x0032   200   200   000    Old_age   Always       -       47
193 Load_Cycle_Count        0x0032   197   197   000    Old_age   Always       -       9584
194 Temperature_Celsius     0x0022   115   107   000    Old_age   Always       -       37
196 Reallocated_Event_Count 0x0032   200   200   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0030   100   253   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age   Always       -       0
200 Multi_Zone_Error_Rate   0x0008   100   253   000    Old_age   Offline      -       0


Backup of desktop machines

Summary

Currently desktop machines (Mac, Windows) are not backed up automatically. Almost all of the important data is located on file servers, and programs can be re-installed. Some data, however, exists only locally on some of those desktop machines. Not all of that local data is worthy of backup, but here is a list of what I find important enough to warrant the effort of making a periodical manual backup:

  • Photo Library
  • Music Library (the playlists and ratings, not the music files - these are already stored on the file server)


I manage the backup of this data via the following process:

  • I keep a snapshot of the data on the file server. This snapshot is picked up by the automatic backup processes described in the previous sections.
  • All I have to do is to update the snapshot on the file server from time to time by making a manual copy of the data. The next time one of the automated backup processes is running it will simply start to work with the updated snapshot.
  • Snapshots of Mac data are stored on the file server as sparse bundles which internally use the HFSX file system. This allows to preserve any resource fork data or extended attributes.
  • Because I want the snapshot to change only as little as possible I use rsync to update the snapshot. Unfortunately a few bands of the sparse bundle change every time that the disk image stored by the bundle is mounted, even if no data inside the disk image is actually modified. But that's a price I'm willing to pay for the ease of use afforded by sparse bundles.
  • The version of rsync that ships with Mac OS X (2.6.9) is ancient and outdated and doesn't properly handle resource forks and extended attributes. For this reason it is important that a modern version of rsync is installed on the system via a package manager such as Homebrew.


Photo Library

Why is the photo library not stored on the file server? Currently I manage the photo library with digikam. In theory it would be possibble to store digikam's internal database and the photo library on a disk image that is located on the file server, but then work with the photo library would probably become very slow - at least that was the case when I was still using iPhoto, and I can't imagine that this would be different with digikam. For this reason, the photo library must be stored locally on a desktop machine or on a fast external hard disk.


One of the following commands updates the snapshot on the file server:

# Use this command if the "alles-andere" share is not mounted on the machine.
# Important: DO NOT USE A TRAILING SLASH for the local folder to be snapshotted,
# otherwise rsync will place the CONTENT of the local folder into the root folder
# of the sparse bundle.
htb-mkbackupcopy.sh -s "//pi@raspberrypi1/alles-andere" "$HOME/Pictures/digikam" "Backup/Snapshots/Photo Library"

# Use this command if the "alles-andere" share is mounted on the machine for the current user.
# A numeric suffix possibly has to be added to the mount point (e.g. "/Volumes/alles-andere-1"),
# if some other user on the machine has mounted the share first.
htb-mkbackupcopy.sh "$HOME/Pictures/digikam" "/Volumes/alles-andere/Backup/Snapshots/Photo Library"


Music Library

As explained before the Music Library data consists of playlists, ratings and possibly other metadata, not the music files themselves. Why is the Music Library not stored on the file server? Unfortunately Apple Music does not allow the user to choose an arbitrary storage location for the Apple Music library data. It's possible to trick Apple Music by creating a symbolic link that refers to a location on the file server, but this is a fragile solution that easily breaks because of two reasons:

  • When multiple users mount the same network volume (e.g. a Samba share), Mac OS X unfortunately assigns mount points on a first-come-first-serve basis. This means that mount points are not stable between reboots, and sometimes not even between login sessions without a reboot. This in turn means that the symbolic link to the Apple Music library often points at the wrong location, requiring the symlink to be manually re-created every so often with an updated target location.
  • Sometimes the symbolic link gets deleted and needs to be re-created.

Because of these things I have decided to stop fighting Apple Music and just accept that some data must be stored locally.


One of the following commands updates the snapshot on the file server. Note: The sparse bundle name still contains "iTunes" even though iTunes has been superseded by Apple Music. I will eventually rename the sparse bundle, but for the moment it's too much effort to update all scripts and configurations involved in the backup solution.

# Use this command if the "alles-andere" share is not mounted on the machine
# Important: DO NOT USE A TRAILING SLASH for the local folder to be snapshotted,
# otherwise rsync will place the CONTENT of the local folder into the root folder
# of the sparse bundle.
htb-mkbackupcopy.sh -s "//pi@raspberrypi1/alles-andere" -d "Backup/Snapshots/iTunes Library.sparsebundle" "$HOME/Music/Music" "."

# Use this command if the "alles-andere" share is mounted on the machine for the current user.
# A numeric suffix possibly has to be added to the mount point (e.g. "/Volumes/alles-andere-1"),
# if some other user on the machine has mounted the share first.
htb-mkbackupcopy.sh -d "/Volumes/alles-andere/Backup/Snapshots/iTunes Library.sparsebundle" "$HOME/Music/Music" "."

# Use this command if the sparse bundle image is mounted on the machine for the current user.
htb-mkbackupcopy.sh "$HOME/Music/Music" "/Volumes/iTunes Library"


Backup history

My pre-2016 backup scheme required me to manually create backups and burn them to DVD. This section - although obsolete - shows how in the 13 years that I have been working on my self-hosting project I managed to create merely two of these manual backups. It is a monument both to my personal lack of discipline when it comes to making backups, and to the danger that is inherent in manual backups because they require such discipline. In 2016 a catastrophic hard disk failure on my MacMini occurred and - with the most recent backup seven (!) years out of date - made me realize how foolish I had been. I paid my dues to a data rescue service and then brought the era of manual backups in my household to an end.


Basic inventory:

osgiliath system backups:

  • 06.05.2007 (1 DVD)
  • 06.03.2009 (2 DVDs)

iPhoto library:

  • 05.03.2007 (1 DVD with albums 2001-2005, 1 DVD with albums 2006+2007)

Important data backup

  • 08.05.2007 (2 DVDs)
  • 06.03.2009 (1 DVDs)