bfsync - Documentation

SYNOPSIS

bfsync [--version] <command> <args>…

DESCRIPTION

bfsync is a file-synchronization tool which allows to keep a collection of big files synchronized on many machines. There are two types of repositories: master repositories and checkouts. Master repositories only contain the history information. Usually they should be stored on a central server which all computers that should view/edit this data can reach (at least some of the time). Master repositories are created using bfsync-init(1). The other repository type is a checkout. A checkout can be created using bfsync-clone(1). Besides the history information checkouts contain the actual file contents, so a checkout can be huge (houndreds of gigabytes) whereas a master repository is usually small.

To view/edit the data in a checked out repository, the repository must be mounted using bfsyncfs(1). Once mounted, bfsync can be called with the commands given below. For instance bfsync commit/push/pull can be used to modify the local history and resynchronize the history with the master history. Other commands like bfsync get/put can be used to transfer data between checkouts.

For transfer, bfsync needs to be installed on every system that needs to be accessed. The actual transfer is done using ssh. Transfer has only be tested with ssh keys; it’s highly recommended to use ssh-agent to avoid entering your password over and over again.

It is possible to use bfsync for backups, in that case you only use a subset of what bfsync can do to get a versioned FUSE filesystem with deduplication. See the section BACKUP WALKTHROUGH for a description.

OPTIONS

For sub-commands like bfsync pull or bfsync commit, the options depend on the command. In these cases, options are documented on the sub-command manual page.

--version: If this option is used and no command is specified, bfsync prints out the current version.

COMMANDS

Main Commands

bfsync-check(1): Checks local repository for completeness.
bfsync-clone(1): Clone a bfsync repo from a bfsync master repo.
bfsync-collect(1): Scan directory for file contents required by repository.
bfsync-commit(1): Commit changes to the repository.
bfsync-delete-version(1): Delete one or more versions from repository.
bfsync-diff(1): Diff text file changes between last and current version.
bfsync-expire(1): Mark old versions as deleted to reduce disk usage.
bfsync-gc(1): Cleanup unused files.
bfsync-get(1): Transfer file contents from repository.
bfsync-init(1): Create a new bfsync master repo.
bfsync-log(1): Display the history of the repository.
bfsync-pull(1): Pull changes from master repository.
bfsync-push(1): Push changes to master repository.
bfsync-put(1): Transfer file contents to repository.
bfsync-recover(1): Recover database to a consistent state.
bfsync-repo-files(1): Search a directory for files that are also in the repo.
bfsync-revert(1): Go back to a previous version.
bfsync-status(1): Show information about uncommitted changes.
bfsync-undelete-version(1): Undelete one or more versions from repository.

Additional Commands

bfsync-check-integrity(1): Check repository integrity.
bfsync-config-set(1): Modify repository configuration file.
bfsync-config-unset(1): Remove entry from repository configuration.
bfsync-continue(1): Continue previously interrupted command.
bfsync-copy-expire(1): Copy expire tags from source to target repository.
bfsync-disk-usage(1): Generate disk usage information.
bfsync-export-history(1): Export history from a repository.
bfsync-file-log(1): Display the history of one file.
bfsync-find-missing(1): Show filenames of files where file contents are unavailable.
bfsync-get-repo-id(1): Get unique bfsync repo id.
bfsync-need-continue(1): Check whether bfsync continue needs to be run.
bfsync-need-recover(1): Check whether bfsync recover needs to be run.
bfsync-need-upgrade(1): Check whether bfsync upgrade needs to be run.
bfsync-new-files(1): Show which files were added for a given version.
bfsync-sql-export(1): Export versioned file list to postgres database.
bfsync-transfer-bench(1): Measure transfer speed from remote host to local host.
bfsync-upgrade(1): Upgrade repository contents from an old bfsync version.

CONFIGURATION

Every bfsync checkout has a file called "config", which can be used to set configuration variables for this checkout.

use-uid-gid 0|1: Bfsync was designed to store all file meta data, including the user id and group id of each file. These numbers will only make sense if all checkouts use the same uid/gid number to name mappings.

Since for most users we cannot assume that the uid/gid numbers are the same on every system that has a checkout, bfsync defaults to ignoring the access permissions and uid/gid numbers stored in the repository. All files will appear to belong to the user that mounted the filesystem, and access rights will also not be enforced.

To use the uid/gid numbers and enforce access rights, set use-uid-gid to 1. This is for instance useful if you want to copy data into the repository as root and preserve the ownership of the files.
get-rate-limit <get-limit-kb>: Set the maximum transfer rate in kilobytes/sec that bfsync get will use. This is helpful if your internet connection has a limited speed: that way you can ensure that bfsync will not use up your line completely.
put-rate-limit <put-limit-kb>: Set the maximum transfer rate in kilobytes/sec that bfsync put will use.
default { get "<url>|<path>"; }: Set default location for get (an <url> or <path>) to be used if bfsync get is called without an argument.
default { put "<url>|<path>"; }: Set default location for put (an <url> or <path>) to be used if bfsync put is called without an argument.
default { pull "<url>|<path>"; }: Set default location for pull (an <url> or <path>) to be used if bfsync pull is called without an argument.
default { push "<url>|<path>"; }: Set default location for push (an <url> or <path>) to be used if bfsync push is called without an argument.
default { copy-expire "<url>|<path>"; }: Set default location for copy-expire (an <url> or <path>) to be used if bfsync copy-expire is called without an argument.

The configuration keys in the default group can be set simultaneously, by using

default {
  get "...";
  put "...";
  push "...";
  pull "...";
  ...
}

expire { keep-most-recent <N>; }: Keep <N> most recent versions during expire.
expire { create-daily first|last; }: Tag first/last backup of the day as daily backup during expire.
expire { keep-daily <N>; }: Keep the newest <N> daily backups during expire.
expire { create-weekly <weekday>; }: Tag daily backup on <weekday> as weekly backup during expire. Possible values for <weekday> are monday, tuesday, …, sunday.
expire { keep-weekly <N>; }: Keep the newest <N> weekly backups during expire.
expire { create-monthly first|last; }: Tag first/last daily backup of the month as monthly backup during expire.
expire { keep-monthly <N>; }: Keep the newest <N> monthly backups during expire.
expire { create-yearly first|last; }: Tag first/last daily backup of the year as yearly backup during expire.
expire { keep-yearly <N>; }: Keep the newest <N> yearly backups during expire.

The configuration keys in the expire group can be set simultaneously, for instance by using

expire {
  keep-most-recent 30;
  keep-daily 45;
  keep-monthly 30;
  ...
}

sql-export { database <database>; }: Use the postgres database <database> for the sql-export command.
sql-export { user <user>; }: Use the postgres user <user> for the sql-export command.
sql-export { password <password>; }: Use the postgres password <password> for the sql-export command.
sql-export { host <host>; }: Use the postgres host <host> for the sql-export command.
sql-export { port <port>; }: Use the postgres port <port> for the sql-export command.

The configuration keys in the sql-export group can be set simultaneously, for instance by using

sql-export {
  database bfsync;
  user postgres;
  password secret;
  ...
}

SHARED MEMORY CONFIGURATION

Shared memory is used by bfsync to access the Berkeley DB database contents from different processes: the bfsync FUSE filesystem process, bfsyncfs, and the python frontend, bfsync. Under Linux, the amount of shared memory usually is limited by three system-wide kernel parameters:

/proc/sys/kernel/shmall: The maximum amount of shared memory that can be allocated.
/proc/sys/kernel/shmmax: The maximum size of a shared memory segment.
/proc/sys/kernel/shmmni: The maximum number of shared memory segments.

These limits need to be large enough to allow bfsync to allocate the required amount of shared memory. The amount of shared memory required mainly depends on the cache size. Bfsync will use somewhat more shared memory than the cache size, but setting the limits too high is usually not a problem.

Example: If you’re using three bfsync filesystems with 256 MB cache per filesystem, you can do so if shmall is 2 GB and shmmax is 512 MB. shmmni is usually not an issue, because bfsync doesn’t use may segments (about 4 per filesystem).

To display your current limits, you can use:

# Display the system wide shared memory limits.
server:~$ ipcs -lm

To adjust shared memory settings at boot time, create a file called /etc/sysctl.d/90-bfsync-shm.conf:

# Shared memory settings for bfsync

# Maximum size of shared memory segment in bytes
# 512 MB
kernel.shmmax = 536870912

# Maximum total size of shared memory in pages (normally 4096 bytes)
# 2 GB
kernel.shmall = 524288

Note that if you have other programs that also need shared memory, you need to coordinate the settings of all shared memory using programs. Its also not a problem if your limits are too high, so if the system wide limit for shmall is already 8 GB, there is no need to adjust it.

After creating this files, the settings will be loaded at boot time. To activate the shared memory configuration without rebooting, you can use

# Load shared memory settings (as root).
server:~$ sysctl -p /etc/sysctl.d/90-bfsync-shm.conf

MERGES

bfsync allows independent modifications of the data/history contained in different checkouts. Upon push, bfsync will check that the master history doesn’t contain new commits that are unknown to the local checkout. If two clients modify the repository independently, the first client that uses bfsync push will simply reintegrate its changes into the master history, and the new master history will be this client’s history.

However, if the second client tries a bfsync push, the push will be refused. To resolve the situation, the second client can use bfsync-pull(1). Once it is detected that merging both histories is necessary, a merge algorithm will be used. For non-conflicting changes, everything will be merged automatically.

Non-conflicting changes could be:

master history has new file F - client 2 has new file G: After merging, both files will be present in the repository
master history has new dir A, with new files in it - client 2 has new dir B, with new files in it: After merging, both directories will be part of the repository
master history has renamed file F to G - client 2 has renamed dir X to Y: After merging, both renames will be done
master history has new file X - client 2 has new file X: In this case, one of the files will be renamed to X~1, since they were both independently added it is likely that the user wants to keep both files.

However, there are situations where the merge algorithm can’t merge both histories automatically:

master history has edited file F - client 2 has edited file F: In this case, bfsync pull will ask the user to resolve the situation; it is possible to keep the master version, or the local version or both.
master history has edited file F - client 2 has deleted file F: bfsync pull will ask the user in this case; it is possible to either keep the file with changes, or delete it.

In any case, after the merge decisions are made (if any), the merge algorithm will use them to modify the local history so that it can be executed without conflicts after the master history. After this step, the modified local commits will be based on the master history. This means that then, bfsync push will succeed, and the modified changes of client 2 can be pushed to the master history.

Note that the master history is always linear, so the history branch that was present before the merge algorithm was used will no longer be visible in the history after the pull. The merged history will simply contain the old history (before client 1 and client 2 made their changes), the changes made on client 1, an extra merge commit (if necessary to resolve merge issues), and the modified changes of client 2.

WALKTHROUGH

First, we create and setup repositories on three computers: server, client1 and client2. The server will hold the master repository (which manages the history, but nothing else). It is stored under ~/repos/big.bfsync. All computers will contain a checkout, so that the actual contents of the files can be kept there.

server:~$ mkdir repos: Create a directory on the server for the master repository.
server:~$ cd repos: Change dir.
server:~/repos$ bfsync init big.bfsync: Init master repo.
server:~/repos$ cd ~: Change dir.
server:~$ bfsync clone repos/big.bfsync: Clone repository on the server.
server:~$ mkdir big: Create mount point on the server.
server:~$ bfsyncfs big.bfsync big: Mount repository on the server.
client1:~$ bfsync clone server:repos/big.bfsync: Clone repository on client1.
client1:~$ mkdir big: Create mount point on client1.
client1:~$ bfsyncfs big.bfsync big: Mount repository on client1.
client2:~$ bfsync clone server:repos/big.bfsync: Clone repository on client2.
client2:~$ mkdir big: Create mount point on client2.
client2:~$ bfsyncfs big.bfsync big: Mount repository on client2.

As second step, we add a music file on client1. Of course it’s possible to add more files in one step; you can also use rsync, mc or a file manager to copy files into the repository. Whenever files are added or otherwise changed, we need to commit and push the changes to the server, so that it contains the canonical index of files.

client1:~$ cd big: Change dir.
client1:~/big$ cp ~/download/01-some-music.flac .: Copy a big file into the repository checkout.
client1:~/big$ bfsync commit: Commit the changes to the repository.
client1:~/big$ bfsync push: Push the changes to the server.

So far, we have added the file to the repository on client1, but the contents of the file are only present on client1, and not in the other repos. To change this, we can transfer the file to the server.

server:~$ cd big: Change directory.
server:~/big$ bfsync pull: Using pull is required on the server before we can transfer the file there. By pulling, the server will have the necessary information, or in other words: the server can know that a file named 01-some-music.flac is part of the bfsync repository and should be there. Running bfsync check will report one missing file after this step.
client1:~/big$ bfsync put server:big: Now the actual transfer: after this step, both client1 and server will have a copy of 01-some-music.flac.

As last step, we’ll transfer the file to client2. Of course we could use the same commands that we used to get the file to the server, but let’s assume that client2 is behind a firewall, and that it’s not possible to ssh to client2 directly. Fortunately, besides uploading files to another host (bfsync put), it’s also possible to download data from another host (bfsync get).

client2:~$ cd big: Change directory
client2:~/big$ bfsync pull: Update directory information.
client2:~/big$ bfsync get server:big: Get the file from the server.

BACKUP WALKTHROUGH

Since bfsync implements file level deduplication and versioning of files, it can be used to do backups. Backups typically contain lots of files (like 5.000.000 files). Therefore you can only use a subset of the available commands for backups, since some commands do not work well if the number of files is that large.

Commands like commit, get, put, push, pull, check and gc should work fine for backup usage. Also deleting old backups using expire and maybe copy-expire is supported. However, advanced functions like merges might never be supported for backups - for typical backup scenarios this is not an issue.

The first step for backups is to set up repositories. All steps should be done as root. For this example, we assume that our backup harddisk is mounted to /backup.

server:/backup$ bfsync init master: Setup master repository
server:/backup$ bfsync clone -u -c 500 master repo: Clone repository, ensure uid/gid are stored and set cache size.

The cache size is important for backups: if it is too small, the backup will take a lot more time. However, since the cache is stored in shared memory, a overly large cache may use too much of the system memory. As a rule of thumb, 100 megabytes of cache should be used for every 1.000.000 files that are stored in the backup. More is better, if you can afford it.

server:/backup$ mkdir mnt: Create mount point for the backup repository.
server:/backup$ bfsyncfs repo mnt: Mount repository.
server:/backup$ cd /backup/mnt: Change dir.

Now that everything is initialized, we can backup some data. For this example we backup /home.

server:/backup/mnt$ rsync -axH --delete /home/ home: Copy everything from /home to the backup. This is the initial backup, so all files will be copyied to the backup harddisk.

The rsync options we use here are:

-a: Copy all file attributes.
-x: Exclude everything that is not on the filesystem that /home is on.
-H: Backup hardlinks as hardlinks.
--delete: Delete files in the target directory that are not in the source directory.
server:/backup/mnt$ bfsync commit -m "initial backup": Snapshot current state, run deduplication.
server:/backup/mnt$ bfsync push: Push changes into the master repository. This is a precaution for the case that your repository gets damaged due to disk failure. Having the metadata stored twice can be used to recover your repository in that case (by cloning again for master using bfsync-clone(1) and reassembling the data files using bfsync-collect(1)).

We have the initial full backup. Now one day later, we only need to backup changes (which will be a lot faster than the initial backup), like this:

server:/backup/mnt$ rsync -axH --delete /home/ home: Copy changes from /home to the backup.
server:/backup/mnt$ bfsync commit -m "first incremental backup": Snapshot current state, run deduplication.
server:/backup/mnt$ bfsync push: Push changes into the master repository.

Now, we’ve created the first incremental backup. This usually uses a lot less additional disk space than the initial full backup, since usually only few files will be changed. To access an individual backup, you can use

server:/backup/mnt$ cd /backup/mnt/.bfsync/commits/2/home: Access a specific version. The version log can be viewed with bfsync-log(1).

To automate the process, a script which runs the rsync and commit steps every night can be used. Removing the contents of old backups is currently not supported, but will be available in the future.

The commandline for creating a backup of the root filesystem is:

server:/backup/mnt$ rsync -axH --delete / root: Copy changes from / to the backup.

If you backup more than one filesystem every day, you only need to commit once, that is first rsync all filesystems and commit as last step.

UPDATING FROM AN OLD VERSION

The repository format is not (yet) stable across bfsync versions. In some cases upgrades from an old to the current bfsync version can be done automatically by running

server:/path/to/files.bfsync$ bfsync upgrade

This is the easiest way. If this way doesn’t work, you need to manually convert your old repositories to a new version. There is currently no easy way to preserve the history when manually updating from an old version of bfsync. But if you need only preserve the repository content, you can use the following steps:

install the old version and the new version of bfsync in parallel on one machine

Use different prefixes to make this happen (configure --prefix=...).
create a new empty master repository and checkout

This will become your new repository & checkout.
copy all files from the old repository to the new repository

You’ll need to mount both, the old and new bfsync repository using bfsyncfs. Copying can be done with a filemanager, cp -a or rsync. You need to copy everything except for the .bfsync directory.
commit and push in the new repository

You have a new repository now, conversion on this machine is finished.

To avoid retransfer if you have the data on other machines, the following steps can be used:

checkout the new master repository on a satellite system

Now you have a new bfsync repository, but the data is still missing
use bfsync collect to get your data into the new repository without retransfer

Since you already have a bfsync checkout on the satellite system, you can simply get the data from there, without retransfer. Since bfsync-collect(1) automatically detects whether it needs a file or not using the file contents, you can simply use
```
bfsync collect /path/to/files.bfsync
```
to get the data from your old checkout into the new repository.

Repeat these steps on all machines that contain checkouts of your repository. You can delete the old format checkouts after verifying with bfsync-check(1) that all files you need are there. You do not need to install old and new bfsync versions on the satellite systems. Only the new bfsync is required to perform the checkout & collect steps.