Discussion:
blockfs: concatenating fuse module for efficient block device backups
(too old to reply)
Lennert Buytenhek
2011-04-04 10:06:31 UTC
Permalink
Hi!

Recently, I ran into the problem of needing an efficient way of doing
incremental backups of virtual machine images. I looked around a bit,
finding the patches that people have written to make rsync work on
block devices, but none of what was out there really suited my needs,
so I ended up hacking up something myself.

My first idea was to write a FUSE module to interpose between a block
device and a virtual representation of that block device in a FUSE
filesystem, relaying all I/O requests to the underlying block device,
but keeping a separate database of per-sector or per-block (or some
other granularity) mtimes. A separate tool would then compare that
database to that of a remote host, and perform an rsync-like operation
to bring the remote host in sync. While this would work, it seemed
overly complex for the job, and it would end up being a half-assed
reimplementation of rsync plus a filesystem.

So then I figured, since *NIX filesystems already track mtimes on a
per-inode basis, why not just keep the master copy of the block device
data in a set of, say, 1 MiB files, and have a FUSE module emulate a
larger file out of this, redirecting the I/Os to the component files?
This way, you don't need to track mtimes yourself (the underlying
filesystem will do that for you), and you can use standard tools like
rsync to synchronise the component files to another host.

The attached FUSE module implements this idea. To get started, do
something like this

cd /some/where

mkdir parts
mkdir parts/volume
for part in `seq 0 99`
do
dd if=/dev/zero of=parts/volume/$data bs=1024k count=1
done
echo 100 > parts/volume/num_parts
echo 1048576 > parts/volume/part_size

mkdir mount
~/bin/blockfs parts/ mount/

In mount/, a file 'volume' will then appear, which should be 100MiB
in size, and which you can dd over, create filesystems on, loopback
mount, etc. The underlying data for the volume will be stored in
the 100 1 MiB component files in parts/, with reads and writes to
the volume file being redirected to those component files.

When a write happens, blockfs will first read the original data for
that byte range from the component file, and if that matches, it won't
perform the write, to avoid updating the mtime on the component file.

With modern filesystems like ext4 and xfs, having many large files in
a single directory shouldn't incur a performance penalty like it would
on other filesystems, e.g. on ext3 without extents or htree, so this
approach should be okay.

There are some things left to be done:

- There should really be some kind of caching in blockfs -- not for
data, but for things like metadata and component file descriptors.
This is currently not done, as it doesn't seem possible to do things
like setting timers or asking the fuse main loop to poll on some
file descriptors for you (e.g. inotify fds for the component directory).

At some point I want to look into integrating the fuse main loop
with ivykis (http://libivykis.sourceforge.net/man3/ivykis.3.html)
to be able to address this.

- There should be an option for suspending I/O while a backup of the
backing store is made, so that you don't end up with a backup copy
of your block device that has had writes to it reordered.

- Establish some kind of rule of thumb for what the optimal component
file size is. This probably depends on hardware specs, method of
synchronisation the data to the slave cop{y,ies}, write pattern and
write intensity, etc, but it should be possible to figure out a sane
default (or a set of defaults) to recommend.

- The name -- 'blockfs' doesn't really cover the functionality of this
module very well. Any better ideas?

Any comments otherwise?


cheers,
Lennert


=== blockfs.c
#define PACKAGE_VERSION "0.1"

#define _GNU_SOURCE
#define FUSE_USE_VERSION 26

#include <stdio.h>
#include <stdlib.h>
#include <dirent.h>
#include <errno.h>
#include <fcntl.h>
#include <fuse.h>
#include <pthread.h>
#include <string.h>
#include <sys/types.h>
#include <unistd.h>

static int backing_dir_fd;
static pthread_mutex_t readdir_lock;

static int is_volume_dir(int dirfd)
{
struct stat buf;

if (fstatat(dirfd, "num_parts", &buf, 0) < 0)
return 0;

if (fstatat(dirfd, "part_size", &buf, 0) < 0)
return 0;

return 1;
}

static int open_volume_dir(const char *volume)
{
int dirfd;

dirfd = openat(backing_dir_fd, volume,
O_RDONLY | O_DIRECTORY | O_NOATIME);
if (dirfd < 0)
return -ENOENT;

if (!is_volume_dir(dirfd)) {
close(dirfd);
return -EINVAL;
}

return dirfd;
}

static void
iter_volumes(void *cookie, void (*iter)(void *cookie, char *volume, int dirfd))
{
int fd;
DIR *dirp;
struct dirent entry;
struct dirent *result;

pthread_mutex_lock(&readdir_lock);

fd = dup(backing_dir_fd);
if (fd < 0)
goto out;

dirp = fdopendir(fd);
if (dirp == NULL) {
close(fd);
goto out;
}

rewinddir(dirp);

while (readdir_r(dirp, &entry, &result) == 0) {
int dirfd;

if (result == NULL)
break;

if (result->d_type != DT_DIR && result->d_type != DT_UNKNOWN)
continue;

if (result->d_name[0] == '.')
continue;

dirfd = open_volume_dir(result->d_name);
if (dirfd < 0)
continue;

iter(cookie, result->d_name, dirfd);

close(dirfd);
}

closedir(dirp);

out:
pthread_mutex_unlock(&readdir_lock);
}

static int read_property(int dirfd, char *name, unsigned int *_value)
{
int fd;
FILE *f;
unsigned int value;

fd = openat(dirfd, name, O_RDONLY | O_NOATIME);
if (fd < 0)
return -1;

f = fdopen(fd, "r");
if (f == NULL) {
close(fd);
return -1;
}

if (fscanf(f, "%u", &value) != 1) {
fclose(f);
return -1;
}

*_value = value;
fclose(f);

return 0;
}

static void count_volumes_cb(void *cookie, char *volume, int dirfd)
{
(*((unsigned int *)cookie))++;
}

static int count_volumes(void)
{
unsigned int count = 0;

iter_volumes(&count, count_volumes_cb);

return count;
}

static int mkpartname(char *buf, int buf_size, unsigned int part)
{
return snprintf(buf, buf_size, "%d", part);
}

static int open_part(int dirfd, unsigned int part, int flags)
{
char partname[32];

mkpartname(partname, sizeof(partname), part);

return openat(dirfd, partname, flags | O_NOATIME);
}

struct blockfs_getattr_info {
unsigned int part_size;
int part_size_mismatches;
struct timespec st_atim;
struct timespec st_mtim;
struct timespec st_ctim;
};

static int timespec_cmp(struct timespec *a, struct timespec *b)
{
if (a->tv_sec > b->tv_sec)
return 1;

if (a->tv_sec < b->tv_sec)
return -1;

if (a->tv_nsec > b->tv_nsec)
return 1;

if (a->tv_nsec < b->tv_nsec)
return -1;

return 0;
}

static void blockfs_getattr_cb(void *_info, unsigned int part, struct stat *buf)
{
struct blockfs_getattr_info *info = _info;

if (buf->st_size != info->part_size)
info->part_size_mismatches++;

if (timespec_cmp(&buf->st_atim, &info->st_atim) > 0)
info->st_atim = buf->st_atim;

if (timespec_cmp(&buf->st_mtim, &info->st_mtim) > 0)
info->st_mtim = buf->st_mtim;

if (timespec_cmp(&buf->st_ctim, &info->st_ctim) > 0)
info->st_ctim = buf->st_ctim;
}

static int
iter_parts_stat(int dirfd, unsigned int num_parts, void *cookie,
void (*iter)(void *cookie, unsigned int part, struct stat *buf))
{
int i;

for (i = 0; i < num_parts; i++) {
char partname[32];
struct stat buf;

mkpartname(partname, sizeof(partname), i);

if (fstatat(dirfd, partname, &buf, 0) < 0) {
perror("fstatat");
close(dirfd);
return -1;
}

iter(cookie, i, &buf);
}

return 0;
}

static int blockfs_getattr(const char *path, struct stat *stbuf)
{
struct stat buf;
int dirfd;
unsigned int num_parts;
unsigned int part_size;
struct blockfs_getattr_info info;

if (path[0] == 0)
return -ENOENT;

memset(stbuf, 0, sizeof(struct stat));

if (strcmp(path, "/") == 0) {
int ret;

ret = fstat(backing_dir_fd, &buf);
if (ret < 0)
return -ENOENT;

stbuf->st_mode = buf.st_mode;
stbuf->st_nlink = 2 + count_volumes();
stbuf->st_uid = buf.st_uid;
stbuf->st_gid = buf.st_gid;
stbuf->st_size = 1024;
stbuf->st_blksize = 1024;
stbuf->st_blocks = 1;
stbuf->st_atim = buf.st_atim;
stbuf->st_mtim = buf.st_mtim;
stbuf->st_ctim = buf.st_ctim;

return 0;
}

dirfd = open_volume_dir(path + 1);
if (dirfd < 0)
return -ENOENT;

if (read_property(dirfd, "num_parts", &num_parts) < 0) {
close(dirfd);
return -EINVAL;
}

if (read_property(dirfd, "part_size", &part_size) < 0) {
close(dirfd);
return -EINVAL;
}

if (fstat(dirfd, &buf) < 0) {
perror("fstat");
close(dirfd);
return -EINVAL;
}

stbuf->st_mode = S_IFREG | (buf.st_mode & 0666);
stbuf->st_nlink = 1;
stbuf->st_uid = buf.st_uid;
stbuf->st_gid = buf.st_gid;
stbuf->st_size = (uint64_t)num_parts * part_size;
stbuf->st_blksize = part_size;
stbuf->st_blocks = ((uint64_t)num_parts * part_size + 511) / 512;

memset(&info, 0, sizeof(info));
info.part_size = part_size;
if (iter_parts_stat(dirfd, num_parts, &info, blockfs_getattr_cb) ||
info.part_size_mismatches) {
close(dirfd);
return -EINVAL;
}

stbuf->st_atim = info.st_atim;
stbuf->st_mtim = info.st_mtim;
stbuf->st_ctim = info.st_ctim;

close(dirfd);

return 0;
}

static int blockfs_truncate(const char *path, off_t length)
{
/* Silently fail. */

return 0;
}

static int blockfs_open(const char *path, struct fuse_file_info *fi)
{
int dirfd;
unsigned int num_parts;
unsigned int part_size;
int mode;
int i;

if (path[0] == 0)
return -ENOENT;

dirfd = open_volume_dir(path + 1);
if (dirfd < 0)
return -ENOENT;

if (read_property(dirfd, "num_parts", &num_parts) < 0) {
close(dirfd);
return -EINVAL;
}

if (read_property(dirfd, "part_size", &part_size) < 0) {
close(dirfd);
return -EINVAL;
}

mode = ((fi->flags & O_ACCMODE) == O_RDONLY) ? R_OK : W_OK;

for (i = 0; i < num_parts; i++) {
char partname[32];

mkpartname(partname, sizeof(partname), i);

if (faccessat(dirfd, partname, mode, 0) < 0) {
close(dirfd);
return -EPERM;
}
}

close(dirfd);

return 0;
}

static int
part_read(int dirfd, unsigned int part, off_t off, char *buf, int num)
{
int fd;
int ret;

fd = open_part(dirfd, part, O_RDONLY);
if (fd < 0)
return 0;

ret = pread(fd, buf, num, off);
if (ret < 0)
ret = -errno;

close(fd);

return ret;
}

static int blockfs_read(const char *path, char *buf, size_t size,
off_t offset, struct fuse_file_info *fi)
{
int dirfd;
size_t numread;
unsigned int num_parts;
unsigned int part_size;

if (path[0] == 0)
return -ENOENT;

dirfd = open_volume_dir(path + 1);
if (dirfd < 0)
return -ENOENT;

if (read_property(dirfd, "num_parts", &num_parts) < 0) {
close(dirfd);
return -EINVAL;
}

if (read_property(dirfd, "part_size", &part_size) < 0) {
close(dirfd);
return -EINVAL;
}

numread = 0;
while (size) {
unsigned int part;
off_t part_off;
off_t part_end;
size_t toread;
int ret;

part = offset / part_size;
part_off = offset % part_size;

part_end = (part + 1) * (uint64_t)part_size;

toread = size;
if (offset + toread > part_end)
toread = part_end - offset;

ret = part_read(dirfd, part, part_off, buf, toread);
if (ret < 0) {
if (numread == 0)
numread = ret;
break;
}

buf += ret;
size -= ret;
offset += ret;
numread += ret;

if (ret != toread)
break;
}

close(dirfd);

return numread;
}

static int
part_write(int dirfd, unsigned int part, off_t off, const char *buf, int num)
{
char rbuf[num];
int fd;
int ret;

fd = open_part(dirfd, part, O_RDWR);
if (fd < 0)
return 0;

ret = pread(fd, rbuf, num, off);
if (ret != num || memcmp(buf, rbuf, num)) {
ret = pwrite(fd, buf, num, off);
if (ret < 0)
ret = -errno;
}

close(fd);

return ret;
}

static int blockfs_write(const char *path, const char *buf, size_t size,
off_t offset, struct fuse_file_info *fi)
{
int dirfd;
size_t numwritten;
unsigned int num_parts;
unsigned int part_size;

if (path[0] == 0)
return -ENOENT;

dirfd = open_volume_dir(path + 1);
if (dirfd < 0)
return -ENOENT;

if (read_property(dirfd, "num_parts", &num_parts) < 0) {
close(dirfd);
return -EINVAL;
}

if (read_property(dirfd, "part_size", &part_size) < 0) {
close(dirfd);
return -EINVAL;
}

numwritten = 0;
while (size) {
unsigned int part;
off_t part_off;
off_t part_end;
size_t towrite;
int ret;

part = offset / part_size;
part_off = offset % part_size;

part_end = (part + 1) * (uint64_t)part_size;

towrite = size;
if (offset + towrite > part_end)
towrite = part_end - offset;

ret = part_write(dirfd, part, part_off, buf, towrite);
if (ret < 0) {
if (numwritten == 0)
numwritten = ret;
break;
}

buf += ret;
size -= ret;
offset += ret;
numwritten += ret;

if (ret != towrite)
break;
}

close(dirfd);

return numwritten;
}

static void
blockfs_statfs_part_cb(void *num, unsigned int part, struct stat *buf)
{
((uint64_t *)num)[1] += buf->st_size;
}

static void blockfs_statfs_volume_cb(void *num, char *volume, int dirfd)
{
unsigned int num_parts;

((uint64_t *)num)[0]++;
if (read_property(dirfd, "num_parts", &num_parts) >= 0)
iter_parts_stat(dirfd, num_parts, num, blockfs_statfs_part_cb);
}

static int blockfs_statfs(const char *path, struct statvfs *buf)
{
uint64_t num[2];

num[0] = 1;
num[1] = 0;
iter_volumes(num, blockfs_statfs_volume_cb);

buf->f_bsize = 4096;
buf->f_frsize = 4096;
buf->f_blocks = num[1] / 4096;
buf->f_bfree = 0;
buf->f_bavail = 0;
buf->f_files = num[0];
buf->f_ffree = 0;
buf->f_favail = 0;
buf->f_fsid = 0;
buf->f_flag = 0;
buf->f_namemax = PATH_MAX;

return 0;
}

static int blockfs_fsync(const char *path, int datasync,
struct fuse_file_info *fi)
{
/* FIXME. */
sync();

return 0;
}

struct blockfs_readdir_info {
void *buf;
fuse_fill_dir_t filler;
};

static void blockfs_readdir_cb(void *_info, char *name, int dirfd)
{
struct blockfs_readdir_info *info = _info;

info->filler(info->buf, name, NULL, 0);
}

static int blockfs_readdir(const char *path, void *buf, fuse_fill_dir_t filler,
off_t offset, struct fuse_file_info *fi)
{
struct blockfs_readdir_info info = { buf, filler };

if (strcmp(path, "/") != 0)
return -ENOENT;

filler(buf, ".", NULL, 0);
filler(buf, "..", NULL, 0);
iter_volumes(&info, blockfs_readdir_cb);

return 0;
}

static struct fuse_operations blockfs_oper = {
.getattr = blockfs_getattr,
.truncate = blockfs_truncate,
.open = blockfs_open,
.read = blockfs_read,
.write = blockfs_write,
.statfs = blockfs_statfs,
.fsync = blockfs_fsync,
.readdir = blockfs_readdir,
};


enum {
KEY_HELP,
KEY_VERSION,
};

static struct fuse_opt blockfs_opts[] = {
FUSE_OPT_KEY("-h", KEY_HELP),
FUSE_OPT_KEY("--help", KEY_HELP),
FUSE_OPT_KEY("-V", KEY_VERSION),
FUSE_OPT_KEY("--version", KEY_VERSION),
};

static void usage(const char *progname)
{
fprintf(stderr,
"Usage: %s backingdir mountpoint [options]\n"
"\n"
"General options:\n"
" -h --help print help\n"
" -V --version print version\n"
"\n", progname);
}

static char *backing_dir;

static int blockfs_opt_proc(void *data, const char *arg, int key,
struct fuse_args *outargs)
{
if (key == FUSE_OPT_KEY_NONOPT) {
if (backing_dir == NULL) {
backing_dir = strdup(arg);
return 0;
}
return 1;
}

if (key == KEY_HELP) {
usage(outargs->argv[0]);
fuse_opt_add_arg(outargs, "-ho");
fuse_main(outargs->argc, outargs->argv, &blockfs_oper, NULL);
exit(1);
}

if (key == KEY_VERSION) {
fprintf(stderr, "MERGE version: %s\n", PACKAGE_VERSION);
fuse_opt_add_arg(outargs, "--version");
fuse_main(outargs->argc, outargs->argv, &blockfs_oper, NULL);
exit(0);
}

return 1;
}

int main(int argc, char *argv[])
{
struct fuse_args args = FUSE_ARGS_INIT(argc, argv);

if (fuse_opt_parse(&args, NULL, blockfs_opts, blockfs_opt_proc) == -1)
exit(1);

if (backing_dir == NULL) {
fprintf(stderr, "missing backing dir\n");
fprintf(stderr, "see '%s -h' for usage\n", argv[0]);
exit(1);
}

backing_dir_fd = open(backing_dir, O_RDONLY | O_DIRECTORY | O_NOATIME);
if (backing_dir_fd < 0) {
perror("open");
return 1;
}

pthread_mutex_init(&readdir_lock, NULL);

return fuse_main(args.argc, args.argv, &blockfs_oper, NULL);
}
Charlie Bennett
2011-04-04 11:18:43 UTC
Permalink
Post by Lennert Buytenhek
Hi!
Recently, I ran into the problem of needing an efficient way of doing
incremental backups of virtual machine images. I looked around a bit,
finding the patches that people have written to make rsync work on
block devices, but none of what was out there really suited my needs,
so I ended up hacking up something myself.
My first idea was to write a FUSE module to interpose between a block
device and a virtual representation of that block device in a FUSE
filesystem,
Maybe your second idea should be to read the documentation for
dm-snapshot. If I understand your use case, it is generally how such
things are done.

http://www.mjmwired.net/kernel/Documentation/device-mapper/snapshot.txt

Should be vastly more efficient than trying to implement what is appears
to be effectively a UNIONFS.


ccb
Lennert Buytenhek
2011-04-04 12:30:11 UTC
Permalink
Post by Charlie Bennett
Post by Lennert Buytenhek
Recently, I ran into the problem of needing an efficient way of doing
incremental backups of virtual machine images. I looked around a bit,
finding the patches that people have written to make rsync work on
block devices, but none of what was out there really suited my needs,
so I ended up hacking up something myself.
My first idea was to write a FUSE module to interpose between a block
device and a virtual representation of that block device in a FUSE
filesystem,
Maybe your second idea should be
Thanks for the reminder of why I don't participate in a lot of open
source projects anymore these days.
Post by Charlie Bennett
to read the documentation for dm-snapshot. If I understand your use
case, it is generally how such things are done.
http://www.mjmwired.net/kernel/Documentation/device-mapper/snapshot.txt
I'm aware of the DM snapshot target, but I don't see how it helps for my
use case, which is basically to find an efficient way (in terms of CPU
time and bandwidth used) of comparing a very large file on a local
system to an older version of that very large file on a remote system.
Stef Bon
2011-04-04 13:35:35 UTC
Permalink
Hi,

when reading your case (blockfs) I see some simularities to FUSE fs's
that are written to access the Amazon S3 online storage. For example
s3backer, which offers access to a block device.

As far as I understand it works the same way, by only updating per block.

Let me know it;s of any use. I hope you're not thinking "well another
reason why I quit OS projects....."

Stef
Post by Lennert Buytenhek
Post by Charlie Bennett
Post by Lennert Buytenhek
Recently, I ran into the problem of needing an efficient way of doing
incremental backups of virtual machine images.  I looked around a bit,
finding the patches that people have written to make rsync work on
block devices, but none of what was out there really suited my needs,
so I ended up hacking up something myself.
My first idea was to write a FUSE module to interpose between a block
device and a virtual representation of that block device in a FUSE
filesystem,
Maybe your second idea should be
Thanks for the reminder of why I don't participate in a lot of open
source projects anymore these days.
Post by Charlie Bennett
to read the documentation for dm-snapshot.  If I understand your use
case, it is generally how such things are done.
http://www.mjmwired.net/kernel/Documentation/device-mapper/snapshot.txt
I'm aware of the DM snapshot target, but I don't see how it helps for my
use case, which is basically to find an efficient way (in terms of CPU
time and bandwidth used) of comparing a very large file on a local
system to an older version of that very large file on a remote system.
------------------------------------------------------------------------------
Create and publish websites with WebMatrix
Use the most popular FREE web apps or write code yourself;
WebMatrix provides all the features you need to develop and
publish your website. http://p.sf.net/sfu/ms-webmatrix-sf
_______________________________________________
fuse-devel mailing list
https://lists.sourceforge.net/lists/listinfo/fuse-devel
Lennert Buytenhek
2011-04-04 14:45:22 UTC
Permalink
Post by Stef Bon
Hi,
Hello,
Post by Stef Bon
when reading your case (blockfs) I see some simularities to FUSE fs's
that are written to access the Amazon S3 online storage. For example
s3backer, which offers access to a block device.
As far as I understand it works the same way, by only updating per block.
Yes, the ideas are pretty similar. I've looked at S3 and s3backer, but
they won't do the job for my use case. s3backer is useful mostly to
run as a separate block device, running a separate file system, to rsync
copies of your data to, but you really don't want to run your virtual
machines directly on block devices served by s3backer, as that limits
your disk I/O bandwidth to the speed of your internet connection, and
you incur the (pretty high) S3 transfer pricing for every I/O.


thanks,
Lennert
Stef Bon
2011-04-04 14:53:33 UTC
Permalink
Post by Lennert Buytenhek
Post by Stef Bon
Hi,
Hello,
Post by Stef Bon
when reading your case (blockfs) I see some simularities to FUSE fs's
that are written to access the Amazon S3 online storage. For example
s3backer, which offers access to a block device.
As far as I understand it works the same way, by only updating per block.
Yes, the ideas are pretty similar.  I've looked at S3 and s3backer, but
they won't do the job for my use case.  s3backer is useful mostly to
run as a separate block device, running a separate file system, to rsync
copies of your data to, but you really don't want to run your virtual
machines directly on block devices served by s3backer, as that limits
your disk I/O bandwidth to the speed of your internet connection, and
you incur the (pretty high) S3 transfer pricing for every I/O.
No of course not. I'm not suggesting to have everything connected to
the internet directly, or the local network.

As far as I know, this is what Nokolas Rath is suggesting, is a
administration of checksums local and on the server, and compare those
two.

I believe that Google Docs also works that way. I've been reading the
Google Docs API som time ago, and it uses these checksums also. When
there is a difference, action should be taken.

I believe in the case of Google, the object is the document and other
files, in your case the block is the object of interest.

Stef
Post by Lennert Buytenhek
thanks,
Lennert
Lennert Buytenhek
2011-04-04 15:11:42 UTC
Permalink
Post by Stef Bon
Post by Lennert Buytenhek
Post by Stef Bon
when reading your case (blockfs) I see some simularities to FUSE fs's
that are written to access the Amazon S3 online storage. For example
s3backer, which offers access to a block device.
As far as I understand it works the same way, by only updating per block.
Yes, the ideas are pretty similar.  I've looked at S3 and s3backer, but
they won't do the job for my use case.  s3backer is useful mostly to
run as a separate block device, running a separate file system, to rsync
copies of your data to, but you really don't want to run your virtual
machines directly on block devices served by s3backer, as that limits
your disk I/O bandwidth to the speed of your internet connection, and
you incur the (pretty high) S3 transfer pricing for every I/O.
No of course not. I'm not suggesting to have everything connected to
the internet directly, or the local network.
Sorry if I was unclear -- I didn't want to claim that you suggested that. :)
Post by Stef Bon
As far as I know, this is what Nokolas Rath is suggesting, is a
administration of checksums local and on the server, and compare those
two.
I believe that Google Docs also works that way. I've been reading the
Google Docs API som time ago, and it uses these checksums also. When
there is a difference, action should be taken.
I believe in the case of Google, the object is the document and other
files, in your case the block is the object of interest.
See my reply to Nikolaus -- in my case, the block devices can be large,
while write intensity is typically pretty low. Maintaining checksums
would be an option, but I'd rather just maintain the per-piece mtimes
(by having the files in a filesystem) and letting rsync deal with
checksumming once it considers a part to have changed.


thanks,
Lennert
Nikolaus Rath
2011-04-04 14:05:33 UTC
Permalink
Post by Lennert Buytenhek
I'm aware of the DM snapshot target, but I don't see how it helps for my
use case, which is basically to find an efficient way (in terms of CPU
time and bandwidth used) of comparing a very large file on a local
system to an older version of that very large file on a remote system.
The way to do that is to compute blockwise checksums on both systems and
then transfer and compare those (as rsync does).

Now, your file system avoids having to read the file and compute the
checksums at compare time. However, if your description is correct, your
file system instead does the comparison on every write: instead of
just overwriting the existing data, it is first read and compared to the
new data.

Effectively, this may be much more costly than reading the entire file
on compare time: the file may be updated several times between
comparisons, and my guess is that it really rarely happens that a
program will rewrite a block without any changes at all.

You could improve things by avoiding the comparison and always writing
the new data. But this still leaves you with the problem that you are
slowing down the more common operation (updating the file) to speed up
the less common operation (synchronizing it).


Best,

-Nikolaus
--
»Time flies like an arrow, fruit flies like a Banana.«

PGP fingerprint: 5B93 61F8 4EA2 E279 ABF6 02CF A9AD B7F8 AE4E 425C
Lennert Buytenhek
2011-04-04 15:05:13 UTC
Permalink
Post by Nikolaus Rath
Post by Lennert Buytenhek
I'm aware of the DM snapshot target, but I don't see how it helps for my
use case, which is basically to find an efficient way (in terms of CPU
time and bandwidth used) of comparing a very large file on a local
system to an older version of that very large file on a remote system.
The way to do that is to compute blockwise checksums on both systems and
then transfer and compare those (as rsync does).
It's all about what the typical access pattern is. If the block device
is large (say, 100G), and only infrequently sees writes, then the
decision to go with checksumming all of that 100G every day for the daily
backup just so that you don't have to stick a hook into the block write
path might not be justified.
Post by Nikolaus Rath
Now, your file system avoids having to read the file and compute the
checksums at compare time. However, if your description is correct, your
file system instead does the comparison on every write: instead of
just overwriting the existing data, it is first read and compared to the
new data.
That is merely an optimisation (an optimisation of which I'm not even
sure how effective it really is), and the system still works fine without
it.
Post by Nikolaus Rath
Effectively, this may be much more costly than reading the entire file
on compare time: the file may be updated several times between
comparisons, and my guess is that it really rarely happens that a
program will rewrite a block without any changes at all.
You could improve things by avoiding the comparison and always writing
the new data. But this still leaves you with the problem that you are
slowing down the more common operation (updating the file) to speed up
the less common operation (synchronizing it).
Except that the operation of synchronising consists of bignums of
little operations, and assuming low write intensity, the majority of
which you can probably avoid in most cases by "slowing down the more
common operation".

Updating a disk block maybe costs you 0.1ms of CPU time, while computing
block checksums for a 100G disk image will easily take half an hour of
wall clock time, so if you do daily backups but do less than a couple
million writes per day, you still save time and power by having that
hook in the write path.

And yes, for very I/O intensive applications it might not be justified.


thanks,
Lennert
Nikolaus Rath
2011-04-05 01:08:52 UTC
Permalink
Post by Lennert Buytenhek
Post by Nikolaus Rath
Now, your file system avoids having to read the file and compute the
checksums at compare time. However, if your description is correct, your
file system instead does the comparison on every write: instead of
just overwriting the existing data, it is first read and compared to the
new data.
That is merely an optimisation (an optimisation of which I'm not even
sure how effective it really is), and the system still works fine without
it.
Yeah, what I'm trying to say is: I believe it will most certainly work
better without it.
Post by Lennert Buytenhek
Post by Nikolaus Rath
Effectively, this may be much more costly than reading the entire file
on compare time: the file may be updated several times between
comparisons, and my guess is that it really rarely happens that a
program will rewrite a block without any changes at all.
You could improve things by avoiding the comparison and always writing
the new data. But this still leaves you with the problem that you are
slowing down the more common operation (updating the file) to speed up
the less common operation (synchronizing it).
Except that the operation of synchronising consists of bignums of
little operations, and assuming low write intensity, the majority of
which you can probably avoid in most cases by "slowing down the more
common operation".
Updating a disk block maybe costs you 0.1ms of CPU time, while computing
block checksums for a 100G disk image will easily take half an hour of
wall clock time, so if you do daily backups but do less than a couple
million writes per day, you still save time and power by having that
hook in the write path.
And yes, for very I/O intensive applications it might not be justified.
Well, you mentioned virtual machine images as your use case. If the
machine is running, you have *lots* of updates. I wouldn't be surprised
if the additional path through fuse would slow write operations down by
a factor of two. So I agree that your idea works fine for low IO, but I
don't think that VM images qualify as such.


Best,

-Nikolaus
--
»Time flies like an arrow, fruit flies like a Banana.«

PGP fingerprint: 5B93 61F8 4EA2 E279 ABF6 02CF A9AD B7F8 AE4E 425C
Lennert Buytenhek
2011-04-05 08:40:38 UTC
Permalink
Post by Nikolaus Rath
Post by Lennert Buytenhek
Post by Nikolaus Rath
Now, your file system avoids having to read the file and compute the
checksums at compare time. However, if your description is correct, your
file system instead does the comparison on every write: instead of
just overwriting the existing data, it is first read and compared to the
new data.
That is merely an optimisation (an optimisation of which I'm not even
sure how effective it really is), and the system still works fine without
it.
Yeah, what I'm trying to say is: I believe it will most certainly work
better without it.
Your point is taken. I'll add some counters to see how often doing the
comparison saves doing a write.
Post by Nikolaus Rath
Post by Lennert Buytenhek
Post by Nikolaus Rath
Effectively, this may be much more costly than reading the entire file
on compare time: the file may be updated several times between
comparisons, and my guess is that it really rarely happens that a
program will rewrite a block without any changes at all.
You could improve things by avoiding the comparison and always writing
the new data. But this still leaves you with the problem that you are
slowing down the more common operation (updating the file) to speed up
the less common operation (synchronizing it).
Except that the operation of synchronising consists of bignums of
little operations, and assuming low write intensity, the majority of
which you can probably avoid in most cases by "slowing down the more
common operation".
Updating a disk block maybe costs you 0.1ms of CPU time, while computing
block checksums for a 100G disk image will easily take half an hour of
wall clock time, so if you do daily backups but do less than a couple
million writes per day, you still save time and power by having that
hook in the write path.
And yes, for very I/O intensive applications it might not be justified.
Well, you mentioned virtual machine images as your use case. If the
machine is running, you have *lots* of updates.
Well, most virtual machines are typically idle, at least over here.
And even if disk I/O is going on, that typically seems to be concentrated
in some areas of the disk.
Post by Nikolaus Rath
I wouldn't be surprised if the additional path through fuse would
slow write operations down by a factor of two. So I agree that your
idea works fine for low IO, but I don't think that VM images qualify
as such.
I guess that more statistics are needed, and some more testing to come
up with some guidelines to see for which workloads this is a win.


thanks,
Lennert
Charles Bennett
2011-04-05 10:48:02 UTC
Permalink
Post by Lennert Buytenhek
Post by Charlie Bennett
Post by Lennert Buytenhek
Recently, I ran into the problem of needing an efficient way of doing
incremental backups of virtual machine images. I looked around a bit,
finding the patches that people have written to make rsync work on
block devices, but none of what was out there really suited my needs,
so I ended up hacking up something myself.
My first idea was to write a FUSE module to interpose between a block
device and a virtual representation of that block device in a FUSE
filesystem,
Maybe your second idea should be
Thanks for the reminder of why I don't participate in a lot of open
source projects anymore these days.
Post by Charlie Bennett
to read the documentation for dm-snapshot. If I understand your use
case, it is generally how such things are done.
http://www.mjmwired.net/kernel/Documentation/device-mapper/snapshot.txt
I'm aware of the DM snapshot target, but I don't see how it helps for my
use case, which is basically to find an efficient way (in terms of CPU
time and bandwidth used) of comparing a very large file on a local
system to an older version of that very large file on a remote system.
It wasn't my intention to come off like a dick. In my day job working
for a major storage vendor I see lot of people spending huge sums of
money trying solve these kinds of deduplication problems. They are
hugely not easy to solve.

The essence of my comment should have been more like: is the filesystem
the appropriate place for attempting to solve this particular version of
this problem? Or are there less compute-intesive block-level solutions
near at hand?

Looks like you're "bought into" a filesystem solution. In which case
you came to the right place.

ccb
Mark Ruijter
2011-04-04 14:18:05 UTC
Permalink
Hi Lennert,

Please take a look at lessfs (www.lessfs.com). It is a
deduplicating+encrypting+compressing filesystem with replication.
I think that it does everything you want, and even more.

That said, I may not be completely unbiased.

Mark Ruijter
Post by Lennert Buytenhek
Hi!
Recently, I ran into the problem of needing an efficient way of doing
incremental backups of virtual machine images. I looked around a bit,
finding the patches that people have written to make rsync work on
block devices, but none of what was out there really suited my needs,
so I ended up hacking up something myself.
My first idea was to write a FUSE module to interpose between a block
device and a virtual representation of that block device in a FUSE
filesystem, relaying all I/O requests to the underlying block device,
but keeping a separate database of per-sector or per-block (or some
other granularity) mtimes. A separate tool would then compare that
database to that of a remote host, and perform an rsync-like operation
to bring the remote host in sync. While this would work, it seemed
overly complex for the job, and it would end up being a half-assed
reimplementation of rsync plus a filesystem.
So then I figured, since *NIX filesystems already track mtimes on a
per-inode basis, why not just keep the master copy of the block device
data in a set of, say, 1 MiB files, and have a FUSE module emulate a
larger file out of this, redirecting the I/Os to the component files?
This way, you don't need to track mtimes yourself (the underlying
filesystem will do that for you), and you can use standard tools like
rsync to synchronise the component files to another host.
The attached FUSE module implements this idea. To get started, do
something like this
cd /some/where
mkdir parts
mkdir parts/volume
for part in `seq 0 99`
do
dd if=/dev/zero of=parts/volume/$data bs=1024k count=1
done
echo 100 > parts/volume/num_parts
echo 1048576 > parts/volume/part_size
mkdir mount
~/bin/blockfs parts/ mount/
In mount/, a file 'volume' will then appear, which should be 100MiB
in size, and which you can dd over, create filesystems on, loopback
mount, etc. The underlying data for the volume will be stored in
the 100 1 MiB component files in parts/, with reads and writes to
the volume file being redirected to those component files.
When a write happens, blockfs will first read the original data for
that byte range from the component file, and if that matches, it won't
perform the write, to avoid updating the mtime on the component file.
With modern filesystems like ext4 and xfs, having many large files in
a single directory shouldn't incur a performance penalty like it would
on other filesystems, e.g. on ext3 without extents or htree, so this
approach should be okay.
- There should really be some kind of caching in blockfs -- not for
data, but for things like metadata and component file descriptors.
This is currently not done, as it doesn't seem possible to do things
like setting timers or asking the fuse main loop to poll on some
file descriptors for you (e.g. inotify fds for the component directory).
At some point I want to look into integrating the fuse main loop
with ivykis (http://libivykis.sourceforge.net/man3/ivykis.3.html)
to be able to address this.
- There should be an option for suspending I/O while a backup of the
backing store is made, so that you don't end up with a backup copy
of your block device that has had writes to it reordered.
- Establish some kind of rule of thumb for what the optimal component
file size is. This probably depends on hardware specs, method of
synchronisation the data to the slave cop{y,ies}, write pattern and
write intensity, etc, but it should be possible to figure out a sane
default (or a set of defaults) to recommend.
- The name -- 'blockfs' doesn't really cover the functionality of this
module very well. Any better ideas?
Any comments otherwise?
cheers,
Lennert
=== blockfs.c
#define PACKAGE_VERSION "0.1"
#define _GNU_SOURCE
#define FUSE_USE_VERSION 26
#include <stdio.h>
#include <stdlib.h>
#include <dirent.h>
#include <errno.h>
#include <fcntl.h>
#include <fuse.h>
#include <pthread.h>
#include <string.h>
#include <sys/types.h>
#include <unistd.h>
static int backing_dir_fd;
static pthread_mutex_t readdir_lock;
static int is_volume_dir(int dirfd)
{
struct stat buf;
if (fstatat(dirfd, "num_parts", &buf, 0) < 0)
return 0;
if (fstatat(dirfd, "part_size", &buf, 0) < 0)
return 0;
return 1;
}
static int open_volume_dir(const char *volume)
{
int dirfd;
dirfd = openat(backing_dir_fd, volume,
O_RDONLY | O_DIRECTORY | O_NOATIME);
if (dirfd < 0)
return -ENOENT;
if (!is_volume_dir(dirfd)) {
close(dirfd);
return -EINVAL;
}
return dirfd;
}
static void
iter_volumes(void *cookie, void (*iter)(void *cookie, char *volume, int dirfd))
{
int fd;
DIR *dirp;
struct dirent entry;
struct dirent *result;
pthread_mutex_lock(&readdir_lock);
fd = dup(backing_dir_fd);
if (fd < 0)
goto out;
dirp = fdopendir(fd);
if (dirp == NULL) {
close(fd);
goto out;
}
rewinddir(dirp);
while (readdir_r(dirp, &entry, &result) == 0) {
int dirfd;
if (result == NULL)
break;
if (result->d_type != DT_DIR && result->d_type != DT_UNKNOWN)
continue;
if (result->d_name[0] == '.')
continue;
dirfd = open_volume_dir(result->d_name);
if (dirfd < 0)
continue;
iter(cookie, result->d_name, dirfd);
close(dirfd);
}
closedir(dirp);
pthread_mutex_unlock(&readdir_lock);
}
static int read_property(int dirfd, char *name, unsigned int *_value)
{
int fd;
FILE *f;
unsigned int value;
fd = openat(dirfd, name, O_RDONLY | O_NOATIME);
if (fd < 0)
return -1;
f = fdopen(fd, "r");
if (f == NULL) {
close(fd);
return -1;
}
if (fscanf(f, "%u", &value) != 1) {
fclose(f);
return -1;
}
*_value = value;
fclose(f);
return 0;
}
static void count_volumes_cb(void *cookie, char *volume, int dirfd)
{
(*((unsigned int *)cookie))++;
}
static int count_volumes(void)
{
unsigned int count = 0;
iter_volumes(&count, count_volumes_cb);
return count;
}
static int mkpartname(char *buf, int buf_size, unsigned int part)
{
return snprintf(buf, buf_size, "%d", part);
}
static int open_part(int dirfd, unsigned int part, int flags)
{
char partname[32];
mkpartname(partname, sizeof(partname), part);
return openat(dirfd, partname, flags | O_NOATIME);
}
struct blockfs_getattr_info {
unsigned int part_size;
int part_size_mismatches;
struct timespec st_atim;
struct timespec st_mtim;
struct timespec st_ctim;
};
static int timespec_cmp(struct timespec *a, struct timespec *b)
{
if (a->tv_sec > b->tv_sec)
return 1;
if (a->tv_sec < b->tv_sec)
return -1;
if (a->tv_nsec > b->tv_nsec)
return 1;
if (a->tv_nsec < b->tv_nsec)
return -1;
return 0;
}
static void blockfs_getattr_cb(void *_info, unsigned int part, struct stat *buf)
{
struct blockfs_getattr_info *info = _info;
if (buf->st_size != info->part_size)
info->part_size_mismatches++;
if (timespec_cmp(&buf->st_atim, &info->st_atim) > 0)
info->st_atim = buf->st_atim;
if (timespec_cmp(&buf->st_mtim, &info->st_mtim) > 0)
info->st_mtim = buf->st_mtim;
if (timespec_cmp(&buf->st_ctim, &info->st_ctim) > 0)
info->st_ctim = buf->st_ctim;
}
static int
iter_parts_stat(int dirfd, unsigned int num_parts, void *cookie,
void (*iter)(void *cookie, unsigned int part, struct stat *buf))
{
int i;
for (i = 0; i < num_parts; i++) {
char partname[32];
struct stat buf;
mkpartname(partname, sizeof(partname), i);
if (fstatat(dirfd, partname, &buf, 0) < 0) {
perror("fstatat");
close(dirfd);
return -1;
}
iter(cookie, i, &buf);
}
return 0;
}
static int blockfs_getattr(const char *path, struct stat *stbuf)
{
struct stat buf;
int dirfd;
unsigned int num_parts;
unsigned int part_size;
struct blockfs_getattr_info info;
if (path[0] == 0)
return -ENOENT;
memset(stbuf, 0, sizeof(struct stat));
if (strcmp(path, "/") == 0) {
int ret;
ret = fstat(backing_dir_fd, &buf);
if (ret < 0)
return -ENOENT;
stbuf->st_mode = buf.st_mode;
stbuf->st_nlink = 2 + count_volumes();
stbuf->st_uid = buf.st_uid;
stbuf->st_gid = buf.st_gid;
stbuf->st_size = 1024;
stbuf->st_blksize = 1024;
stbuf->st_blocks = 1;
stbuf->st_atim = buf.st_atim;
stbuf->st_mtim = buf.st_mtim;
stbuf->st_ctim = buf.st_ctim;
return 0;
}
dirfd = open_volume_dir(path + 1);
if (dirfd < 0)
return -ENOENT;
if (read_property(dirfd, "num_parts", &num_parts) < 0) {
close(dirfd);
return -EINVAL;
}
if (read_property(dirfd, "part_size", &part_size) < 0) {
close(dirfd);
return -EINVAL;
}
if (fstat(dirfd, &buf) < 0) {
perror("fstat");
close(dirfd);
return -EINVAL;
}
stbuf->st_mode = S_IFREG | (buf.st_mode & 0666);
stbuf->st_nlink = 1;
stbuf->st_uid = buf.st_uid;
stbuf->st_gid = buf.st_gid;
stbuf->st_size = (uint64_t)num_parts * part_size;
stbuf->st_blksize = part_size;
stbuf->st_blocks = ((uint64_t)num_parts * part_size + 511) / 512;
memset(&info, 0, sizeof(info));
info.part_size = part_size;
if (iter_parts_stat(dirfd, num_parts, &info, blockfs_getattr_cb) ||
info.part_size_mismatches) {
close(dirfd);
return -EINVAL;
}
stbuf->st_atim = info.st_atim;
stbuf->st_mtim = info.st_mtim;
stbuf->st_ctim = info.st_ctim;
close(dirfd);
return 0;
}
static int blockfs_truncate(const char *path, off_t length)
{
/* Silently fail. */
return 0;
}
static int blockfs_open(const char *path, struct fuse_file_info *fi)
{
int dirfd;
unsigned int num_parts;
unsigned int part_size;
int mode;
int i;
if (path[0] == 0)
return -ENOENT;
dirfd = open_volume_dir(path + 1);
if (dirfd < 0)
return -ENOENT;
if (read_property(dirfd, "num_parts", &num_parts) < 0) {
close(dirfd);
return -EINVAL;
}
if (read_property(dirfd, "part_size", &part_size) < 0) {
close(dirfd);
return -EINVAL;
}
mode = ((fi->flags & O_ACCMODE) == O_RDONLY) ? R_OK : W_OK;
for (i = 0; i < num_parts; i++) {
char partname[32];
mkpartname(partname, sizeof(partname), i);
if (faccessat(dirfd, partname, mode, 0) < 0) {
close(dirfd);
return -EPERM;
}
}
close(dirfd);
return 0;
}
static int
part_read(int dirfd, unsigned int part, off_t off, char *buf, int num)
{
int fd;
int ret;
fd = open_part(dirfd, part, O_RDONLY);
if (fd < 0)
return 0;
ret = pread(fd, buf, num, off);
if (ret < 0)
ret = -errno;
close(fd);
return ret;
}
static int blockfs_read(const char *path, char *buf, size_t size,
off_t offset, struct fuse_file_info *fi)
{
int dirfd;
size_t numread;
unsigned int num_parts;
unsigned int part_size;
if (path[0] == 0)
return -ENOENT;
dirfd = open_volume_dir(path + 1);
if (dirfd < 0)
return -ENOENT;
if (read_property(dirfd, "num_parts", &num_parts) < 0) {
close(dirfd);
return -EINVAL;
}
if (read_property(dirfd, "part_size", &part_size) < 0) {
close(dirfd);
return -EINVAL;
}
numread = 0;
while (size) {
unsigned int part;
off_t part_off;
off_t part_end;
size_t toread;
int ret;
part = offset / part_size;
part_off = offset % part_size;
part_end = (part + 1) * (uint64_t)part_size;
toread = size;
if (offset + toread > part_end)
toread = part_end - offset;
ret = part_read(dirfd, part, part_off, buf, toread);
if (ret < 0) {
if (numread == 0)
numread = ret;
break;
}
buf += ret;
size -= ret;
offset += ret;
numread += ret;
if (ret != toread)
break;
}
close(dirfd);
return numread;
}
static int
part_write(int dirfd, unsigned int part, off_t off, const char *buf, int num)
{
char rbuf[num];
int fd;
int ret;
fd = open_part(dirfd, part, O_RDWR);
if (fd < 0)
return 0;
ret = pread(fd, rbuf, num, off);
if (ret != num || memcmp(buf, rbuf, num)) {
ret = pwrite(fd, buf, num, off);
if (ret < 0)
ret = -errno;
}
close(fd);
return ret;
}
static int blockfs_write(const char *path, const char *buf, size_t size,
off_t offset, struct fuse_file_info *fi)
{
int dirfd;
size_t numwritten;
unsigned int num_parts;
unsigned int part_size;
if (path[0] == 0)
return -ENOENT;
dirfd = open_volume_dir(path + 1);
if (dirfd < 0)
return -ENOENT;
if (read_property(dirfd, "num_parts", &num_parts) < 0) {
close(dirfd);
return -EINVAL;
}
if (read_property(dirfd, "part_size", &part_size) < 0) {
close(dirfd);
return -EINVAL;
}
numwritten = 0;
while (size) {
unsigned int part;
off_t part_off;
off_t part_end;
size_t towrite;
int ret;
part = offset / part_size;
part_off = offset % part_size;
part_end = (part + 1) * (uint64_t)part_size;
towrite = size;
if (offset + towrite > part_end)
towrite = part_end - offset;
ret = part_write(dirfd, part, part_off, buf, towrite);
if (ret < 0) {
if (numwritten == 0)
numwritten = ret;
break;
}
buf += ret;
size -= ret;
offset += ret;
numwritten += ret;
if (ret != towrite)
break;
}
close(dirfd);
return numwritten;
}
static void
blockfs_statfs_part_cb(void *num, unsigned int part, struct stat *buf)
{
((uint64_t *)num)[1] += buf->st_size;
}
static void blockfs_statfs_volume_cb(void *num, char *volume, int dirfd)
{
unsigned int num_parts;
((uint64_t *)num)[0]++;
if (read_property(dirfd, "num_parts", &num_parts) >= 0)
iter_parts_stat(dirfd, num_parts, num, blockfs_statfs_part_cb);
}
static int blockfs_statfs(const char *path, struct statvfs *buf)
{
uint64_t num[2];
num[0] = 1;
num[1] = 0;
iter_volumes(num, blockfs_statfs_volume_cb);
buf->f_bsize = 4096;
buf->f_frsize = 4096;
buf->f_blocks = num[1] / 4096;
buf->f_bfree = 0;
buf->f_bavail = 0;
buf->f_files = num[0];
buf->f_ffree = 0;
buf->f_favail = 0;
buf->f_fsid = 0;
buf->f_flag = 0;
buf->f_namemax = PATH_MAX;
return 0;
}
static int blockfs_fsync(const char *path, int datasync,
struct fuse_file_info *fi)
{
/* FIXME. */
sync();
return 0;
}
struct blockfs_readdir_info {
void *buf;
fuse_fill_dir_t filler;
};
static void blockfs_readdir_cb(void *_info, char *name, int dirfd)
{
struct blockfs_readdir_info *info = _info;
info->filler(info->buf, name, NULL, 0);
}
static int blockfs_readdir(const char *path, void *buf, fuse_fill_dir_t filler,
off_t offset, struct fuse_file_info *fi)
{
struct blockfs_readdir_info info = { buf, filler };
if (strcmp(path, "/") != 0)
return -ENOENT;
filler(buf, ".", NULL, 0);
filler(buf, "..", NULL, 0);
iter_volumes(&info, blockfs_readdir_cb);
return 0;
}
static struct fuse_operations blockfs_oper = {
.getattr = blockfs_getattr,
.truncate = blockfs_truncate,
.open = blockfs_open,
.read = blockfs_read,
.write = blockfs_write,
.statfs = blockfs_statfs,
.fsync = blockfs_fsync,
.readdir = blockfs_readdir,
};
enum {
KEY_HELP,
KEY_VERSION,
};
static struct fuse_opt blockfs_opts[] = {
FUSE_OPT_KEY("-h", KEY_HELP),
FUSE_OPT_KEY("--help", KEY_HELP),
FUSE_OPT_KEY("-V", KEY_VERSION),
FUSE_OPT_KEY("--version", KEY_VERSION),
};
static void usage(const char *progname)
{
fprintf(stderr,
"Usage: %s backingdir mountpoint [options]\n"
"\n"
"General options:\n"
" -h --help print help\n"
" -V --version print version\n"
"\n", progname);
}
static char *backing_dir;
static int blockfs_opt_proc(void *data, const char *arg, int key,
struct fuse_args *outargs)
{
if (key == FUSE_OPT_KEY_NONOPT) {
if (backing_dir == NULL) {
backing_dir = strdup(arg);
return 0;
}
return 1;
}
if (key == KEY_HELP) {
usage(outargs->argv[0]);
fuse_opt_add_arg(outargs, "-ho");
fuse_main(outargs->argc, outargs->argv, &blockfs_oper, NULL);
exit(1);
}
if (key == KEY_VERSION) {
fprintf(stderr, "MERGE version: %s\n", PACKAGE_VERSION);
fuse_opt_add_arg(outargs, "--version");
fuse_main(outargs->argc, outargs->argv, &blockfs_oper, NULL);
exit(0);
}
return 1;
}
int main(int argc, char *argv[])
{
struct fuse_args args = FUSE_ARGS_INIT(argc, argv);
if (fuse_opt_parse(&args, NULL, blockfs_opts, blockfs_opt_proc) == -1)
exit(1);
if (backing_dir == NULL) {
fprintf(stderr, "missing backing dir\n");
fprintf(stderr, "see '%s -h' for usage\n", argv[0]);
exit(1);
}
backing_dir_fd = open(backing_dir, O_RDONLY | O_DIRECTORY | O_NOATIME);
if (backing_dir_fd < 0) {
perror("open");
return 1;
}
pthread_mutex_init(&readdir_lock, NULL);
return fuse_main(args.argc, args.argv, &blockfs_oper, NULL);
}
------------------------------------------------------------------------------
Create and publish websites with WebMatrix
Use the most popular FREE web apps or write code yourself;
WebMatrix provides all the features you need to develop and
publish your website. http://p.sf.net/sfu/ms-webmatrix-sf
_______________________________________________
fuse-devel mailing list
https://lists.sourceforge.net/lists/listinfo/fuse-devel
Lennert Buytenhek
2011-04-04 20:16:32 UTC
Permalink
Hoi Mark,

Maybe I've just not been looking in the right place, but is there some
architecural overview document of lessfs somewhere?


thanks,
Lennert
Post by Mark Ruijter
Hi Lennert,
Please take a look at lessfs (www.lessfs.com). It is a
deduplicating+encrypting+compressing filesystem with replication.
I think that it does everything you want, and even more.
That said, I may not be completely unbiased.
Mark Ruijter
Post by Lennert Buytenhek
Hi!
Recently, I ran into the problem of needing an efficient way of doing
incremental backups of virtual machine images. I looked around a bit,
finding the patches that people have written to make rsync work on
block devices, but none of what was out there really suited my needs,
so I ended up hacking up something myself.
My first idea was to write a FUSE module to interpose between a block
device and a virtual representation of that block device in a FUSE
filesystem, relaying all I/O requests to the underlying block device,
but keeping a separate database of per-sector or per-block (or some
other granularity) mtimes. A separate tool would then compare that
database to that of a remote host, and perform an rsync-like operation
to bring the remote host in sync. While this would work, it seemed
overly complex for the job, and it would end up being a half-assed
reimplementation of rsync plus a filesystem.
So then I figured, since *NIX filesystems already track mtimes on a
per-inode basis, why not just keep the master copy of the block device
data in a set of, say, 1 MiB files, and have a FUSE module emulate a
larger file out of this, redirecting the I/Os to the component files?
This way, you don't need to track mtimes yourself (the underlying
filesystem will do that for you), and you can use standard tools like
rsync to synchronise the component files to another host.
The attached FUSE module implements this idea. To get started, do
something like this
cd /some/where
mkdir parts
mkdir parts/volume
for part in `seq 0 99`
do
dd if=/dev/zero of=parts/volume/$data bs=1024k count=1
done
echo 100 > parts/volume/num_parts
echo 1048576 > parts/volume/part_size
mkdir mount
~/bin/blockfs parts/ mount/
In mount/, a file 'volume' will then appear, which should be 100MiB
in size, and which you can dd over, create filesystems on, loopback
mount, etc. The underlying data for the volume will be stored in
the 100 1 MiB component files in parts/, with reads and writes to
the volume file being redirected to those component files.
When a write happens, blockfs will first read the original data for
that byte range from the component file, and if that matches, it won't
perform the write, to avoid updating the mtime on the component file.
With modern filesystems like ext4 and xfs, having many large files in
a single directory shouldn't incur a performance penalty like it would
on other filesystems, e.g. on ext3 without extents or htree, so this
approach should be okay.
- There should really be some kind of caching in blockfs -- not for
data, but for things like metadata and component file descriptors.
This is currently not done, as it doesn't seem possible to do things
like setting timers or asking the fuse main loop to poll on some
file descriptors for you (e.g. inotify fds for the component directory).
At some point I want to look into integrating the fuse main loop
with ivykis (http://libivykis.sourceforge.net/man3/ivykis.3.html)
to be able to address this.
- There should be an option for suspending I/O while a backup of the
backing store is made, so that you don't end up with a backup copy
of your block device that has had writes to it reordered.
- Establish some kind of rule of thumb for what the optimal component
file size is. This probably depends on hardware specs, method of
synchronisation the data to the slave cop{y,ies}, write pattern and
write intensity, etc, but it should be possible to figure out a sane
default (or a set of defaults) to recommend.
- The name -- 'blockfs' doesn't really cover the functionality of this
module very well. Any better ideas?
Any comments otherwise?
cheers,
Lennert
=== blockfs.c
#define PACKAGE_VERSION "0.1"
#define _GNU_SOURCE
#define FUSE_USE_VERSION 26
#include <stdio.h>
#include <stdlib.h>
#include <dirent.h>
#include <errno.h>
#include <fcntl.h>
#include <fuse.h>
#include <pthread.h>
#include <string.h>
#include <sys/types.h>
#include <unistd.h>
static int backing_dir_fd;
static pthread_mutex_t readdir_lock;
static int is_volume_dir(int dirfd)
{
struct stat buf;
if (fstatat(dirfd, "num_parts", &buf, 0) < 0)
return 0;
if (fstatat(dirfd, "part_size", &buf, 0) < 0)
return 0;
return 1;
}
static int open_volume_dir(const char *volume)
{
int dirfd;
dirfd = openat(backing_dir_fd, volume,
O_RDONLY | O_DIRECTORY | O_NOATIME);
if (dirfd < 0)
return -ENOENT;
if (!is_volume_dir(dirfd)) {
close(dirfd);
return -EINVAL;
}
return dirfd;
}
static void
iter_volumes(void *cookie, void (*iter)(void *cookie, char *volume, int dirfd))
{
int fd;
DIR *dirp;
struct dirent entry;
struct dirent *result;
pthread_mutex_lock(&readdir_lock);
fd = dup(backing_dir_fd);
if (fd < 0)
goto out;
dirp = fdopendir(fd);
if (dirp == NULL) {
close(fd);
goto out;
}
rewinddir(dirp);
while (readdir_r(dirp, &entry, &result) == 0) {
int dirfd;
if (result == NULL)
break;
if (result->d_type != DT_DIR && result->d_type != DT_UNKNOWN)
continue;
if (result->d_name[0] == '.')
continue;
dirfd = open_volume_dir(result->d_name);
if (dirfd < 0)
continue;
iter(cookie, result->d_name, dirfd);
close(dirfd);
}
closedir(dirp);
pthread_mutex_unlock(&readdir_lock);
}
static int read_property(int dirfd, char *name, unsigned int *_value)
{
int fd;
FILE *f;
unsigned int value;
fd = openat(dirfd, name, O_RDONLY | O_NOATIME);
if (fd < 0)
return -1;
f = fdopen(fd, "r");
if (f == NULL) {
close(fd);
return -1;
}
if (fscanf(f, "%u", &value) != 1) {
fclose(f);
return -1;
}
*_value = value;
fclose(f);
return 0;
}
static void count_volumes_cb(void *cookie, char *volume, int dirfd)
{
(*((unsigned int *)cookie))++;
}
static int count_volumes(void)
{
unsigned int count = 0;
iter_volumes(&count, count_volumes_cb);
return count;
}
static int mkpartname(char *buf, int buf_size, unsigned int part)
{
return snprintf(buf, buf_size, "%d", part);
}
static int open_part(int dirfd, unsigned int part, int flags)
{
char partname[32];
mkpartname(partname, sizeof(partname), part);
return openat(dirfd, partname, flags | O_NOATIME);
}
struct blockfs_getattr_info {
unsigned int part_size;
int part_size_mismatches;
struct timespec st_atim;
struct timespec st_mtim;
struct timespec st_ctim;
};
static int timespec_cmp(struct timespec *a, struct timespec *b)
{
if (a->tv_sec > b->tv_sec)
return 1;
if (a->tv_sec < b->tv_sec)
return -1;
if (a->tv_nsec > b->tv_nsec)
return 1;
if (a->tv_nsec < b->tv_nsec)
return -1;
return 0;
}
static void blockfs_getattr_cb(void *_info, unsigned int part, struct stat *buf)
{
struct blockfs_getattr_info *info = _info;
if (buf->st_size != info->part_size)
info->part_size_mismatches++;
if (timespec_cmp(&buf->st_atim, &info->st_atim) > 0)
info->st_atim = buf->st_atim;
if (timespec_cmp(&buf->st_mtim, &info->st_mtim) > 0)
info->st_mtim = buf->st_mtim;
if (timespec_cmp(&buf->st_ctim, &info->st_ctim) > 0)
info->st_ctim = buf->st_ctim;
}
static int
iter_parts_stat(int dirfd, unsigned int num_parts, void *cookie,
void (*iter)(void *cookie, unsigned int part, struct stat *buf))
{
int i;
for (i = 0; i < num_parts; i++) {
char partname[32];
struct stat buf;
mkpartname(partname, sizeof(partname), i);
if (fstatat(dirfd, partname, &buf, 0) < 0) {
perror("fstatat");
close(dirfd);
return -1;
}
iter(cookie, i, &buf);
}
return 0;
}
static int blockfs_getattr(const char *path, struct stat *stbuf)
{
struct stat buf;
int dirfd;
unsigned int num_parts;
unsigned int part_size;
struct blockfs_getattr_info info;
if (path[0] == 0)
return -ENOENT;
memset(stbuf, 0, sizeof(struct stat));
if (strcmp(path, "/") == 0) {
int ret;
ret = fstat(backing_dir_fd, &buf);
if (ret < 0)
return -ENOENT;
stbuf->st_mode = buf.st_mode;
stbuf->st_nlink = 2 + count_volumes();
stbuf->st_uid = buf.st_uid;
stbuf->st_gid = buf.st_gid;
stbuf->st_size = 1024;
stbuf->st_blksize = 1024;
stbuf->st_blocks = 1;
stbuf->st_atim = buf.st_atim;
stbuf->st_mtim = buf.st_mtim;
stbuf->st_ctim = buf.st_ctim;
return 0;
}
dirfd = open_volume_dir(path + 1);
if (dirfd < 0)
return -ENOENT;
if (read_property(dirfd, "num_parts", &num_parts) < 0) {
close(dirfd);
return -EINVAL;
}
if (read_property(dirfd, "part_size", &part_size) < 0) {
close(dirfd);
return -EINVAL;
}
if (fstat(dirfd, &buf) < 0) {
perror("fstat");
close(dirfd);
return -EINVAL;
}
stbuf->st_mode = S_IFREG | (buf.st_mode & 0666);
stbuf->st_nlink = 1;
stbuf->st_uid = buf.st_uid;
stbuf->st_gid = buf.st_gid;
stbuf->st_size = (uint64_t)num_parts * part_size;
stbuf->st_blksize = part_size;
stbuf->st_blocks = ((uint64_t)num_parts * part_size + 511) / 512;
memset(&info, 0, sizeof(info));
info.part_size = part_size;
if (iter_parts_stat(dirfd, num_parts, &info, blockfs_getattr_cb) ||
info.part_size_mismatches) {
close(dirfd);
return -EINVAL;
}
stbuf->st_atim = info.st_atim;
stbuf->st_mtim = info.st_mtim;
stbuf->st_ctim = info.st_ctim;
close(dirfd);
return 0;
}
static int blockfs_truncate(const char *path, off_t length)
{
/* Silently fail. */
return 0;
}
static int blockfs_open(const char *path, struct fuse_file_info *fi)
{
int dirfd;
unsigned int num_parts;
unsigned int part_size;
int mode;
int i;
if (path[0] == 0)
return -ENOENT;
dirfd = open_volume_dir(path + 1);
if (dirfd < 0)
return -ENOENT;
if (read_property(dirfd, "num_parts", &num_parts) < 0) {
close(dirfd);
return -EINVAL;
}
if (read_property(dirfd, "part_size", &part_size) < 0) {
close(dirfd);
return -EINVAL;
}
mode = ((fi->flags & O_ACCMODE) == O_RDONLY) ? R_OK : W_OK;
for (i = 0; i < num_parts; i++) {
char partname[32];
mkpartname(partname, sizeof(partname), i);
if (faccessat(dirfd, partname, mode, 0) < 0) {
close(dirfd);
return -EPERM;
}
}
close(dirfd);
return 0;
}
static int
part_read(int dirfd, unsigned int part, off_t off, char *buf, int num)
{
int fd;
int ret;
fd = open_part(dirfd, part, O_RDONLY);
if (fd < 0)
return 0;
ret = pread(fd, buf, num, off);
if (ret < 0)
ret = -errno;
close(fd);
return ret;
}
static int blockfs_read(const char *path, char *buf, size_t size,
off_t offset, struct fuse_file_info *fi)
{
int dirfd;
size_t numread;
unsigned int num_parts;
unsigned int part_size;
if (path[0] == 0)
return -ENOENT;
dirfd = open_volume_dir(path + 1);
if (dirfd < 0)
return -ENOENT;
if (read_property(dirfd, "num_parts", &num_parts) < 0) {
close(dirfd);
return -EINVAL;
}
if (read_property(dirfd, "part_size", &part_size) < 0) {
close(dirfd);
return -EINVAL;
}
numread = 0;
while (size) {
unsigned int part;
off_t part_off;
off_t part_end;
size_t toread;
int ret;
part = offset / part_size;
part_off = offset % part_size;
part_end = (part + 1) * (uint64_t)part_size;
toread = size;
if (offset + toread > part_end)
toread = part_end - offset;
ret = part_read(dirfd, part, part_off, buf, toread);
if (ret < 0) {
if (numread == 0)
numread = ret;
break;
}
buf += ret;
size -= ret;
offset += ret;
numread += ret;
if (ret != toread)
break;
}
close(dirfd);
return numread;
}
static int
part_write(int dirfd, unsigned int part, off_t off, const char *buf, int num)
{
char rbuf[num];
int fd;
int ret;
fd = open_part(dirfd, part, O_RDWR);
if (fd < 0)
return 0;
ret = pread(fd, rbuf, num, off);
if (ret != num || memcmp(buf, rbuf, num)) {
ret = pwrite(fd, buf, num, off);
if (ret < 0)
ret = -errno;
}
close(fd);
return ret;
}
static int blockfs_write(const char *path, const char *buf, size_t size,
off_t offset, struct fuse_file_info *fi)
{
int dirfd;
size_t numwritten;
unsigned int num_parts;
unsigned int part_size;
if (path[0] == 0)
return -ENOENT;
dirfd = open_volume_dir(path + 1);
if (dirfd < 0)
return -ENOENT;
if (read_property(dirfd, "num_parts", &num_parts) < 0) {
close(dirfd);
return -EINVAL;
}
if (read_property(dirfd, "part_size", &part_size) < 0) {
close(dirfd);
return -EINVAL;
}
numwritten = 0;
while (size) {
unsigned int part;
off_t part_off;
off_t part_end;
size_t towrite;
int ret;
part = offset / part_size;
part_off = offset % part_size;
part_end = (part + 1) * (uint64_t)part_size;
towrite = size;
if (offset + towrite > part_end)
towrite = part_end - offset;
ret = part_write(dirfd, part, part_off, buf, towrite);
if (ret < 0) {
if (numwritten == 0)
numwritten = ret;
break;
}
buf += ret;
size -= ret;
offset += ret;
numwritten += ret;
if (ret != towrite)
break;
}
close(dirfd);
return numwritten;
}
static void
blockfs_statfs_part_cb(void *num, unsigned int part, struct stat *buf)
{
((uint64_t *)num)[1] += buf->st_size;
}
static void blockfs_statfs_volume_cb(void *num, char *volume, int dirfd)
{
unsigned int num_parts;
((uint64_t *)num)[0]++;
if (read_property(dirfd, "num_parts", &num_parts) >= 0)
iter_parts_stat(dirfd, num_parts, num, blockfs_statfs_part_cb);
}
static int blockfs_statfs(const char *path, struct statvfs *buf)
{
uint64_t num[2];
num[0] = 1;
num[1] = 0;
iter_volumes(num, blockfs_statfs_volume_cb);
buf->f_bsize = 4096;
buf->f_frsize = 4096;
buf->f_blocks = num[1] / 4096;
buf->f_bfree = 0;
buf->f_bavail = 0;
buf->f_files = num[0];
buf->f_ffree = 0;
buf->f_favail = 0;
buf->f_fsid = 0;
buf->f_flag = 0;
buf->f_namemax = PATH_MAX;
return 0;
}
static int blockfs_fsync(const char *path, int datasync,
struct fuse_file_info *fi)
{
/* FIXME. */
sync();
return 0;
}
struct blockfs_readdir_info {
void *buf;
fuse_fill_dir_t filler;
};
static void blockfs_readdir_cb(void *_info, char *name, int dirfd)
{
struct blockfs_readdir_info *info = _info;
info->filler(info->buf, name, NULL, 0);
}
static int blockfs_readdir(const char *path, void *buf, fuse_fill_dir_t filler,
off_t offset, struct fuse_file_info *fi)
{
struct blockfs_readdir_info info = { buf, filler };
if (strcmp(path, "/") != 0)
return -ENOENT;
filler(buf, ".", NULL, 0);
filler(buf, "..", NULL, 0);
iter_volumes(&info, blockfs_readdir_cb);
return 0;
}
static struct fuse_operations blockfs_oper = {
.getattr = blockfs_getattr,
.truncate = blockfs_truncate,
.open = blockfs_open,
.read = blockfs_read,
.write = blockfs_write,
.statfs = blockfs_statfs,
.fsync = blockfs_fsync,
.readdir = blockfs_readdir,
};
enum {
KEY_HELP,
KEY_VERSION,
};
static struct fuse_opt blockfs_opts[] = {
FUSE_OPT_KEY("-h", KEY_HELP),
FUSE_OPT_KEY("--help", KEY_HELP),
FUSE_OPT_KEY("-V", KEY_VERSION),
FUSE_OPT_KEY("--version", KEY_VERSION),
};
static void usage(const char *progname)
{
fprintf(stderr,
"Usage: %s backingdir mountpoint [options]\n"
"\n"
"General options:\n"
" -h --help print help\n"
" -V --version print version\n"
"\n", progname);
}
static char *backing_dir;
static int blockfs_opt_proc(void *data, const char *arg, int key,
struct fuse_args *outargs)
{
if (key == FUSE_OPT_KEY_NONOPT) {
if (backing_dir == NULL) {
backing_dir = strdup(arg);
return 0;
}
return 1;
}
if (key == KEY_HELP) {
usage(outargs->argv[0]);
fuse_opt_add_arg(outargs, "-ho");
fuse_main(outargs->argc, outargs->argv, &blockfs_oper, NULL);
exit(1);
}
if (key == KEY_VERSION) {
fprintf(stderr, "MERGE version: %s\n", PACKAGE_VERSION);
fuse_opt_add_arg(outargs, "--version");
fuse_main(outargs->argc, outargs->argv, &blockfs_oper, NULL);
exit(0);
}
return 1;
}
int main(int argc, char *argv[])
{
struct fuse_args args = FUSE_ARGS_INIT(argc, argv);
if (fuse_opt_parse(&args, NULL, blockfs_opts, blockfs_opt_proc) == -1)
exit(1);
if (backing_dir == NULL) {
fprintf(stderr, "missing backing dir\n");
fprintf(stderr, "see '%s -h' for usage\n", argv[0]);
exit(1);
}
backing_dir_fd = open(backing_dir, O_RDONLY | O_DIRECTORY | O_NOATIME);
if (backing_dir_fd < 0) {
perror("open");
return 1;
}
pthread_mutex_init(&readdir_lock, NULL);
return fuse_main(args.argc, args.argv, &blockfs_oper, NULL);
}
------------------------------------------------------------------------------
Create and publish websites with WebMatrix
Use the most popular FREE web apps or write code yourself;
WebMatrix provides all the features you need to develop and
publish your website. http://p.sf.net/sfu/ms-webmatrix-sf
_______________________________________________
fuse-devel mailing list
https://lists.sourceforge.net/lists/listinfo/fuse-devel
Sven Utcke
2011-04-05 09:01:27 UTC
Permalink
Hoi Mark,
Post by Lennert Buytenhek
Maybe I've just not been looking in the right place, but is there
some architecural overview document of lessfs somewhere?
mee too :-)

Your sales pitch sounds pretty nice, but the web-page seems to be
missing essentially all informations of a "what is this, what does it,
what does it need" type of information.

Sven
--
_ ___ ___ ___ The dCache File System
__| |/ __|| __|/ __| An archive file-system for PB of data
/ _` | (__ | _| \__ \ http://www.desy.de/~utcke/Data/
\__,_|\___||_| |___/ http://www.dr-utcke.de/
Mark Ruijter
2011-04-05 12:42:42 UTC
Permalink
Hi Lennert,

This is something that still leaves room for improvement. ;-)
The basics are not difficult though:

Lessfs uses a number of btree+ databases to store it's meta data.

(dbu) blockusage -> hash : inuse
(dbb) fileblock -> inode-blocknr : hash
(dbdta) blockdata -> hash : (compressed) data
(dbp) metadata -> inode : struct stat
(dbdirent) dirent -> directory_inode : inode

The best way to see how this work is to compile lessfs, mount it and
copy a file to lessfs.
Then unmount and type ./listdb /etc/lessfs.cfg

This will dump the databases so that you can see what's going on.

Replication works like this:
The master writes all database actions into a replication log file.
The replication logfile is send to the slave. The slave reads the
instructions in the replication log and executes them against it's
databases.

I hope this helps a little.

Regards,

Mark.

-----
***@saturn:/usr/src/lessfs-1.1.1# cp Makefile /fuse/
***@saturn:/usr/src/lessfs-1.1.1# umount /fuse/
***@saturn:/usr/src/lessfs-1.1.1# ./listdb /etc/lessfs.cfg
(null) (3745): info:
The selected data store is file_io.(null) (3745): info:
Lessfs transaction support is enabled.(null) (3745): info:
Hash MHASH_TIGER192 has been selected(null) (3745): info:
Lessfs uses a 24 bytes long hash.(null) (3745): info:
Automatic defragmentation is enabled.(null) (3745): info:
cache 8192 data blocks

dbp
NFI : 9
ddstat->filename /
->inode 1 -> size 4096 -> real_size 0 time 1301987110
->filename / is a directory
ddstat->filename .
->inode 2 -> size 4096 -> real_size 0 time 1301987110
->filename . is a directory
ddstat->filename ..
->inode 3 -> size 4096 -> real_size 0 time 1301987110
->filename .. is a directory
ddstat->filename .lessfs
->inode 4 -> size 4096 -> real_size 0 time 1301987110
->filename .lessfs is a directory
ddstat->filename .
->inode 5 -> size 4096 -> real_size 0 time 1301987110
->filename . is a directory
ddstat->filename ..
->inode 6 -> size 4096 -> real_size 0 time 1301987110
->filename .. is a directory
ddstat->filename lessfs_stats
->inode 7 -> size 0 -> real_size 0 time 1301987110
ddstat->filename Makefile
->inode 8 -> size 101030 -> real_size 18708 time 1301987127


dbu
FBA533FC4B73A11554D43754A5B9FDF5611A6CC575ED8992 : 1
offset : 0
size : 0
round size : 0

061936B61995F36C7901A876B2E4592B9591A04A61B53BA3 : 1301987129
offset : 0
size : 0
round size : 0


nextoffset = 18944

F65C7F296773A41D4F232B9BABA5D736B71D0F64E6A19E8A : 131072
offset : 0
size : 0
round size : 0

20C8741C8118000A7207E62A21B87FCB2DD239A393363FD8 : 1
offset : 0
size : 18708
round size : 18944



dbb
8-0 : 20C8741C8118000A7207E62A21B87FCB2DD239A393363FD8


dbdirent
1:1
1:2
1:3
1:4
1:8
4:4
4:5
4:6
4:7


dbs (symlinks)


dbl (hardlinks)
Post by Lennert Buytenhek
Hoi Mark,
Post by Lennert Buytenhek
Maybe I've just not been looking in the right place, but is there
some architecural overview document of lessfs somewhere?
mee too :-)
Your sales pitch sounds pretty nice, but the web-page seems to be
missing essentially all informations of a "what is this, what does it,
what does it need" type of information.
Sven
Goswin von Brederlow
2011-04-05 07:42:59 UTC
Permalink
Post by Lennert Buytenhek
Hi!
Recently, I ran into the problem of needing an efficient way of doing
incremental backups of virtual machine images. I looked around a bit,
finding the patches that people have written to make rsync work on
block devices, but none of what was out there really suited my needs,
so I ended up hacking up something myself.
If I understand you right you have large images of which only small
chunks are changed. Reading the complete image takes too long so you
want some smarter way to read only chunks that have changes. (just so we
are talking about the same thing)
Post by Lennert Buytenhek
...
When a write happens, blockfs will first read the original data for
that byte range from the component file, and if that matches, it won't
perform the write, to avoid updating the mtime on the component file.
I would think that is a waste of time. How often do you have writes
matching the data?
Post by Lennert Buytenhek
With modern filesystems like ext4 and xfs, having many large files in
a single directory shouldn't incur a performance penalty like it would
on other filesystems, e.g. on ext3 without extents or htree, so this
approach should be okay.
Implementation wise you could have the blocks as actual files on some
underlying filesystem.

Or you could have the image stored on an underlying block device (or
image file) and keep metadata for the blocks seperate on a
filesystem. Your filesystem would then provide both the block device as
one big file as well as virtual files for each chunk. A read/write to
either the one big file or individual chunks should be supported. The
virtual machine would read/write from the one big file. The backup would
read individual chunks and restore would write individual chunks. That
would probably just work with rsync and the in-place option.

I would use the later. You can easily drop the fs and use the
image/device as is or add the fs to existing image file. Simpler to
change betwee using and not using. With a real device (or loopback
mounting the image) you can also use dm-snapshot as part of a snapshot
feature in your fs.
Post by Lennert Buytenhek
- There should really be some kind of caching in blockfs -- not for
data, but for things like metadata and component file descriptors.
Going with my second design choice you would only have 2 FDs open. One
for the image and one for metadata. How often you sync the metadata is
then up to you. You could just mmap() the metadata and msync() the first
time each block is changed after a backup and otherwise let the kernel
decide when to commit changes. If you loose some update due to crash the
block will still have a newer mtime than the backup even if it isn't the
most recent mtime. This would require some feadback from the backup
process though, I would just combine this with the snapshot feature. If
the mtime of a chunk changes the first time after a snapshot you msync()
it, otherwise you just alter the data in memory and let the kernel flush
it out over time.
Post by Lennert Buytenhek
This is currently not done, as it doesn't seem possible to do things
like setting timers or asking the fuse main loop to poll on some
file descriptors for you (e.g. inotify fds for the component directory).
There recently was a patch for having a timer for e.g. flushing caches
to disk in your fs.
Post by Lennert Buytenhek
At some point I want to look into integrating the fuse main loop
with ivykis (http://libivykis.sourceforge.net/man3/ivykis.3.html)
to be able to address this.
- There should be an option for suspending I/O while a backup of the
backing store is made, so that you don't end up with a backup copy
of your block device that has had writes to it reordered.
I found that extended attributes make for a simple interface for
this. Setting an extended attribute would create a snapshot and a new
directory in your fs containing all the chunks from that time. Removing
the extended attribute could remove the snapshot.
Post by Lennert Buytenhek
- Establish some kind of rule of thumb for what the optimal component
file size is. This probably depends on hardware specs, method of
synchronisation the data to the slave cop{y,ies}, write pattern and
write intensity, etc, but it should be possible to figure out a sane
default (or a set of defaults) to recommend.
At a minimum you need 4 byte mtime per block. So at page granularity you
would have 0.1% overhead. If you only msync() metadata the first time
after a backup there might be verry little overhead I/O wise.

The overhead for rsync and the backup filesystem would be insane
though. I probably wouldn't want anything smaller than 1MB blocks. I
would probably try out going as high as 1GB blocks.

But as you say it highly depends on the use case. 1MB chunks sounds like
a sane dfault though.
Post by Lennert Buytenhek
...
static int
part_read(int dirfd, unsigned int part, off_t off, char *buf, int num)
{
int fd;
int ret;
fd = open_part(dirfd, part, O_RDONLY);
This is a total killer. If you do use individual files per chunk I would
reduce the number of chunks to a number so you can keep them all
open. Or at least keep a cache of the last 1000 chunks used.

Using a single image and a metadata file avoids this completly. Only 2
FDs needed.
Post by Lennert Buytenhek
if (fd < 0)
return 0;
ret = pread(fd, buf, num, off);
You should check out the splice operations for this to avoid needless
copies of the data.
Post by Lennert Buytenhek
if (ret < 0)
ret = -errno;
close(fd);
return ret;
}
static int blockfs_read(const char *path, char *buf, size_t size,
off_t offset, struct fuse_file_info *fi)
{
Horribly complex.
Post by Lennert Buytenhek
}
static int
part_write(int dirfd, unsigned int part, off_t off, const char *buf, int num)
{
char rbuf[num];
int fd;
int ret;
fd = open_part(dirfd, part, O_RDWR);
if (fd < 0)
return 0;
ret = pread(fd, rbuf, num, off);
if (ret != num || memcmp(buf, rbuf, num)) {
Benchmark this. I highly doubt this gains you anything.
Post by Lennert Buytenhek
ret = pwrite(fd, buf, num, off);
Same thing as read, splice this.
Post by Lennert Buytenhek
if (ret < 0)
ret = -errno;
}
close(fd);
return ret;
}
static int blockfs_write(const char *path, const char *buf, size_t size,
off_t offset, struct fuse_file_info *fi)
{
This code looks awfully familiar. Wait, that is nearly what read
does. Maybe you could write common helper function that gets
part_read/part_write as argument.
Post by Lennert Buytenhek
static int blockfs_fsync(const char *path, int datasync,
struct fuse_file_info *fi)
{
/* FIXME. */
sync();
Verry verry bad. You should sync all FDs. sync() might block here for
literaly hours, e.g. if the user copies data to an usb drive at the
time.

Look at fsync, msync, sync_file_range (if you have linux).

Doing this with concurrent asynchronous IO would avoid the need for
thread and mutexes so aio is worth a look.

And most importantly I think would be looking at splice (if you have
linux). That should save you most of the cpu time overhead.


Overall though I think your idea has merits. So keep at it.

MfG
Goswin
Lennert Buytenhek
2011-04-05 09:44:29 UTC
Permalink
Post by Goswin von Brederlow
Post by Lennert Buytenhek
Recently, I ran into the problem of needing an efficient way of doing
incremental backups of virtual machine images. I looked around a bit,
finding the patches that people have written to make rsync work on
block devices, but none of what was out there really suited my needs,
so I ended up hacking up something myself.
If I understand you right you have large images of which only small
chunks are changed. Reading the complete image takes too long so you
want some smarter way to read only chunks that have changes. (just
so we are talking> about the same thing)
Yes, and yes.
Post by Goswin von Brederlow
Post by Lennert Buytenhek
When a write happens, blockfs will first read the original data for
that byte range from the component file, and if that matches, it won't
perform the write, to avoid updating the mtime on the component file.
I would think that is a waste of time. How often do you have writes
matching the data?
I don't know. :) I'll find out.
Post by Goswin von Brederlow
Post by Lennert Buytenhek
With modern filesystems like ext4 and xfs, having many large files in
a single directory shouldn't incur a performance penalty like it would
on other filesystems, e.g. on ext3 without extents or htree, so this
approach should be okay.
Implementation wise you could have the blocks as actual files on some
underlying filesystem.
Or you could have the image stored on an underlying block device (or
image file) and keep metadata for the blocks seperate on a
filesystem.
I considered both options (I touch on that briefly in my initial
email)..
Post by Goswin von Brederlow
Your filesystem would then provide both the block device as
one big file as well as virtual files for each chunk. A read/write to
either the one big file or individual chunks should be supported. The
virtual machine would read/write from the one big file. The backup would
read individual chunks and restore would write individual chunks. That
would probably just work with rsync and the in-place option.
..but this is a pretty interesting twist. I need to think about this
some more.
Post by Goswin von Brederlow
I would use the later. You can easily drop the fs and use the
image/device as is or add the fs to existing image file. Simpler to
change betwee using and not using.
Actually, I'm not sure if I would like that to be simple -- if one
accidentally uses the backing file instead of the fuse-provided file,
one ends up losing mtime updates, and the backup copies can get into
an inconsistent state.
Post by Goswin von Brederlow
With a real device (or loopback mounting the image) you can also use
dm-snapshot as part of a snapshot feature in your fs.
Right. (I'd have to snapshot the mtime database as well at that point.)
Post by Goswin von Brederlow
Post by Lennert Buytenhek
- There should really be some kind of caching in blockfs -- not for
data, but for things like metadata and component file descriptors.
Going with my second design choice you would only have 2 FDs open. One
for the image and one for metadata. How often you sync the metadata is
then up to you. You could just mmap() the metadata and msync() the first
time each block is changed after a backup and otherwise let the kernel
decide when to commit changes. If you loose some update due to crash the
block will still have a newer mtime than the backup even if it isn't the
most recent mtime. This would require some feadback from the backup
process though, I would just combine this with the snapshot feature. If
the mtime of a chunk changes the first time after a snapshot you msync()
it, otherwise you just alter the data in memory and let the kernel flush
it out over time.
OK, this I get. So the main difference here is being able to get rid
of the intermediate filesystem.

As I said, I'll need to think about this a bit more.
Post by Goswin von Brederlow
Post by Lennert Buytenhek
- There should be an option for suspending I/O while a backup of the
backing store is made, so that you don't end up with a backup copy
of your block device that has had writes to it reordered.
I found that extended attributes make for a simple interface for
this. Setting an extended attribute would create a snapshot and a new
directory in your fs containing all the chunks from that time. Removing
the extended attribute could remove the snapshot.
Makes sense, I'll play with that.
Post by Goswin von Brederlow
Post by Lennert Buytenhek
- Establish some kind of rule of thumb for what the optimal component
file size is. This probably depends on hardware specs, method of
synchronisation the data to the slave cop{y,ies}, write pattern and
write intensity, etc, but it should be possible to figure out a sane
default (or a set of defaults) to recommend.
At a minimum you need 4 byte mtime per block. So at page granularity you
would have 0.1% overhead. If you only msync() metadata the first time
after a backup there might be verry little overhead I/O wise.
The overhead for rsync and the backup filesystem would be insane
though. I probably wouldn't want anything smaller than 1MB blocks. I
would probably try out going as high as 1GB blocks.
But as you say it highly depends on the use case. 1MB chunks sounds like
a sane dfault though.
ACK.
Post by Goswin von Brederlow
Post by Lennert Buytenhek
static int
part_read(int dirfd, unsigned int part, off_t off, char *buf, int num)
{
int fd;
int ret;
fd = open_part(dirfd, part, O_RDONLY);
This is a total killer. If you do use individual files per chunk I would
reduce the number of chunks to a number so you can keep them all
open. Or at least keep a cache of the last 1000 chunks used.
I was going to do something along the lines of the latter, but I'll go
and play with all the suggested approaches to see which works better.
Post by Goswin von Brederlow
Using a single image and a metadata file avoids this completly. Only 2
FDs needed.
Yes, it would.
Post by Goswin von Brederlow
Post by Lennert Buytenhek
if (fd < 0)
return 0;
ret = pread(fd, buf, num, off);
You should check out the splice operations for this to avoid needless
copies of the data.
libfuse doesn't support splicing data to/from /dev/fuse, though -- or
does it? I can't find any mention of 'splice' in the fuse 2.8.5 tarball.
Post by Goswin von Brederlow
Post by Lennert Buytenhek
if (ret < 0)
ret = -errno;
close(fd);
return ret;
}
static int blockfs_read(const char *path, char *buf, size_t size,
off_t offset, struct fuse_file_info *fi)
{
Horribly complex.
Hm, I don't quite agree with that, but OK.
Post by Goswin von Brederlow
Post by Lennert Buytenhek
static int blockfs_write(const char *path, const char *buf, size_t size,
off_t offset, struct fuse_file_info *fi)
{
This code looks awfully familiar. Wait, that is nearly what read
does. Maybe you could write common helper function that gets
part_read/part_write as argument.
I didn't do this at first because of the constness of 'buf' for the
write case and the non-constness of 'buf' for the read case. I guess
that can be worked around, though.
Post by Goswin von Brederlow
Post by Lennert Buytenhek
static int blockfs_fsync(const char *path, int datasync,
struct fuse_file_info *fi)
{
/* FIXME. */
sync();
Verry verry bad.
I know, hence the 'FIXME'. :-)

It's still better than not doing any syncing at all..


thanks,
Lennert
Goswin von Brederlow
2011-04-06 15:11:07 UTC
Permalink
Post by Lennert Buytenhek
Post by Goswin von Brederlow
I would use the later. You can easily drop the fs and use the
image/device as is or add the fs to existing image file. Simpler to
change betwee using and not using.
Actually, I'm not sure if I would like that to be simple -- if one
accidentally uses the backing file instead of the fuse-provided file,
one ends up losing mtime updates, and the backup copies can get into
an inconsistent state.
I'm not quite sure how you would do it accidentally. If it is a file
(not a real block device) you could also check the mtime of the file
against one recorded in your metadata and then reset all metadata mtimes
to that of the image. For non-accidentall use I would add a
--reset-mtime option that sets the mtime of all blocks to now. Rsync
would then read all blocks for the next update and sync any changes.
Post by Lennert Buytenhek
Post by Goswin von Brederlow
Post by Lennert Buytenhek
ret = pread(fd, buf, num, off);
You should check out the splice operations for this to avoid needless
copies of the data.
libfuse doesn't support splicing data to/from /dev/fuse, though -- or
does it? I can't find any mention of 'splice' in the fuse 2.8.5 tarball.
Check the list archive. There was a mail about new callbacks recently.
Post by Lennert Buytenhek
Post by Goswin von Brederlow
Post by Lennert Buytenhek
static int blockfs_write(const char *path, const char *buf, size_t size,
off_t offset, struct fuse_file_info *fi)
{
This code looks awfully familiar. Wait, that is nearly what read
does. Maybe you could write common helper function that gets
part_read/part_write as argument.
I didn't do this at first because of the constness of 'buf' for the
write case and the non-constness of 'buf' for the read case. I guess
that can be worked around, though.
Yeah, you will have to ignore (cast them around) the const or the
non-const bit of the buf in the common helper. In C you can't express
that the constness of the buf is related to the constness of the
part_read/part_write argument.

MfG
Goswin

Loading...