[fuse-devel] clean mountpoint umount on daemon SIGKILL

Discussion:

Anatol Pomozov

2012-12-13 00:14:30 UTC

Hi, Miklos.

On our servers we have many fuse filesystems that run by different users.
We also have some kind of monitoring software that tracks fuse daemon
processes and kills one if it misbehaves. Sometimes (e.g. in case of a
deadlock) the only way to kill daemon is to send SIGKILL to it.

Unfortunately SIGKILL produces another issue - the mountpount is left in
inconsistent state. libfuse calls umount() in its uninitialization logic
and SIGKILL does not give any chance to run umount(). But having clean
unmount even in case of SIGKILL would be really nice to have in fuse.

Miklos, is there any way to cleanly umount the filesystem in case of
SIGKILL? Maybe it can be done in kernel in fuse_dev_release()? This
function corresponds to close() of /dev/fuse - kernel always closes
descriptors in case of thread (i.e. fuse daemon) death.

Nikolaus Rath

2012-12-13 07:28:27 UTC

Permalink

Post by Anatol Pomozov
Unfortunately SIGKILL produces another issue - the mountpount is left in
inconsistent state. libfuse calls umount() in its uninitialization logic
and SIGKILL does not give any chance to run umount(). But having clean
unmount even in case of SIGKILL would be really nice to have in fuse.

This is generally not a good idea. Imagine if you run a tool like rsync.
If the source mountpoint suddenly becomes empty, rsync would end up
deleting everything in the destination. If the mountpoint returns an I/O
error instead (as it currently does), rsync can detect the problem and
will instead refuse to do anything.

Best,

-Nikolaus

--
»Time flies like an arrow, fruit flies like a Banana.«

PGP fingerprint: 5B93 61F8 4EA2 E279 ABF6 02CF A9AD B7F8 AE4E 425C

Maxim V. Patlasov

2012-12-13 08:18:13 UTC

Permalink

Anatol,

Post by Nikolaus Rath

FUSE reconnect can be implemented relatively easy. The idea is to keep
kernel fuse queueing requests while user-space is dead and when it
restarts, it's being connected to existing kernel fuse_conn. Having this
feature implemented, you could SIGKILL deadlocked fuse daemon, then
start it again and umount the filesystem cleanly. Will this scheme be
helpful for you?

Thanks,
Maxim

Anatol Pomozov

2013-02-25 16:59:20 UTC

Permalink

Hi

On Thu, Dec 13, 2012 at 12:18 AM, Maxim V. Patlasov

Post by Maxim V. Patlasov
Anatol,
FUSE reconnect can be implemented relatively easy. The idea is to keep
kernel fuse queueing requests while user-space is dead and when it restarts,
it's being connected to existing kernel fuse_conn. Having this feature
implemented, you could SIGKILL deadlocked fuse daemon, then start it again
and umount the filesystem cleanly. Will this scheme be helpful for you?

One question about the reconnection - what are you going to do with
open file descriptors? With daemon crash they become invalid and
should be closed and thus you break filesystem clients anyway.
Otherwise reconnection sounds interesting. In fact it can be useful
for regular clean shutdown if we want to make hot filesystem upgrade.

But the reconnection makes sense only in case the system has some kind
of "process supervisor". Something that tracks process status and
restarts it on crash, e.g. systemd service. Otherwise we still have
the same issue - on abnormal daemon exit the user has inconsistent
mountpoint and has to do something with it. The only difference is
that the crashed filesystem returns ECONN error now and with your
proposal it will hang (in uninterruptable sleep!).

As of our server setup we do have a "process supervisor". But in our
case crash does not always lead to restart, e.g. the process is
rescheduled on different machine. So we still need to have some kind
of process afterwork cleanup, but we want to keep the supervisor code
fuse-unaware. Kernel autocleanup on daemon death seems the best option
for us.

This can also replace recently added "auto_unmount" feature. The
option enables user-space cleanup mechanism, but having kernel cleanup
on daemon shutdown is more reliable.

Anyway I have a working code for kernel autocleanup and I'll post it
here for comments.

Anatol Pomozov

2013-02-25 18:11:08 UTC

Permalink

To cleanup its mountpoint a fuse application registres signal hook that calls
'fusermount' tool. But in case of abnormal exit (SIGSEGV, SIGKILL) application
has no chance to call fusermount and the mountpoint is left in inconsistent
state (it returns ENOTCONN error).
There is an option that was added recently "auto_unmount" but it utilizes
user-space daemon and not very reliable (it also can be killed with SIGKILL).

Instead we implement unmount on '/dev/fuse' file close. With it there is no
need to use 'auto_unmount' or call 'fusermount' on shutdown but we keep it for
compatibility with old kernels.

Current implementation unmounts original mountpoint and all bind mounts. So
it differs from original implementation that called 'fusermount' only on
original mount.

Note that both fusermount and kernel style mount cleanup unmounts filesystem
only in current process namespace. If daemon changed filesystem namespace
then those mountpoints are left untouched.

Tested: run a fuse filesystem and tried to kill it different ways:
SIGTERM, SIGKILL, "umount dir". Check that it also works in case of bind mounts.

Google-Bug-Id: 7718269

Change-Id: I0838b40e1e3c9328c76674d5043b7a700b9053b7
Signed-off-by: Anatol Pomozov <***@gmail.com>
---
fs/fuse/dev.c | 34 ++++++++++++++++++++++++++++++++++
fs/namespace.c | 10 ++++++++--
include/linux/mount.h | 1 +
3 files changed, 43 insertions(+), 2 deletions(-)

diff --git a/fs/fuse/dev.c b/fs/fuse/dev.c
index e9bdec0..4f592b7 100644
--- a/fs/fuse/dev.c
+++ b/fs/fuse/dev.c
@@ -19,6 +19,10 @@
#include <linux/pipe_fs_i.h>
#include <linux/swap.h>
#include <linux/splice.h>
+#include <linux/nsproxy.h>
+#include <linux/mount.h>
+
+#include "../mount.h"

MODULE_ALIAS_MISCDEV(FUSE_MINOR);
MODULE_ALIAS("devname:fuse");
@@ -2082,10 +2086,34 @@ void fuse_abort_conn(struct fuse_conn *fc)
}
EXPORT_SYMBOL_GPL(fuse_abort_conn);

+static void fuse_umount(struct super_block *sb)
+{
+ struct nsproxy *nsp = task_nsproxy(current);
+ struct mnt_namespace *ns = nsp->mnt_ns;
+ struct mount *mnt, *tmp;
+
+ list_for_each_entry_safe(mnt, tmp, &ns->list, mnt_list) {
+ struct vfsmount *vfsmnt = &mnt->mnt;
+ if (vfsmnt->mnt_sb == sb) {
+ /* in case of mount binds there can be more than one
+ * mountpoint that corresponds to sb
+ */
+ mntget(vfsmnt);
+ do_umount(vfsmnt, 0);
+ mntput(vfsmnt);
+
+ /* TODO: better debug message? */
+ pr_debug("fuse: mountpoint (%d,%d) automatically unmounted\n",
+ MAJOR(sb->s_dev), MINOR(sb->s_dev));
+ }
+ }
+}
+
int fuse_dev_release(struct inode *inode, struct file *file)
{
struct fuse_conn *fc = fuse_get_conn(file);
if (fc) {
+ struct super_block *sb = fc->sb;
spin_lock(&fc->lock);
fc->connected = 0;
fc->blocked = 0;
@@ -2094,6 +2122,12 @@ int fuse_dev_release(struct inode *inode, struct file *file)
wake_up_all(&fc->blocked_waitq);
spin_unlock(&fc->lock);
fuse_conn_put(fc);
+
+ /* super block might already be NULL if we killed this fs by
+ * "umount"
+ */
+ if (sb)
+ fuse_umount(sb);
}

return 0;
diff --git a/fs/namespace.c b/fs/namespace.c
index 55605c5..d7496d8 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -1147,7 +1147,7 @@ void umount_tree(struct mount *mnt, int propagate, struct list_head *kill)

static void shrink_submounts(struct mount *mnt, struct list_head *umounts);

-static int do_umount(struct mount *mnt, int flags)
+static int __do_umount(struct mount *mnt, int flags)
{
struct super_block *sb = mnt->mnt.mnt_sb;
int retval;
@@ -1237,6 +1237,12 @@ static int do_umount(struct mount *mnt, int flags)
return retval;
}

+int do_umount(struct vfsmount *mnt, int flags)
+{
+ return __do_umount(real_mount(mnt), flags);
+}
+EXPORT_SYMBOL(do_umount);
+
/*
* Now umount can handle mount points as well as block devices.
* This is important for filesystems which use unnamed block devices.
@@ -1272,7 +1278,7 @@ SYSCALL_DEFINE2(umount, char __user *, name, int, flags)
if (!ns_capable(mnt->mnt_ns->user_ns, CAP_SYS_ADMIN))
goto dput_and_out;

- retval = do_umount(mnt, flags);
+ retval = __do_umount(mnt, flags);
dput_and_out:
/* we mustn't call path_put() as that would clear mnt_expiry_mark */
dput(path.dentry);
diff --git a/include/linux/mount.h b/include/linux/mount.h
index d7029f4..333c1e8 100644
--- a/include/linux/mount.h
+++ b/include/linux/mount.h
@@ -65,6 +65,7 @@ extern struct vfsmount *mntget(struct vfsmount *mnt);
extern void mnt_pin(struct vfsmount *mnt);
extern void mnt_unpin(struct vfsmount *mnt);
extern int __mnt_is_readonly(struct vfsmount *mnt);
+extern int do_umount(struct vfsmount *mnt, int flags);

struct file_system_type;
extern struct vfsmount *vfs_kern_mount(struct file_system_type *type,

--
1.8.1.3

Anatol Pomozov

2013-02-25 18:14:31 UTC

Permalink

Hi,

I sent it as a reply to "clean mountpoint umount on daemon SIGKILL"
mail thread. The change is RFC and needs discussion.

On Mon, Feb 25, 2013 at 10:11 AM, Anatol Pomozov

Post by Anatol Pomozov
To cleanup its mountpoint a fuse application registres signal hook that calls
'fusermount' tool. But in case of abnormal exit (SIGSEGV, SIGKILL) application
has no chance to call fusermount and the mountpoint is left in inconsistent
state (it returns ENOTCONN error).
There is an option that was added recently "auto_unmount" but it utilizes
user-space daemon and not very reliable (it also can be killed with SIGKILL).
Instead we implement unmount on '/dev/fuse' file close. With it there is no
need to use 'auto_unmount' or call 'fusermount' on shutdown but we keep it for
compatibility with old kernels.
Current implementation unmounts original mountpoint and all bind mounts. So
it differs from original implementation that called 'fusermount' only on
original mount.
Note that both fusermount and kernel style mount cleanup unmounts filesystem
only in current process namespace. If daemon changed filesystem namespace
then those mountpoints are left untouched.
SIGTERM, SIGKILL, "umount dir". Check that it also works in case of bind mounts.
Google-Bug-Id: 7718269
Change-Id: I0838b40e1e3c9328c76674d5043b7a700b9053b7
---
fs/fuse/dev.c | 34 ++++++++++++++++++++++++++++++++++
fs/namespace.c | 10 ++++++++--
include/linux/mount.h | 1 +
3 files changed, 43 insertions(+), 2 deletions(-)
diff --git a/fs/fuse/dev.c b/fs/fuse/dev.c
index e9bdec0..4f592b7 100644
--- a/fs/fuse/dev.c
+++ b/fs/fuse/dev.c
@@ -19,6 +19,10 @@
#include <linux/pipe_fs_i.h>
#include <linux/swap.h>
#include <linux/splice.h>
+#include <linux/nsproxy.h>
+#include <linux/mount.h>
+
+#include "../mount.h"
MODULE_ALIAS_MISCDEV(FUSE_MINOR);
MODULE_ALIAS("devname:fuse");
@@ -2082,10 +2086,34 @@ void fuse_abort_conn(struct fuse_conn *fc)
}
EXPORT_SYMBOL_GPL(fuse_abort_conn);
+static void fuse_umount(struct super_block *sb)
+{
+ struct nsproxy *nsp = task_nsproxy(current);
+ struct mnt_namespace *ns = nsp->mnt_ns;
+ struct mount *mnt, *tmp;
+
+ list_for_each_entry_safe(mnt, tmp, &ns->list, mnt_list) {
+ struct vfsmount *vfsmnt = &mnt->mnt;
+ if (vfsmnt->mnt_sb == sb) {
+ /* in case of mount binds there can be more than one
+ * mountpoint that corresponds to sb
+ */
+ mntget(vfsmnt);
+ do_umount(vfsmnt, 0);
+ mntput(vfsmnt);
+
+ /* TODO: better debug message? */
+ pr_debug("fuse: mountpoint (%d,%d) automatically unmounted\n",
+ MAJOR(sb->s_dev), MINOR(sb->s_dev));
+ }
+ }
+}
+
int fuse_dev_release(struct inode *inode, struct file *file)
{
struct fuse_conn *fc = fuse_get_conn(file);
if (fc) {
+ struct super_block *sb = fc->sb;
spin_lock(&fc->lock);
fc->connected = 0;
fc->blocked = 0;
@@ -2094,6 +2122,12 @@ int fuse_dev_release(struct inode *inode, struct file *file)
wake_up_all(&fc->blocked_waitq);
spin_unlock(&fc->lock);
fuse_conn_put(fc);
+
+ /* super block might already be NULL if we killed this fs by
+ * "umount"
+ */
+ if (sb)
+ fuse_umount(sb);
}
return 0;
diff --git a/fs/namespace.c b/fs/namespace.c
index 55605c5..d7496d8 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -1147,7 +1147,7 @@ void umount_tree(struct mount *mnt, int propagate, struct list_head *kill)
static void shrink_submounts(struct mount *mnt, struct list_head *umounts);
-static int do_umount(struct mount *mnt, int flags)
+static int __do_umount(struct mount *mnt, int flags)
{
struct super_block *sb = mnt->mnt.mnt_sb;
int retval;
@@ -1237,6 +1237,12 @@ static int do_umount(struct mount *mnt, int flags)
return retval;
}
+int do_umount(struct vfsmount *mnt, int flags)
+{
+ return __do_umount(real_mount(mnt), flags);
+}
+EXPORT_SYMBOL(do_umount);
+
/*
* Now umount can handle mount points as well as block devices.
* This is important for filesystems which use unnamed block devices.
@@ -1272,7 +1278,7 @@ SYSCALL_DEFINE2(umount, char __user *, name, int, flags)
if (!ns_capable(mnt->mnt_ns->user_ns, CAP_SYS_ADMIN))
goto dput_and_out;
- retval = do_umount(mnt, flags);
+ retval = __do_umount(mnt, flags);
/* we mustn't call path_put() as that would clear mnt_expiry_mark */
dput(path.dentry);
diff --git a/include/linux/mount.h b/include/linux/mount.h
index d7029f4..333c1e8 100644
--- a/include/linux/mount.h
+++ b/include/linux/mount.h
@@ -65,6 +65,7 @@ extern struct vfsmount *mntget(struct vfsmount *mnt);
extern void mnt_pin(struct vfsmount *mnt);
extern void mnt_unpin(struct vfsmount *mnt);
extern int __mnt_is_readonly(struct vfsmount *mnt);
+extern int do_umount(struct vfsmount *mnt, int flags);
struct file_system_type;
extern struct vfsmount *vfs_kern_mount(struct file_system_type *type,
--
1.8.1.3

Anatol Pomozov

2013-04-27 14:47:45 UTC

Permalink

Hi

On Thu, Dec 13, 2012 at 12:18 AM, Maxim V. Patlasov

Post by Maxim V. Patlasov
Anatol,

Post by Nikolaus Rath

FUSE reconnect can be implemented relatively easy. The idea is to keep
kernel fuse queueing requests while user-space is dead and when it restarts,
it's being connected to existing kernel fuse_conn. Having this feature
implemented, you could SIGKILL deadlocked fuse daemon, then start it again
and umount the filesystem cleanly. Will this scheme be helpful for you?

More I think about fuse transparent reconnect more I like it. In the
future it will allow to implement stuff like failover (start a new
daemon in place of the dead/killed one) and hotswap daemon upgrade.

Maxim, have you tied to implement it? Do you see any issues, in
particular is it possible to restore daemon state to make filesystem
client believe it is the same connection?

Maxim V. Patlasov

2013-04-29 13:42:32 UTC

Permalink

Hi Anatol,

Post by Anatol Pomozov
Hi
On Thu, Dec 13, 2012 at 12:18 AM, Maxim V. Patlasov

Post by Maxim V. Patlasov
Anatol,

Post by Nikolaus Rath

FUSE reconnect can be implemented relatively easy. The idea is to keep
kernel fuse queueing requests while user-space is dead and when it restarts,
it's being connected to existing kernel fuse_conn. Having this feature
implemented, you could SIGKILL deadlocked fuse daemon, then start it again
and umount the filesystem cleanly. Will this scheme be helpful for you?

Yes, there are two patches developed by Pavel Emelyanov: one to show
open files in fusectl, and another to reconnect fuse daemon to an
existing fuse-connection. I can post them as 'rfc' if you're interested.

Post by Anatol Pomozov
Do you see any issues, in
particular is it possible to restore daemon state to make filesystem
client believe it is the same connection?

It's depend on fuse daemon. If it's simple enough, re-opening files
listed in fusectl on restart would work. But any transient userspace
state not derivable from the list of open files will be the problem. In
our case, fuse daemon keeps knowledge about last write request that was
flushed on data server (i.e. we sync storage less often than send
writes). So after restart the daemon won't be able to recognize whether
data servers are in consistent state or not.

Thanks,
Maxim

Anatol Pomozov

2013-03-20 19:48:30 UTC

Permalink

Post by Nikolaus Rath

It end up that our users have the same concerns about autoumount-on-SIGKILL.

Our build system (the user) distinguishes "normal" fs errors (like
ENOENT) from abnormal one (ENOTCONN). In case of ENOTCONN build system
knows that filesystem is broken and there is nothing what it can do.
So build tool aborts current compilations and shutdowns itself. Only
after all users exited mountpoint can be cleaned. If we would do
autoumount then build system does not know that fs is broken and keeps
compiling. This might produce incorrect output.