Discussion:
Strange hangups
(too old to reply)
Franco Broi
2005-03-16 03:48:10 UTC
Permalink
Hi

I've recently upgraded all my FUSE filesystem to fuse-2.2, previously we
were running fuse-2.1-pre0 and before that fuse-1.9. The upgrade to 2.2
required a few code changes to my filesystem, mainly yo use the new
filehandles and I was able to throw away much of the code which kept
track of open files.

Before the upgrade the filesystem had been running pretty flawlessly for
many months.

We've been running with fuse-2.2 for about a month and have just started
seeing a few problems.

1) FUSE filesystems fail and disappear without a trace. I've only seen
this a couple of times. There aren't any messages in the systems logs
and I can't find any trace of a core file. I unlimit the coredumpsize
before I start the fuse process.

2) FUSE filesystems hang, can't strace or attach with the debugger. In
dmesg we see this strange message: init_special_inode: bogus imode (0)
but I'm not sure if this is directly related to the hangups.
Only 1 thread of the filesystem hangs, other processes seem to continue
to work.

Here is one such thread, the kernel is 2.4.26.

m1 1178 1 0 Mar09 ?
00:00:06 /usr/local/bin/squirrel /m1 -o
allow_other,large_read,fsname=metadata/m1

$ cat status
Name: squirrel
State: D (disk sleep)

$ ls -l fd
lrwx------ 1 m1 64 Mar 16 11:34 3 -
/tmp/.fuse_devPVZN4A/fuse (deleted)
lr-x------ 1 m1 64 Mar 16 11:34 4 -> pipe:[2617]
l-wx------ 1 m1 64 Mar 16 11:34 5 -> pipe:[2617]
lrwx------ 1 m1 64 Mar 16 11:34 6 -> /tmp/tmpfPscGmA
(deleted)
lr-x------ 1 m1 64 Mar 16 11:34 7 -
/data29/m1/367/va4/.mgiva_tc2:0.0
$ ls -l /data29/m1/367/va4/.mgiva_tc2:0.0
-rw-r--r-- 1 m1 1261 Mar 15
11:50 /data29/m1/367/va4/.mgiva_tc2:0.0

$ file /data29/m1/367/va4/.mgiva_tc2:0.0
/data29/m1/367/va4/.mgiva_tc2:0.0: ASCII text

The process that got stuck while using the filesystem was a find command
and I've seen the same thing happen with a perl script that uses readdir
to search for files.

m1 9835 1 0 Mar14 ? 00:00:08 find /m1 -type f -a
( ( -name core ) -o ( -name core.* ) etc.......

I've still got the hanging thread if anyone has any suggestions.

Thanks.
Miklos Szeredi
2005-03-16 10:16:31 UTC
Permalink
Post by Franco Broi
I've recently upgraded all my FUSE filesystem to fuse-2.2, previously we
were running fuse-2.1-pre0 and before that fuse-1.9. The upgrade to 2.2
required a few code changes to my filesystem, mainly yo use the new
filehandles and I was able to throw away much of the code which kept
track of open files.
Before the upgrade the filesystem had been running pretty flawlessly for
many months.
We've been running with fuse-2.2 for about a month and have just started
seeing a few problems.
There was a nasty bug in the kernel part of 2.2 which is fixed in
2.2.1. There haven't been any reports of this causing trouble, maybe
you are the first. Can you try upgrading, and see if the problems
persist?
Post by Franco Broi
1) FUSE filesystems fail and disappear without a trace. I've only seen
this a couple of times. There aren't any messages in the systems logs
and I can't find any trace of a core file. I unlimit the coredumpsize
before I start the fuse process.
Is it unmounted too? That means a clean exit, and is probably the
filesystem's fault.
Post by Franco Broi
2) FUSE filesystems hang, can't strace or attach with the debugger.
Can you do 'Alt-SysRQ-t' or 'echo t > /proc/sysrq-trigger'? And
search for the hanging process in dmesg (or in /var/log/syslog)?
Post by Franco Broi
In dmesg we see this strange message: init_special_inode: bogus
imode (0) but I'm not sure if this is directly related to the
hangups.
Maybe it is. This message cannot possibly be printed, so something is
seriously wrong.
Post by Franco Broi
Only 1 thread of the filesystem hangs, other processes seem to
continue to work.
OK, it would be very nice to see where it's hanging. Either with
sysrq-t or if there's no magic sysrq compiled in your kernel, the
WCHAN field of 'ps axHl'

Thanks,
Miklos
Franco Broi
2005-03-16 11:54:47 UTC
Permalink
Post by Miklos Szeredi
Post by Franco Broi
I've recently upgraded all my FUSE filesystem to fuse-2.2, previously we
were running fuse-2.1-pre0 and before that fuse-1.9. The upgrade to 2.2
required a few code changes to my filesystem, mainly yo use the new
filehandles and I was able to throw away much of the code which kept
track of open files.
Before the upgrade the filesystem had been running pretty flawlessly for
many months.
We've been running with fuse-2.2 for about a month and have just started
seeing a few problems.
There was a nasty bug in the kernel part of 2.2 which is fixed in
2.2.1. There haven't been any reports of this causing trouble, maybe
you are the first. Can you try upgrading, and see if the problems
persist?
I'll upgrade one machine tomorrow and try to recreate the problem.
Post by Miklos Szeredi
Post by Franco Broi
1) FUSE filesystems fail and disappear without a trace. I've only seen
this a couple of times. There aren't any messages in the systems logs
and I can't find any trace of a core file. I unlimit the coredumpsize
before I start the fuse process.
Is it unmounted too? That means a clean exit, and is probably the
filesystem's fault.
No, "Transport endpoint is not connected", fusermount -u cleans it up.
Post by Miklos Szeredi
Post by Franco Broi
2) FUSE filesystems hang, can't strace or attach with the debugger.
Can you do 'Alt-SysRQ-t' or 'echo t > /proc/sysrq-trigger'? And
search for the hanging process in dmesg (or in /var/log/syslog)?
Post by Franco Broi
In dmesg we see this strange message: init_special_inode: bogus
imode (0) but I'm not sure if this is directly related to the
hangups.
Maybe it is. This message cannot possibly be printed, so something is
seriously wrong.
Post by Franco Broi
Only 1 thread of the filesystem hangs, other processes seem to
continue to work.
OK, it would be very nice to see where it's hanging. Either with
sysrq-t or if there's no magic sysrq compiled in your kernel, the
WCHAN field of 'ps axHl'
I don't have magic sysrq in the kernel, or at least I'm assuming I don't
as the echo did nothing.

PID WCHAN COMMAND
1178 down squirrel
Miklos Szeredi
2005-03-16 12:05:50 UTC
Permalink
Post by Franco Broi
Post by Miklos Szeredi
There was a nasty bug in the kernel part of 2.2 which is fixed in
2.2.1. There haven't been any reports of this causing trouble, maybe
you are the first. Can you try upgrading, and see if the problems
persist?
I'll upgrade one machine tomorrow and try to recreate the problem.
Thanks. Upgrade should be painless, since it contains only small bug
fixes.
Post by Franco Broi
I don't have magic sysrq in the kernel, or at least I'm assuming I don't
as the echo did nothing.
If /proc/sysrq-trigger exists, than it should work. Are you sure
there's nothing in dmesg?
Post by Franco Broi
PID WCHAN COMMAND
1178 down squirrel
It's sleeping on a semaphore, but there's no way to find out which...

Miklos
Franco Broi
2005-03-16 12:12:06 UTC
Permalink
Post by Miklos Szeredi
Post by Franco Broi
Post by Miklos Szeredi
There was a nasty bug in the kernel part of 2.2 which is fixed in
2.2.1. There haven't been any reports of this causing trouble, maybe
you are the first. Can you try upgrading, and see if the problems
persist?
I'll upgrade one machine tomorrow and try to recreate the problem.
Thanks. Upgrade should be painless, since it contains only small bug
fixes.
Post by Franco Broi
I don't have magic sysrq in the kernel, or at least I'm assuming I don't
as the echo did nothing.
If /proc/sysrq-trigger exists, than it should work. Are you sure
there's nothing in dmesg?
Not a sausage.
Franco Broi
2005-03-16 12:24:08 UTC
Permalink
Post by Miklos Szeredi
If /proc/sysrq-trigger exists, than it should work. Are you sure
there's nothing in dmesg?
Didn't have it turned on....


squirrel D 0001A1A7 2404 1178 7527 1179 (NOTLB)
Call Trace: [<c0107cb2>] [<c0107e4c>] [<c01576c6>] [<c0153cf1>]
[<c0154019>]
[<c0154369>] [<c01501ff>] [<c01092df>]
Miklos Szeredi
2005-03-16 12:28:05 UTC
Permalink
Post by Franco Broi
Didn't have it turned on....
squirrel D 0001A1A7 2404 1178 7527 1179 (NOTLB)
Call Trace: [<c0107cb2>] [<c0107e4c>] [<c01576c6>] [<c0153cf1>]
[<c0154019>]
[<c0154369>] [<c01501ff>] [<c01092df>]
Ahh, good!

Now please run in through ksymoops, hopefully it will make more sense.

Thanks,
Miklos
Franco Broi
2005-03-16 12:36:05 UTC
Permalink
Post by Miklos Szeredi
Post by Franco Broi
Didn't have it turned on....
squirrel D 0001A1A7 2404 1178 7527 1179 (NOTLB)
Call Trace: [<c0107cb2>] [<c0107e4c>] [<c01576c6>] [<c0153cf1>]
[<c0154019>]
[<c0154369>] [<c01501ff>] [<c01092df>]
Ahh, good!
Now please run in through ksymoops, hopefully it will make more sense.
squirrel D 0001A1A7 2404 1178 7527 1179 (NOTLB)
Using defaults from ksymoops -t elf32-i386 -a i386
Call Trace: [<c0107cb2>] [<c0107e4c>] [<c01576c6>] [<c0153cf1>]
[<c0154019>]
[<c0154369>] [<c01501ff>] [<c01092df>]
Warning (Oops_read): Code line not seen, dumping what data is available

Proc; squirrel
Post by Miklos Szeredi
Post by Franco Broi
EIP; 0001a1a7 Before first symbol <=====
Trace; c0107cb2 <__down+82/d0>
Trace; c0107e4c <__down_failed+8/c>
Trace; c01576c6 <.text.lock.namei+35/49f>
Trace; c0153cf1 <link_path_walk+611/730>
Trace; c0154019 <path_lookup+39/40>
Trace; c0154369 <__user_walk+49/60>
Trace; c01501ff <sys_lstat64+1f/90>
Trace; c01092df <system_call+33/38>
Miklos Szeredi
2005-03-16 12:55:29 UTC
Permalink
Post by Franco Broi
Trace; c0107cb2 <__down+82/d0>
Trace; c0107e4c <__down_failed+8/c>
Trace; c01576c6 <.text.lock.namei+35/49f>
Trace; c0153cf1 <link_path_walk+611/730>
Trace; c0154019 <path_lookup+39/40>
Trace; c0154369 <__user_walk+49/60>
Trace; c01501ff <sys_lstat64+1f/90>
Trace; c01092df <system_call+33/38>
Thanks, I'll try to make some sense of this.

Miklos
Miklos Szeredi
2005-03-16 13:36:07 UTC
Permalink
Post by Franco Broi
Trace; c0107cb2 <__down+82/d0>
Trace; c0107e4c <__down_failed+8/c>
Trace; c01576c6 <.text.lock.namei+35/49f>
Trace; c0153cf1 <link_path_walk+611/730>
Trace; c0154019 <path_lookup+39/40>
Trace; c0154369 <__user_walk+49/60>
Trace; c01501ff <sys_lstat64+1f/90>
Trace; c01092df <system_call+33/38>
It's probably sleeping on the inode 'i_sem' semaphore. This means
that some other thread is keeping this semaphore locked.

It could possibly be a deadlock, if the same semaphore is held by the
requester. Isn't threre another thread sleeping (not the filesystem
itself, but some other process)? It should be sleeping in
request_wait_answer, or request_wait_answer_nonint. If there is just
kill it and the stuck thread should also get unstuck.

This is just a guess however, it could be anything.

Thanks,
Miklos
Franco Broi
2005-03-17 00:30:18 UTC
Permalink
Post by Miklos Szeredi
Post by Franco Broi
Trace; c0107cb2 <__down+82/d0>
Trace; c0107e4c <__down_failed+8/c>
Trace; c01576c6 <.text.lock.namei+35/49f>
Trace; c0153cf1 <link_path_walk+611/730>
Trace; c0154019 <path_lookup+39/40>
Trace; c0154369 <__user_walk+49/60>
Trace; c01501ff <sys_lstat64+1f/90>
Trace; c01092df <system_call+33/38>
It's probably sleeping on the inode 'i_sem' semaphore. This means
that some other thread is keeping this semaphore locked.
It could possibly be a deadlock, if the same semaphore is held by the
requester. Isn't threre another thread sleeping (not the filesystem
itself, but some other process)? It should be sleeping in
request_wait_answer, or request_wait_answer_nonint. If there is just
kill it and the stuck thread should also get unstuck.
This is just a guess however, it could be anything.
This is a trace from the find process, it can't be killed.

Proc; find
Post by Miklos Szeredi
Post by Franco Broi
EIP; f61b2670 <_end+35e40aa0/3849b490> <=====
Trace; c0107cb2 <__down+82/d0>
Trace; c0107e4c <__down_failed+8/c>
Trace; f9cf981c <[fuse]fuse_readdir+0/d4>
Trace; c01597b1 <.text.lock.readdir+5/a4>
Trace; c015962b <sys_getdents64+5b/c0>
Trace; c01594c0 <filldir64+0/110>
Trace; c015835d <sys_fcntl64+5d/c0>
Trace; c01092df <system_call+33/38>
Miklos Szeredi
2005-03-17 07:17:49 UTC
Permalink
Post by Franco Broi
This is a trace from the find process, it can't be killed.
Not even with "kill -9"?

Otherwise the trace is interesting, it shows that it's not actually a
deadlock, since the find is not in request_wait_answer, but is
sleeping on the request semaphore, which it shouldn't do if otherwise
requests are being handled.

So whichever way I look at it, this must be memory corruption, which
could be explained with the bug in 2.2.

The interesting thing is that find and squirrel are both contending
for the same inode, which means, that the filesystem is touching it's
own files. This is not strictly forbidden, but usually not a good
idea, because it's prone to deadlock. Do you think that this could be
the case or am I totally off the track?

Thanks,
Miklos
Franco Broi
2005-03-17 07:51:24 UTC
Permalink
Post by Miklos Szeredi
Post by Franco Broi
This is a trace from the find process, it can't be killed.
Not even with "kill -9"?
No. Probably have to reboot.
Post by Miklos Szeredi
Otherwise the trace is interesting, it shows that it's not actually a
deadlock, since the find is not in request_wait_answer, but is
sleeping on the request semaphore, which it shouldn't do if otherwise
requests are being handled.
So whichever way I look at it, this must be memory corruption, which
could be explained with the bug in 2.2.
I've been trying to reproduce the problem before I do the upgarde but it
hasn't failed yet. I also tried to reproduce the error using fusexmp,
that too didn't fail but It did allocate nearly 40MB of memory and it
didn't look like it was going to release it before I unmounted the fs. I
know this is probably normal but I don't ever remember seeing a FUSE fs
use quite this much memory before.
Post by Miklos Szeredi
The interesting thing is that find and squirrel are both contending
for the same inode, which means, that the filesystem is touching it's
own files. This is not strictly forbidden, but usually not a good
idea, because it's prone to deadlock. Do you think that this could be
the case or am I totally off the track?
You mean that squirrel is accessing files using the FUSE mount point? I
don't think so.

How do you know it's the same file? The find process fd directory only
shows a directory as being open:


lrwx------ 1 m1 64 Mar 17 15:48 0 -> /dev/pts/2 (deleted)
l-wx------ 1 m1 64 Mar 17 15:48 1 -> pipe:[20426705]
l-wx------ 1 m1 64 Mar 17 15:48 2 -> pipe:[20426705]
lr-x------ 1 m1 64 Mar 17 15:48 3 -> /home/m1/
lr-x------ 1 m1 64 Mar 17 15:48 4 -> /m1/406/test/


Whereas the squirrel process seems to be accessing a text file:

lr-x------ 1 m1 64 Mar 17 15:50 7 -
Post by Miklos Szeredi
/data29/m1/367/va4/.mgiva_tc2:0.0
Miklos Szeredi
2005-03-17 09:56:25 UTC
Permalink
Post by Franco Broi
I've been trying to reproduce the problem before I do the upgarde but it
hasn't failed yet.
If it's the bug I think it is, then it should be very hard to trigger,
and may require heavy FUSE filesystem activity while mounting other
FUSE filesystems. I'm not really familiar with the exact workings of
the memory allocator, so there may be other more subtle failure modes.
Post by Franco Broi
I also tried to reproduce the error using fusexmp, that too didn't
fail but It did allocate nearly 40MB of memory and it didn't look
like it was going to release it before I unmounted the fs. I know
this is probably normal but I don't ever remember seeing a FUSE fs
use quite this much memory before.
Is the 40MB in the VSZ or the RSS column? If it's the virtual size,
that is quite normal. Real memory usage should not go up that much.
Post by Franco Broi
You mean that squirrel is accessing files using the FUSE mount point? I
don't think so.
Can it happen in theory?
Post by Franco Broi
How do you know it's the same file? The find process fd directory only
There's only circumstantial evidence:

- The find process is doing a readdir operation (and is blocking in
it for some yet unknown reason), during which it has the
directory's inode semaphore locked.

- The squirrel thread is blocking on some directory's inode semaphore
while it looks up a file for lstat().

Putting these together makes it highly probable that it's one and the
same directory, which means that squirrel is accessing a directory
under a FUSE mount.

But even if this can happen, it shouldn't cause such problems, and
killing the original requester would break the deadlock.

Miklos
Franco Broi
2005-03-17 11:50:51 UTC
Permalink
Post by Miklos Szeredi
Post by Franco Broi
I've been trying to reproduce the problem before I do the upgarde but it
hasn't failed yet.
If it's the bug I think it is, then it should be very hard to trigger,
and may require heavy FUSE filesystem activity while mounting other
FUSE filesystems. I'm not really familiar with the exact workings of
the memory allocator, so there may be other more subtle failure modes.
Our FUSE filesystems stay permanently mounted so it doesn't sound like
this is our problem but I do a test upgrade anyway.

I've seen other problems on a 2.6 system but I was also running NFSv4 so
I'm reluctant to blame FUSE at this point.
Post by Miklos Szeredi
Post by Franco Broi
You mean that squirrel is accessing files using the FUSE mount point? I
don't think so.
Can it happen in theory?
No, I'm pretty sure not. The FUSE mount point is in the / directory and
the real partitions are NFS mounted. The squirrel system gives the user
access to all data without them knowing on which partition the data
resides, it doesn't have any sort or directory structure itself.
Miklos Szeredi
2005-03-17 12:14:19 UTC
Permalink
Post by Franco Broi
Our FUSE filesystems stay permanently mounted so it doesn't sound like
this is our problem but I do a test upgrade anyway.
Ugrading is a good idea, even if this is another problem.
Post by Franco Broi
I've seen other problems on a 2.6 system but I was also running NFSv4 so
I'm reluctant to blame FUSE at this point.
I wouldn't rule out a FUSE bug yet, your have helped to find quite a
number of them :)

Can you try with NFSv3? NFSv4 is still marked experimental.
Post by Franco Broi
Post by Miklos Szeredi
Post by Franco Broi
You mean that squirrel is accessing files using the FUSE mount point? I
don't think so.
Can it happen in theory?
No, I'm pretty sure not. The FUSE mount point is in the / directory and
the real partitions are NFS mounted. The squirrel system gives the user
access to all data without them knowing on which partition the data
resides, it doesn't have any sort or directory structure itself.
Symlink pointing out of the NFS directory?

Thanks,
Miklos
Franco Broi
2005-03-17 12:34:00 UTC
Permalink
Post by Miklos Szeredi
I wouldn't rule out a FUSE bug yet, your have helped to find quite a
number of them :)
I have been absolutely amazed how few bugs there have been. We've been
basically using it in a full blown production environment since the 1.9
days, it's been brilliant. We are about to increase our disk capacity to
110TB - one huge FUSE filesystem - is this a record?
Post by Miklos Szeredi
Can you try with NFSv3? NFSv4 is still marked experimental.
All our machines are running v3 for FUSE, I only ran v4 on this
particular machine to test performance - wow, random access is nothing
short of fantastic! I was using a stock kernel (2.6.11) and didn't have
all the lastest patches so I wasn't too surprised when some weird things
started to happen. The other 2.6 machines have been fine so far.
Post by Miklos Szeredi
Post by Franco Broi
Post by Miklos Szeredi
Can it happen in theory?
No, I'm pretty sure not. The FUSE mount point is in the / directory and
the real partitions are NFS mounted. The squirrel system gives the user
access to all data without them knowing on which partition the data
resides, it doesn't have any sort or directory structure itself.
Symlink pointing out of the NFS directory?
I'd thought of that but most of our users wouldn't know a symlink if it
bit them, they're only Geophysicists after all!

As a test I ran fusexmp and did a find from the root directory to
include the FUSE mount point. It ran for over 3 hours and had recursed
several times before I stopped it, worked fine.
Miklos Szeredi
2005-03-17 14:05:52 UTC
Permalink
Post by Franco Broi
Post by Miklos Szeredi
I wouldn't rule out a FUSE bug yet, your have helped to find quite a
number of them :)
I have been absolutely amazed how few bugs there have been. We've been
basically using it in a full blown production environment since the 1.9
days, it's been brilliant. We are about to increase our disk capacity to
110TB - one huge FUSE filesystem - is this a record?
Probably it is. I'm more used to filesystems of 1/1000 this size :)
Post by Franco Broi
All our machines are running v3 for FUSE, I only ran v4 on this
particular machine to test performance - wow, random access is nothing
short of fantastic! I was using a stock kernel (2.6.11) and didn't have
all the lastest patches so I wasn't too surprised when some weird things
started to happen. The other 2.6 machines have been fine so far.
Ahh, OK. The others are running FUSE 2.2 too?

Miklos
Franco Broi
2005-03-18 00:29:55 UTC
Permalink
Post by Miklos Szeredi
Post by Franco Broi
All our machines are running v3 for FUSE, I only ran v4 on this
particular machine to test performance - wow, random access is nothing
short of fantastic! I was using a stock kernel (2.6.11) and didn't have
all the lastest patches so I wasn't too surprised when some weird things
started to happen. The other 2.6 machines have been fine so far.
Ahh, OK. The others are running FUSE 2.2 too?
Yes.
Franco Broi
2005-03-22 02:16:42 UTC
Permalink
Upgraded all our machines to FUSE-2.2.1 yesterday, today I've had a
failure, but it's of the disappearing filesystem type.


fuse init (API version 5.1)
fuse distribution version: 2.2.1


Filesystem 1K-blocks Used Available Use% Mounted on
metadata/3d 49492105856 43593505920 5898599936 89% /3d
metadata/l1 49492105856 43593506656 5898599200 89% /l1
df: `/m1': Transport endpoint is not connected

I can't find a core file anywhere. There were 2 processes using the
filesystem at the time it failed and they were killed stone dead.

I've only seen this happen starting with FUSE-2.2.

If my code had called exit (which it doesn't) the filesystem should have
unmounted cleanly, right? What could have happened to cause the 2
processes using the filesystem to die as if killed with -9?
Miklos Szeredi
2005-03-22 10:07:20 UTC
Permalink
Post by Franco Broi
Upgraded all our machines to FUSE-2.2.1 yesterday, today I've had a
failure, but it's of the disappearing filesystem type.
Does coredumping otherwise work? Can you check with killing a running
filesystem with 'kill -SEGV'?
Post by Franco Broi
fuse init (API version 5.1)
fuse distribution version: 2.2.1
Filesystem 1K-blocks Used Available Use% Mounted on
metadata/3d 49492105856 43593505920 5898599936 89% /3d
metadata/l1 49492105856 43593506656 5898599200 89% /l1
df: `/m1': Transport endpoint is not connected
I can't find a core file anywhere. There were 2 processes using the
filesystem at the time it failed and they were killed stone dead.
I've only seen this happen starting with FUSE-2.2.
If my code had called exit (which it doesn't) the filesystem should have
unmounted cleanly, right?
Yes. It would have lazy umounted the filesystem.
Post by Franco Broi
What could have happened to cause the 2 processes using the
filesystem to die as if killed with -9?
No idea.

Anything interesting in dmesg output? Out of memory? Oops?

How easy is it to reproduce?

Thanks,
Miklos
Franco Broi
2005-03-23 00:42:10 UTC
Permalink
Post by Miklos Szeredi
Post by Franco Broi
Upgraded all our machines to FUSE-2.2.1 yesterday, today I've had a
failure, but it's of the disappearing filesystem type.
Does coredumping otherwise work? Can you check with killing a running
filesystem with 'kill -SEGV'?
No it doesn't. Where would the core file go if not in the directory
where the process is started?
Post by Miklos Szeredi
Anything interesting in dmesg output? Out of memory? Oops?
No, nothing.
Post by Miklos Szeredi
How easy is it to reproduce?
I haven't managed to reproduce the problem at all. It takes random user
production use on 28 machines each running 3 FUSE systems, and then it
seems fairly random and totally unpredictable. There were no new
occurrences of the problem today.
Franco Broi
2005-03-23 01:21:54 UTC
Permalink
More on this:

If I run the filesystem as a foreground process I can kill it and get a
core dump.

I've noticed that on startup the filesystem has 3 threads and what seems
odd to me is that they don't all share the same parent, ie thread 3 has
thread 2 as the parent, is this normal?

franco 11198 9415 0 09:19 pts/0
00:00:00 /home/franco/fuse-2.2.1/example/.libs/lt-fusexmp -f fred
franco 11217 11198 0 09:19 pts/0
00:00:00 /home/franco/fuse-2.2.1/example/.libs/lt-fusexmp -f fred
franco 11218 11217 0 09:19 pts/0
00:00:00 /home/franco/fuse-2.2.1/example/.libs/lt-fusexmp -f fred
Post by Franco Broi
Post by Miklos Szeredi
Post by Franco Broi
Upgraded all our machines to FUSE-2.2.1 yesterday, today I've had a
failure, but it's of the disappearing filesystem type.
Does coredumping otherwise work? Can you check with killing a running
filesystem with 'kill -SEGV'?
No it doesn't. Where would the core file go if not in the directory
where the process is started?
Post by Miklos Szeredi
Anything interesting in dmesg output? Out of memory? Oops?
No, nothing.
Post by Miklos Szeredi
How easy is it to reproduce?
I haven't managed to reproduce the problem at all. It takes random user
production use on 28 machines each running 3 FUSE systems, and then it
seems fairly random and totally unpredictable. There were no new
occurrences of the problem today.
Terje Oseberg
2005-03-23 02:12:12 UTC
Permalink
Post by Franco Broi
I've noticed that on startup the filesystem has 3 threads and what seems
odd to me is that they don't all share the same parent, ie thread 3 has
thread 2 as the parent, is this normal?
franco 11198 9415 0 09:19 pts/0 00:00:00 /home/franco/fuse-2.2.1/example/.libs/lt-fusexmp -f fred
franco 11217 11198 0 09:19 pts/0 00:00:00 /home/franco/fuse-2.2.1/example/.libs/lt-fusexmp -f fred
franco 11218 11217 0 09:19 pts/0 00:00:00 /home/franco/fuse-2.2.1/example/.libs/lt-fusexmp -f fred
This is exactly what I got after I upgraded my libc to fix a bug in
the native posix thread library (NPTL). It appeared as if there is a
monitor process (11198) that is monitoring the main thread (11217).
Then an extra thread created by fusexmp (11218). When you load the
process with parallel file system requests you will get more threads
who's parent is 11217.

Terje Oseberg
Miklos Szeredi
2005-03-23 07:18:50 UTC
Permalink
Post by Franco Broi
Post by Miklos Szeredi
Post by Franco Broi
Upgraded all our machines to FUSE-2.2.1 yesterday, today I've had a
failure, but it's of the disappearing filesystem type.
Does coredumping otherwise work? Can you check with killing a running
filesystem with 'kill -SEGV'?
No it doesn't. Where would the core file go if not in the directory
where the process is started?
Since 2.1 daemon() is used to put the filesystem in the background,
which changes the CWD to '/'. You might find the corefile in there.
Post by Franco Broi
Post by Miklos Szeredi
How easy is it to reproduce?
I haven't managed to reproduce the problem at all. It takes random user
production use on 28 machines each running 3 FUSE systems, and then it
seems fairly random and totally unpredictable. There were no new
occurrences of the problem today.
Keep on sending the reports if it crashes. The backtrace from
the corefile would probably help as well.
Post by Franco Broi
I've noticed that on startup the filesystem has 3 threads and what seems
odd to me is that they don't all share the same parent, ie thread 3 has
thread 2 as the parent, is this normal?
As you can see in fuse_mt.c it's random which thread starts a new
worker thread. So maybe this is followed by newer versions of
libpthread. This is linux 2.4, isn't it? On 2.6 you would get the
same PID and PPID for all threads.

Thanks,
Miklos
Franco Broi
2005-03-23 07:34:05 UTC
Permalink
Post by Miklos Szeredi
Post by Franco Broi
Post by Miklos Szeredi
Post by Franco Broi
Upgraded all our machines to FUSE-2.2.1 yesterday, today I've had a
failure, but it's of the disappearing filesystem type.
Does coredumping otherwise work? Can you check with killing a running
filesystem with 'kill -SEGV'?
No it doesn't. Where would the core file go if not in the directory
where the process is started?
Since 2.1 daemon() is used to put the filesystem in the background,
which changes the CWD to '/'. You might find the corefile in there.
But does the process have permission to write to / ?
Post by Miklos Szeredi
As you can see in fuse_mt.c it's random which thread starts a new
worker thread. So maybe this is followed by newer versions of
libpthread. This is linux 2.4, isn't it? On 2.6 you would get the
same PID and PPID for all threads.
I thought that might be the case, although I don't understand why 3
threads appear immediately, before any sort of access is made to the
filesystem.

On 2.6 I only see a single thread with ps but I see there are entries in
the task directory.
Miklos Szeredi
2005-03-23 12:25:21 UTC
Permalink
Post by Franco Broi
Post by Miklos Szeredi
Since 2.1 daemon() is used to put the filesystem in the background,
which changes the CWD to '/'. You might find the corefile in there.
But does the process have permission to write to / ?
If not, than no corefile will be written. So the easiest solution is
to start filesystem with 'yourfs -f mountpoint &'.
Post by Franco Broi
I thought that might be the case, although I don't understand why 3
threads appear immediately, before any sort of access is made to the
filesystem.
Probably the 'INIT' request always provokes starting a new thread
(this is only since 2.2).
Post by Franco Broi
On 2.6 I only see a single thread with ps but I see there are entries in
the task directory.
Yes, that's how it should work.

Miklos
Franco Broi
2005-03-23 12:54:56 UTC
Permalink
Post by Miklos Szeredi
Post by Franco Broi
Post by Miklos Szeredi
Since 2.1 daemon() is used to put the filesystem in the background,
which changes the CWD to '/'. You might find the corefile in there.
But does the process have permission to write to / ?
If not, than no corefile will be written. So the easiest solution is
to start filesystem with 'yourfs -f mountpoint &'.
OK will do. I should have done it last time you suggested it as a way of
allowing output to a log file.

I'll try and get most machines running this way tomorrow and hopefully
we'll soon have a core file to give us some clues as to what's going on.

Thanks.
Franco Broi
2005-03-23 14:14:14 UTC
Permalink
Post by Terje Oseberg
This is exactly what I got after I upgraded my libc to fix a bug in
the native posix thread library (NPTL). It appeared as if there is a
monitor process (11198) that is monitoring the main thread (11217).
Then an extra thread created by fusexmp (11218). When you load the
process with parallel file system requests you will get more threads
who's parent is 11217.
OK that's interesting because I've just realised that 8 of my machines
show the multiple threads and the other 20 just single threads. They are
all running Red Hat 8 but the 8 machines are slightly newer. The uptime
format is also different suggesting that something got updated on the 8
new machines at some point.

Continue reading on narkive:
Loading...