Discussion:
device namespaces
(too old to reply)
riya khanna
2014-09-24 04:34:46 UTC
Permalink
(Please pardon multiple emails, artifact of merging all separate
conversations)

Thanks for your feedback!

Letting the kernel know about what devices a container could access (based
on device cgroups) and having devtmpfs in the kernel create device nodes
for a container that map to corresponding CUSE nodes is what I thought of.
For example, "echo 29:0 > /proc/<pid>/devices" would prepare a virtual
framebuffer (based on real fb0 SCREENINFO properties) for this process
provided permissions allow this operation. To view the framebuffer, the
CUSE based virtual device would talk to the actual hardware. Since
namespaces would have different view of the underlying devices, "sysfs" has
to made aware of this as well.

Please let me know your inputs. Thanks again!

-Riya
Hi,
I'm a newbie trying to come up with a fuse/cuse-based solution to
device namespace virtualization.
Fwiw I find the thought of allowing use of cuse from a container
(well,
an unprivileged container at least) more than a little bit
frightening
from a security perspective. If a process does an ioctl on a
cuse-based
device then the process implementing the device can get a very
broad
ability to read and write in the initiator's address space. If the
The cuse or fuse process would best run with the permissions of the
container. Even for an unprivileged container it could connect to
bind-mounts of say /dev/null etc for any passthrough access.
device were to show up automagically in devtmpfs and a process on
the
host could be tricked into opening the device, then that sounds
like a
great vector for an attack. Just something to keep in mind.
Yup. You'd like to think that having the devices be owned by uid
100000
would be a clue, but a script might not notice. The fs should only
be
mounted in the container's fs, but that can of course be reached
through
/proc/pid/root. Now an unpriv user shouldn't be able to chroot into
there without starting a new user namespace - leaving the victim no
long privileged and so no more harmful than the user was to begin
with.
I don't think it matters if the user is unprivileged if you're using
cuse to implement the devices. In order for it to work the unprivileged
user would need read/write access to /dev/cuse, and once it has that
there seems to be no restrictions on what cuse functionality it can
make
use of.
When the user creates a device cuse calls device_add() for the new
device, which is going to create a node in devtmpfs which is owned by
global root. At that point I see nothing that would stop a process in
the host from opening the file and doing ioctls. It looks like it would
even be possible to use cuse to claim a well-known major/minor pair for
your device if it wasn't already claimed (e.g. the driver was a module
and not loaded).
I didn't spend a lot of time looking at the code, so it's possible I
missed something, but if I didn't then giving unprivileged users access
to /dev/cuse seems like a very bad idea.
Ok, agreed. The original author mainly mentioned fuse. I thought fuse
couldn't create device nodes though.
Yeah, but since he did mention cuse I thought I'd throw out a warning.
With fuse it is technically possible to have device nodes, but it's
usually prevented for unprivileged users by the suid helper (fusermount)
adding MS_NODEV to the mountflags. With my patches for fuse in user
namespaces the kernel will add nodev for any userns mount, and from a
security perspective I don't see any way around that.
Seth
_______________________________________________
lxc-devel mailing list
http://lists.linuxcontainers.org/listinfo/lxc-devel
Eric W. Biederman
2014-09-24 05:04:30 UTC
Permalink
(Please pardon multiple emails, artifact of merging all separate conversations)
Thanks for your feedback!
Letting the kernel know about what devices a container could access (based on
device cgroups) and having devtmpfs in the kernel create device nodes for a
container that map to corresponding CUSE nodes is what I thought of. For
example, "echo 29:0 > /proc/<pid>/devices" would prepare a virtual framebuffer
(based on real fb0 SCREENINFO properties) for this process provided permissions
allow this operation. To view the framebuffer, the CUSE based virtual device
would talk to the actual hardware. Since namespaces would have different view of
the underlying devices, "sysfs" has to made aware of this as well.
Please let me know your inputs. Thanks again!
The solution hugely depends on what you are trying to do with it.

The situation today is that device nodes are slowly fading out. In
another 20 years linux may not have any device nodes at all.

Therefore the question becomes what are you trying to support.

If it is just filtering of existing device nodes. We can do a pretty
good approximation with bind mounts.

If you want to emulate a device you can use normal fuse (not cuse).
As normal fuse file will support arbitrary ioctls.

There are a few cases where it is desirable to emulate what devpts
does for allowing arbitrary users to creating virtual devices in the
kernel. Loop devices in particular.

Ultimately given the existence of device hotplug I don't see any call
for being able to create device nodes with well known device numbers
(fundamentally what a device namespace would be about).

The conversation last year was about people wanting to multiplex devices
that don't have multiplexer support in the kernel. If that is your
desire I think it is entirely reasonable to device type by device type
add support for multiplexing that device type to the kernel, or
potentially just use fuse or cuse to implement your multiplexer in
userspace but that has the potential to be unusably slow.

Eric
riya khanna
2014-09-24 05:32:27 UTC
Permalink
My use case for having device namespaces is device isolation. Isn't what
namespaces are there for (as I understand)? Not everything should be
accessible (or even visible) from a container all the time (we have seen
people come up with different use cases for this). However, bind-mounting
takes away this flexibility. I agree that assigning fixed device numbers is
clearly not a long-term solution. Emulation for safe and flexible
multiplexing, like you suggested either using CUSE/FUSE or something like
devpts, is what I'm exploring.
Post by riya khanna
Post by riya khanna
(Please pardon multiple emails, artifact of merging all separate
conversations)
Post by riya khanna
Thanks for your feedback!
Letting the kernel know about what devices a container could access
(based on
Post by riya khanna
device cgroups) and having devtmpfs in the kernel create device nodes
for a
Post by riya khanna
container that map to corresponding CUSE nodes is what I thought of. For
example, "echo 29:0 > /proc/<pid>/devices" would prepare a virtual
framebuffer
Post by riya khanna
(based on real fb0 SCREENINFO properties) for this process provided
permissions
Post by riya khanna
allow this operation. To view the framebuffer, the CUSE based virtual
device
Post by riya khanna
would talk to the actual hardware. Since namespaces would have different
view of
Post by riya khanna
the underlying devices, "sysfs" has to made aware of this as well.
Please let me know your inputs. Thanks again!
The solution hugely depends on what you are trying to do with it.
The situation today is that device nodes are slowly fading out. In
another 20 years linux may not have any device nodes at all.
Therefore the question becomes what are you trying to support.
If it is just filtering of existing device nodes. We can do a pretty
good approximation with bind mounts.
If you want to emulate a device you can use normal fuse (not cuse).
As normal fuse file will support arbitrary ioctls.
There are a few cases where it is desirable to emulate what devpts
does for allowing arbitrary users to creating virtual devices in the
kernel. Loop devices in particular.
Ultimately given the existence of device hotplug I don't see any call
for being able to create device nodes with well known device numbers
(fundamentally what a device namespace would be about).
The conversation last year was about people wanting to multiplex devices
that don't have multiplexer support in the kernel. If that is your
desire I think it is entirely reasonable to device type by device type
add support for multiplexing that device type to the kernel, or
potentially just use fuse or cuse to implement your multiplexer in
userspace but that has the potential to be unusably slow.
Eric
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Serge Hallyn
2014-09-24 16:37:40 UTC
Permalink
Isolation is provided by the devices cgroup. You want something more
than isolation.
Post by riya khanna
My use case for having device namespaces is device isolation. Isn't what
namespaces are there for (as I understand)? Not everything should be
accessible (or even visible) from a container all the time (we have seen
people come up with different use cases for this). However, bind-mounting
takes away this flexibility. I agree that assigning fixed device numbers is
clearly not a long-term solution. Emulation for safe and flexible
multiplexing, like you suggested either using CUSE/FUSE or something like
devpts, is what I'm exploring.
Post by riya khanna
Post by riya khanna
(Please pardon multiple emails, artifact of merging all separate
conversations)
Post by riya khanna
Thanks for your feedback!
Letting the kernel know about what devices a container could access
(based on
Post by riya khanna
device cgroups) and having devtmpfs in the kernel create device nodes
for a
Post by riya khanna
container that map to corresponding CUSE nodes is what I thought of. For
example, "echo 29:0 > /proc/<pid>/devices" would prepare a virtual
framebuffer
Post by riya khanna
(based on real fb0 SCREENINFO properties) for this process provided
permissions
Post by riya khanna
allow this operation. To view the framebuffer, the CUSE based virtual
device
Post by riya khanna
would talk to the actual hardware. Since namespaces would have different
view of
Post by riya khanna
the underlying devices, "sysfs" has to made aware of this as well.
Please let me know your inputs. Thanks again!
The solution hugely depends on what you are trying to do with it.
The situation today is that device nodes are slowly fading out. In
another 20 years linux may not have any device nodes at all.
Therefore the question becomes what are you trying to support.
If it is just filtering of existing device nodes. We can do a pretty
good approximation with bind mounts.
If you want to emulate a device you can use normal fuse (not cuse).
As normal fuse file will support arbitrary ioctls.
There are a few cases where it is desirable to emulate what devpts
does for allowing arbitrary users to creating virtual devices in the
kernel. Loop devices in particular.
Ultimately given the existence of device hotplug I don't see any call
for being able to create device nodes with well known device numbers
(fundamentally what a device namespace would be about).
The conversation last year was about people wanting to multiplex devices
that don't have multiplexer support in the kernel. If that is your
desire I think it is entirely reasonable to device type by device type
add support for multiplexing that device type to the kernel, or
potentially just use fuse or cuse to implement your multiplexer in
userspace but that has the potential to be unusably slow.
Eric
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Eric W. Biederman
2014-09-24 17:43:12 UTC
Permalink
Post by Serge Hallyn
Isolation is provided by the devices cgroup. You want something more
than isolation.
Post by riya khanna
My use case for having device namespaces is device isolation. Isn't what
namespaces are there for (as I understand)?
Namespaces fundamentally provide for using the same ``global'' name
in different contexts. This allows them to be used for isolation
and process migration (because you can take the same name from
machine to machine).

Unless someone cares about device numbers at a namespace level
the work is done.

The mount namespace provides exsits to deal with file names.
The devices cgroup will limit which devices you can access (although
I can't ever imagine a case where the mout namespace would be
insufficient).
Post by Serge Hallyn
Post by riya khanna
Not everything should be
accessible (or even visible) from a container all the time (we have seen
people come up with different use cases for this). However, bind-mounting
takes away this flexibility.
I don't see how. If they are mounts that propogate into the container
and are controlled from outside you can do whatever you want. (I am
imagining device by device bind mounts here). It should be trivial
to have a a directory tree that propogates into a container and works.
Post by Serge Hallyn
Post by riya khanna
I agree that assigning fixed device numbers is
clearly not a long-term solution. Emulation for safe and flexible
multiplexing, like you suggested either using CUSE/FUSE or something like
devpts, is what I'm exploring.
Is the problem you actually care about multiplexing devices?

I think there is quite a bit of room to talk about how to safely
and effectively use devices in containers. So let's make that the
discussion. No one actually wants device number namespaces and talking
about them only muddies the watters.

Eric
Jude Nelson
2014-09-24 18:06:45 UTC
Permalink
Whoops, meant to CC fuse-devel

---------- Forwarded message ----------
From: Jude Nelson <***@gmail.com>
Date: Wed, Sep 24, 2014 at 2:03 PM
Subject: Re: [fuse-devel] Using devices in Containers (was: [lxc-devel]
device namespaces)
To: "Eric W. Biederman" <***@xmission.com>


What if you had a FUSE filesystem mounted on your container's /dev that
kept track of the device nodes in the root context's /dev, but applied some
filters to show only the device nodes you want the contained processes to
see?

This is something I've been working on for a few weeks, and I'm almost
ready to put something on github. It doesn't depend on a particular
hotplug service, nor does it depend on cgroups. Would you be interested?

-Jude
Post by Eric W. Biederman
Post by Serge Hallyn
Isolation is provided by the devices cgroup. You want something more
than isolation.
Post by riya khanna
My use case for having device namespaces is device isolation. Isn't what
namespaces are there for (as I understand)?
Namespaces fundamentally provide for using the same ``global'' name
in different contexts. This allows them to be used for isolation
and process migration (because you can take the same name from
machine to machine).
Unless someone cares about device numbers at a namespace level
the work is done.
The mount namespace provides exsits to deal with file names.
The devices cgroup will limit which devices you can access (although
I can't ever imagine a case where the mout namespace would be
insufficient).
Post by Serge Hallyn
Post by riya khanna
Not everything should be
accessible (or even visible) from a container all the time (we have seen
people come up with different use cases for this). However,
bind-mounting
Post by Serge Hallyn
Post by riya khanna
takes away this flexibility.
I don't see how. If they are mounts that propogate into the container
and are controlled from outside you can do whatever you want. (I am
imagining device by device bind mounts here). It should be trivial
to have a a directory tree that propogates into a container and works.
Post by Serge Hallyn
Post by riya khanna
I agree that assigning fixed device numbers is
clearly not a long-term solution. Emulation for safe and flexible
multiplexing, like you suggested either using CUSE/FUSE or something
like
Post by Serge Hallyn
Post by riya khanna
devpts, is what I'm exploring.
Is the problem you actually care about multiplexing devices?
I think there is quite a bit of room to talk about how to safely
and effectively use devices in containers. So let's make that the
discussion. No one actually wants device number namespaces and talking
about them only muddies the watters.
Eric
------------------------------------------------------------------------------
Meet PCI DSS 3.0 Compliance Requirements with EventLog Analyzer
Achieve PCI DSS 3.0 Compliant Status with Out-of-the-box PCI DSS Reports
Are you Audit-Ready for PCI DSS 3.0 Compliance? Download White paper
Comply to PCI DSS 3.0 Requirement 10 and 11.5 with EventLog Analyzer
http://pubads.g.doubleclick.net/gampad/clk?id=154622311&iu=/4140/ostg.clktrk
_______________________________________________
fuse-devel mailing list
https://lists.sourceforge.net/lists/listinfo/fuse-devel
Riya Khanna
2014-09-24 19:30:30 UTC
Permalink
Post by Eric W. Biederman
Post by Serge Hallyn
Isolation is provided by the devices cgroup. You want something more
than isolation.
Post by riya khanna
My use case for having device namespaces is device isolation. Isn't what
namespaces are there for (as I understand)?
Namespaces fundamentally provide for using the same ``global'' name
in different contexts. This allows them to be used for isolation
and process migration (because you can take the same name from
machine to machine).
Unless someone cares about device numbers at a namespace level
the work is done.
The mount namespace provides exsits to deal with file names.
The devices cgroup will limit which devices you can access (although
I can't ever imagine a case where the mout namespace would be
insufficient).
Post by Serge Hallyn
Post by riya khanna
Not everything should be
accessible (or even visible) from a container all the time (we have seen
people come up with different use cases for this). However, bind-mounting
takes away this flexibility.
I don't see how. If they are mounts that propogate into the container
and are controlled from outside you can do whatever you want. (I am
imagining device by device bind mounts here). It should be trivial
to have a a directory tree that propogates into a container and works.
Device-by-device bind mounts can grant/revoke access to real individual devices as and when needed. However, revoking the access to real devices could break the applications if there’s no transparent mechanism to back up the propagated (but now revoked) device bind mounts that could fool the apps into believing that they are working with real devices. Frame buffer is one such example, where safe multiplexing could be applied.
Post by Eric W. Biederman
Post by Serge Hallyn
Post by riya khanna
I agree that assigning fixed device numbers is
clearly not a long-term solution. Emulation for safe and flexible
multiplexing, like you suggested either using CUSE/FUSE or something like
devpts, is what I'm exploring.
Is the problem you actually care about multiplexing devices?
The problem I care about is access to real devices, such as input, fb, loop, etc. as and when needed, thereby having native I/O performance - either through secure multiplexing or exclusive ownership, whatever makes sense according to the device type.
Post by Eric W. Biederman
I think there is quite a bit of room to talk about how to safely
and effectively use devices in containers. So let's make that the
discussion. No one actually wants device number namespaces and talking
about them only muddies the watters.
I cannot agree more. Let’s restrict the discussion to it.

Thanks,
Riya
Post by Eric W. Biederman
Eric
Eric W. Biederman
2014-09-24 22:38:03 UTC
Permalink
Post by Riya Khanna
Post by Eric W. Biederman
Post by Serge Hallyn
Isolation is provided by the devices cgroup. You want something more
than isolation.
Post by riya khanna
My use case for having device namespaces is device isolation. Isn't what
namespaces are there for (as I understand)?
Namespaces fundamentally provide for using the same ``global'' name
in different contexts. This allows them to be used for isolation
and process migration (because you can take the same name from
machine to machine).
Unless someone cares about device numbers at a namespace level
the work is done.
The mount namespace provides exsits to deal with file names.
The devices cgroup will limit which devices you can access (although
I can't ever imagine a case where the mout namespace would be
insufficient).
Post by Serge Hallyn
Post by riya khanna
Not everything should be
accessible (or even visible) from a container all the time (we have seen
people come up with different use cases for this). However, bind-mounting
takes away this flexibility.
I don't see how. If they are mounts that propogate into the container
and are controlled from outside you can do whatever you want. (I am
imagining device by device bind mounts here). It should be trivial
to have a a directory tree that propogates into a container and works.
Device-by-device bind mounts can grant/revoke access to real
individual devices as and when needed. However, revoking the access to
real devices could break the applications if there’s no transparent
mechanism to back up the propagated (but now revoked) device bind
mounts that could fool the apps into believing that they are working
with real devices. Frame buffer is one such example, where safe
multiplexing could be applied.
Post by Eric W. Biederman
Post by Serge Hallyn
Post by riya khanna
I agree that assigning fixed device numbers is
clearly not a long-term solution. Emulation for safe and flexible
multiplexing, like you suggested either using CUSE/FUSE or something like
devpts, is what I'm exploring.
Is the problem you actually care about multiplexing devices?
The problem I care about is access to real devices, such as input, fb,
loop, etc. as and when needed, thereby having native I/O performance -
either through secure multiplexing or exclusive ownership, whatever
makes sense according to the device type.
I guess policy-based multiplexing (or exclusive ownership) is the
usage. What kind of devices (loop, fb, etc.) this is needed for
depends on the usage. If there are multiple FBs, then each container
could potentially own one. One may want to provide exclusive ownership
of input devices to one container at a time to avoid information
leakage. Like we saw at LPC last year, this applies to sensors (gps,
accelerometer, etc.) on mobile devices as well.
Allowing mutiplexing of those devices seems reasonable.

Where the discussion ran into problems last time was that people did not
want to use any of the existing linux solutions for multiplexing those
kind of thing and wanted to invent something new.

Inventing something new is fine if it the extra code maintenance can be
justified, or if the invention just a better solution for all users and
new code can just start using that in general.

The old solution to your problem of multiplexing devices is by
allocating a virtual terminal nd sending signals to coordinate
cooperatively sharing those resources.

If you want some sort of preemtive multitasking that requires
something a bit more effort, and work in the device abstractions.
You may be able to share concepts and library code but I don't believe
there is something you can just pain on top of devices and make it
happen. Certainly in the bad old days of X terminal switching the
cooperation was necessary so that when a video card was yanked from an
application writing directly to that video card the application would
need to restore the video card to a known state so the next application
would have a chance of making sense of it. Furthermore most devices
are not safe to let unprivileged users to access their control registers
directly.

All of which boils down the simple fact that for each type of device you
would like to share it is necessary to update the subsystem to support
arbitrary numbers of virtual devices that you can talk to.

The macvlan driver in the networking stack is a rough example of what I
expect you would like. Something that takes one real physical device
and turns it into N virtual devices each of which runs at effectively
full speed. Along with some kind of new master interface for
controlling when the multiplexing takes place.

I think we do most of this is software today and arguably for a lot of
devices the overhead is small enough that a software solution is fine.
So perhaps all you need is a fuse interface to the existing software
multiplexers so that weird legacy code can be made to run.

Now I suspect part of doing this right will be getting proper video
drivers on Android. I assume that Android is the platform you care
about.

Eric
riya khanna
2014-09-25 00:25:09 UTC
Permalink
Post by riya khanna
Post by Riya Khanna
Post by Eric W. Biederman
Post by Serge Hallyn
Isolation is provided by the devices cgroup. You want something more
than isolation.
Post by riya khanna
My use case for having device namespaces is device isolation. Isn't
what
Post by Riya Khanna
Post by Eric W. Biederman
Post by Serge Hallyn
Post by riya khanna
namespaces are there for (as I understand)?
Namespaces fundamentally provide for using the same ``global'' name
in different contexts. This allows them to be used for isolation
and process migration (because you can take the same name from
machine to machine).
Unless someone cares about device numbers at a namespace level
the work is done.
The mount namespace provides exsits to deal with file names.
The devices cgroup will limit which devices you can access (although
I can't ever imagine a case where the mout namespace would be
insufficient).
Post by Serge Hallyn
Post by riya khanna
Not everything should be
accessible (or even visible) from a container all the time (we have
seen
Post by Riya Khanna
Post by Eric W. Biederman
Post by Serge Hallyn
Post by riya khanna
people come up with different use cases for this). However,
bind-mounting
Post by Riya Khanna
Post by Eric W. Biederman
Post by Serge Hallyn
Post by riya khanna
takes away this flexibility.
I don't see how. If they are mounts that propogate into the container
and are controlled from outside you can do whatever you want. (I am
imagining device by device bind mounts here). It should be trivial
to have a a directory tree that propogates into a container and works.
Device-by-device bind mounts can grant/revoke access to real
individual devices as and when needed. However, revoking the access to
real devices could break the applications if there’s no transparent
mechanism to back up the propagated (but now revoked) device bind
mounts that could fool the apps into believing that they are working
with real devices. Frame buffer is one such example, where safe
multiplexing could be applied.
Post by Eric W. Biederman
Post by Serge Hallyn
Post by riya khanna
I agree that assigning fixed device numbers is
clearly not a long-term solution. Emulation for safe and flexible
multiplexing, like you suggested either using CUSE/FUSE or something
like
Post by Riya Khanna
Post by Eric W. Biederman
Post by Serge Hallyn
Post by riya khanna
devpts, is what I'm exploring.
Is the problem you actually care about multiplexing devices?
The problem I care about is access to real devices, such as input, fb,
loop, etc. as and when needed, thereby having native I/O performance -
either through secure multiplexing or exclusive ownership, whatever
makes sense according to the device type.
I guess policy-based multiplexing (or exclusive ownership) is the
usage. What kind of devices (loop, fb, etc.) this is needed for
depends on the usage. If there are multiple FBs, then each container
could potentially own one. One may want to provide exclusive ownership
of input devices to one container at a time to avoid information
leakage. Like we saw at LPC last year, this applies to sensors (gps,
accelerometer, etc.) on mobile devices as well.
Allowing mutiplexing of those devices seems reasonable.
Where the discussion ran into problems last time was that people did not
want to use any of the existing linux solutions for multiplexing those
kind of thing and wanted to invent something new.
Inventing something new is fine if it the extra code maintenance can be
justified, or if the invention just a better solution for all users and
new code can just start using that in general.
The old solution to your problem of multiplexing devices is by
allocating a virtual terminal nd sending signals to coordinate
cooperatively sharing those resources.
If you want some sort of preemtive multitasking that requires
something a bit more effort, and work in the device abstractions.
You may be able to share concepts and library code but I don't believe
there is something you can just pain on top of devices and make it
happen. Certainly in the bad old days of X terminal switching the
cooperation was necessary so that when a video card was yanked from an
application writing directly to that video card the application would
need to restore the video card to a known state so the next application
would have a chance of making sense of it. Furthermore most devices
are not safe to let unprivileged users to access their control registers
directly.
All of which boils down the simple fact that for each type of device you
would like to share it is necessary to update the subsystem to support
arbitrary numbers of virtual devices that you can talk to.
The macvlan driver in the networking stack is a rough example of what I
expect you would like. Something that takes one real physical device
and turns it into N virtual devices each of which runs at effectively
full speed. Along with some kind of new master interface for
controlling when the multiplexing takes place.
I think we do most of this is software today and arguably for a lot of
devices the overhead is small enough that a software solution is fine.
So perhaps all you need is a fuse interface to the existing software
multiplexers so that weird legacy code can be made to run.
What kind of existing multiplexers could be used? Is there one for fb? We
have evdev abstractions for input in place already.

Now I suspect part of doing this right will be getting proper video
Post by riya khanna
drivers on Android. I assume that Android is the platform you care
about.
Eric
riya khanna
2014-09-25 15:40:10 UTC
Permalink
Is there a plan or work-in-progress to add namespace tags to other
classes in sysfs similar to net? Does it make sense to add namespace
tags to kobjects?

-Riya
Post by riya khanna
Post by Eric W. Biederman
Post by Riya Khanna
Post by Eric W. Biederman
Post by Serge Hallyn
Isolation is provided by the devices cgroup. You want something more
than isolation.
Post by riya khanna
My use case for having device namespaces is device isolation. Isn't what
namespaces are there for (as I understand)?
Namespaces fundamentally provide for using the same ``global'' name
in different contexts. This allows them to be used for isolation
and process migration (because you can take the same name from
machine to machine).
Unless someone cares about device numbers at a namespace level
the work is done.
The mount namespace provides exsits to deal with file names.
The devices cgroup will limit which devices you can access (although
I can't ever imagine a case where the mout namespace would be
insufficient).
Post by Serge Hallyn
Post by riya khanna
Not everything should be
accessible (or even visible) from a container all the time (we have seen
people come up with different use cases for this). However, bind-mounting
takes away this flexibility.
I don't see how. If they are mounts that propogate into the container
and are controlled from outside you can do whatever you want. (I am
imagining device by device bind mounts here). It should be trivial
to have a a directory tree that propogates into a container and works.
Device-by-device bind mounts can grant/revoke access to real
individual devices as and when needed. However, revoking the access to
real devices could break the applications if there’s no transparent
mechanism to back up the propagated (but now revoked) device bind
mounts that could fool the apps into believing that they are working
with real devices. Frame buffer is one such example, where safe
multiplexing could be applied.
Post by Eric W. Biederman
Post by Serge Hallyn
Post by riya khanna
I agree that assigning fixed device numbers is
clearly not a long-term solution. Emulation for safe and flexible
multiplexing, like you suggested either using CUSE/FUSE or something like
devpts, is what I'm exploring.
Is the problem you actually care about multiplexing devices?
The problem I care about is access to real devices, such as input, fb,
loop, etc. as and when needed, thereby having native I/O performance -
either through secure multiplexing or exclusive ownership, whatever
makes sense according to the device type.
I guess policy-based multiplexing (or exclusive ownership) is the
usage. What kind of devices (loop, fb, etc.) this is needed for
depends on the usage. If there are multiple FBs, then each container
could potentially own one. One may want to provide exclusive ownership
of input devices to one container at a time to avoid information
leakage. Like we saw at LPC last year, this applies to sensors (gps,
accelerometer, etc.) on mobile devices as well.
Allowing mutiplexing of those devices seems reasonable.
Where the discussion ran into problems last time was that people did not
want to use any of the existing linux solutions for multiplexing those
kind of thing and wanted to invent something new.
Inventing something new is fine if it the extra code maintenance can be
justified, or if the invention just a better solution for all users and
new code can just start using that in general.
The old solution to your problem of multiplexing devices is by
allocating a virtual terminal nd sending signals to coordinate
cooperatively sharing those resources.
If you want some sort of preemtive multitasking that requires
something a bit more effort, and work in the device abstractions.
You may be able to share concepts and library code but I don't believe
there is something you can just pain on top of devices and make it
happen. Certainly in the bad old days of X terminal switching the
cooperation was necessary so that when a video card was yanked from an
application writing directly to that video card the application would
need to restore the video card to a known state so the next application
would have a chance of making sense of it. Furthermore most devices
are not safe to let unprivileged users to access their control registers
directly.
All of which boils down the simple fact that for each type of device you
would like to share it is necessary to update the subsystem to support
arbitrary numbers of virtual devices that you can talk to.
The macvlan driver in the networking stack is a rough example of what I
expect you would like. Something that takes one real physical device
and turns it into N virtual devices each of which runs at effectively
full speed. Along with some kind of new master interface for
controlling when the multiplexing takes place.
I think we do most of this is software today and arguably for a lot of
devices the overhead is small enough that a software solution is fine.
So perhaps all you need is a fuse interface to the existing software
multiplexers so that weird legacy code can be made to run.
What kind of existing multiplexers could be used? Is there one for fb? We
have evdev abstractions for input in place already.
Post by Eric W. Biederman
Now I suspect part of doing this right will be getting proper video
drivers on Android. I assume that Android is the platform you care
about.
Eric
Eric W. Biederman
2014-09-25 18:09:43 UTC
Permalink
Post by riya khanna
Is there a plan or work-in-progress to add namespace tags to other
classes in sysfs similar to net? Does it make sense to add namespace
tags to kobjects?
Currently the a general nack from gregkh on such work.

Given that sysfs is almost never a fast path I suspect it makes most
sense to filter sysfs in some way (aka bind mounts or fuse) and present
the results to the container.

At the point this is something that we are using a lot and have
demonstrated the usefulness of it and it appears a kernel level
solution would be better it would be worth reopening the disucssion.

Eric
Eric W. Biederman
2014-09-25 18:21:50 UTC
Permalink
What kind of existing multiplexers could be used? Is there one for fb? We have
evdev abstractions for input in place already.
We have X and Wayland/Weston and pulse audio and doubtless more that I
am not aware of.

For video a lot of working is going into compositing and handling
multiple contexts in the hardware so there may already be support in the
kernel.

Fundamentally these are all pieces of hardware we allow multiple
userspace applications access to their information or to modify.
Therefore there is existing multiplexing somewhere.

I won't claim all of the existing multiplexing methods are good and
should be used as is, but they definitely should be used as a starting
point.
Riya Khanna
2014-09-24 19:07:31 UTC
Permalink
I guess policy-based multiplexing (or exclusive ownership) is the usage. What kind of devices (loop, fb, etc.) this is needed for depends on the usage. If there are multiple FBs, then each container could potentially own one. One may want to provide exclusive ownership of input devices to one container at a time to avoid information leakage. Like we saw at LPC last year, this applies to sensors (gps, accelerometer, etc.) on mobile devices as well.
Post by Serge Hallyn
Isolation is provided by the devices cgroup. You want something more
than isolation.
Post by riya khanna
My use case for having device namespaces is device isolation. Isn't what
namespaces are there for (as I understand)? Not everything should be
accessible (or even visible) from a container all the time (we have seen
people come up with different use cases for this). However, bind-mounting
takes away this flexibility. I agree that assigning fixed device numbers is
clearly not a long-term solution. Emulation for safe and flexible
multiplexing, like you suggested either using CUSE/FUSE or something like
devpts, is what I'm exploring.
Post by riya khanna
Post by riya khanna
(Please pardon multiple emails, artifact of merging all separate
conversations)
Post by riya khanna
Thanks for your feedback!
Letting the kernel know about what devices a container could access
(based on
Post by riya khanna
device cgroups) and having devtmpfs in the kernel create device nodes
for a
Post by riya khanna
container that map to corresponding CUSE nodes is what I thought of. For
example, "echo 29:0 > /proc/<pid>/devices" would prepare a virtual
framebuffer
Post by riya khanna
(based on real fb0 SCREENINFO properties) for this process provided
permissions
Post by riya khanna
allow this operation. To view the framebuffer, the CUSE based virtual
device
Post by riya khanna
would talk to the actual hardware. Since namespaces would have different
view of
Post by riya khanna
the underlying devices, "sysfs" has to made aware of this as well.
Please let me know your inputs. Thanks again!
The solution hugely depends on what you are trying to do with it.
The situation today is that device nodes are slowly fading out. In
another 20 years linux may not have any device nodes at all.
Therefore the question becomes what are you trying to support.
If it is just filtering of existing device nodes. We can do a pretty
good approximation with bind mounts.
If you want to emulate a device you can use normal fuse (not cuse).
As normal fuse file will support arbitrary ioctls.
There are a few cases where it is desirable to emulate what devpts
does for allowing arbitrary users to creating virtual devices in the
kernel. Loop devices in particular.
Ultimately given the existence of device hotplug I don't see any call
for being able to create device nodes with well known device numbers
(fundamentally what a device namespace would be about).
The conversation last year was about people wanting to multiplex devices
that don't have multiplexer support in the kernel. If that is your
desire I think it is entirely reasonable to device type by device type
add support for multiplexing that device type to the kernel, or
potentially just use fuse or cuse to implement your multiplexer in
userspace but that has the potential to be unusably slow.
Eric
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Serge Hallyn
2014-09-24 16:38:16 UTC
Permalink
Post by Eric W. Biederman
(Please pardon multiple emails, artifact of merging all separate conversations)
Thanks for your feedback!
Letting the kernel know about what devices a container could access (based on
device cgroups) and having devtmpfs in the kernel create device nodes for a
container that map to corresponding CUSE nodes is what I thought of. For
example, "echo 29:0 > /proc/<pid>/devices" would prepare a virtual framebuffer
(based on real fb0 SCREENINFO properties) for this process provided permissions
allow this operation. To view the framebuffer, the CUSE based virtual device
would talk to the actual hardware. Since namespaces would have different view of
the underlying devices, "sysfs" has to made aware of this as well.
Please let me know your inputs. Thanks again!
The solution hugely depends on what you are trying to do with it.
The situation today is that device nodes are slowly fading out. In
another 20 years linux may not have any device nodes at all.
Therefore the question becomes what are you trying to support.
If it is just filtering of existing device nodes. We can do a pretty
good approximation with bind mounts.
If you want to emulate a device you can use normal fuse (not cuse).
As normal fuse file will support arbitrary ioctls.
There are a few cases where it is desirable to emulate what devpts
does for allowing arbitrary users to creating virtual devices in the
kernel. Loop devices in particular.
Ultimately given the existence of device hotplug I don't see any call
for being able to create device nodes with well known device numbers
(fundamentally what a device namespace would be about).
The conversation last year was about people wanting to multiplex devices
that don't have multiplexer support in the kernel. If that is your
desire I think it is entirely reasonable to device type by device type
add support for multiplexing that device type to the kernel, or
potentially just use fuse or cuse to implement your multiplexer in
userspace but that has the potential to be unusably slow.
It would be helpful to have a list of devices that may want that
multiplexing. Is it really just loop and graphics drivers?
Continue reading on narkive:
Loading...