Interesting. What are the legitimate use cases to not treat /proc as readonly, and what are legitimate use cases to mount around and especially bind-mount random filesystems around in /proc?
Like, my first impulse is "Why do we allow this?" And I guess, sure, the answer is "root is allowed to do this, because root is never not allowed". And sure I very much dislike my computer telling me "Nay I cannot do that", hence why I have no windows anymore at home.
But there is some stuff that seemingly doesn't have any legitimate use case on a server. And even if protections from that stuff keep me from fixing some situations, I can still nuke and rebuild it in an hour or so.
> What are the legitimate use cases to not treat /proc as readonly,
Only some parts of proc are "read only." /proc/sys is filled with writable controls.
> my first impulse is "Why do we allow this?"
The user is allowed to do whatever they like with their machine. It's the reason I use linux. It never puts me in a position where "system policies" or other default "security theater" nonsense disadvantages me on my own hardware.
If you're that concerned you can easily add a policy framework, like SELinux, or others, which would prevent this from happening or raise an exception if it does.
> that seemingly doesn't have any legitimate use case on a server.
There are dozens of other ways to achieve this same effect that rely on mechanisms that have legitimate use cases. In particular if you are root you will not struggle to find ways to hide processes. In this case you can just observe "/proc/mounts" to see that something perfidious is occurring.
> I can still nuke and rebuild it in an hour or so.
As long as there is no important data at rest within the server. This isn't always the case.
Takes one of "all" (the default) and "pid". If "pid", all files and directories not directly associated with process management and introspection are made invisible in the /proc/ file system configured for the unit's processes. This controls the "subset=" mount option of the "procfs" instance for the unit. For further details see The /proc Filesystem². Note that Linux exposes various kernel APIs via /proc/, which are made unavailable with this setting. Since these APIs are used frequently this option is useful only in a few, specific cases, and is not suitable for most non-trivial programs.
And that's what I'm getting at, and where I'd like the community to improve in discussions. In what context do you need it, and how much, and what would your alternatives be?
Because, the amount of different contexts linux is being used in, and the different threat levels are vastly different.
For example, I'm aware that the industrial and embedded world does wild things at times. Because it's hard to establish redundancy and replacability there. Because the system is attached to a $750k lathe. However, that thing is not networked, and physical access is controlled by people with guns. Do whatever you need to keep this thing running, as horrid as it may be.
On the other hand, I have a fleet of loadbalancers and their job is to accept traffic from all criminals in this world, and then some legitimate users as well. I can reset them to base linux and have them back operational in 10 minutes or so. Things modifying loaded code in memory outside of some very specific situations like service startup on these systems is terrifying and entirely not necessary.
So I would be very happy with a switch to turn that off, even though some other use cases wouldn't need it or wouldn't be able to use it at all.
/proc still read-only, the article uses "mounts", a generic mechanism available for any area of the filesystem. This is allowed because mounting logic is living "above" filesystem-specific logic, and there is no specific exception for /proc filesystem.
One can imagine having a special code which checks for mounts for /proc, but this turns ugly pretty quick. Disallow all mounts on /proc? Not going to work, the PC I am typing the comment on has a mount at /proc/sys/fs/binfmt_misc. Maybe just disallow bind mounts only? This breaks "create chroot/container and bind-mount host fs" use case, plus attacker can mount tmpfs anyway.
You have to design some sort of rules for where mounting is allowed and where not, and then ensure they are correct and up-to-date. This is a non-trivial amount of work -- and for what? A method which is super obvious to detect (mount table entries!) and can only fool the beginner defenders?
Instead, Linux provides generic "LSM" hook mechanism that can restrict any operations, including mounts. If someone thinks such mount restrictions are a good idea, they are welcome to write kmod (kernel module) to do so, or configure one of the existing ones to reject those operations. But I expect that by the time people get knowledgeable enough to write kmods, they are knowledgeable enough to come up with a better protection against rogue root user.
One way to detect when this has occurred is to use fstatfs() [0]: it will tell you whether an fd truly belongs to a procfs, or whether it belongs to some other kind of filesystem. Of course, you still have the issue of one part of a procfs being bind-mounted onto another part of a procfs (or onto a different procfs). For that, one mitigation is to double-check the st_dev of every file down the chain. But it still doesn't guarantee that you haven't been led into a recursive bind-mount: for that you have to use tricks like checking rename() error codes, or some of the fancy new openat2() flags, to test the absence of a mount-point boundary.
Personally, my stance is to just blindly trust whatever /proc tells you, except in really unusual security models. Users can only end up hurting themselves by putting nonsensical things in there. (The same way no one should bother checking that /dev/null isn't a regular file, even though root can easily make it one.)
The Unix philosophy is "everything is a file". But maintaining that abstraction indirectly leads to edge cases like this where devs may be thinking in terms of a FILESYSTEM but some of the files are, in reality, a SYSCALL API/RPC INTERFACE. This makes every new filesystem feature a potential security risk. Is it worth the abstraction? I think so.
One of the solutions listed to discovering this is to investigate /proc/mounts and look for these type of mounts. Couldn't you use the same trick on /proc/mounts itself?
/proc/mounts is a symlink to /proc/self/mounts. And /proc/self is a symlink to /proc/$PID. Those symlinks could be trivially bind-mounted, but you don't have to use them.
Instead, you can do something like "cat /proc/$$/mounts" - that will check the mounts file directly, from location like /proc/14102/mounts. That location is much harder to bind-mount over, as there are many processes and process IDs are hard to predict.
That said, to bind-mount, malware must have a root; and if malware has a root, it can do much better than playing silly tricks with bind-mounting over /proc - it can load a kernel rootkit to hide itself very effectively. That's why all investigation of potentially compromised machines must be done from known-clean OS.
I had the same thought, but I suspect that would break the mount. It might make the entire /proc/mounts directory appear empty but none of the mounts would work.
Interesting. What are the legitimate use cases to not treat /proc as readonly, and what are legitimate use cases to mount around and especially bind-mount random filesystems around in /proc?
Like, my first impulse is "Why do we allow this?" And I guess, sure, the answer is "root is allowed to do this, because root is never not allowed". And sure I very much dislike my computer telling me "Nay I cannot do that", hence why I have no windows anymore at home.
But there is some stuff that seemingly doesn't have any legitimate use case on a server. And even if protections from that stuff keep me from fixing some situations, I can still nuke and rebuild it in an hour or so.
> What are the legitimate use cases to not treat /proc as readonly,
Only some parts of proc are "read only." /proc/sys is filled with writable controls.
> my first impulse is "Why do we allow this?"
The user is allowed to do whatever they like with their machine. It's the reason I use linux. It never puts me in a position where "system policies" or other default "security theater" nonsense disadvantages me on my own hardware.
If you're that concerned you can easily add a policy framework, like SELinux, or others, which would prevent this from happening or raise an exception if it does.
> that seemingly doesn't have any legitimate use case on a server.
There are dozens of other ways to achieve this same effect that rely on mechanisms that have legitimate use cases. In particular if you are root you will not struggle to find ways to hide processes. In this case you can just observe "/proc/mounts" to see that something perfidious is occurring.
> I can still nuke and rebuild it in an hour or so.
As long as there is no important data at rest within the server. This isn't always the case.
From systemd.exec(5):
—
ProcSubset=
Takes one of "all" (the default) and "pid". If "pid", all files and directories not directly associated with process management and introspection are made invisible in the /proc/ file system configured for the unit's processes. This controls the "subset=" mount option of the "procfs" instance for the unit. For further details see The /proc Filesystem². Note that Linux exposes various kernel APIs via /proc/, which are made unavailable with this setting. Since these APIs are used frequently this option is useful only in a few, specific cases, and is not suitable for most non-trivial programs.
2. The /proc Filesystem: <https://docs.kernel.org/filesystems/proc.html#mount-options>
I can answer the writing to /proc one. It is sometimes useful to hotpatch running programs with /proc/pid/mem.
And that's what I'm getting at, and where I'd like the community to improve in discussions. In what context do you need it, and how much, and what would your alternatives be?
Because, the amount of different contexts linux is being used in, and the different threat levels are vastly different.
For example, I'm aware that the industrial and embedded world does wild things at times. Because it's hard to establish redundancy and replacability there. Because the system is attached to a $750k lathe. However, that thing is not networked, and physical access is controlled by people with guns. Do whatever you need to keep this thing running, as horrid as it may be.
On the other hand, I have a fleet of loadbalancers and their job is to accept traffic from all criminals in this world, and then some legitimate users as well. I can reset them to base linux and have them back operational in 10 minutes or so. Things modifying loaded code in memory outside of some very specific situations like service startup on these systems is terrifying and entirely not necessary.
So I would be very happy with a switch to turn that off, even though some other use cases wouldn't need it or wouldn't be able to use it at all.
/proc still read-only, the article uses "mounts", a generic mechanism available for any area of the filesystem. This is allowed because mounting logic is living "above" filesystem-specific logic, and there is no specific exception for /proc filesystem.
One can imagine having a special code which checks for mounts for /proc, but this turns ugly pretty quick. Disallow all mounts on /proc? Not going to work, the PC I am typing the comment on has a mount at /proc/sys/fs/binfmt_misc. Maybe just disallow bind mounts only? This breaks "create chroot/container and bind-mount host fs" use case, plus attacker can mount tmpfs anyway.
You have to design some sort of rules for where mounting is allowed and where not, and then ensure they are correct and up-to-date. This is a non-trivial amount of work -- and for what? A method which is super obvious to detect (mount table entries!) and can only fool the beginner defenders?
Instead, Linux provides generic "LSM" hook mechanism that can restrict any operations, including mounts. If someone thinks such mount restrictions are a good idea, they are welcome to write kmod (kernel module) to do so, or configure one of the existing ones to reject those operations. But I expect that by the time people get knowledgeable enough to write kmods, they are knowledgeable enough to come up with a better protection against rogue root user.
SELinux could prevent this, if that is what you wanted.
One way to detect when this has occurred is to use fstatfs() [0]: it will tell you whether an fd truly belongs to a procfs, or whether it belongs to some other kind of filesystem. Of course, you still have the issue of one part of a procfs being bind-mounted onto another part of a procfs (or onto a different procfs). For that, one mitigation is to double-check the st_dev of every file down the chain. But it still doesn't guarantee that you haven't been led into a recursive bind-mount: for that you have to use tricks like checking rename() error codes, or some of the fancy new openat2() flags, to test the absence of a mount-point boundary.
Personally, my stance is to just blindly trust whatever /proc tells you, except in really unusual security models. Users can only end up hurting themselves by putting nonsensical things in there. (The same way no one should bother checking that /dev/null isn't a regular file, even though root can easily make it one.)
[0] https://man7.org/linux/man-pages/man2/statfs.2.html
The Unix philosophy is "everything is a file". But maintaining that abstraction indirectly leads to edge cases like this where devs may be thinking in terms of a FILESYSTEM but some of the files are, in reality, a SYSCALL API/RPC INTERFACE. This makes every new filesystem feature a potential security risk. Is it worth the abstraction? I think so.
Taking it one step further:
Concealing Namespaces Within a File Descriptor | https://tmpout.sh/3/06.html
One of the solutions listed to discovering this is to investigate /proc/mounts and look for these type of mounts. Couldn't you use the same trick on /proc/mounts itself?
/proc/mounts is a symlink to /proc/self/mounts. And /proc/self is a symlink to /proc/$PID. Those symlinks could be trivially bind-mounted, but you don't have to use them.
Instead, you can do something like "cat /proc/$$/mounts" - that will check the mounts file directly, from location like /proc/14102/mounts. That location is much harder to bind-mount over, as there are many processes and process IDs are hard to predict.
That said, to bind-mount, malware must have a root; and if malware has a root, it can do much better than playing silly tricks with bind-mounting over /proc - it can load a kernel rootkit to hide itself very effectively. That's why all investigation of potentially compromised machines must be done from known-clean OS.
I had the same thought, but I suspect that would break the mount. It might make the entire /proc/mounts directory appear empty but none of the mounts would work.
> Couldn't you use the same trick on /proc/mounts itself?
Unfortunately, yes. 'bind' allows mount to bind over individual files and not just directories. Which is a little bit insane.
In any case, if you are suspicious, 'umount /proc/mounts'. It either returns ok or it returns "umount: /proc/mounts: not mounted."
Could you use a unionfs like overlay2 for more advanced pid hiding?