Linux Kernel PWN | 02 CVE-2009-1897

Introduction

Last week, Prof. Yap assigned an ungraded experiment to us in CS5439 course, which has something to do with security issues of compiler optimizations. The task is to build a binary with different compiler optimization options and to observe the result.

On my Ubuntu 20.04 (64 bit), I compiled the program with options from -O0 to -O3 each time:

gcc -O0 -o prog-o0 L3-util.c L3-main.c
gcc -O1 -o prog-o1 L3-util.c L3-main.c
gcc -O2 -o prog-o2 L3-util.c L3-main.c
gcc -O3 -o prog-o3 L3-util.c L3-main.c

And the output are listed sequentially:

➜  lab3 ls
L3-main.c  L3-util.c  prog-o0  prog-o1  prog-o2  prog-o3
➜  lab3 ./prog-o0
3
[1]    2687 segmentation fault (core dumped)  ./prog-o0
➜  lab3 ./prog-o1
3
NULL deref
➜  lab3 ./prog-o2
3
A = 42 ➜  lab3 ./prog-o3
3
A = 42

It is apparent that different optimizations result in different runtime outcomes.

Analysis of The Weird Results

What have we observed?

In -O0 case, the process received a SIGSEGV signal, because it tried to dereference NULL pointer as expected.
In -O1 case, x = p->b[1] had no problem, but if (!p) caused oops() to be invoked.
In -O2 and -O3 cases, no segmentation fault or oops() happened. The program ran and exited happily.

The following command could be used to disassemble one specific function:

gdb -batch -ex 'file ./prog' -ex 'disassemble f'

With this command we disassembled all f() functions from prog-o0 to prog-o3:

The disassembly code makes all the incredible results clear. -O1 removed x = p->b[1], while -O2 and -O3 only remained the last line *p = 42 for f(), removing both the unused variable x and the NULL pointer check if (!p).

Lastly, if we add -fno-delete-null-pointer-checks option when compiling, -O2 and -O3 cases will act the same as the -O1 case.

From Experiment to Real Vulnerability

Also in this experiment, Prof. Yap points out that this experiment is an abstraction and simplification of CVE-2009-1897, a vulnerability in one Linux kernel driver, causing local escalation of privilege. After searching, I found an exploit from Bradley Spengler. The most interesting thing is that Spengler wrote a lot of comments to criticize Linus and other developers in kernel community for their neglect of vulnerabilities’ potential impact and silent fixes.

What’s more, Spengler quoted a paragraph from The Childhood of a Leader by Jean-Paul Sartre at the end of his exploit. Honestly speaking, I rarely read such an exploit before! This exploit motivates me to reproduce CVE-2009-1897 and write this post.

I recommend you to read Spengler’s exploit, from which we could learn a lot about the situation of Linux kernel security 13 years ago. I have quoted some of these comments in Appendix A.

Technical Details of CVE-2009-1897

CVE-2009-1897 is a vulnerability in drivers/net/tun.c, which affects kernel versions between 2.6.30-rc1 and 2.6.31-rc3. You can find the vulnerable function tun_chr_poll here. Let’s go deep into the code details.

If we open /dev/net/tun device, the tun_chr_open function will be called in kernel. A tfile will be allocated and tfile->tun is set to NULL:

static int tun_chr_open(struct inode *inode, struct file * file) {
	struct tun_file *tfile;
	// ...
	tfile = kmalloc(sizeof(*tfile), GFP_KERNEL);
	// ...
	tfile->tun = NULL;
	// ...
}

Now, what if we poll on /dev/net/tun immediately after opening it? Similarly, the tun_chr_poll function will be invoked:

static unsigned int tun_chr_poll(struct file *file, poll_table * wait) {
	struct tun_file *tfile = file->private_data;
	struct tun_struct *tun = __tun_get(tfile);
	struct sock *sk = tun->sk;
	unsigned int mask = 0;
	if (!tun)
		return POLLERR;
	// ...
	if (sock_writeable(sk) ||
	    (!test_and_set_bit(SOCK_ASYNC_NOSPACE, &sk->sk_socket->flags) &&
	     sock_writeable(sk)))
		mask |= POLLOUT | POLLWRNORM;
	// ...
}

Because tfile->tun is still NULL, *sk = tun->sk means dereference of NULL, which will cause a crash. Up to now, it seems that the dereference is just a bug. However, just like what GCC have done with our small experiment in the introduction section, the NULL pointer check if (!tun) is removed after kernel’s compilation. Under this condition, if the attacker could allocate and manipulate the memory beginning from zero (NULL), the crash will be avoided and the tun_chr_poll won’t return POLLERR.

The patch is simple, just move the assignment after the NULL pointer check:

- struct sock *sk = tun->sk;
+ struct sock *sk;
 	unsigned int mask = 0;
  	if (!tun)
 		return POLLERR;
+	sk = tun->sk;
+ DBG(KERN_INFO "%s: tun_chr_poll\n", tun->dev->name);

And the -fno-delete-null-pointer-checks option was added in kernel’s Makefile later.

More information about CVE-2009-1897 is available at Red Hat Bugzilla, LKML and LWN.

Spengler successfully proved that this bug was actually a vulnerability. As he said, he managed to exploit this ‘unexploitable’ vulnerability and got arbitrary code execution. We will analyse his exploit in the next section.

Analysis of Spengler’s Exploit

Before we go into details, remember that it was 2009. The exploiting conditions and mitigations are not so complex as today’s. For example, you don’t have to consider KASLR, KPTI or SMEP/SMAP, and ret2usr still works. What’s more, non-privilege processes could read symbol addresses from /proc/kallsyms.

We have learnt that there is a NULL dereference bug in tun_chr_poll. But in 2.6.30 kernel the mmap_min_addr is set to 65536 by default, so non-privilege process could not map memory at NULL address. Anyway, assuming that we are able to map memory at NULL, what can we do to exploit this vulnerability?

The key idea is to fake an unimplemented file operation for tun device driver and invoke the relative system call on it from userland. As the source code shows, tun device driver does not implement the mmap file operation, so the 11th entry mmap for tun’s file_operations (we will call it tun_mmap_fop) is NULL, which is the very address mapped into our exploit process.

However, NULL is not enough. If the tun_mmap_fop is NULL, when we mmap the opended tun device, the kernel will treat a NULL of file operation entry as unimplemented. The control flow will not go into our payload at NULL. How can we deal with this problem?

Well, actually the tun_chr_poll helps here. Notice that there is a test_and_set_bit operation within this function, which will set 1 on the specified bit:

// statement:
// bool test_and_set_bit(long nr, volatile unsigned long * addr);
	if (sock_writeable(sk) ||
	    (!test_and_set_bit(SOCK_ASYNC_NOSPACE, &sk->sk_socket->flags) &&
	     sock_writeable(sk)))

And SOCK_ASYNC_NOSPACE is 0. So if sock_writeable(sk) returns False, the lowest bit of tun->sk->sk_socket->flags will be set to 1. If we could make &tun->sk->sk_socket->flags point to the tun_mmap_fop, it will be set to 1. Then when we call mmap on /dev/net/tun, our payload at 0x01 will be invoked as the mmap file operation. The control flow will run in kernel context. Because no mitigations prevent us from ret2usr at that time, it is convenient to put a jmp instruction at 0x01 to jump to our userland payload.

Last question, to make sock_writeable(sk) return False, we need to make sk->sk_wmem_alloc not less than sk->sk_sndbuf >> 1:

static inline int sock_writeable(const struct sock *sk) {
	return atomic_read(&sk->sk_wmem_alloc) < (sk->sk_sndbuf >> 1);
}

Remember that tfile->tun is NULL after opened, the tun->sk->sk_wmem_alloc and tun->sk->sk_sndbuf are both located in the memory beginning from NULL we control. We just set both of them 0 for convenience, so as to make sock_writeable(sk) return False.

Now focus on tun->sk->sk_socket->flags. tun->sk->sk_socket is located in our control and flags is at offset 8 in socket structure in our environment, so we set tun->sk->sk_socket to tun_mmap_fop - offset_of_flags.

In short, tun_chr_poll helps to set tun_mmap_fop to 1, so as to make mmap file operation of tun device come into effect.

You can find Spengler’s exploit here. Here is Spengler’s approach:

Somehow bypass the mmap_min_addr limitation and map the memory beginning from NULL.
Get needed kernel symbol addresses from /proc/kallsyms, including commit_creds, init_cred, tun_fops and other necessary symbols.
Craft a malicious sock structure at NULL.
open and poll the /dev/net/tun device to increase the value at tun’s mmap filer operation function by 1.
Put a jmp QWORD PTR own_the_kernel instruction at the address 0x01.
mmap the /dev/net/tun to trigger the whole exploit.
The fake tun_mmap_fop file operation jmp to own_the_kernel function.
The own_the_kernel funciton commit_creds(init_cred) to escalate to root privilege and returns.
The rest of userland exploit will spawn a shell.

After manipulation, the layout of related memory regions and structures is just like the figure blow:

Finally, mmap the tun device to run our exploit in kernel context:

fd = open("/dev/net/tun", O_RDONLY);
mem = mmap(NULL, 0x1000, PROT_READ, MAP_PRIVATE, fd, 0);

P.S. Spengler utilized two techniques to bypass the limitation of mmap_min_addr. This post will not discuss these techniques. Two related web pages are recommended:

Reproduction of CVE-2009-1897

Now that we have learnt about the vulnerability and its exploitation, let’s prepare a vulnerable environment and exploit this vulnerability!

Preparation of Environment

Given that this vulnerability was 13 years ago and it is hard to find a Vagrant box or Linux distribution images, I decided to download Ubuntu 9.10 (64-bit, server) and create a virtual machine from it. The kernel version is 2.6.31 and I found this vulnerability has been patched.

It is hard to find a 2.6.30 kernel as well, so my plan is to download the source code for the current kernel, revert the vulnerable code in drivers/net/tun.c, comment the -fno-delete-null-pointer-checks option in Makefile, compile the source code and replace the current kernel with my new vulnerable kernel.

You can find apt sources for Ubuntu 9.10 here, which I gathered from StackExchange. I followed one reply from StackExchange to recompile the kernel:

# get source code
apt-get source linux-image-$(uname -r)
# install dependencies
sudo apt-get build-dep linux-image-$(uname -r)

Then save the patch as cve-2009-1897.diff in linux-2.6.31 directory. Inverse the patch with:

cd linux-2.6.31
patch -R drivers/net/tun.c ./cve-2009-1897.diff -o drivers/net/tun.c.bak
mv drivers/net/tun.c.bak drivers/net/tun.c

Comment -fno-delete-null-pointer-checks in KBUILD_CFLAGS.

Now we can compile the kernel:

fakeroot debian/rules clean
fakeroot debian/rules binary

After compilation, just install the new kernel, e.g.:

cd ..
dpkg -i linux-image-2.6.31-23-generic_2.6.31-23.75_amd64.deb linux-headers-2.6.31-23-generic_2.6.31-23.75_amd64.deb linux-headers-2.6.31-23_2.6.31-23.75_all.deb

Now reboot and enjoy pwning the vulnerable environment. We will use Spengler’s exploit to PWN the kernel in the next section.

PWN :)

Currently I am not interested in bypassing mmap_min_addr, so let’s just set it to 0 with root privilege before exploiting. Besides, it seems that in the new version kernel non-privilege user could not read or write the /dev/net/tun device. Anyway, let’s add R/W privilege for non-privilege user rambo. The commands are:

# disable mmap_min_addr
echo 0 > /proc/sys/vm/mmap_min_addr
# add R/W priv for others
chmod o+r,o+w /dev/net/tun

BTW, I don’t care about LSMs and they didn’t affect my experiment as well.

Then use a non-privilege user, compile the exploit (I modified Spengler’s to make it run in my environment) and exploit:

cc -fno-stack-protector -o exploit exploit.c
./exploit

The whole exploiting process is:

rambo@ubuntu:~/cve-2008-1897$ ./exploit
 [+] MAPPED ZERO PAGE!
 [+] Resolved tun_fops to 0xffffffffa01c8340
 [+] Resolved selinux_enforcing to 0xffffffff819bed88
 [+] Resolved apparmor_enabled to 0xffffffff817fb804
 [+] Resolved nf_unregister_hooks to 0xffffffff814602f0
 [+] Resolved security_ops to 0xffffffff819bd510
 [+] Resolved default_security_ops to 0xffffffff817b9560
 [+] Resolved sel_read_enforce to 0xffffffff812327f0
 [+] Resolved audit_enabled to 0xffffffff8197d324
 [+] Resolved commit_creds to 0xffffffff8107f090
 [+] Resolved init_cred to 0xffffffff817a8c20
 [+] *0xffffffffa01c8398 |= 1
 [+] b00m!
 [+] Disabled security of : LSM
 [+] Got root!
 [+] BAM! About to launch your rootshell!...but first some chit-chat...
           ,        ,
          /(_,    ,_)\
          \ _/    \_ /
          //        \\
          \\ (@)(@) //
           \'="=="='/
       ,===/        \===,
      ",===\        /===,"
      " ,==='------'===, "
       "                "
Do you know the deadliest catch?
yes
That's right! MAN is the deadliest catch of all!
WAIIIIIIIIIITTTT....do you hear it?
You hear it! You do too! It's not just me!  It's here, it's here I say!!
I must face this....
What's this? Something stirs within the beast's belly! Something unexpected....
# whoami
root

P.S. Spengler shared his exploiting demos for 32-bit OS (Fedora) and 64-bit OS (Ubuntu) on YouTube. You could watch them if you like:

Summary

This is my reproduction of CVE-2009-1897. It’s an incredible awesome trip that I learnt a lot about not only kernel PWN techniques but also the security-related histories of Linux kernel.

Finally, hopefully I would read some of Sartre’s novels someday :)

Appendix A: Spengler’s Comments

As of the writing of this, the above fix exists for this vulnerability, but it’s unlikely to make it into any -stable release (at least, not until after this exploit is released) because as we say in Linux kernel development circles, there are no vulnerabilities, just DoS bugs and silent fixes. When noone seems to care to classify bugs properly or put any real effort into determining the impact of a vulnerability (leading to everything being called DoSes with no justification), then even the maintainers don’t know what should be included in the “stable” kernels, leaving users vulnerable and attackers with beautiful, 100% reliable vulnerabilities like this one to exploit.

It’s at these times that I take comfort in the words of security expert Linus Torvalds, who steers the good ship Linux into safer seas. As we read at http://article.gmane.org/gmane.linux.kernel/848718 he’s been blessed with the foresight to claim that “we could probably drop PAE some day,” calling upon his own insight that “I’d also not ever enable it on any machine I have. PAE does add overhead, and the NX bit isn’t that important to me.”

……

So much sadness! And it’s funny, none of them emailed me to ask anything. Probably because they choose to operate in secrecy, depending upon the spies (it’s not cool or ethical to spy, guys :P) they have in some public channels I frequent (which is the only way they found out about the videos – I hadn’t posted links to them anywhere else). I’m amazed they come to the conclusions they came to, as if I didn’t just release a very similar exploit in 2007 that attacked the kernel via a null ptr dereference and then disabled SELinux & LSM. The fact that pulseaudio was used and was discussed publicly in Julien’s blog regarding its use for bypassing mmap_min_addr “protection” surely didn’t ring any bells. The fact that I’m throwing around kernel addresses and suggesting that at least one of the addresses is being written to clearly shows this is not a kernel problem at all – good call Linus. I’m glad this arm-chair security expert discussion goes on in private; it’d be pretty embarrassing if it were public :)

……

PS (to vendorsec, etc): though you will never thank me (or sgrakkyu, or Julien), you’re welcome for all this free security research, which could have been sold in private instead. The industry isn’t what it was like in 2000, people don’t publish things anymore: they make money off them. Not seeing exploits published doesn’t mean you’re doing a good job anymore. Have you noticed that the complex exploits that have been released are released unpossibly quickly after the vulnerability is finally fixed? There’s a reason for that. If the vulnerable code in this case had happened to have gone into the 2.6.29 kernel instead of 2.6.30 (which won’t be used for vendor kernels) I likely wouldn’t have published. I have no use for exploits, but a good laugh is only worth so much. My suggestion to you is to hire a couple of sgrakkyus or Juliens instead of old guys who have never written an exploit, since other than Stealth, I don’t know of anyone skilled in the industry that you actually employ. A second suggestion: as you are companies promoting open source, free software, it would be nice if the justifications for your vulnerability classifications were more transparent (or made available at all). The old game of calling everything for which a public exploit doesn’t exist a DoS for no reason is getting very tired, and it’s not fooling anyone. Third, the official policy of intentionally omitting security relevant information in modifications to the Linux codebase is a disgrace and a disservice to yourselves, to other vendors, and to your customers. It demonstrates a lack of integrity and trust in your own products, though I know you have no intention of changing this policy as you’re currently enjoying a reputation for security that is ill-gotten and has no basis in reality. Truthfully representing the seriousness of vulnerabilities in your software would tarnish that image, and that’s not good for business. You’re praised when you cover things up, and yet Microsoft is the one with the bad reputation. If you were to follow the suggestions above, then maybe your security wouldn’t be the laughing stock of the industry.