Librelp buffer overflow fix (cve-2018-1000140) - a collaboration between Adiscon and Semmle

This is a joint blog post, from Adiscon and Semmle, about the finding and fixing of CVE-2018-1000140, a security vulnerability in librelp. This was a collaborative effort by:

Kevin Backhouse, Semmle, Security Researcher.
Rainer Gerhards, Adiscon, Founder and President.
Bas van Schaik, Semmle, Head of Product.

We have published this post on Rainer’s blog and our blog.

Bas originally found the vulnerability (using CodeQL) and Rainer fixed it. Kev developed the proof-of-concept exploit.

In this blog post, we explain the cause of the bug, which is related to a subtle gotcha in the behavior of snprintf, and how it was found by a default query. We also demonstrate a working exploit (in a docker container, so that you can safely download it and try it for yourself). As a bonus, we give a short tutorial on how to set up rsyslog with TLS for secure communication between the client and server.

Severity and mitigation

The vulnerability (CVE-2018-1000140) is in librelp versions 1.1.1 up to 1.2.14.

Librelp is a component used heavily inside rsyslog. It is also used in some other projects, but we use the term “rsyslog” as a proxy for any affected projects, as it is the prime user of librelp (to the best of our knowledge). However, if you use librelp in any project, we strongly recommend that you upgrade to version 1.2.15 or newer.

The vulnerability only affects rsyslog installations that use TLS for secure RELP communication between the client and the server. In its default configuration, rsyslog does not do any client-server communication at all, so the vulnerability only affects more advanced installations. Furthermore, to trigger the vulnerability, an attacker needs to create a malicious certificate that has been signed by a certificate authority that is trusted by the rsyslog server.

The ‘snprintf’ gotcha

The vulnerability is in the following block of code (lines 1195-1211 of tcp.c):

/* first search through the dNSName subject alt names */
iAltName = 0;
while(!bFoundPositiveMatch) { /* loop broken below */
  szAltNameLen = sizeof(szAltName);
  gnuRet = gnutls_x509_crt_get_subject_alt_name(cert, iAltName,
                                                szAltName, &szAltNameLen, NULL);
  if(gnuRet < 0)
    break;
  else if(gnuRet == GNUTLS_SAN_DNSNAME) {
    pThis->pEngine->dbgprint("librelp: subject alt dnsName: '%s'\n", szAltName);
    iAllNames += snprintf(allNames+iAllNames, sizeof(allNames)-iAllNames,
                          "DNSname: %s; ", szAltName);
    relpTcpChkOnePeerName(pThis, szAltName, &bFoundPositiveMatch);
    /* do NOT break, because there may be multiple dNSName's! */
  }
  ++iAltName;
}

It is caused by a subtle gotcha in the behavior of snprintf. You have to read this excerpt of the man page for snprintf quite carefully to spot the problem:

The functions snprintf() and vsnprintf() do not write more than size bytes (including the terminating null byte (‘\0’)). If the output was truncated due to this limit then the return value is the number of characters (excluding the terminating null byte) which would have been written to the final string if enough space had been available. Thus, a return value of size or more means that the output was truncated.

The Red Hat Security Blog has a great post about this, written by Florian Weimer in March 2014. The crucial point is that snprintf returns the length of the string that it would have written if the destination buffer was big enough, not the length of the string that it actually wrote. So the correct way to use snprintf is something like this:

int n = snprintf(buf, sizeof(buf), "%s", str);
if (n < 0 or n >= sizeof(buf)) {
   printf("Buffer too small");
   return -1;
}

Alternatively, you could use it like this to allocate a buffer that is exactly the right size:

int n = snprintf(0, 0, "%s", str);
n++; // Add space for null terminator.
if (n <= 0) {
  printf("snprintf error\n");
  return -1;
}
buf = malloc(n);
if (!buf) {
  printf("out of memory\n");
  return -1;
}
snprintf(buf, n, "%s", str);

This second use-case is probably the reason why snprintf is designed the way it is. However, given that the type of its size parameter is size_t, it still seems like a bizarre design choice that its return type is int, rather than size_t.

Interestingly, some codebases use their own implementations of snprintf, which don’t necessarily behave in the same way. For example, there’s one in curl and another in sqlite. Other codebases use wrapper functions to work around inconsistencies in the implementation of snprintf on different platforms, like this one in flac.

How the ‘snprintf’ gotcha can cause a buffer overflow

The cruel irony of the snprintf gotcha is that it can cause the exact buffer overflow that the programmer was diligently trying to avoid by using snprintf, rather than sprintf. This is exactly what happened on line 1205 of tcp.c:

iAllNames += snprintf(allNames+iAllNames, sizeof(allNames)-iAllNames,
                      "DNSname: %s; ", szAltName);

Suppose szAltName is a sufficiently long string to cause a buffer overflow. Then snprintf does not write beyond end of the buffer. However, due to the surprising behavior described above, it still returns the length of the string that it would have written if the buffer was big enough. This means that after the += operation, iAllNames > sizeof(allNames). On the next iteration of the loop, sizeof(allNames)-iAllNames overflows negatively and snprintf is called with an extremely large size parameter. This means that subsequent iterations of the loop write beyond the end of the buffer.

As a side note, a dynamic test utilizing Undefined Behavior Sanitizer (UBSAN) would have detected that problem. Unfortunately there is no such test case inside the rsyslog test suite. That’s the usual problem with dynamic testing; it always depends on the test cases. A strength of static analysis is that such situations can be detected even if they are overlooked by developers and QA engineers. A more elaborate discussion of this topic can be found in Rainer’s blog post on the benefits of static analysis.

An interesting feature of this bug is that there is a gap in the buffer overflow. Suppose that there are 10 bytes left in the buffer and that the length of szAltName is 100 bytes. Then only the first 10 bytes of the string are written to the buffer. On the next iteration of the loop, the next string is written to a starting offset 90 bytes after the end of the buffer. An attacker can utilize this gap to only overwrite a very specific area of the stack. It also means that they can avoid overwriting the stack canary.

Finding and fixing the bug

The bug was found by Bas, after a conversation with Rainer on twitter. Rainer had just heard about CodeQL and LGTM.com, and asked Bas if he could upload rsyslog. At the time, CodeQL’s support for C/C++ projects was still in beta, so Bas needed to upload it manually. Bas uploaded it to a non-public LGTM instance first to make sure that the results looked reasonable and noticed the result. It was found by our Potentially overflowing call to snprintf query, which is designed to find this exact pattern. Bas consulted with Kev, who agreed that it looked like a genuine vulnerability. Kev then immediately contacted Rainer, so that he could fix the bug before it went public on CodeQL query set.

Rainer fixed the bug one day after Kev reported it to him and released librelp version 1.2.15, which includes the fix, two days after that.

Using rsyslog with secure TLS communication

In the next section, we demonstrate a proof-of-concept exploit. But first, we need to explain how to set up rsyslog with TLS secured communication between RELP client and server, because the vulnerability is in the code that checks the certificate of the client. The instructions that we explain here are quite similar to this tutorial, except that we use openssl, rather than GnuTLS, to generate the certificates.

In its default configuration, rsyslog is just used for logging on a single machine. In a more advanced configuration, you might want to have multiple client machines sending log messages to a central server. If so, you can use TLS to securely send the log messages to the server. To enable TLS communication, every machine in the cluster needs a public/private key pair. Additionally, all of these keys need to be signed by a Certificate Authority (CA), which has its own public/private key pair. Every machine in the cluster needs to know the public key of the CA, so that they can verify each other’s keys: when two members of the cluster connect to each other, they send each other certificates signed by the CA to prove to each other that they are legitimate members of the cluster. In this tutorial, we generate public/private key pairs for the CA, client, and server. Then we start a server rsyslog and a client rsyslog in two separate docker containers and connect them to each other.

An important word of warning about this tutorial:

To make the tutorial easy to run, we have set it up so that all the certificates are generated by the docker build step. This means that all the private keys are stored in the same docker image. Don’t do this in a real installation! You should generate all the certificates separately so that each machine only knows its own private key.

To make the tutorial as simple as possible to try out, we have uploaded the necessary scripts to GitHub. You can download the scripts and build the docker image as follows:

git clone https://github.com/Semmle/SecurityExploits.git
cd SecurityExploits/rsyslog/CVE-2018-1000140_snprintf_librelp
docker build . -t rsyslog-demo

Besides installing all the necessary software packages, the Dockerfile also runs the following setup steps:

Download the source code for rsyslog, librelp and related components and check out the version with the vulnerability.
Build and install rsyslog from source, by calling this build script.
Call this script to generate certificates for the client, server, and CA.
Call this script to generate a malicious certificate. We demonstrate this in the next section.

Generating certificates with openssl is normally an interactive process, in which it asks you questions about your name, email address, organization, and so on. To make the process fully automatic, we have instead used configuration files. For example, ca.config contains the configuration for the certificate authority and server.config contains the configuration for the server. Hopefully it is reasonably clear which lines of the configuration files you need to change if you are setting up your own rsyslog installation.

Using docker, we can simulate the scenario where there are two machines, a client and a server, both running rsyslog. First we need to create a virtual network, so that the client can connect to the server:

docker network create -d bridge --subnet 172.25.0.0/16 rsyslog-demo-network

Now, in two separate terminals, we start two docker containers: one for the server and one for the client. In terminal 1, we start a container for the server like this:

docker run --network=rsyslog-demo-network --ip=172.25.0.10 -h rsyslog-server -i -t rsyslog-demo

In terminal 2, we start a container for the client like this:

docker run --network=rsyslog-demo-network --ip=172.25.0.20 -h rsyslog-client -i -t rsyslog-demo

The docker image contains rsyslog configuration files for the client and server. These configuration files reference the TLS certificates which were generated during the building of the docker image. We can now start the client and server. In terminal 1, we start the rsyslog server:

sudo rsyslogd -f benevolent/rsyslog-server.conf

(Note: the docker image is configured so that the sudo password is “x”.)

In terminal 2, we start the rsyslog client:

sudo rsyslogd -f benevolent/rsyslog-client.conf

There is now a secure TCP connection between the client and server.

Proof-of-concept exploit

The purpose of the vulnerable block of code is to search through the subject alternative names in the certificate, looking for a name that it recognizes. For example, we have specified in the configuration file for the rsyslog server that it should only allow a client with the name client.wholesomecomputing.com. Subject alternative names are usually used to enable one certificate to cover multiple subdomains. For example, the certificate for example.com might also include www.example.com and help.example.com as subject alternative names. In the context of rsyslog, you might use the subject alternative names to generate a single certificate that covers all the machines in the cluster (although this would be less secure than generating a separate certificate for each machine).

To trigger the vulnerability, we need a certificate with a large number of very long subject alternative names. The size of the stack buffer, allNames, is 32768 bytes. Alternative names that exceed 1024 bytes are rejected by gnutls_x509_crt_get_subject_alt_name, so the config file for our malicious certificate contains 33 alternative names, most of which are just over 1000 bytes long. The 32nd name is slightly shorter to adjust the gap, so that the 33rd name overwrites the return address on the stack.

To run the exploit PoC, start the two docker containers as before. In the client container, start rsyslog with the malicious configuration:

sudo rsyslogd -f malicious/rsyslog-client.conf

The rsyslog server runs in the background, so you need to use ps to see that it has crashed. Alternatively, you can attach gdb to it before you start the malicious client, to see the effect of the exploit in slow motion. If you want to do this, then you need to add a few extra command line flags when you start the docker container for the server:

docker run --network=rsyslog-demo-network --ip=172.25.0.10 -h rsyslog-server --cap-add=SYS_PTRACE --security-opt seccomp=unconfined -i -t rsyslog-demo

These extra arguments disable the default security settings in docker which prevent gdb from attaching to another process. Without them, you are likely to encounter an error message like this. This solution was suggested here.

Timeline

2018-03-19: Kev reported the vulnerability to Rainer.
2018-03-20: Rainer fixed the bug (The fix was cleverly camouflaged as a refactoring.)
2018-03-22: Rainer released librelp version 1.2.15, which included the fix for the vulnerability.
2018-03-23: iwantacve.org assigned CVE-2018-1000140.
2018-03-26: Rainer published a security advisory.
2018-06-19: Blog post and exploit PoC published on Rainer’s blog and our blog.

Note: Post originally published on LGTM.com on June 19, 2018