skip to content
Back to GitHub.com
Home Bounties Research Advisories CodeQL Wall of Fame Get Involved Events
August 27, 2020

Now you C me, now you don't: An introduction to the hidden attack surface of interpreted languages

Bas Alberts

Attack surface is a layer cake. We commonly define software attack surface as anything that reacts to or is influenced by attacker-controlled input. When developing higher-level interpreted language applications it is tempting to make the assumption that the lower-level code in the runtime system or libraries of the language itself is sound.

This is empirically false.

It is often the case that there exists relatively fragile C/C++-based attack surface right below the memory managed luxuries of the higher-level language. Such issues may exist in the core implementation of the language itself as well as in the third party language ecosystem that exposes C/C++-based library functionality to the higher-level language.

Such third party surface is commonly exposed through an explicit Foreign Function Interface (FFI) or some other form of API-translation wrappers that facilitate the use of higher-level objects to and from the lower-level code. Often referred to as native modules, extensions, or some sort of play on the FFI acronym, such attack surface exposes all the memory mismanagement vulnerabilities associated with C/C++ code in the security context of the higher-level application.

In this series of posts we’re going to explore some historical and current examples of the sometimes overlooked C/C++ attack surface in the context of higher-level language applications. In this first installment we will provide historical context to the low-level attack surface of interpreted languages and demonstrate that many of its vulnerability themes span across languages. In future installments we will present new attacks against modern interpreted language ecosystems and which characteristics tend to make a bug in these surfaces an actual exploitable vulnerability.

This series is aimed mostly at developers that want to make informed decisions about how, why, and where they expose attack surface to potentially malicious input and as such it will include explanations of software exploitation theory where required.

For the purposes of this discussion we broadly define software exploitation as follows: leveraging input as influence to pivot a target process from its intended state space into unintended state space.

Setting the plate

Context is everything when it comes to deciding whether something is “just” a bug, or can be considered a vulnerability. To be considered a vulnerability, a bug should further the agenda of an attacker in some shape, way, or form. This in itself is highly context dependent, especially when dealing with core language issues. How and where an affected API is exposed to attacker input mostly dictates whether or not it can be considered a vulnerability.

In the context of interpreted languages there exist two main attack scenarios. In the first scenario, the attacker has the ability to run their own programs on the target interpreter and generally their aim is to subvert the security guarantees of the interpreter itself to entice the process hosting the interpreter into crossing some sort of security boundary.

Examples of this include vulnerabilities in Javascript interpreters employed by web browsers. Variations on this theme include scenarios in which the attacker has achieved the ability to run arbitrary interpreted code in the first phase of an attack, but the interpreter itself enforces some sort of restriction that limits the attacker’s ability to further their attack, e.g. a restrictive PHP configuration that disallows the use of command execution functions.

When an attacker has full access to the targeted interpreter, it is common to look for vulnerabilities in the core interpreter implementation. There exists a long history of e.g. memory mismanagement bugs in various Javascript interpreters that have led to exploitable client side vulnerabilities in web browsers. Since the attacker has full control of the interpreter state, any bugs in the interpreter itself are potentially beneficial from an attacker perspective.

In the second scenario, there exists some logic implemented in the interpreted language to which the attacker can provide input but they can not interact with the interpreter directly. In these scenarios the scope of the attacker’s surface reach is limited by the APIs to which they can actually pass data, either directly or indirectly.

When we only control input into a target process implemented in the higher-level language, we are limited to the logic we can influence with our input and this often limits attacker options. In these scenarios digging a bit deeper to expose any lower-level surfaces handling your input can uncover vulnerabilities that are invisible from a higher-level logic perspective.

From an attacker perspective, we essentially aim to increase the vertical of our attack surface. If you can’t go wide, go deep!

Getting past the icing

There exists a long and fascinating history of interpreted language vulnerabilities that were exploitable at a lower level. A full history is beyond the scope of this post, but we will dive into the details of some interesting examples that highlight how high level bugs can become low-level vulnerabilities, as often an understanding of the old can inspire the new.

Historical Example: when Perl formatting goes rogue

The case of format string vulnerabilities as they pertained to Perl code anno 2005 is an interesting one. To fully appreciate the issue as it existed we have to first quickly recap the basics of format string exploitation in C programs.

C Format String bugs in 5 minutes

While many people consider format string bugs to be largely eradicated due to their ease of detection, they still tend to pop up in unexpected places from time to time.

Format string bugs are also interesting to consider from the context of interpreted languages interacting with lower-level code as they may pre-process attacker influenced format strings at the higher level that then end up being passed down into lower-level formatting functions directly. Such deferred formatting issues are not uncommon and especially have a habit of creeping in when there is a strong logical separation between the source of a format string and the destination of said format string.

In a nutshell, format string bugs are a class of bugs in which an attacker provides their own format string data into a formatting function, e.g. printf(attacker_controlled). They can then abuse the handling of their controlled format specifiers to achieve read and write primitives against the targeted process space.

Practical exploitation of such vulnerabilities mostly relies on the ability to abuse the %n and %hn class of format specifiers. These formatters direct a formatting function to write the current running count of printed characters to an int (%n) or a short (%hn) respectively, e.g. printf("ABCD%n", &count) would write a value of 4 to the integer count by way of a pointer argument.

Likewise, in a scenario where the output of the formatting function is visible to an attacker, they may dump memory contents by simply providing formatters that expect to print the value of a variable, e.g. printf("%x%x%x%x"). This ability to “eat” the stack also becomes important when lining up the intended target pointer values with their %n counterparts.

If an attacker is able to provide controlled data into the callstack of a formatting function, which is commonly achieved through the malicious format string itself, and bar any compiler mitigations, they can combine control over the written character counter with control over the pointer values to which %n/%hn would write to. This results in the ability to write attacker-controlled values into attacker specified memory locations.

By using tricks like setting the precision/width on format specifiers to set the written character counter to specific values and, where supported, direct parameter access indexes, even small format string inputs can become powerful exploitation primitives for an attacker.

Direct Parameter Access (DPA) in C formatting functions allows you to specify the index of the argument to use for your formatter. For example printf("%2$s %1$s\n", "first", "second") would print “second first”, since the first string formatter specified argument 2 (2$) and the second string formatter specified argument 1 (1$). Likewise, from an attacker perspective, using DPA allows you to directly offset to a stack location that contains your desired target pointer value for a given %n/%hn write. Understanding the purpose of DPA is important as we pivot into our historic Perl example.

The ghosts of Perl formatting past (CVE-2005-3962)

It is not uncommon for interpreted languages to provide their own formatting functions, specifically Perl provides formatting support through its Perl_sv_vcatpvfn function at the lower level. This low level C API provides much of the core formatting support for the higher level Perl API. Its formatting support is somewhat similar to C formatting support in syntax in that it supports the concept of direct parameter access, which in Perl land is referred to as “exact format index”, as well as the %n class of formatters.

Understanding Perl’s built-in formatting support became interesting when you considered that there existed Perl based remote service applications that were blatantly vulnerable to format string bugs. However, since there was no way to directly exploit those bugs at the Perl level, the security research community had not put much effort into attempting to exploit such issues and they were generally considered to be “just a bug.”

Around 2005 I performed some deeper research into Perl format string exploitation after a request from the author of CVE-2005-3962 (Jack Louis) to establish exploitability. He had encountered some observable crashes in the Perl interpreter when testing one of his Perl format string bug findings in Webmin.

Turns out that yes, you could indeed exploit Perl based format string bugs through its C level implementation of its formatting support in Perl_sv_vcatpvfn.

Arguments to Perl’s format strings were stored in an array of argument structure pointers (called svargs), and when presented with an exact format index for a format specifier (e.g. %1$n), this index was then used to retrieve the appropriate argument structure pointer from the argument array. When retrieving the associated argument structure pointer from the array, Perl would ensure the provided index did not exceed the upper bounds of the array according to the number of arguments available to the format string. This argument count was maintained as a signed integer svmax. I.e. if a format string was passed 1 argument, svmax would have a value of 1, and the exact format index value was checked to not exceed 1. In the case of an attacker supplied formatstring there exist no arguments, and svmax is 0.

However, the exact format index was also maintained as a signed 32bit integer and its value was completely controlled from the attacker supplied format string. This meant that you could set this argument array index to a negative value, which would pass the signed upper bounds checks against svmax.

With this realization, exploitation became fairly straightforward, especially anno 2005. One could simply index below the svargs array to any pointer that pointed at attacker-controlled data. This attacker-controlled data would be interpreted as an argument structure which contained a pointer to a value field. Combined with the familiar %n formatter this then resulted in a controlled write to a controlled location. Using such a write primitive it is then possible to rewrite the contents of any writeable process memory which can be leveraged into full process control in a variety of ways.

This is a good example of a relatively simple peek below the covers of the Perl formatting implementation turning a bug into a vulnerability. Combined with the format string vulnerabilities in Webmin this resulted in a full Remote Code Execution (RCE) exploit against Webmin.

Our takeaway is that anytime something looks like “just a bug” at the higher-level language level, it is beneficial to assess the lower-level handling of the faulty input, as there may exist straightforward paths to escalate into full exploitation, even when there is a previously established consensus that such an issue is not practically exploitable.

The unlimited potential of the PHP interpreter

The PHP interpreter in its many incarnations has an interesting history from an attacker perspective. It is commonly attacked both from a full interpreter control perspective, in which an attacker has the ability to execute arbitrary PHP code, as well as from a remote API input perspective, in which the attacker can provide malicious input to potentially vulnerable PHP APIs.

One of the more interesting examples of PHP interpreter exploitation is the unserialize class of attacks. It is interesting for our discussion because attackers have attacked PHP’s unserialization API at both the PHP logic level as well as at the core interpreter level.

It is commonly understood that unserializing untrusted user supplied data is a bad idea. Arbitrary object inflation in the context of the remote application may, depending on which classes are available and allowed in the application namespace, lead to relatively straightforward arbitrary PHP execution. This theme transcends language boundaries and we see the same concepts actively exploited in pretty much any language and application framework that supports deserialization.

After an attacker achieves arbitrary PHP execution, they may find themselves limited by a restricted interpreter configuration, at which point they will commonly explore methods of lifting those restrictions. A historically popular option is to abuse bugs in the PHP interpreter itself. A recent example of such an attack can be found in https://bugs.php.net/bug.php?id=76047 in which a Use After Free (UAF) vulnerability in the debug_backtrace() function can be leveraged to take full control of the PHP interpreter itself and lift any configuration restrictions.

At times, even when presented with a controlled PHP unserialization primitive, an attacker may not be able to pivot this to arbitrary PHP execution due to a lack of available class insights or other restrictions on the application namespace. This is where diving under the surface into the lower code layers again becomes a viable strategy.

There exists a significant history of memory mismanagement issues in PHP’s unserialization API and it is a popular target for fuzzing and research into interpreter vulnerabilities in general.

By leveraging such memory mismanagement issues at the interpreter implementation level of the unserialization API, a determined attacker is able to pivot an otherwise unexploitable vulnerability into a fully exploitable vulnerability. There exist many examples of practical and real world attacks leveraging this surface for remote exploitation of PHP applications.

A relatively recent and related example is Ruslan Habolov’s excellent write up of how they leveraged a combination of low level PHP interpreter bugs and high level PHP API interaction into full RCE against a high-profile real world target.

The case of PHP unserialization’s hybrid attack surface at both the higher- and lower-level implementation serves as another good example of the vertical attack surface of interpreted languages and how it can be abused in concert by a determined attacker.

Putting the C in Boa Constrictor: CVE-2014-1912

Our third and final example for the purposes of this post is the case of CVE-2014-1912. This vulnerability existed in Python’s socket.recvfrom_into function.

Introduced in Python 2.5, socket.recvfrom_into’s intended use is to receive data into a specified Python bytearray. However, it lacked an explicit check that ensured that the destination buffer for the received data was actually large enough to hold the specified amount of incoming data.

e.g. socket.recvfrom_into(bytearray(256), 512) would trigger memory corruption.

It was patched with the following fix:

diff -r e6358103fe4f Modules/socketmodule.c
--- a/Modules/socketmodule.c	Wed Jan 08 20:44:37 2014 -0800
+++ b/Modules/socketmodule.c	Sun Jan 12 13:21:19 2014 -0800
@@ -2877,6 +2877,14 @@
         recvlen = buflen;
     }

+    /* Check if the buffer is large enough */
+    if (buflen < recvlen) {
+        PyBuffer_Release(&pbuf);
+        PyErr_SetString(PyExc_ValueError,
+                        "buffer too small for requested bytes");
+        return NULL;
+    }
+
     readlen = sock_recvfrom_guts(s, buf, recvlen, flags, &addr);
     if (readlen < 0) {
         PyBuffer_Release(&pbuf);

To actually be vulnerable to exploitation, an application would have to explicitly attempt to read more data into a bytearray than was allocated for it via a size argument that is larger than the destination bytearray. If no size argument is provided, the function will default to the size of the bytearray itself and no memory corruption occurs.

If your development background is in languages where you’re responsible for your own memory management, that might sound familiar. You might even assume that no one in their right mind would commit such a folly. Obviously you’re supposed to not read more data than is available in your destination buffers, right?

But that’s exactly why this is an interesting case. Not because the vulnerability was widespread in the real world. It’s interesting because developers in memory managed languages tend to explicitly trust the language implementation to keep them safe. There is a cognitive dissonance that may occur when issues such as CVE-2014-1912 are present in the language.

A Python developer might fully expect to be able to go s.recvfrom_into(bytearray(256), 512) without their Python interpreter being subjected to memory corruption. And indeed if you try this post-patch it now behaves as one might expect:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
ValueError: nbytes is greater than the length of the buffer
>>>

So essentially the vulnerability was that this function was not memory safe in a language where memory safety is largely assumed. But to a C programmer CVE-2014-1912 mostly reads like “yes, that’s how that works, no?”

This teaches us the lesson that the assumption of safe memory management semantics, even in higher level languages that advertise memory safety, is never a given. When dealing with APIs that explicitly operate on statically sized mutable buffers, it never hurts to ensure that your sizes match your buffers, even in cases where it’s presumably safe to not do so by virtue of the language itself.

From an attacker perspective, it is useful to audit for situations in which APIs that are commonly assumed to be memory safe are in fact not memory safe at all.

Conclusion

In this first installment of our series on the hidden C/C++ attack surface we’ve explored several practical examples of how the illusion of memory safety can trick developers into playing loose and fast with the inputs they accept into their applications.

Variations of this theme can and do exist in any interpreter API that is accessible from a higher level whose core is implemented in a memory unsafe language. Whether or not these issues are practically exploitable is often a result of how much leeway a developer affords an attacker with regards to their input.

Exploitation is routinely frustrated by developers who put strict requirements on their input types, sizes, and value ranges — e.g. when receiving an integer value, explicitly bounding the range of that value to what makes sense in the application context, as opposed to leaving it open to the value range of the variable type itself, is a defensive programming habit that will serve you well.

In the next installment of this series, we will dive deeper into the modern C/C++ attack surface of interpreted languages with a focus on the third party library ecosystem of popular interpreted language frameworks and present some new attacks in this space.