GHSL-2021-1037_GHSL-2021-1038: Improper sanitization of data URLs and style attributes in lxml HTML Sanitizer - CVE-2021-43818

Coordinated Disclosure Timeline

2021-11-10: Report sent to gh@behnel.de
2021-11-11: Issues are acknowledged
2021-12-12: Fix is released

Summary

The lxml HTML sanitizer fails to properly sanitize data URLs and style attributes

Product

lxml

Tested Version

Latest at the time of reporting

Details

Issue 1: Improper sanitization of inline style attributes (`GHSL-2021-1037`)

The code responsible for cleaning in-line style attributes looks like:

if not self.inline_style:
    for el in _find_styled_elements(doc):
        old = el.get('style')
        new = _css_javascript_re.sub('', old)
        new = _css_import_re.sub('', new)
        if self._has_sneaky_javascript(new):
            # Something tricky is going on...
            del el.attrib['style']
        elif new != old:
            el.set('style', new)

This code uses the following regexps to remove import statements and expression calls:

_css_javascript_re = re.compile(r'expression\s*\(.*?\)', re.S|re.I)
_css_import_re = re.compile(r'@\s*import', re.I)

However, the regexp substitutions can be used to reintroduce dangerous expressions:

<div style="@@importimport url('chrome://communicator/skin/');"></div>

This issue has lower priority since XSS vectors on CSS styles do not normally work on modern browsers.

Impact

This issue may lead to Cross-Site Scripting

Issue 2: Improper sanitization of data URL images (`GHSL-2021-1038`)

When lxml rewrites links, it uses the following regexps to identify possibly malicious schemes:

_is_image_dataurl = re.compile(
    r'^data:image/.+;base64', re.I).search
_is_possibly_malicious_scheme = re.compile(
    r'(?:javascript|jscript|livescript|vbscript|data|about|mocha):',
    re.I).search
def _is_javascript_scheme(s):
    if _is_image_dataurl(s):
        return None
    return _is_possibly_malicious_scheme(s)

Because r'^data:image/.+;base64', re.I).search allows data URLs as long as they are images, it is possible to use data:image/svg+xml;base64, URLs with embedded javascript code within the SVG image:

<a href="data:image/svg+xml;base64,PHN2ZyB4bWxuczpzdmc9Imh0dHA6Ly93d3cudzMub3JnLzIwMDAvc3ZnIiB4bWxucz0iaHR0cDovL3d3dy53My5vcmcvMjAwMC9zdmciIHhtbG5zOnhsaW5rPSJodHRwOi8vd3d3LnczLm9yZy8xOTk5L3hsaW5rIiB2ZXJzaW9uPSIxLjAiIHg9IjAiIHk9IjAiIHdpZHRoPSIxOTQiIGhlaWdodD0iMjAwIiBpZD0ieHNzIj48c2NyaXB0IHR5cGU9InRleHQvZWNtYXNjcmlwdCI+YWxlcnQoIlhTUyIpOzwvc2NyaXB0Pjwvc3ZnPg==">asdf</a>

Right-clicking the link and opening it in a new tab will trigger the execution of the javascript code.

Impact

This issue may lead to Cross-Site Scripting

CVE

CVE-2021-43818

Resources

https://github.com/lxml/lxml/security/advisories/GHSA-55x5-fj6c-h6m8

Credit

These issues were discovered and reported by GitHub Security Lab team member @pwntester (Alvaro Muñoz).

Contact

You can contact the GHSL team at securitylab@github.com, please include a reference to GHSL-2021-1037 or GHSL-2021-1038 in any communication regarding these issues.

Coordinated Disclosure Timeline

Summary

Product

Tested Version

Details

Issue 1: Improper sanitization of inline style attributes (GHSL-2021-1037)

Impact

Issue 2: Improper sanitization of data URL images (GHSL-2021-1038)

Impact

CVE

Resources

Credit

Contact

Issue 1: Improper sanitization of inline style attributes (`GHSL-2021-1037`)

Issue 2: Improper sanitization of data URL images (`GHSL-2021-1038`)