skip to content
Back to GitHub.com
Home Bounties Research Advisories CodeQL Wall of Fame Get Involved Events
January 12, 2022

GHSL-2021-1037_GHSL-2021-1038: Improper sanitization of data URLs and style attributes in lxml HTML Sanitizer - CVE-2021-43818

Alvaro Munoz

Coordinated Disclosure Timeline

Summary

The lxml HTML sanitizer fails to properly sanitize data URLs and style attributes

Product

lxml

Tested Version

Latest at the time of reporting

Details

Issue 1: Improper sanitization of inline style attributes (GHSL-2021-1037)

The code responsible for cleaning in-line style attributes looks like:

if not self.inline_style:
    for el in _find_styled_elements(doc):
        old = el.get('style')
        new = _css_javascript_re.sub('', old)
        new = _css_import_re.sub('', new)
        if self._has_sneaky_javascript(new):
            # Something tricky is going on...
            del el.attrib['style']
        elif new != old:
            el.set('style', new)

This code uses the following regexps to remove import statements and expression calls:

_css_javascript_re = re.compile(r'expression\s*\(.*?\)', re.S|re.I)
_css_import_re = re.compile(r'@\s*import', re.I)

However, the regexp substitutions can be used to reintroduce dangerous expressions:

<div style="@@importimport url('chrome://communicator/skin/');"></div>

This issue has lower priority since XSS vectors on CSS styles do not normally work on modern browsers.

Impact

This issue may lead to Cross-Site Scripting

Issue 2: Improper sanitization of data URL images (GHSL-2021-1038)

When lxml rewrites links, it uses the following regexps to identify possibly malicious schemes:

_is_image_dataurl = re.compile(
    r'^data:image/.+;base64', re.I).search
_is_possibly_malicious_scheme = re.compile(
    r'(?:javascript|jscript|livescript|vbscript|data|about|mocha):',
    re.I).search
def _is_javascript_scheme(s):
    if _is_image_dataurl(s):
        return None
    return _is_possibly_malicious_scheme(s)

Because r'^data:image/.+;base64', re.I).search allows data URLs as long as they are images, it is possible to use data:image/svg+xml;base64, URLs with embedded javascript code within the SVG image:

<a href="data:image/svg+xml;base64,PHN2ZyB4bWxuczpzdmc9Imh0dHA6Ly93d3cudzMub3JnLzIwMDAvc3ZnIiB4bWxucz0iaHR0cDovL3d3dy53My5vcmcvMjAwMC9zdmciIHhtbG5zOnhsaW5rPSJodHRwOi8vd3d3LnczLm9yZy8xOTk5L3hsaW5rIiB2ZXJzaW9uPSIxLjAiIHg9IjAiIHk9IjAiIHdpZHRoPSIxOTQiIGhlaWdodD0iMjAwIiBpZD0ieHNzIj48c2NyaXB0IHR5cGU9InRleHQvZWNtYXNjcmlwdCI+YWxlcnQoIlhTUyIpOzwvc2NyaXB0Pjwvc3ZnPg==">asdf</a>

Right-clicking the link and opening it in a new tab will trigger the execution of the javascript code.

Impact

This issue may lead to Cross-Site Scripting

CVE

Resources

Credit

These issues were discovered and reported by GitHub Security Lab team member @pwntester (Alvaro Muñoz).

Contact

You can contact the GHSL team at securitylab@github.com, please include a reference to GHSL-2021-1037 or GHSL-2021-1038 in any communication regarding these issues.