skip to content
Back to
Home Bounties Research Advisories CodeQL Wall of Fame Get Involved Events
August 18, 2022

GHSL-2022-021: Regular Expression Denial of Service (ReDoS) in Apache Tika - CVE-2022-30126, CVE-2022-33879

GitHub Security Lab

Coordinated Disclosure Timeline


Apache Tika up to version 1.28.1 is vulnerable to Regular Expression Denial of Service (ReDoS) in the way it handles standard references in text files. Specially crafted files may cause catastrophic backtracking, taking exponential time to complete.


Apache Tika

Tested Version



Issue: Regular Expression Denial of Service (ReDoS) in (GHSL-2022-021)

Apache Tika uses the following regular expression to match uppercase standard headers when extracting standard references from a text file using StandardsText:

private static final String REGEX_HEADER =

Note the nested repetition at (\\d+\\.?)*. The regex engine would need to exponentially backtrack [1] in order to distinguish which part of the expression (either \\d+, or the * after the parentheses, since \\. is optional) matches an input containing a long sequence of numbers, in case there is not a full match after it.

This regex is used in StandardsText.findHeaders, which in turn is called in StandardsText.extractStandardReferences. The latter is used by the StandardsExtractingContentHandler class, which means that a text file being parsed using this handler and containing a specific payload may exploit the ReDoS vulnerability.

To demonstrate the exploitation, the provided StandardsExtractionExample can be used as follows:

$ cat /tmp/test/test.txt

$ cd tika-examples

$ mvn compile exec:java -Dexec.mainClass="org.apache.tika.example.StandardsExtractionExample" -Dexec.args="/tmp/test"

The time needed to parse the file grows exponentially with the length of the sequence of 9s in the input file test.txt.

Note that JDK 9 introduced important mitigations for this problem, so in order to reproduce the issue with the above example, it must be run with JDK =< 8.


This issue may lead to a denial of service by resource consumption.


[1] [2]



This issue was discovered and reported by the CodeQL team members @atorralba (Tony Torralba) and @joefarebrother (Joseph Farebrother).
The incomplete fix was discovered and reported by the CodeQL team member @atorralba (Tony Torralba) and @jarlob (Jaroslav Lobačevski) from Github Security Lab.


You can contact the GHSL team at, please include a reference to GHSL-2022-021 in any communication regarding this issue.