Coordinated Disclosure Timeline

Summary

textacy contains a regular expression that is vulnerable to ReDoS (Regular Expression Denial of Service).

Product

textacy

Tested Version

0.11.0

Details

ReDoS

ReDoS, or Regular Expression Denial of Service, is a vulnerability affecting inefficient regular expressions which can perform extremely badly when run on a crafted input string.

Vulnerability

The vulnerable regular expression is here.

To see that the regular expression is vulnerable, copy-paste it into a separate file as shown below:

import re

URL_REGEX = re.compile(
    r"(?:^|(?<![\w/.]))"
    # protocol identifier
    # r"(?:(?:https?|ftp)://)"  <-- alt?
    r"(?:(?:https?://|ftp://|www\d{0,3}\.))"
    # user:pass authentication
    r"(?:\S+(?::\S*)?@)?"
    r"(?:"
    # IP address exclusion
    # private & local networks
    r"(?!(?:10|127)(?:\.\d{1,3}){3})"
    r"(?!(?:169\.254|192\.168)(?:\.\d{1,3}){2})"
    r"(?!172\.(?:1[6-9]|2\d|3[0-1])(?:\.\d{1,3}){2})"
    # IP address dotted notation octets
    # excludes loopback network 0.0.0.0
    # excludes reserved space >= 224.0.0.0
    # excludes network & broadcast addresses
    # (first & last IP address of each class)
    r"(?:[1-9]\d?|1\d\d|2[01]\d|22[0-3])"
    r"(?:\.(?:1?\d{1,2}|2[0-4]\d|25[0-5])){2}"
    r"(?:\.(?:[1-9]\d?|1\d\d|2[0-4]\d|25[0-4]))"
    r"|"
    # host name
    r"(?:(?:[a-z\u00a1-\uffff0-9]-?)*[a-z\u00a1-\uffff0-9]+)"
    # domain name
    r"(?:\.(?:[a-z\u00a1-\uffff0-9]-?)*[a-z\u00a1-\uffff0-9]+)*"
    # TLD identifier
    r"(?:\.(?:[a-z\u00a1-\uffff]{2,}))"
    r")"
    # port number
    r"(?::\d{2,5})?"
    # resource path
    r"(?:/\S*)?"
    r"(?:$|(?![\w?!+&/]))",
    flags=re.IGNORECASE
)

match = URL_REGEX.findall("http://0.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.")

This vulnerability was found by the CodeQL ReDoS query for Python, which was still experimental when it found this bug in 2021, but is now included in the standard suite of queries used by code scanning.

Impact

This issue may lead to a denial of service.

Credit

This issue was discovered by GitHub team members @erik-krogh (Erik Krogh Kristensen) and @yoff (Rasmus Petersen).

Contact

You can contact the GHSL team at securitylab@github.com, please include a reference to GHSL-2021-109 in any communication regarding this issue.