1

I'm working on a function that returns a HTTP response from https://pypi.org/simple/ when Python's pip installer requests it for a package. When pushing my code onto GitHub, the CodeQL checks warn of the risk of server side request forgery (SSRF), and asks me to create validation checks for the "user-defined input" (which is pip, in this case).

I have already made many attempts at validating the URL to satisfy this SSRF warning, but GitHub CodeQL has not accepted any of them so far. How can I rewrite the following to satisfy GitHub CodeQL's requirements for guarding against SSRF?

The relevant block of code:

import requests
from fastapi import APIRouter, Response

pypi = APIRouter(prefix="/pypi", tags=["bootstrap"])

@pypi.get("/{package}/", response_class=Response)
def get_pypi_package_downloads_list(package: str) -> Response:
    """
    Obtain list of all package downloads from PyPI via the simple API (PEP 503).
    """
    url = f"https://pypi.org/simple/{package}"
    full_path_response = requests.get(url)

The following is a non-exhaustive overview of attempts I've tried in order to satisfy that SSRF warning. However, none of them have worked for me.

# Attempt 1
# Check that it's a PyPI URL
url = f"https://pypi.org/simple/{package}"
if "pypi" in url:
    full_path_response = requests.get(url)
else:
    raise ValueError("This is not a valid package")


# Attempt 2
# Validate that package name is alphanumeric (allow _ and -)
if package.replace("_", "").replace("-", "").isalnum():  
    url = f"https://pypi.org/simple/{package}"
    full_path_response = requests.get(url)
else:
    raise ValueError("This is not a valid package")


# Attempt 3
# Check that it's a valid connection
with requests.get("https://pypi.org/simple/{package}") as http_response:
    if http_response.status_code == 200:
        full_path_response = http_response
    else:
        raise ValueError("This is not a valid package")


# Attempt 4
# Tried using RegEx matching to validate package name
if re.match(r"^[a-z0-9\_\-]+$", package):
    full_path_response = requests.get(f"https://pypi.org/simple/{package}")
else:
    raise ValueError("This is not a valid package")


# Attempt 5
# Use urllib.parse.urlparse to parse and validate the url
def validate_url(url: str) -> bool:
    parsed_url = urlparse(url)
    if parsed_url.scheme == "https" and parsed_url.hostname == "pypi.org":
        return True
    else:
        return False

def validate_package(package: str) -> bool:
    if package.replace("_", "").replace("-", "").isalnum():
        return True
    else:
        return False

# Validate package and URL
if validate_package(package) and validate_url(f"https://pypi.org/simple/{package}"):
    full_path_response = requests.get(
        f"https://pypi.org/simple/{package}"
    )  # Get response from PyPI
else:
    raise ValueError("This is not a valid package")


# Attempt 6
# Using a Pydantic model
from pydantic import BaseModel, HttpUrl, ValidationError

class UrlValidator(BaseModel):
    url: HttpUrl

def validate(url: str):
    try:
        UrlValidator(url=url)
    except ValidationError:
        log.error(f"{url} was not a valid URL")
        return False
    else:
        log.info(f"{url} was a valid URL")
        return True

# Attempt at URL validation to satisfy GitHub CodeQL requirements
url = f"https://pypi.org/simple/{package}"
if validate(url):
    full_path_response = requests.get(url)


# Attempt 7
# Encoding string before injection
from urllib.parse import quote_plus

def _validate_package_name(package: str) -> bool:
    # Check that it only contains alphanumerics, "_", or "-", and isn't excessively long
    if re.match(r"^[a-z0-9\-\_]+$", package):
        return True
    else:
        return False

def _get_full_path_response(package: str) -> requests.Response:
    # Sanitise string
    package_clean = quote_plus(package)
    print(f"Cleaned package: {package_clean}")

    # Validation checks
    if _validate_package_name(package_clean):
        url = f"https://pypi.org/simple/{package_clean}"
        print(f"URL: {url}")
        return requests.get(url)
    else:
        raise ValueError(f"{package_clean} is not a valid package name")
full_path_response = _get_full_path_response(package)


# Attempt 8
# The nuclear option of maintaining a list of approved packages
approved_packages: list = [pkg.lower() for pkg in approved_packages]  # List of package names from running `conda env list`

# Validate package and URL
if package.lower() in approved_packages:
    url = f"https://pypi.org/simple/{package}"
    full_path_response = requests.get(url)
else:
    raise ValueError(f"{package} is not a valid package name")

2 Answers 2

0

CodeQL has the concept of sanitizers which can be used to untaint input like the package string in your case.

The proper solution both for preventing security issues and for getting rid of the SSRF warning is to URL-encode the package string before inserting it into the URL. This makes it impossible to inject special characters like slashes (for sub-paths) or a question mark (for query parameters).

Unfortunately, CodeQL doesn't currently seem to recognize URL-encoding as a sanitizer. However, it should be fairly easy to define one yourself. You can use the HTML-escaping sanitizer as an example.

As an alternative, download and cache the entire index under https://pypi.org/simple/. It's only around 27 MB. This fixes the security warnings, and you can avoid making external HTTP requests all the time.

Clarification: When using quote as a sanitizer, you must consider the context. URL-encoding does not prevent SSRF attacks against the host component of the URL -- this requires a whitelist of constant strings (which CodeQL already recognizes). Also note that quote by default leaves forward slashes unencoded. To change this, pass the empty string to the safe parameter.

8
  • Thank you so much for your reply! This is the first one I've gotten since posting this question (including on other forums) at the start of this week! I have two questions, which I'll write as separate comments below, given the lack of paragraphs in the comments function! Commented May 22, 2024 at 14:21
  • 1. For URL encoding, are quote() and quote_plus() the ones I would want to implement? Looking at the docs, they seem to do the job you've described of replacing special characters with their escape variants instead, thereby sanitising the string? Commented May 22, 2024 at 14:22
  • 2. CodeQL not recognising URL-encoding as a sanitiser is quite unfortunate. .Thank you for pointing me to the GitHub repo! When you say "define one [my]self", do you mean forking it, cloning it, and then writing my own variant of the HTML-escaping sanitiser? My team seems to be using a version labelled github/codeql-action/autobuild@v3, which was set in .github/workflows/codeql.yml in the repository. I'm still quite new to the field, so thank you in advance for your patience! Commented May 22, 2024 at 14:28
  • Ad 1: Yes, I mean quote(). Ad 2: I believe it's possible to customize CodeQL, so you don't necessarily have to replace the entire tool with your own version. But I'm not a CodeQL expert, so that's something you need to look up or discuss with your team. Commented May 22, 2024 at 16:21
  • Brilliant, thanks again for your reply. With regards to CodeQL, I've managed to make contact with their team, so I'll see how it goes from there. As for using quote(), would url = f"https://pypi.org/simple/{quote(package)}" be enough to mitigate the SSRF risk? Commented May 23, 2024 at 12:41
0

It turns out that this was a false positive due to the py/partial-ssrf check being incorrectly written. This issue has recently been reported as resolved, and the fix will be implemented in GitHub CodeQL v2.18.0.

From the dialogue, partial-SSRF can be guarded against as follows:

At that point, validating your input using mechanisms like isalnum() (as stated in the query help) or regex matches should help you avoid this alert.

You must log in to answer this question.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.