1

I'm trying to download a protected PDF from the New York State Courts NYSCEF website using Python. The URL looks like this:

https://iapps.courts.state.ny.us/nyscef/ViewDocument?docIndex=cdHe_PLUS_DaUdFKcTLzBtSo6zw==

When I try to use requests.get() or even navigate to the page with Selenium, I either get:

  • A 403 Forbidden response (via requests)
  • Or a blank page with no <embed> tag (via Selenium)

Here’s what I’ve tried:

Using requests:

import requests

url = "https://iapps.courts.state.ny.us/nyscef/ViewDocument?docIndex=..."
headers = {
    "User-Agent": "Mozilla/5.0",
    "Referer": "https://iapps.courts.state.ny.us/nyscef/"
}
response = requests.get(url, headers=headers)
print(response.status_code)  # Always 403

And using SeleniumBase:

from seleniumbase import SB

with SB(headless=False) as sb:
    sb.open(url)
    sb.wait(5)
    try:
        embed = sb.find_element("embed")
        print(embed.get_attribute("src"))
    except Exception as e:
        print("❌ No embed tag found", e)

Nothing works.

Full code for reference:

from seleniumbase import SB
import requests
import os
import time

def download_pdf_with_selenium_and_requests():
    # Target document URL
    doc_url = "https://iapps.courts.state.ny.us/nyscef/ViewDocument?docIndex=cdHe_PLUS_DaUdFKcTLzBtSo6zw=="

    # Setup download directory
    download_dir = os.path.join(os.getcwd(), "downloads")
    os.makedirs(download_dir, exist_ok=True)
    filename = os.path.join(download_dir, "NYSCEF_Document.pdf")

    with SB(headless=True) as sb:
        # Step 1: Navigate to the document page (using browser session)
        sb.open(doc_url)
        time.sleep(5)  # Wait for any redirects/cookies to be set

        # Step 2: Grab the actual PDF <embed src>
        try:
            embed = sb.find_element("embed")
            pdf_url = embed.get_attribute("src")
            print(f"Found PDF URL: {pdf_url}")
        except Exception as e:
            print(f"No <embed> tag found: {e}")
            return

        # Step 3: Extract cookies from Selenium session
        selenium_cookies = sb.driver.get_cookies()
        session = requests.Session()
        for cookie in selenium_cookies:
            session.cookies.set(cookie['name'], cookie['value'])

        # Step 4: Download PDF using requests with cookies
        headers = {
            "User-Agent": "Mozilla/5.0",
            "Referer": doc_url
        }

        response = session.get(pdf_url, headers=headers)
        if response.status_code == 200 and "application/pdf" in response.headers.get("Content-Type", ""):
            with open(filename, "wb") as f:
                f.write(response.content)
            print(f"PDF saved as: {filename}")
        else:
            print(f"PDF download failed. Status: {response.status_code}")
            print(f"Content-Type: {response.headers.get('Content-Type')}")
            print(f"Final URL: {response.url}")

if __name__ == "__main__":
    download_pdf_with_selenium_and_requests()

Response:

No <embed> tag found: Message: 
 Element {embed} was not present after 10 seconds!
6
  • 2
    The error you receive from the requests.get() call clearly tells you that this file is not freely available. If you believe that you have the right to access this file then you need to contact the owners of the website. Commented Aug 4 at 10:39
  • maybe first you should keep open Selenium to check in DevTools if it has the same HTML with <embed>, or if there is no <iframe> which would need driver.switch_to() Commented Aug 4 at 10:59
  • maybe you should configure Selenium/browser to download PDF instead of display it, and open page with link, and click it instead of using requests for download. Commented Aug 4 at 11:01
  • 1
    I'm surprised that even if I add all the headers that my browser sends, I still get a HTTP 403 (using only the requests package). If you manage to send exactly the same request as your browser would do, the receiver has no way to tell them apart, does he? I will further investigate this, it seems interesting. Commented Aug 4 at 12:04
  • 1
    @Jeyekomon the server can absolutely tell if you are not using a real browser (via fingerprinting and javascript evaluation) Commented Aug 4 at 13:37

2 Answers 2

1

With SeleniumBase, you can do the following to download that file to the ./downloaded_files/ folder:

from seleniumbase import SB

with SB(uc=True, test=True, external_pdf=True) as sb:
    url = "https://iapps.courts.state.ny.us/nyscef/ViewDocument?docIndex=cdHe_PLUS_DaUdFKcTLzBtSo6zw=="
    sb.activate_cdp_mode(url)
    sb.sleep(10)
Sign up to request clarification or add additional context in comments.

4 Comments

Yes, this works. This actually downloaded the file. If its not too much problem could you tell me what does seleniumbase do here that downloads the file, when in my original code it was giving 403?
SeleniumBase has CDP Mode, which is a special mode that can bypass bot-detection. That website has special protections in place to prevent downloads to non-humans, which is why you need to evade bot-detection.
Does SeleniumBase also work on HCaptcha? Because i saw it can navigate cloudfare human verification.
It doesn’t work on HCaptcha.
0

PDF's by their very nature are binary files and thus have no native protection unless the source requires a release code to allow download or the viewer includes a DRM method. Alternatively, one method is to scramble (encrypt) the PDF. So an opener password is needed to unscramble the key PDF readers functionality.

This is why many sites include a challenger mechanism, like a region IP block, cookies, captchas or timer-based limitation.

Likewise, some sites / PDF files may even in the past have tested for Adobe DRM Reader is the viewer.

Since a PDF must download to view the security is often introduced by cookies testing for browsers. Also the URL is not a /file.pdf but an instructive URL. Thus, without Agent cookies or a true filename a PDF reader can not download a file.

enter image description here

The given example is a Public file and in its response has

Content-Disposition: inline; filename="2025_1214_Citizens_Bank_N_A_v_Jose_L_Benitez_SR_et_al_NOTICE_OF_PENDENCY_6.pdf"

This indicates it is not protected at all but does expect a browsers signature (user Agent) to allow for "inline" Dispositioning from server to recipient.

Thus to easily download and view you just need to signal as a browser user. NOTE: This shows that in this case cookies and other automatic challenge bypass is not needed the file is NOT protected, however, may not work for more complex cases.

cURL -Lo file.pdf -A "Mozilla/5.0 (X11; Linux x86_64; rv:60.0) Gecko/20100101 Firefox/81.0" https://iapps.courts.state.ny.us/nyscef/ViewDocument?docIndex=cdHe_PLUS_DaUdFKcTLzBtSo6zw==

enter image description here

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.