Solve Computer Science

Posted on Jun 19

YouTube channel mirror on Jekyll - part 2

#youtube #python #ai #chatgpt

🧩 The problem

Continuing the previous post we'll now transform the shell script into a Python one. We want an integration with Python to pull feeds from YouTube directly, so we must, first of all, transform the original instructions into Python API calls. This will enable yt-dlp to directly download the necessary videos avoiding the whole channel every time, which might result in IP throttling or temporary bans.

yt-dlp is also capable of working with text files in two ways:

to know which files have already been downloaded by saving their ids
to store the options instead of manually passing them via the shell or API

We're not going to use these two features today.

⚠️ Warning

⚠️⚠️ Before continuing, please only mirror content you have permission to... ⚠️⚠️

✅ The solution

🤖 Vibe coding

I'm lazy so I asked GPT o4-mini (via Duck.ai) how to do it. Here I redacted some output so it's more straight forward to read.

original shell command

yt-dlp "${URL}" \
    --verbose \
    --fixup detect_or_warn \
    --prefer-ffmpeg \
    --sub-langs "en,it" \
    --write-subs \
    --embed-subs \
    --write-auto-sub \
    --prefer-free-formats \
    --no-call-home \
    --no-overwrites \
    --recode webm \
    --add-metadata \
    --write-thumbnail \
    --convert-thumbnails png \
    --exec "echo \"%(title)s\" > ${DST_DIR}/%(id)s/title.txt" \
    --exec "echo \"%(description)s\" > ${DST_DIR}/%(id)s/description.txt" \
    --output "${DST_DIR}/%(id)s/%(id)s.%(ext)s"

Starting from this command, and setting URL='my youtube channel url', please transform the shell command to the Python calls

o4-mini

Here’s how you can do the same download in Python using yt-dlp’s API. You define essentially the same options in a dict, register a small progress hook to emit your “exec” behavior, and then call .download().

import os
from yt_dlp import YoutubeDL

# ========== configuration ==========

URL     = 'https://www.youtube.com/c/YourChannelNameHere'
DST_DIR = '/path/to/your/download/folder'

# Make sure the DST_DIR exists
os.makedirs(DST_DIR, exist_ok=True)

# ========== progress hook to write title & description ==========
def write_meta_hook(d):
    """
    This hook is called multiple times; we care about when download
    of a single video is finished so we can write title/description.
    """
    if d['status'] == 'finished':
        info = d.get('info_dict', {})
        video_id    = info.get('id')
        title       = info.get('title', '')
        description = info.get('description', '')

        # Create a per-video subfolder, mirror --output structure
        vid_dir = os.path.join(DST_DIR, video_id)
        os.makedirs(vid_dir, exist_ok=True)

        # write out title.txt and description.txt
        with open(os.path.join(vid_dir, 'title.txt'), 'w', encoding='utf-8') as f:
            f.write(title)
        with open(os.path.join(vid_dir, 'description.txt'), 'w', encoding='utf-8') as f:
            f.write(description)

# ========== yt-dlp options ==========
ydl_opts = {
    # general
    'verbose':             True,
    'no_overwrites':       True,
    'nocallhome':          True,
    'add_metadata':        True,
    'fixup':               'detect_or_warn',
    'prefer_ffmpeg':       True,
    'recode_video':        'webm',

    # subtitles
    'sub_langs':           ['en', 'it'],
    'writesubtitles':      True,   # --write-subs
    'writeautomaticsub':   True,   # --write-auto-sub
    'embedsubtitles':      True,

    # formats
    'prefer_free_formats': True,

    # thumbnails
    'writethumbnail':      True,
    'convert_thumbnails':  'png',

    # output template
    'outtmpl': os.path.join(DST_DIR, '%(id)s', '%(id)s.%(ext)s'),

    # post hooks
    'progress_hooks': [write_meta_hook],
}

# ========== run the download ==========
with YoutubeDL(ydl_opts) as ydl:
    ydl.download([URL])

🔎 Clues

I then go and double check if all the options in the generated code actually exist. Surprise! They do not.

By opening the yt-dlp source code, specifically the YoutubeDL class, you'll see all the supported API options.

🆘 Trying to help the bot help me

Me

i'll tell you the missing options from the yt-dlp code. you need to fix them:

'recode_video', 'embedsubtitles', 'convert_thumbnails',

o4-mini

The three fixed option names are:

recode_video: your --recode webm remux into WebM

embed_subtitles: your --embed-subs

convert_thumbnails: your --convert-thumbnails png

Turns out these fixes are wrong.

😌 Script that saved the day

There's one interesting utility in the yt-dlp repository called cli_to_api.py which is able to translate shell options into Python ones. One of the maintainers suggests to use that script for this use case. Here's how to run it:

clone the repository

  git clone https://github.com/yt-dlp/yt-dlp

install locally

  cd yt-dlp
  python3 -m venv .venv
  . .venv/bin/activate
  pip install .

run the script

  cd devscripts
  python -m cli_to_api "--verbose"

This is the output for the --verbose option

The arguments passed translate to:

[debug] Override config: ['--verbose']
{'verbose': True}

Combining these with the CLI defaults gives:

[debug] Override config: ['--verbose']
{'extract_flat': 'discard_in_playlist',
 'fragment_retries': 10,
 'ignoreerrors': 'only_download',
 'postprocessors': [{'key': 'FFmpegConcat',
                     'only_multi_video': True,
                     'when': 'playlist'}],
 'retries': 10,
 'verbose': True}

You can of course pass multiple options, each one between quotes:

python -m cli_to_api "--output" "DST_DIR/%(id)s/%(id)s.%(ext)s"

Which results in:

The arguments passed translate to:

{'outtmpl': {'default': 'DST_DIR/%(id)s/%(id)s.%(ext)s'}}

Combining these with the CLI defaults gives:

{'extract_flat': 'discard_in_playlist',
 'fragment_retries': 10,
 'ignoreerrors': 'only_download',
 'outtmpl': {'default': 'DST_DIR/%(id)s/%(id)s.%(ext)s'},
 'postprocessors': [{'key': 'FFmpegConcat',
                     'only_multi_video': True,
                     'when': 'playlist'}],
 'retries': 10}

🎯 Result

The write_meta_hook provided by GPT seems correct enough. However, to keep things simple I decided to translate the original shell exec options verbatim. I also like pathlib more than os to manage paths.

Here's the complete result:

import pathlib
import sys
from yt_dlp import YoutubeDL

DST_DIR = '/srv/http/videos'

ydl_opts: dict = {
    'verbose':             True,
    'no_overwrites':       True,
    'call_home':           False,
    'add_metadata':        True,
    'fixup':               'detect_or_warn',
    'prefer_ffmpeg':       True,
    'subtitleslangs':      ['en', 'it'],
    'writesubtitles':      True,
    'writeautomaticsub':   True,
    'prefer_free_formats': True,
    'writethumbnail':      True,
    'final_ext':           'webm',
    'outtmpl': {
        'default': str(pathlib.Path(DST_DIR,'%(id)s','%(id)s.%(ext)s'))
    },
    'postprocessors': [
        {
            # --convert-thumbnails png
            'format': 'png',
            'key': 'FFmpegThumbnailsConvertor',
            'when': 'before_dl'
        },
        {
            # --recode webm
        'key': 'FFmpegVideoConvertor',
            'preferedformat': 'webm'
        },
        {
            'exec_cmd': ["cat > " + str(pathlib.Path(DST_DIR, '%(id)s', 'title.txt')) + " << 'EOF'\n"
                         '%(title)s\n'
                         'EOF'],
            'key': 'Exec',
            'when': 'after_move'
        },
        {
            'exec_cmd': ["cat > " + str(pathlib.Path(DST_DIR, '%(id)s', 'description.txt')) + " << 'EOF'\n"
                         '%(description)s\n'
                         'EOF'],
            'key': 'Exec',
            'when': 'after_move'
        },
    ],
}

if __name__ == '__main__':
    # RSS feed code here: TODO
    # This will replace the single URL passed to the `download` method.

    url: str = sys.argv[1]

    with YoutubeDL(ydl_opts) as ydl:
        ydl.download([url])

The script works exactly the same as the one using the shell:

python -m mirror_yt "video or channel URL"

🎉 Conclusion

As you read, this is yet another evidence that vibe coding does not always work 100%.

Next time we'll integrate the YouTube RSS feeds into the script like I did in the first post of this series.

You can comment here and check my YouTube channel.

DEV Community