Erry Kostala

Posted on May 19

How I used NLP and LLM to supercharge my Japanese learning

#llm #nlp #ai #python

Introduction

I'm a software engineer and I have been studying Japanese for about a year and a half. In my free time, I've made some tools to help me study. In this post, I want to talk about these tools.

Japanese 101

First of all, I imagine that my audience is mainly Software Engineers and other people mainly interested in the technology aspect, and not many of you speak Japanese or are studying Japanese. So I need to explain a few basic things about the Japanese language that will help you understand the reasoning behind the code I wrote. I'm not a linguist, so I will just try to explain things to the limit of my knowledge.

Japanese has 3 writing systems, two of which (Hiragana and Katakana) are a syllabary, and one of which is pictographs (in particular, Chinese letters - Kanji). Each Kanji has a meaning (what the character means) as well as one or more readings (what sounds correspond to the character). In english, the letter 'a' doesn't mean anything on its own. But in Japanese, the kanji 本 means "book", "origin", and a few other things depending on the context.

If you see that kanji and you have never seen it before, you will not be able to read it, and you also won't know what it means. So when you are learning, you care about both the reading and the meaning of each word.

Typically, hiragana is used to replace kanji to show the reading of new words. So the sentence 本(book)を(object marking particle) 読む(read) ("to read a book") could be written as ほん(book)を(object marking particle)よむ(read). The advantage of this way of writing is that you can use your existing knowledge of Hiragana and Katakana (only 92 characters, which you need to learn very early on), to learn the reading of new kanji.

These characters will often be added on top of kanji (called furigana ) to aid in reading, this is an example image.

Now that I have covered this, I want to cover a few things about learning Japanese.

Tools for learning Japanese

A popular study application used by learners of Japanese (but also by students of other subjects) is Anki. This is an application with which you can make flashcards, a selection of which you can study each day. The order and frequency in which flash cards are presented is chosen by an advanced Spaced Repetition Study (SRS) ensuring you study as efficiently as possible to retain knowledge longer.

Typically I make my Anki flashcards using three fields

Sentence:
The original sentence

Reading:
The sentence annotated with furigana

Meaning:
The English meaning

For example, the below image is a flash card for the sentence that I mentioned earlier

And here's what the flash card will look like when studying and when checking the answer

As you can see, by adding furigana in brackets [] next to the kanji, I can annotate the reading. So by entering 本[ほん] the reading ほん will appear above the kanji 本.

Automating the flash card making process. (ChatGPT stuff begins!)

Making flash cards, especially inputing the reading and meaning, is a time consuming process. I wanted to build an application that allowed me to paste or type in a Japanese sentence (from Manga, Anime, News, etc.) and generate the flash card for me.

Enter ChatGPT.

I made a simple Node.js backend which retrieves a Japanese sentence and returns three fields.

{"sentence": "本を読む",
"reading": "本[ほん]を 読[よ]む",
"meaning": "to read a book"
} # The space is necessary to correctly insert the furigana inside Anki

The backend uses ChatGPT with the following code

  # anki card format
  const AnkiCard = z.object({
    sentence: z.string(),
    reading: z.string(),
    meaning: z.string(),
  });

  # System prompt
  const STARTING_PROMPT = `You will receive a japanese sentence. You are to return ONLY RAW PLAINTEXT JSON of the following:
  1. ** sentence**: Present each sentence with kanji as typically used, always inserting kanji where applicable even if omitted by the user.
  2. **reading**: Display the sentence with furigana formatting compatible with Anki, by adding readings in brackets next to the kanji.
  Ensure a single regular full-width space ALWAYS precedes each kanji. Even if the kanji is at the start of the sentence, the space should still be applied.
  For example, "わたしは 食[た]べます". or at the start of a sentence: " 食[た]べます"
  3. **meaning **: Provide an English translation of each sentence, including necessary explanations to accurately convey the meaning.
  Direct translation isn't required, but the essence of the message should be clear.
  Your responses will automatically generate the required information for effective Anki Deck cards for each sentence without user confirmation or additional prompts. 
  You are adept at handling sentences across various  contexts, supporting users from beginner to advanced levels. 
    You provide RAW TEXT JSON only, as the text will be parsed by an app!`;

  const SYSTEM_MESSAGE = {
    "role": "system",
    "content": STARTING_PROMPT
  }

  const existingConversation = [
      SYSTEM_MESSAGE,
  ];

  # Sentence to convert
  existingConversation.push({
    "role": "user",
    "content": text
  })

  const response = await openai.chat.completions.create({
    model: ANKI_MAKER_MODEL,
    messages: existingConversation,
    response_format: zodResponseFormat(AnkiCard, "anki-card"),
  });

Once I send my prompt and the user's sentence to ChatGPT, I use my custom zodResponseFormat to get the data back in JSON rather than human language. Then I simply return it to the user.

   res.json({
    prompt: text,
    reply: {
      reading: parsed.reading,
      sentence: parsed.sentence,
      meaning: parsed.meaning,
    }
  });

And here's it working in real time:

After that, I built a very basic frontend for it

And finally I added a feature to download a CSV of the sentences, which can then be dragged and dropped into Anki.

Automating further (The python/NLP stuff!)

The process of creating flash cards consists of reading text, spotting words I don't know, and then using my API above to create flash cards.

As a learning project, I wanted to automate the first two parts of the process as well. I generally don't think that's something one should do, because reading and finding new words on your own is an important part of learning. Nevertheless, if I ever wanted to be lazy, I needed a way to automate the whole process up to the creation of the flashcard, leaving me with just the task of studying from the card.

The project could be broken down in the following steps.

Find some Japanese sentences. I could scrape news websites for easy access to some real Japanese.
Find sentences that contain words I don't already have in my Anki deck. This is important, because I don't want to create thousands of duplicated cards every time I run the script, nor do I want to create thousands of cards with no learning value to me.
Use my API to create the anki-formatted flash cards (this is simple, just an API call)
Add the cards to Anki (again this can be done with just an API call)

The challenging parts would be 1 and 2 as 3 and 4 were just API calls.

Scraping the news

Turns out this was pretty simple. Given a news article, all I had to do was to use Beautiful Soup to extract the text.

def scrape_news_article(url):
    """
    Scrapes a Japanese article for its title and content.
    Returns a dictionary with 'title' and 'content'.
    """
    headers = {
        "User-Agent": USER_AGENT,
    }
    try:
        response = requests.get(url, headers=headers, timeout=30)
        response.raise_for_status()
    except requests.RequestException as e:
        logger.error(f"Error fetching the article: {e}")
        return None

    soup = BeautifulSoup(response.content, "html.parser")

    # Find the title
    title_tag = soup.find("h1")
    title = title_tag.get_text(strip=True) if title_tag else "No title found"

    # Find the content
    content_tag = soup.find("div", id="js-article-body")
    content = content_tag.get_text(strip=True) if content_tag else "No content found"

    return {"title": title, "content": content}

Getting my existing list of Anki cards

Using the 'Anki Connect' extension, I got a list of every single card in the deck.

def get_anki_sentences(deck_name=None):
    """
    Fetches sentences from Anki via AnkiConnect API.
    Optionally filters by deck name.
    Returns a list of sentences (strings).
    """
    # Find note IDs in deck
    query = {"action": "findNotes", "version": 6, "params": {}}
    if deck_name:
        query["params"]["query"] = f"deck:{deck_name}"
    else:
        query["params"]["query"] = ""

    response = requests.post(ANKI_SERVER, json=query, timeout=60)
    note_ids = response.json().get("result", [])
    if not note_ids:
        logger.error(f"No notes found in deck '{deck_name}'")
        return []

    logger.info(f"Found {len(note_ids)} notes in deck '{deck_name}'")
    # Fetch note info
    notes_query = {"action": "notesInfo", "version": 6, "params": {"notes": note_ids}}

    notes_response = requests.post(ANKI_SERVER, json=notes_query, timeout=60)
    notes = notes_response.json().get("result", [])
    # Extract sentences (assume field named 'Sentence' or use first field)

    sentences = []
    for note in notes:
        fields = note.get("fields", {})
        if "Sentence" in fields:
            value = fields["Sentence"]["value"]
            value = strip_tags(value)
            sentences.append(value)
        elif fields:
            first_field = next(iter(fields.values()))
            value = first_field["value"]
            value = strip_tags(value)
            sentences.append(value)
    return sentences

Extracting the words

As I mentioned, I only care about sentences that contain words that don't already exist in my deck. In very simple terms I had 2 sets:
existing_words - every word from every sentence in my deck, new_words - every word from every sentence in the article. Then all I had to do was to subtract the decks.
But wait a second, how do you get the sets of words?

Even in English, it's not as simple as splitting the sentence by spaces and other characters such as commas. For example "I eat an apple" could be split into `["I", "eat", "an", "apple"] if you just split by space, but "I ate an apple" would be split into ["I", "ate", "an", "apple"]. This isn't good enough, because "eat" and "ate" are the same word, just conjugated differently. If I simply split by word, I would end up with a lot of cards containing the same words just conjugated differently. Additionally, articles such as "an" are words, but this is clearly not something that should count towards being a word worth learning.

In Japanese, it's even more complicated because you can't just split by spaces. Japanese simply doesn't use spaces between words (although text aimed specifically at young children or foreign learners often does, 'real' Japanese doesn't).

It's clear I'd need to do something clever. Enter NLP.

Using a python module called fugashi, I was able to do the following:

Get the base form of each word - 'ate' and 'eat' would both count as 'eat'
Ignore particles, prepositions, and other words that aren't worth caring about, focusing only on verbs, nouns, and adjectives.
Ignore numbers (10, 20, 300, 3411, etc are all words, but not something worth learning on its own with anki), proper nouns (I don't need to learn the name of some random Japanese politician from Fukuoka, sorry), and foreign words and symbols.
Get a 'neat' list of words from both news articles and my Anki deck and return only sentences containing 'new' words!

This is the code to extract the words I cared about:

`
def extract_vocab(sentences):
"""
Extracts a set of vocabulary words from a list of sentences, filtering by
part of speech. Excludes words that are numbers.

This function uses a morphological tagger to analyze each sentence and
extracts words whose part of speech is either noun ("名詞"), verb ("動詞"), or
adjective ("形容詞").  For each matching word, the lemma (base form) is added
to the vocabulary set if available; otherwise, the surface form is used.

Args: sentences (Iterable[str]): A list or iterable of sentences to process.

Returns: set: A set of unique vocabulary words (lemmas or surface forms)
matching the allowed parts of speech.
"""
# Nouns, verbs, and adjectives
allowed_pos = ("名詞", "動詞", "形容詞")
# Exclulude numbers and proper nouns
excluded_pos2 = ("数", "数詞", "固有名詞")
# Exclude foreign words and symbols
excluded_goshu = ("外", "記号")
tagger = Tagger()
vocab = set()
for sentence in sentences:
    for word in tagger(sentence):
        pos = word.feature.pos1
        if (
            pos in allowed_pos
           and word.feature.pos2 not in excluded_pos2
           and word.feature.goshu not in excluded_goshu
        ):
            logger.debug(f"Word: {word.surface}, feature {word.feature}")
            # Use lemma if available (for base form comparison)
            vocab.add(word.feature.lemma or word.surface)
return vocab

And I can use it like so to get the new word sentences:
`
def filter_sentences_by_new_words(new_sentence_list, existing_sentence_list):
"""
Filters sentences from new_sentence_list that contain words not present in
existing_sentence_list. Also returns the new words found in each sentence.

Args: new_sentence_list (list of str): List of sentences to filter.
existing_sentence_list (list of str): List of sentences representing known
vocabulary.

Returns: list of tuples: Each tuple contains (sentence, set of new words)
where the set contains words in the sentence not found in the vocabulary
extracted from existing_sentence_list.
"""
known_vocab = extract_vocab(existing_sentence_list)
results = []
for sentence in new_sentence_list:
    sentence_vocab = extract_vocab([sentence])
    new_words = sentence_vocab - known_vocab
    if new_words:
        results.append((sentence, new_words))
return results