Jason St-Cyr

Posted on Mar 31 • Originally published at jasonstcyr.com on Mar 25

Using OpenAI to Suggest Tags From a Taxonomy List

#generativeai #llm #openai #sanity

To be able to search and list products by tags, the data in a database needs those tags associated to every product (or piece of content). Sometimes, that is just too large to manually curate. I wanted auto-tagging, but not just whatever tag the AI could think up. I have a specific taxonomy that I want respected, and I want to have all the pieces of content and products processed and updated with the tags from my taxonomy list that are most applicable.

I started looking at some of my article content and playing around with tag-based listings of articles. I wanted to be able to separate some of my fantasy fiction content from my developer tutorials and videos, or at least be able to tag them that way. This would prove out the model of auto-tagging to a specified list of tags.

In this article I'll break down what I learned about doing this with OpenAI (I also investigated Gemini, but that will be another article). If you just want to see the code: 🧑‍💻Full example script on GitHub

The Tech Stack Used:

OpenAI 4.88.0 for OpenAI client
React 18.0.25 and Typescript 5.7 for app development
Next Sanity 9.9.5 for Sanity client
npm 11.2 for package management
tsnode 10.9.2 for running scripts
dotenv 16.4.7 for loading environment variables

Getting Started with an OpenAI Account and the Client

My first attempt was using OpenAI. There isn't a free tier to do this type of work, so I needed to put at least $5 onto a credit card to start making requests. Once I did, the requests cost about half a penny (I made 16 requests for 8 cents during my first test run).

In my example, I'm using TypeScript in a Next.js application, but you can probably use similar code in your application. The following steps will get you connecting to OpenAI.

1. Install OpenAI package

The first step was to get the OpenAI package installed in the application. I used npm for my installation.

// Define the set of tags to check
const selectedTags = ["fiction", "fantasy", "developer", "tutorial", "video", "event"];

2. Import OpenAI and Initialize Client

Next up was getting the OpenAI client initialized. Part of this requires having an API key which is where the OpenAI platform comes in. Until you have an account in the OpenAI platform (https://platform.openai.com/) you can't create an API key. Once you have your API key, you can then initialize your client and make calls. In my code example here, I'm putting my API Key in an environment file

import OpenAI from 'openai';

const openai = new OpenAI({
    apiKey: process.env.OPENAI_API_KEY, // Make sure to add this to your .env file
});

3. Add Credits to Your OpenAI Account

Until you put a credit card on your account, at least $5, any request you make to the OpenAI platform will come back with an error message. You'll need to do this in the Billing section of the OpenAI Platform: https://platform.openai.com/settings/organization/billing/overview

Using OpenAI as Your Tagging Recommendation Engine

Now that the OpenAI client is imported into your account and has some credits to work with, it's time to start using the OpenAI API to do so some tagging.

1. Define Your Taxonomy

At some point in your script you should define what the collection of tags are that you want to restrict OpenAI's suggestions to. In my testing case, these were related to my website articles that I was testing with, but whatever collection you use is what needs to be specified here in the script.

// Define the set of tags to check
const selectedTags = ["fiction", "fantasy", "developer", "tutorial", "video", "event"];

2. Retrieve Data to be Tagged

The next step is to retrieve the data from your data source. In my scenario, this was executing a query against Sanity, but you might have a different data source you need to pull from. The key point is to get the fields from your data source that you want to pass to OpenAI to read as context. This is what it is going to use to determine which of your tags would apply to the piece of content.

// Create a client to connect to your data source
const sanityClient = createClient({
    projectId,
    dataset,
    token,
    apiVersion: '2024-03-19',
    useCdn: false,
});

// Fetch all posts from the data source, pulling back the data needed
const query = `*[_type == "post"] {
      _id,
      title,
      body,
      tags
}`;

const posts = await sanityClient.fetch(query);

3. Use OpenAI to Check Tags Against Your Content

For each data item you are processing, you now need to connect to OpenAI and check if any of your tags are applicable to the content. There are five parts of this function that are important:

Specify the model to use. I used "gpt-4o" because that was what was suggested and it worked pretty well.
Messages. These are the instructions to the LLM to give it some context as to what task it should do and the specific prompt we want it to run, along with the data we have.
Temperature. This parameter on the request controls how creative the LLM should be. I used a setting of 0.3 to try to keep the model from being too random.
Max Tokens. This parameter tells OpenAI how large of a response we want back. By keeping this at a small number (100) it gives the context to OpenAI to try to keep its response very short and not try to add a bunch of filler content.
Tag extraction. The data on the response comes back as a string and we need to extract it from the response and then split it up to get the array of strings, one for each tag that the LLM suggested.

// New function to check applicable tags
async function checkApplicableTags(title: string, content: string, targetTags: string[]): Promise<string[]> {
    try {
        const response = await openai.chat.completions.create({
            model: "gpt-4o",
            messages: [
                {
                    role: "system",
                    content: "You are a helpful AI assistant that specializes in content analysis. Your task is to determine which of the specified tags are applicable to the provided article content."
                },
                {
                    role: "user",
                    content: `Given the following article:\n\nTitle: ${title}\n\nContent: ${content}\n\nPlease analyze the content and determine which of the following tags are applicable: ${targetTags.join(', ')}. Provide only the applicable tags in a comma-separated list with no additional text or explanation.`
                }
            ],
            temperature: 0.3,
            max_tokens: 100,
        });

        // Extract applicable tags from the response
        const applicableTags = response.choices[0]?.message.content?.trim() || '';
        return applicableTags.split(',').map(tag => tag.trim()).filter(tag => tag.length > 0);
    } catch (error) {
        console.error('Error checking applicable tags from OpenAI:', error);
        return [];
    }
}

Saving Your Progress

At this point you will want to save the results into whatever your target system is. In my scenario, my script is going to save back to Sanity to update the records I already moved over. However, I also want to put this in a migration script that will process the data as it goes between WordPress and Sanity. The key is to make sure you capture the results from OpenAI and then update the correct records with the results.

// Save the updated tags back to the Sanity document
if (updatedTags.length > 0) {
     console.log(`Saving updated tags for "${post.title}" to Sanity...`);
     await sanityClient.patch(post._id) // Document ID
          .set({ tags: updatedTags }) // Update the tags field
          .commit() // Commit the changes
          .then(() => {
               console.log(`Successfully updated tags for "${post.title}".`);
          })
          .catch((error) => {
               console.error(`Error updating tags for "${post.title}":`, error);
          });
}

This was a very simple scenario, but I found it was very easy to get OpenAI into an application to do some tag evaluation! I do wonder how I could possibly do something with batching or in some way reducing the number of requests... For a large set of data this one-request-per-post approach is going to generate a lot of requests to the OpenAI endpoint which ultimately means costs going up.

🤖 AI Disclosure

AI-Generated Code: Most of the code samples above were generated by AI using Cursor IDE and a variety of models. While they were reviewed, edited, and tested by myself, I did not artisanally create this logic. They are part of working code that is in my GitHub repo.

DEV Community