AI and the WikiVerse have a complicated an developing relationship. Here, I investigate possible uses of AI to assist with imports of unstructured data from Mix’n’Match into Wikidata.
Approach
At the time of writing, Mix’n’match contains ~162M entries. Many of them have a more-or-less helpful short description, usually taken from the respective third-party source. I do use some old-fashioned but reliable parsing to extract information like birth and death dates from descriptions such as this: Born/died: 1869, Arenys de Mar (Espagne) - 1939, Paris (France).
Years are fine, but more complex dates (including localized names of months, and days like “January 1st, 1901”) already require more involved parsers. And we know which Paris is meant in the above description, but how can a parser be sure it’s not Paris (Texas)?
So I took the unmatched entries from catalog 6410 (MnM entry IDs, names, and descriptions only) and asked an AI to generate me some Wikidata statments (here I use Claude Sonnet 3.5, but CharGPT yields similar results): The following is a list of entries with IDs, names, and descriptions. Please create appropriate Wikidata statements per entry. Try to infer nationality and gender where possible. Use Wikidata item IDs instead of text where possible.
Results
This yields a list of statements, grouped by entry (input data here: Born/died: 1945, Newark (New Jersey, États-Unis)):
For Barbara Kruger (ID: 168457784):
CREATE
LAST|Len|"Barbara Kruger"
LAST|P31|Q5
LAST|P21|Q6581072
LAST|P27|Q30
LAST|P569|+1945-00-00T00:00:00Z/9
LAST|P19|Q18127
Analysis
- I find it insanely funny and interesting that Claude gives me QuickStatement format
- Claude will only parse about a dozen or so entries before stopping, and I need to manually say “continue”
- P31 (human) and P21 (female) were inferred by Claude, probably from the name
- P27 was correctly identified as United States, even though it’s in French
- P569 (birth date) is correct Wikidata-internal format
- P19 (place of birth), however, is not Newark, New Jersey, but record label
Some modification of the prompt gives me the parsed values in both Wikidata items and plain text. I wrote a simple parser for that output and loaded it into a database table:
More investigation shows that Claude parses the descriptions correctly, and identifies the relevant properties (eg place of birth: Newark, New Jersey). But, with the exception of basic items auch as Q5 (human), it often fails to identify the correct item, and silently hallucinates a different one.
Conclusion
At this point in time, AI appears to be quite useful to extract information from Mix’n’Match-style descriptions and present them as Wikidata statement parts, namely property and a standardized text value (eg country of citizenship: United States). However, finding the matching Wikidata item for this does often elude the AI, and yields faulty item IDs.
This can still be quite useful. Just telling apart humans and eg companies in the same MnM catalog would be a help. Also, because the output of the AI is standardized, I can easily find the correct item for citizenship myself. Also, finding the text for the birthplace helps, and the reformatted birthplace (“Newark, New Jersey”) is easier to find. Claude even annotates that it determined the gender from the name, which could easily be translated into a Wikidata qualifier.
In summary, AI can yield useful analysis of MnM entry descriptions. Given the cost and throughput limitations of commercial AI models, it would be helpful if the WMF would set up a free AI model for Toolforge-internal use. Even if these would be less powerful, getting a structured key-value list from an unstructured MnM entry would be immensly helpful, as it would mitigate the need for bespoke code per catalog, and potentially yield more information per entry. Also, a standardized “AI-generated” reference for such AI-based statements would help.
One Comment