Skip to content

Inference, continued

A while ago I wrote about WD-infernal, an API to infer some information about a Wikidata item, that needs to be checked by a user (somehow). The idea was to offer standardized inference to multiple tools and Wikidata user scripts.

I have now added two new functionalities:

1. referee, which follows extrnal IDs and external links from associated Wikipedia pages from an item, to try and find potential references. The checks are quite suble, for example, trying multiple data formats to find a reference for a date, and infering the language of an external web page to use the appropriate label of a statement (eg for gender:male, use the German label for “male” on a German page). I had written backend code for a little script many years ago, but now I am bringing this functionality to the async/multithread/Rust age.

2. cross_categories, a functionality that I had also writen before, in a (now defunct) tool called CrossCats. For a given Wikidata item about a Wikipedia category, it will find all articles in that category on all associated wikis, aggregate and count the items for these articles, and return the ones that are in a given language Wikipedia, but not yet in that category tree.

As with the previous functionality, these are API endpoints returning JSON. They do not have a UI yet. I will eventually change the aforementioned referee script to use this endpoint. I might also work on a CrossCats interface, unless someone else beats me to it.

REST in Rust

The new Wikibase REST API brings standardized and simplified querying and editing of items, properties, statements etc. to Wikibase installations, first and foremost Wikidata. Last year, Wikimedia Sverige was entertaining the idea of a grant application to Wikimedia Deutschland. Part of the proposal was for me to write a Rust crate (i.e. library) for easier access to the new REST API. The project eventually didn’t go ahead, but I had started writing the code anyway, out of personal interest. To make it more chellenging, I tried to apply industry-level coding standards to the project; most of my tools are not written with this in mind, as usually I am the only one working on them.

grcov code coverage (“functions” does not work properly with the testing mode used)

Now, I have an (almost) feature-complete result:

What about those coding standards? Well:

  • 248 unit tests in the codebase, running in 0.2sec (on my machine), that are also run on every github push
  • many tests mock a HTTP server to allow local testing, including simulated editing
  • grcov reports >97% code coverage
  • tarpaulin reports ~90% code coverage (but doesn’t see some traits being used when they are)
  • cyclomatic complexity mean is 2.1 for the entire codebase, <7 everywhere (according to lizard)
  • function line length is <40 lines for the entire codebase
  • maintainability index median is 126.3 (original) or 73.9 (VS Code)
  • scripts and instructions for calculating the code metrics are part of the repo, as it an analysis.tab file
  • passes miri for all of my code (tests using some external code are deactivated for miri)

I will certainly make it a point to use the crate in both new and existing projects. I hope some of you will join me, and please feed back comments and issues on the issue tracker. Pull requests welcome!

So many Wikidata items have a “described at URL” (P973) statement, where we do not have a property to use an ID, or the source does not use IDs. I was wondering if some URL domains have accumulated in larger numbers for P973, which would make them candidates for properties. So I listed and counted the (unique) URLs from all such statements, and counted the domain names from that. Results:

Domain name Occurrences
www.sciencedirect.com 499,443
link.springer.com 151,935
digicol.dpm.org.cn 127,400
www.tandfonline.com 102,998
journals.sagepub.com 75,685
clevelandart.org 67,668
onlinelibrary.wiley.com 59,064
ieeexplore.ieee.org 48,151
iopscience.iop.org 47,917
www.mdpi.com 44,406
www.cambridge.org 36,292
linkinghub.elsevier.com 34,884
academic.oup.com 34,047
www.ssrn.com 28,785
www.nber.org 28,776
www.jstage.jst.go.jp 24,556
link.aps.org 23,120
parismuseescollections.paris.fr 19,194
journals.lww.com 18,481
data.collectienederland.nl 17,959
www.kci.go.kr 17,668
www.dbpia.co.kr 17,616
pubs.acs.org 17,389
dati.beniculturali.it 17,034
journals.ashs.org 16,875
muse.jhu.edu 15,841
www.emerald.com 14,996
pubs.aip.org 13,749
www.vanbommelvandam.nl 13,653
www.degruyter.com 12,675
xlink.rsc.org 12,022
www.inderscience.com 11,460
www.digiporta.net 11,442
datos-geodesia.ign.es 11,246
www.wga.hu 11,173
www.victorianresearch.org 10,610
www.indianjournals.com 10,364
kokoelmat.fng.fi 10,150

These are the domain names with >10K occurences. Full list here. I hope this list will lead to some new property propositions!

Infer-nal

After I recently wrote a small on-Wiki tool that can suggest statements to add to a Wikidata item, I thought that something like this might be useful in other tools as well. So, using the same concept and technology (Rust/axum) from my Authority Control API, I wrote WD-Infernal, an API that takes data, such as location coordinates or people’s names, and tries to infer Wikidata statements from that.

The idea of statement inference is obviously not novel, but AFAIK it is only implemented in individual codebases such as bots. Implemented as an API, any tool, bot etc. can use the inferred statements. I hope that feedback will improve the existing functions, and add new ones to this API.

In the greater scheme of things, my hope is to create more APIs to offer functionality around Wikidata, and to inspire others to do the same. An ecosystem of APIs would both simplify and standardize recurring tasks of (semi-)automatic Wikidata editing.

Using AI to add to Wikidata

AI and the WikiVerse have a complicated an developing relationship. Here, I investigate possible uses of AI to assist with imports of unstructured data from Mix’n’Match into Wikidata.

Approach

At the time of writing, Mix’n’match contains ~162M entries. Many of them have a more-or-less helpful short description, usually taken from the respective third-party source. I do use some old-fashioned but reliable parsing to extract information like birth and death dates from descriptions such as this: Born/died: 1869, Arenys de Mar (Espagne) - 1939, Paris (France).

Years are fine, but more complex dates (including localized names of months, and days like “January 1st, 1901”) already require more involved parsers. And we know which Paris is meant in the above description, but how can a parser be sure it’s not Paris (Texas)?

So I took the unmatched entries from catalog 6410 (MnM entry IDs, names, and descriptions only) and asked an AI to generate me some Wikidata statments (here I use Claude Sonnet 3.5, but CharGPT yields similar results): The following is a list of entries with IDs, names, and descriptions. Please create appropriate Wikidata statements per entry. Try to infer nationality and gender where possible. Use Wikidata item IDs instead of text where possible.

Results

This yields a list of statements, grouped by entry (input data here: Born/died: 1945, Newark (New Jersey, États-Unis)):

For Barbara Kruger (ID: 168457784):
CREATE
LAST|Len|"Barbara Kruger"
LAST|P31|Q5
LAST|P21|Q6581072
LAST|P27|Q30
LAST|P569|+1945-00-00T00:00:00Z/9
LAST|P19|Q18127

Analysis

  • I find it insanely funny and interesting that Claude gives me QuickStatement format
  • Claude will only parse about a dozen or so entries before stopping, and I need to manually say “continue”
  • P31 (human) and P21 (female) were inferred by Claude, probably from the name
  • P27 was correctly identified as United States, even though it’s in French
  • P569 (birth date) is correct Wikidata-internal format
  • P19 (place of birth), however, is not Newark, New Jersey, but record label

Some modification of the prompt gives me the parsed values in both Wikidata items and plain text. I wrote a simple parser for that output and loaded it into a database table:

More investigation shows that Claude parses the descriptions correctly, and identifies the relevant properties (eg place of birth: Newark, New Jersey). But, with the exception of basic items auch as Q5 (human), it often fails to identify the correct item, and silently hallucinates a different one.

Conclusion

At this point in time, AI appears to be quite useful to extract information from Mix’n’Match-style descriptions and present them as Wikidata statement parts, namely property and a standardized text value (eg country of citizenship: United States). However, finding the matching Wikidata item for this does often elude the AI, and yields faulty item IDs.

This can still be quite useful. Just telling apart humans and eg companies in the same MnM catalog would be a help. Also, because the output of the AI is standardized, I can easily find the correct item for citizenship myself. Also, finding the text for the birthplace helps, and the reformatted birthplace (“Newark, New Jersey”) is easier to find. Claude even annotates that it determined the gender from the name, which could easily be translated into a Wikidata qualifier.

In summary, AI can yield useful analysis of MnM entry descriptions. Given the cost and throughput limitations of commercial AI models, it would be helpful if the WMF would set up a free AI model for Toolforge-internal use. Even if these would be less powerful, getting a structured key-value list from an unstructured MnM entry would be immensly helpful, as it would mitigate the need for bespoke code per catalog, and potentially yield more information per entry. Also, a standardized “AI-generated” reference for such AI-based statements would help.

Mix’n’match stats

Just a fun little statistic for Mix’n’match. This is how many entries were matched in MnM, per year.

Note: This includes Wikidata imports (eg a property exists, matches from Wikidata are imported when the catalog is created).

2013 1,905
2014 86,562
2015 572,467
2016 1,667,843
2017 2,570,586
2018 5,166,435
2019 4,002,785
2020 7,203,921
2021 4,930,444
2022 5,430,432
2023 3,055,016
2024 (so far) 2,241,990
2024 (full year estimate) 3,409,693

That is a total of 36,930,386 matches so far.

Artworks: At least, let’s use what we already have

Wikimedia Commons has a lot of artworks, but it is difficult to find and query them; they sit there if you know exactly what you want, but otherwise they collect digital dust. Wikidata has many artworks that can be queried, but is missing many that are already on Commons.

If there were only some way to get a list of Commons artworks and check if they are on Wikidata, and otherwise create them there?

Luckily, there is is a list with Commons artworks that have no (known) Wikidata item, currently counting >7.8 million files. However, some of these are actually used on Wikidata, just as P18 (image) and not as “official” item.

This list is nice and all, but clearly overwhelming. Also, while there is a QuickStatements-gadget on some files, the created items are quite sparse, and there is the issue of determining if this artwork actually exists on Wikidata.

So I did my thing, and created three Mix’n’match catalogs, one for artworks, one for artists (where they can not be found on Wikidata automatically), and one for institutions (again, only if no Wikidata item could be found). If an artist (or institution) is later matched by hand, the preliminary mappings to the artist catalog will be replaced by the “proper” Wikidata creator properties.

The Commons artwork scraper uses many parts of the Commons description page, preparing a nice item for creation (this will still need some manual looking after!). This example, a painting by Tolouse-Lautrec, does not have a Wikidata item (at the time of writing this blog entry); neither does this one.

I have a background job going through all 7.8 million files (minus the ones already used as an image on Wikidata), but that will take a while.

Merge and diff

Originally, I wanted to blog about adding new properties (taxon data speficially, NCBI, GBIF, and iINaturalist) to my AC2WD tool (originally described here). If you have the user script installed on Wikidata, AC2WD will automatically show up on relevant taxon items.

But then I realized that the underlying tech might be useful to others, if exposed as an API. The tool checks at least one, and likely multiple, external IDs (eg GND, NCBI taxon) for useful information. Instead of trying to patch an existing item, I build new ones in memory, one for each external ID. Then, I merge all the new items into one. Finally, I merge that new item into the existing Wikidata item. Each merge gives me a “diff”, a JSON structure that can be send to the wbeditentity action in the Wikidata API. For the first merges of all the new items into one, I ignore the diffs (because none of these items exist, there is no point in keeping them), but rather I keep the merged item. On the last step, I ignore the resulting item itself, but keep the diff, which can then be applied to Wikidata. This is what the user script does; it retrieves the diff from the AC2WD API and applies it on-wiki. So I am now exposing the merge/diff functionality in the API.

Why does this matter? Because many edits to Wikidata, especially automated ones, are additions, either labels, statements, etc., or references to statements. But if you want to add a new statement, you will have to check if such a statement already exists. If it does, you will need to check the references you have; which ones are already in the satatement, and which should be added? This is very tedious and error-prone to do. But now, you can just take your input data, create the item you want in memory, send it and the Wikidata item in question, and apply the diff with wbeditentity. You can even use the same code to create a new item (with “new=item”).

Statements are considered the same if they have the same property, value, and qualifiers. If they are the same, references will be added if they do not already exist (excluding eg “retrieved” (P813) timetamps). A label in the new item will become an alias in the merged one, unless it is already the label or an alias. All you have to do is to generate an item that has the information you want to be in the Wikidata item, and the AC2WD merge API will do the rest. And if you write in Rust, you can use the merge functionality directly, without going through the API.

I see this as a niche between hand-rolled code to edit Wikidata, and using QuickStatements to offload your edits. The merge function is a bit opinionated at the moment (no deletions of statements etc, no changing values), but I can add some desired functionality if you let me know.

A quick comparison

Visually comparing artworks (example)

Over the years, Mix’n’match has helped to connect many (millions?) of third-party entries to Wikidata. Some entries can be identified and matched in a fully automated fashion (eg people with birth and death dates), but the majority of entries require human oversight. For some entries that works nicely, but others are hard to disambiguate from text alone. Maps (example) for entries with coordinates can help, but pictures also speak a proverbial thousand words.

To aid the matching of entries, I wrote a new functionality called quick compare that shows a random, automatically matched entry from a catalog, as well as the matched Wikidata item. For both, it shows the usual basinc information, but also (where available) an image, and a location on a map.

The external image will have to be extracted from the catalog first, which requires a manual “image pattern” to be created. This is reasonably easy to do in most cases; please let me know if you want this for any specific catalog.

Comparing castles by image and location (example)

The image itself is never copied, but inserted as an <img src="" /> element, hot-loading it from the external source. This is only ever done in the “quick compare” page. While hotloading from external pages is sometimes frowned upon, the low volume and specific context in which this is done here should qualify as fair use (images are displayed no larger than 300×250px). In the end, it saves the user the click on the containing web page, and it saves the site the loading of all associated HTML/JS/etc. files.

I hope this will ease the matching of certain types of entries, including (but not limited to) artworks and buildings.

Get into the Flow

Unix philosophy contains the notion that each program should perform an single function (and perform that function exceptionally well), and then be used together with other single-function programs to form a powerful “toolbox”, with tools connected via the geek-famous pipe (“|”).

A ToolFlow workflow, filtering and re-joining data rows

The Wikimedia ecosystem has lots of tools that perform a specific function, but there is little in terms of interconnection between them. Some tool prorammers have added other tools as (optional) inputs (eg PetScan) or outputs (eg PagePile), but that is the extend of it. There is PAWS, a Wikimedia-centric Jupyter notebook, but it does require a certain amount of coding, which excludes many volunteers.

So I finally got around to implementing the “missing pipe system” for Wikimedia tools, which I call ToolFlow. I also started a help page explaining some concepts. The basic idea is that a user can create a new workflow (or fork an existing one). A workflow would usually start with one or more adapters that represent the output of some tool (eg PetScan, SPARQL query, Quarry) for a specific query. The adapter queries the tool, and represents the tool output in a standardized internal format (JSONL) that is written into a file on the server. These files can then be filtered and combined to form sub- and supersets. Intermediate files will be cleaned up automatically, but the output of steps (ie nodes) that have no further steps after them is kept, and can only be cleared by re-running the entire workflow run. Output can also be exported via a generator; at the moment, the only generator is a Wikipage editor, which will create a Wikitext table on a Wiki page of your choice from an output file.

Only the owner (=creator) of a workflow can edit or run it, but the owner can also set a scheduler (think UNIX cronjob) for the workflow to be run regularly every day, week, or month. ToolFlow remembers your OAuth details, so it can edit wiki pages regularly, updating a wikitext table with the current results of your workflow.

I have created some demo workflows:

Now, this is a very complex software, spanning two code repoitories (one for the HTML/JS/PHP front-end, and one for the Rust back-end). Many things can go wrong here, and many tools, filters etc can be added. Please use the issue trackers for problems and suggestions. I would especially like some more suggestions for tools to use as input. And despite my best efforts, the interface is still somewhat complicated to understand, so please feel free to improve the help page.