Hello, in the next coming months, these changes will happen in databases
and the infrastructure. And it might affect you if you rely on them in your
tools or queries. This list is ordered based on how soon the change will
happen.
We understand that updating your tools and systems can be time consuming,
hence we are giving an advanced notice. I truly apologize for the
inconvenience but many of these changes are needed to keep the site running
smoothly.
Image table redesign
Around fourteen years after the creation of T28741
<https://phabricator.wikimedia.org/T28741>, we are implementing the changes
described therein. Currently, every current version of an image has a row
in the image table and if there are older versions of that file, those rows
could be found in the oldimage table. These two tables (image and oldimage)
will be dropped in around two months. The replacement will be two main
tables: file and filerevision. Every file will have a row in the file table
describing the name and the type. Every version of the file (current and
old) will have a row in filerevision describing the file-specific
information such as its size or the hash of the file, similar to the
existing distinction between pages and revisions. Another improvement is
that every file and file revision will get a unique auto increment id
simplifying many operations and queries. You can check T28741
<https://phabricator.wikimedia.org/T28741> for more information. The new
tables are already accessible in wikireplicas but the data hasn’t been
fully migrated yet.
Term store split out of wikidata’s database
Wikidata’s database has been growing too fast and we need to move the term
store (tables starting with wbt_) to a dedicated cluster to allow growth
and improve wikidata’s performance by utilizing cache locality. The new
section will be called x3 and you will be able to access it in wikireplicas
but this also means you won’t be able to join these tables with the rest of
wikidata’s database (such as page table) since they will be residing in two
physically separate servers that also means most of your queries to
wikidata’s database (and term store) will become faster. We are aiming for
the switch to happen in three months’ time. You can follow the work in
T351820 <https://phabricator.wikimedia.org/T351820>.
Additionally, wb_type table will be dropped and the mapping will be
hard-coded in the code instead. See gerrit:1110810
<https://gerrit.wikimedia.org/r/c/mediawiki/extensions/Wikibase/+/1110810>
for more details. This helped us simplify a lot of Wikibase code (example
<https://gerrit.wikimedia.org/r/c/mediawiki/extensions/Wikibase/+/1110720>).
Categorylinks normalization
Categorylinks is the next table in the series of links tables being
normalized via the linktarget table (parent ticket
<https://phabricator.wikimedia.org/T300222>, RFC
<https://phabricator.wikimedia.org/T222224>). Similar to templatelinks and
pagelinks tables, cl_to will be dropped and instead the new field
cl_target_id will point to lt_id in the linktarget table. We will also drop
the cl_collation field and replace it with cl_collation_id which will point
to the collation_id field on the new table we are introducing called
collation. We are aiming to get this fully done by the end of the next
quarter (end of June 2025) but it depends on how fast the migration script
can operate and that’s outside of our control. You can follow the work in
T299951 <https://phabricator.wikimedia.org/T299951>.It’s worth noting that
after this migration is done, we will start working on the imagelinks table.
Thank you
--
*Amir Sarabadani (he/him)*
Staff Database Architect
Wikimedia Foundation <https://wikimediafoundation.org/>
Hello all!
The public WDQS Split Graph endpoints have been available for ~6 months, it
is time to have a look at what has been happening and at the next steps.
We don’t see a strong adoption of the new endpoints (~20 req/min for
query-scholary [1]). But we’ve identified almost 90% of the current
requests that would require migration to the split endpoints. The large
majority (~80%) are generated by a tool that is unfinished and has been
dropped by its author. Those queries are already broken or don’t have value
and will never be migrated. Unsurprisingly, Scholia is a major user of the
scholarly subgraph and has not migrated yet.
While we want to move forward, we also want to limit disruption, and give
more time to the projects that need it. To ease the transition, we’ve
created a new endpoint (query-legacy-full.wikidata.org) which contains the
full Wikidata graph, but is limited in terms of performances and
availability [2]. This new endpoint can be used in place of the current
query.wikidata.org for the few projects that need the additional migration
time. This endpoint will be available until December 2025.
The next big step is to drop support for the full Wikidata graph on
query.wikidata.org [3]. This should happen around April 10. After that
step, requests to query.wikidata.org that require the full graph will fail
or return invalid results if they are not rewritten to use SPARQL
federation [4]. You can ask for help to rewrite your queries [5].
In related news, Peter [6] has been exploring the performances of various
alternative RDF backends [7]. This is going to be invaluable when we work
on replacing Blazegraph!
Have fun!
Guillaume
[1]
https://grafana.wikimedia.org/d/000000489/wikidata-query-service?orgId=1&re…
[2] https://phabricator.wikimedia.org/T384422
[3] https://phabricator.wikimedia.org/T388134
[4]
https://www.wikidata.org/wiki/Wikidata:SPARQL_query_service/WDQS_graph_spli…
[5] https://www.wikidata.org/wiki/Wikidata:Request_a_query
[6] https://www.wikidata.org/wiki/User:Peter_F._Patel-Schneider
[7] https://www.wikidata.org/wiki/Wikidata:Scaling_Wikidata/Benchmarking
--
*Guillaume Lederrey* (he/him)
Engineering Manager
Wikimedia Foundation <https://wikimediafoundation.org/>