Reduce noisy wikidata injected rows to recentchanges
Open, Needs TriagePublic

Description

Many wikis have many times more wikidata-injected rows in rc table than edit rows. For example in arwiki, there are 2.4 times more wikidata rows in rc table than edits (many other large wikis are not much better, ukwiki is 3.2 times, nowiki is 7 times, etc.). The list of worst offenders can be found in P78713. I did a quick investigation and many of such rows shouldn't be injected in the first place:

Regarding the rows coming from wikibase, I looked at arwiki that has around 760K rows from wikidata (2.3 times the actual number of edits in the wiki). I found a couple of issues:

  • O aspect and aliases. Somehow many pages are subscribing to the O aspect and as result, they are getting all alias changes for that item even in languages that has nothing to do with them. That alone is responsible for ~20% of wikidata injected rows.
  • Statement modifier collapse is a problem. C.P1, ..., collapse to C in general after twenty statements (I think 20, not sure). This is causing a lot of random edits on statements being injected on rc We probably should bump the threshold a bit.
  • Changes to qualifiers and references. They are also triggering a lot of injections while they don't change anything (and I don't think client wikis actually load the references, at least it's quite rare AFAIK). That alone is also responsible for another 20% of rows being injected.

This is drastically contributing to large set of slow queries in recentchanges tables (plus queries being killed for being too slow meaning user-facing impact) and it's also makes them so noisy that people don't check them anymore.

Related Objects

Event Timeline

Just to make the severity of situation a bit clearer, if it continues like this, I'll have to disable wikidata injection in several large wikis.

Statement modifier collapse is a problem. C.P1, ..., collapse to C in general after twenty statements (I think 20, not sure). This is causing a lot of random edits on statements being injected on rc We probably should bump the threshold a bit.

The limit ($wgWBClientSettings['entityUsageModifierLimits']['C']) is currently 33 properties on all wikis, after we removed the last wiki-specific lower value (10 on cebwiki) three years ago. This was extensively discussed, including with the DBAs, in T188730, and we intentionally kept the limit low enough so that the wbc_entity_usage table would not blow up. If you’ve now come to the conclusion that we should rather grow the usage tracking table somewhat in order to reduce the number of recentchanges rows, we¹ can probably make that change, but I don’t like this tone as if it should obviously have been done years ago.

¹ (And by “we”, I mean “probably the Wikidata Integration in Wikimedia projects team”, but I wanted to leave my two cents here anyways.)

Statement modifier collapse is a problem. C.P1, ..., collapse to C in general after twenty statements (I think 20, not sure). This is causing a lot of random edits on statements being injected on rc We probably should bump the threshold a bit.

The limit ($wgWBClientSettings['entityUsageModifierLimits']['C']) is currently 33 properties on all wikis, after we removed the last wiki-specific lower value (10 on cebwiki) three years ago. This was extensively discussed, including with the DBAs, in T188730, and we intentionally kept the limit low enough so that the wbc_entity_usage table would not blow up. If you’ve now come to the conclusion that we should rather grow the usage tracking table somewhat in order to reduce the number of recentchanges rows, we¹ can probably make that change, but I don’t like this tone as if it should obviously have been done years ago.

¹ (And by “we”, I mean “probably the Wikidata Integration in Wikimedia projects team”, but I wanted to leave my two cents here anyways.)

The statement modifier collapse is a small part of this problem. It's something worth considering and revisiting after several years (33 was set in November 2018) but my tone on "it should obviously have been done years ago." is about other problems such as O aspect and other issues not this one.

Statement modifier collapse is a problem. C.P1, ..., collapse to C in general after twenty statements (I think 20, not sure). This is causing a lot of random edits on statements being injected on rc We probably should bump the threshold a bit.

The limit ($wgWBClientSettings['entityUsageModifierLimits']['C']) is currently 33 properties on all wikis, after we removed the last wiki-specific lower value (10 on cebwiki) three years ago. This was extensively discussed, including with the DBAs, in T188730, and we intentionally kept the limit low enough so that the wbc_entity_usage table would not blow up. If you’ve now come to the conclusion that we should rather grow the usage tracking table somewhat in order to reduce the number of recentchanges rows, we¹ can probably make that change, but I don’t like this tone as if it should obviously have been done years ago.

¹ (And by “we”, I mean “probably the Wikidata Integration in Wikimedia projects team”, but I wanted to leave my two cents here anyways.)

The statement modifier collapse is a small part of this problem. It's something worth considering and revisiting after several years (33 was set in November 2018) but my tone on "it should obviously have been done years ago." is about other problems such as O aspect and other issues not this one.

I'm even saying "We probably should bump the threshold a bit."

Hi Amir, thanks for bringing this to our attention. @seanleong-WMDE mentioned you pinged him about this too. Thank you. I'll reach out to discuss how best to proceed. I understand the problem is critical; however, disabling Wikidata changes in large wikis will undo all the work our team has done this year. That would be quite catastrophic. Anyway, reaching out right away so that we can tackle this. Thanks again for the alert.

An idea that was brought up in Wikimania (by Johnathan) is this: Do the reparse and if the result is noop, then avoid injecting the row. It has a lot of complexities (what if it gets reparsed in the mean time? Not all changes are shown in HTML, etc.) but if tackled (I don't think they are too hard to take care of), it'd bring down the number of injected rows to a pretty low number.

What @Ladsgroup is saying I believe is really worth looking into as I believe it would solve all the issues where we currently inject changes that don't really have an effect on the article.

An idea that was brought up in Wikimania (by Johnathan) is this: Do the reparse and if the result is noop, then avoid injecting the row. It has a lot of complexities (what if it gets reparsed in the mean time? Not all changes are shown in HTML, etc.) but if tackled (I don't think they are too hard to take care of), it'd bring down the number of injected rows to a pretty low number.

That sounds like a good idea to me (I was actually thinking in a similar direction when we talked about this the other day, but apparently it was too loud around us and @Ladsgroup couldn’t hear me properly ^^). We’d still want to keep the fine-grained usage tracking we currently have, to reduce the number of reparses we have to do in the first place, but if we only inject recentchanges rows when the reparse changed the HTML, that should capture the essence of “changes editors on the client wiki care about”. (For instance, an infobox might be configured to show data from Wikidata only if the statement has at least one reference that’s not “imported from Wikimedia project”; in that case, adding or removing such a reference shouldn’t have an effect on the HTML and would be invisible, whereas adding or removing a “proper” reference would change the HTML and show up in watchlists.) There will be some “false positives” due to pages that change for other reasons (randomness, {{REVISIONID}}, etc.), but those should be relatively rare (and no worse than the status quo).

I think the biggest complexity would be around race conditions. e.g. if a template gets edited and refreshlinks hasn't got into the page yet (sometimes it take weeks), the reparse could change HTML and cause a false positive. OTOH, if a template gets edited or a direct edit is made before the job gets there, the refresh could become noop and skips rc row insertion while it shouldn't. The most reliable way to collect could be to set some listeners during parsing and then during rc injection job, if they are accessed, then inject it (regardless of whether it has changed or not) but setting up listeners could be a decent amount of work.

Hi Amir, could you please share approximately how much load you think this new approach could reduce from the database? Thank you

Hi Amir, could you please share approximately how much load you think this new approach could reduce from the database? Thank you

That would be the ultimate solution, I don't have a good measure to say how much but that'd fix everything (but again, it's a decent chunk of work and is complex)

OK. We'll disuss with you and prioritize once we've made the first two changes

ldelench_wmf subscribed.

Hi @Ifeatu_Nnaobi_WMDE ! Can you let us know if this work looks feasible to complete by the end of September? This will help us assess overall risk for T400696, which we're aiming to complete by then.

Hi @Ifeatu_Nnaobi_WMDE ! Can you let us know if this work looks feasible to complete by the end of September? This will help us assess overall risk for T400696, which we're aiming to complete by then.

Hi Lauren @ldelench_wmf, please see our team's tickets related to this problem - https://phabricator.wikimedia.org/T401284. We've now worked on the first two tickets - Turn off wikidata qualifiers and references, and Implement a more granular alias tracking, and these changes should reduce the load to the database by around 30% once they have been reviewed and merged. There's a third ticket - Implement a new usage type for qualifiers and references that will help us let editors see the important notifications they need again, but without sending load to the database that we will work on in the next few weeks. But all of this is to say that we are making good progress on the emergency interventions for the database.

@Ladsgroup, please confirm if the work on the first two tickets sufficiently tackles the issues mentioned in this ticket (for now). As discussed, we won't have the capacity to do more in the next two quarters.

Based on my measurements it should definitely make a significant impact and bring us to a healthier position and buy us time until a better solution is in place. @ldelench_wmf to make it clear, this is not the only problem of rc table and alone wouldn't be enough, other non-wikdiata related stuff also need to happen too.

Unsolicited idea: would it be feasible to also provide Lua functionality to fetch just Wikidata identifiers for an item? When I analyzed Wikidata usage on English Wikipedia a few years back (which isn't representative in how they use it but still captures some of the trends), my conclusion was that a large amount of usage was for fetching identifiers (taxonbar, authority control, etc.) but while these templates were fetching the whole item because that was the only reasonable approach (and so triggering lots of unrelated property updates in RecentChanges), they only wanted the identifiers part: https://meta.wikimedia.org/wiki/Research:External_Reuse_of_Wikimedia_Content/Wikidata_Transclusion

Given that it's just a few templates, presumably if the functionality existed to just fetch identifiers (e.g., mw.wikibase.getIdentifiers), folks could work with template developers on the major wikis to update their code pretty easily and to great impact.

Wikibase already tracks fine-grained statement usage (per-property), regardless of how the Lua code accesses the statements (we install special metatables in the data returned by mw.wikibase.getEntity() to track which statements are being used). What might be happening in those cases is that they exceed the entityUsageModifierLimits (in production, up to 33 different property IDs per entity are tracked, beyond that it gets collapsed into “uses any and all statements”).

What might be happening in those cases is that they exceed the entityUsageModifierLimits (in production, up to 33 different property IDs per entity are tracked, beyond that it gets collapsed into “uses any and all statements”).

Oh interesting @Lucas_Werkmeister_WMDE - I was not aware but very cool functionality! Yeah, so then the question I guess is whether this collapse happens a lot and how much RC noise would be saved by allowing it to collapse into "any and all identifiers" instead.

Also, looking into taxonbar, I don't think that was a good example by me actually as it is more specific about what it's requesting.

It will take a while for the impact to fully show itself (most importantly one month must pass for the old rows to be purged) but you can clearly see a reduction in rows of rc table added by wikidata.

Compare what I did in June: P78713 with today: P83255. For example frwiki has had a 32% reduction. Overall, if you sum up all rows injected in all wikis. The total for June is 17592444 and for today is 13512415 (23% reduction in absolute numbers). The ratio of wb rows to mw.edit rows has gone down from 69% to 50% (I'm excluding commons since it's not getting wb injection and is distorting the ratios). I'll run the numbers again in a month.

How did I extract these numbers?

ladsgroup@stat1009:~$ curl https://noc.wikimedia.org/conf/dblists/wikidataclient.dblist | grep -v '#' | xargs -I{} bash -c "analytics-mysql {} -e \"select '{}', rc_source, count(*) from recentchanges where rc_source in ('mw.edit', 'wb') group by rc_source ;\"" | grep -v rc_source > rc_rows_sep

And these two ad-hoc scripts: P83257 and P83258