User Details
- User Since
- Oct 1 2018, 2:19 PM (374 w, 2 d)
- Availability
- Available
- IRC Nick
- isaacj
- LDAP User
- Isaac Johnson
- MediaWiki User
- Isaac (WMF) [ Global Accounts ]
Mon, Dec 1
Mentors have a link on their Mentor Dashboard that will navigate them to a filtered view of RecentChanges, limited to their mentees:
Mentors have two different mentee filters available in Recent Changes:
Oh I love this @KStoller-WMF -- thank you for the correction!
Wed, Nov 26
Updates:
- I updated the code for extracting passages to also grab past questions asked via Growth's mentorship module (i.e. sections on user talk pages that match the format Question from... (<date>) and questions asked via help-me templates on user talk pages (presence of help-me-* template). I'm working on generating their embeddings so they can be added to the question-bank corpus in the prototype.
- Moyan shared her code for her previous experiment with providing feedback to new editors via AI: https://github.com/phoebexxxx/newcomer-llms-user-study/tree/main
- The core functionality is a nearest-neighbor index overtop several core content policies for RAG purposes combined with an instruction to the agent (gpt-4o-mini) to rephrase the participant's question for better retrieval purposes. I have a working nearest-neighbor index but I think that "please rephrase this question for..." is a key functionality to explore where we prototype workflows with a LLM.
- I spoke with @Trizek-WMF about his experiences/thoughts around mentorship. My summary below:
- Answers are often quite slow with 1:1 mentorship (I've been seeing this too in the data).
- Lots and lots of repeat questions (I've been seeing this too in the data).
- A number of editors think their mentor is a bot or AI. Makes me think on one hand that having a bot respond to newcomer questions (one idea we have) could exacerbate this but it might also be a reminder to emphasize that they have a human mentor as well who can provide more context/support/etc. It also might be an opportunity to more clearly set expectations for the newcomer.
- Sometimes newcomers seem to think their mentors are responsible when things don't go well for them. That's hard to do something about but it makes me wonder whether there aren't ways to help mentors better track their mentees so they can step in earlier (if needed) -- e.g., alerts when a mentee is reverted or a form of RecentChanges that is automatically filtered to their mentees. I don't think this latter exists but should be possible to build as rc_actor is a field in RecentChanges so the hard part is deploying a table that has the actor IDs for a mentor's mentees.
- Because it takes a while for mentors to respond or mentees to return for the answer, pages have often been archived. While DiscussionTools should fix this issue, in reality the "This topic could not be found on this page, but it does exist on the following page:..." message might be missed (or perhaps just confusing for a newcomer?).
- Different wikis definitely have different systems/norms around mentorship. French Wikipedia for instance doesn't really use help-me templates but does have a Teahouse equivalent (Forum des nouveaux).
- Mentorship is not recognized within spaces like Admin bids (in the same way that e.g., experience patrolling is). Part cultural but also might be a function of how hard it is to summarize one's impact via mentorship. This is an opportunity for making available more statistics about positive outcomes from mentorship.
- Some potential issues with answers: many experienced editors use wikitext but newcomers are on VE; rules evolve and so old answers may not always be right; rules evolve and so documentation may be behind; many "rules" aren't written down in a formal way.
- He thought a bot that can help handle the repetitive questions very quickly would be welcomed by many folks as it would relieve pressure on quick responses and handle the less interesting inquiries.
Fri, Nov 21
This is very exciting @JMonton-WMF ! I think @fkaelin is the best person to answer both of your questions.
Thu, Nov 20
Updates:
- I built a prototype for natural-language search overtop the Help/Policy namespaces on English Wikipedia. There's a backend API that I can use for testing against an eventual dataset of queries (if we have explicit "correct" answers) and a UI for exploration. The API/UI show the natural-language search results alongside what is returned by our existing keyword search as guided by these entrypoints curated by editors.
- The nearest-neighbor searches are brute-forced (as opposed to an approximate index) so it takes a second or two. Using the Qwen3-Embedding-0.6B model for embeddings. It showed a strong improvement anecdotally over the much smaller standard sentence-transformers models. I suspect adding in a reranking model would help even more but that would require storing the text too (and not just embeddings) and slow things down a good bit further.
- This is actually the third iteration -- the first one was all Help/Wikipedia namespaces but it was way too messy with all the admin noticeboard etc. pages. Second was only top-level pages (no subpages) to remove all that discussion but that was too coarse because I lost some important Q&A archives and even though the results were higher quality, they mixed together very different contexts -- e.g., policies, help documentation, Q&A. So in this current iteration, I have explicitly separated out the different sources so that in theory they could be separately contextualized for an end-user -- e.g., here are similar questions, here is relevant policy, here's some how-to etc.
Tagging you @Trokhymovych as I think you mentioned having (similar?) issues with a Qwen re-ranking model as well that seemed to relate to the torch version?
Tue, Nov 18
Fri, Nov 14
Just dropping a few quick thoughts in case it's helpful if work is picked up in this space because they've been sitting in my head for a bit and I'm happy to now have somewhere to dump them:
- Relevant lit:
- How Grounded is Wikipedia? A Study on Structured Evidential Support and Retrieval -- this is the closest analog to this task in that part of their work was checking Wikipedia claims against their sources. They claim 30% failed verification (for biographies) using GPT-4o-mini as the model to do the claim extraction + verification. Useful deep-dive into how to find relevant evidence to a claim (they find re-ranking via LLMs to be important as a final step after a basic retrieval stage).
- Detecting Corpus-Level Knowledge Inconsistencies in Wikipedia with Large Language Models -- fact-checking Wikipedia with itself; so again not exact same but some very solid data pipelines to consider from these folks regarding finding relevant statements to a claim.
- Improving Wikipedia verifiability with AI -- this was about the broader challenge of finding "better" citations for a claim, but contains some useful ideas etc.
- WikiCheck: An end-to-end open source Automatic Fact-Checking API based on Wikipedia -- this was about fact-checking external claims with Wikipedia, so the opposite question to what's posed here, but with some shared methods.
- A German newspaper also did a deep analysis of a sample of articles on dewiki. You can see annotations from a bunch of dewiki editors after they went through them (table) and Signpost coverage. They found 20% of pages w/ outdated info and ~15% with incorrect info (not just outdated).
- Considerations:
- This is a classification task and fine-tuned smaller models still tend to do comparable or better than LLMs if there is some good data available for fine-tuning. There's the {{failed_verification}} templates as mentioned above that could be used for this. But also because this is a relatively generic task (not particularly wiki-specific) and Wikipedia is so central to the fact-checking sphere (so often appears in their datasets), my understanding is that more generic fine-tuned models should be pretty appropriate for our context even without us training on our own very Wikipedia-specific examples. So we wouldn't necessarily need a huge dataset of those.
- I've played around a bit in this space and my experience was that the data pre-processing is just as important if not more as compared to choosing the right model. For example:
- There's the question of extracting the claim with its appropriate context from Wikipedia. This is a lot easier if it's an Edit Check so in-context in VisualEditor and we can capture specific instances of e.g., new sentence added + citation. But if you're applying a model to existing content (Suggested Edit), it's a bit trickier to capture the specific claim with enough context to understand but not so much that you're really fact-checking multiple claims. Folks vary between using basic heuristics -- e.g., grabbing just the sentence, the whole paragraph, etc. -- and using LLMs to extract the specific claim and adapt to varying levels of contexts. The latter is probably more effective -- see Dense X Retrieval: What Retrieval Granularity Should We Use?.
- There's the question of fetching the text of the source being cited -- with AI etc. destroying the internet, we're seeing a lot more paywalled content and we'd have to make sure that we don't return a ton of false positives just because the external website blocked the request to some degree. Relying on Internet Archive links can help with this potentially but I've heard that websites have started to block them as well (e.g., Reddit block).
- There's the question of cleaning the source HTML and extracting just the relevant text but not all the boilerplate, menus, etc. This generally isn't a big deal in this context because you just need to find one statement that supports (or not) so having some noisy text isn't a big deal, but it can slow things down if you're also processing a bunch of it with LLMs. Models with longer context windows also help reduce the importance of this.
- There's the question of ranking the potential evidence for what's most relevant so not all of it has to be checked. Mostly I see folks recommend a basic similarity-based ranking followed by more complex re-ranking of the top few candidates with a LLM.
- Suggested first steps:
- If you all want to pick this up, I'd start with building a small-ish dataset of positive and negative examples (even just starting with 20 of each would probably be okay though 50 of each would be better).
- If it's for Edit Check, I'd grab a random sample of recent content adds and manually check them. If you're having trouble finding failed-verification examples, I'd narrow down to those that were reverted on the assumption that there'd be more failed-verifications in those. For each negative example, I'd grab a positive example from the same article.
- If it's for a Suggested Edit, I'd grab some of the {{failed_verification}} sentences and a few claims with citations in those articles that don't have that template applied (so you have a semi-balanced dataset of positive and negative claims).
- Once you have the set of claims + citations, I'd then scrape the sources of all of those and see how effective that is. That should already give you a good sense qualitatively of the scale of the challenge. And then run that small dataset through a few LLMs or existing fine-tuned language models to see how they do. That should hopefully be reasonably quick and give a decent idea of what level of accuracy you can expect with a basic setup.
- If you all want to pick this up, I'd start with building a small-ish dataset of positive and negative examples (even just starting with 20 of each would probably be okay though 50 of each would be better).
Updates:
- I started to look into the {{Help me}} template (notebook + ping @MGerlach as the person who flagged this pathway to me). The code is hacky because we don't have a nice content diff dataset for talk pages so I had to find Help me sections post-hoc and then try to guess who added the request etc., but there were at least 1700 instances on English Wikipedia of editors whose account was <= 10 days old using the template so this could be a good dataset to mine for more newcomer questions. These almost exclusively happen on the newcomer's user page (usage on article talk pages is much more likely to be more experienced editors).
- I met with Moyan and Tiziano (external researchers) to discuss some ideas about where this could go. We're going to meet again in early December but they're both excited about the space. Looking ahead, we will work to expand the qualitative coding I'm currently doing of Newcomer Homepage questions (and I think I'll add in the Help Me questions from newer users). This already has revealed quite a bit but we'd then choose one potential space for intervention and build out a prototype and evaluate it. Some of the potential intervention ideas (please chime in if you have others) that have already come from our discussions:
- Natural-language search of Policy/Help namespaces. This was what I came into the project thinking and very likely will still pursue because it should be effective given that these namespaces are relatively constrained in size, not super dynamic, contain a fair bit of jargon, and have many massive/diverse pages that challenge the utility of keyword search. This is also great for prototyping because it's almost purely back-end and easy to incorporate into tooling to test out if we get to that point. Plus it aligns nicely with other work on Semantic Search happening.
- Same as above but of FAQ / Question spaces only. Essentially rather than providing directly the answer, this would help editors find similar questions and see how other editors responded (with answers, asks for clarification, caution about breaking policies, etc.).
- LLM agent to help editors rewrite their questions so they are easier to answer. This could support better Search as well but also ensure there's enough context for an editor to be able to answer directly as opposed to have to first ask for a follow-up (with all the newcomer drop-off that occurs the longer the conversation goes). I like this as a really nicely-constrained and principled use of AI that doesn't get in between the interactions between editors (just tries to ease things from the sidelines). Some similarities to the ideas proposed by Cristian Danescu (meta) but harder to prototype because you need it installed by newcomers so that either requires essentially a full Product deployment or very limited field study at edit-a-thons where you could individually install it for folks.
- "I'm just a human" auto-responder for mentors. This is kinda a combination of the above two ideas but with more interesting prototyping opportunities. Essentially the idea would be that when a mentee asks a question on their mentor's talk page, if the mentor has opted in, a bot would automatically collect that question, query an AI agent, and post a quick-follow up depending on the level of context provided. Probably always included is some boilerplate language about how editors are people and might not be active in this moment so please be patient and check back. If the question has enough info, maybe the response includes a few relevant links from on-wiki documentation / question banks based on the Search prototype. Maybe if the question is lacking context, the bot asks the editor to clarify. Maybe the AI even tries to answer the question. This could be configurable as well -- e.g., an editor could opt in to just the Search links but no answer or just the Clarification component but not the others.
- Tool for helping newcomers keep track of the questions they've asked. It'd be great to be able to track whether the question was answered etc. but that gets a lot trickier because questions get moved around as pages get archived. Easiest would be to just retain the original section link and allow the DiscussionTools extension to handle discovery of the section even if it's been moved. And then the improved Thank/Reply functionality for the editor figuring out how to follow-up.
Wed, Nov 12
PhD Symposium wrapped up! I was not able to attend but reportedly the day went smoothly.
Fri, Nov 7
Thanks for calling that out @KStoller-WMF ! In my informal conversations with lots of experienced editors, exhaustion is definitely a factor though the motivation/desire to help still exists. I definitely came into this wondering how to help mentees but the more I do it, I think the "how do we help mentors get more enjoyment out of the process" question is also crucial.
Thu, Nov 6
Oooh good find! Just repasting link with question-mark included for easier discovery: https://en.wikipedia.org/wiki/Wikipedia:Village_pump_(technical)#Find_articles_that_have_recently_had_a_specific_word_added?
Oct 31 2025
weekly update:
- No major update as other urgent work took up most of my time. Had a good discussion with Moyan Zhou of UMN though about the role of AI in mentorship that sparked some thoughts about how AI could potentially both help newcomers with rephrasing their question and mentors in digging up links etc. to make it easier to respond, but largely stay out of the middle of the relationship where that human connection is important.
Oct 28 2025
Hey folks - a few things to share with the broader group:
- Thanks for sending along your notebooks for review! We're working our way through them and appreciate the patience. Given that the deadline for final submission is November 3rd, please submit in the next day if you are intending to so there is time for feedback.
- The final application requests a project timeline. My suggestion: you can break down week-by-week what would be required to turn your notebook into a fully-functioning app that organizers could use. Please include time for making improvements and learning new skills if you need that. Overall, we're generally more interested in your notebook and responses to other questions though so don't sweat it if you're not really sure how to do this.
- Some common themes across the notebooks:
- We value independent thought, so feel free to try out ideas. That said, please explain the reasoning behind your choices -- e.g., why did use a certain threshold for "good" vs. "bad"; why did you choose the signals you went with; why did you choose the articles you went with; etc. There is no right answer so it's much more important that we understand how you got to the code you produce.
- Please please please check the outputs of your functions. Are the results correct and what you expected? GenAI can be helpful with code but it also can steer you very wrong.
- Remember to take a look at your public link (https://public-paws.wmcloud.org/User:<your-username>/Micro-Task-Generator.ipynb) before submitting as sometimes code/comments can be hard to read in the public form (good use of Markdown cells often helps with this).
- Feel free to delete the example code that we provided but do make sure to answer each of the TODOs.
Oct 27 2025
Oct 23 2025
so the way I see it, it means we may not need a new classifier for chat agents.
@Mayakp.wiki that's fine by me then -- I mainly am concerned with a shifting definition than the exact boundaries and it sounds like you are planning to stick with ChatGPT/Perplexity as Search Engines as opposed to moving them to their own category at a later point. Thanks!
Another quick thought: I was curious how many mentees actually had an email (necessary to get a notification that their mentor had responded if they logged out from Wikipedia) and it's up around 91% so that's not necessarily a major issue here as far as mentee drop-off. EDIT: added authentication check and that's only 67% of mentees, so perhaps a larger factor.
Weekly update:
- I started coding up some questions from Wikipedia Teahouse. After 5 of them, I'm going to pause though. They tend to be far more detailed/advanced and I think out-of-scope for my goals at the moment. These are questions that almost certainly do need the level of detail/context that an editor can provide in their reply (i.e. bad fit for just surfacing documentation). The Teahouse folks are also largely doing a good job of responding pretty quickly -- e.g., 3 of the 5 questions got responses in ~10 minutes. It's telling that in all three of those cases, the conversation was much more in-depth than the usual question + single response (9, 5, and 6 responses) and actually saw the question-asker continue to engage. For the other two (2 hours and 13 hours to first response), the question-asker never re-engaged.
- Mentees essentially never thank their mentor (despite occasionally using this feature to thank others) and often don't respond to their initial thread if the question isn't answered in the first ~10 minutes. We may want to nudge mentees to thank their mentor when their response is helpful. The more accessible Thanks link on talk pages (details) should be a big help when it's deployed to English Wikipedia but perhaps there's a good place to nudge mentees to use this functionality when they appreciate a mentor response (as it's still slightly hidden).
- I talked with a number of folks at WikiConference North America about this work, which led to some interesting ideas:
- Mentorship has at least two dual goals: giving the question-asker specific feedback on what they should do next (competency) and also advising on broader norms within Wikipedia (relatedness). The former is what I think we might address better via improved Search over documentation while the latter is what is still important to preserve as a human interaction.
- At some point, it might be valuable to consider what data could help mentors in assessing their work. This would have to be done carefully because folks are doing this out of their own goodwill and you don't want to transform it into another chore or just plain work -- i.e. it shouldn't feel like grading or surveillance. That said, statistics on mentee survival/success, response times (maybe too surveillance-y?), or other outcome-related data might help in surfacing particularly successful mentors or identify areas for improvement. So maybe e.g., a public top-list of best mentors by engagement/outcomes and then folks can privately view their own statistics about response time etc.
- In relation to discussions around the progression system (T395678), mentors might eventually be folks who could "sign off" on someone achieving a basic level of skills. This could be purely for feedback purposes or help build the confidence of new editors, a way for a mentee to "graduate" out of the mentorship program if mentors feel they have too many folks on their plate, or perhaps even be tied to receiving some sort of user access level if that's deemed helpful?
- I'm less convinced about this but leaving it here as a thought: we may want to institute some sort of back-up similar to how Help me templates work. E.g., if a mentee isn't receiving a response within some timeframe, other editors could be pinged. Perhaps more appropriate would be doing that if e.g., the mentor has not edited in the last 24 hours? Generally I see some issues with slow responses on Growth Homepage mentor questions (especially as compared to Teahouse) though it's been pretty rare that a mentor doesn't respond in e.g., 24 hours so I don't really think this is a problem that needs to be solved.
- Next steps: I'm realizing that there are a lot of approaches to getting feedback (many listed in the description of this task) but it might be helpful to describe them in a bit more organized way -- e.g., whether it pings an individual, a small group, or a large group; how easy to use; how discoverable; etc. This will also help me in deciding whether I want to continue coding up the Newcomer Homepage Mentor questions or switch to a third source.
Just commenting here too as a duplicate of T406531#11303547: I personally would leave in place the expectation that referrers start with http or https. My read is that that behavior is largely coming from bots who are improperly mocking up a referrer. I don't see nearly the volume that Krinkle saw in January and the majority of it is being labeled as automated (query below). Given that it's also not acceptable behavior per the specs, I'd lean towards us enforcing the expectation of having legitimate referers as a further check against bot data.
Thanks for the ping! A few thoughts but don't let this block the work if you all want to proceed:
- I personally would leave in place the expectation that referrers start with http or https. I can go comment on T383088 too but my read is that that behavior is largely coming from bots who are improperly mocking up a referrer. I don't see nearly the volume that Krinkle saw in January and the majority of it is being labeled as automated (query below). He noted that it's not actually acceptable behavior but believed it to be an issue caused by some privacy extensions perhaps. It's a judgment call but I'd prefer that we expect legitimate referers as a further check against bot data.
- +1 to changing IPs to unknown -- seems reasonable and I assume not a major impact on our data so consistency is less important.
- I'd be cautious about adding ChatGPT and Perplexity into our Search Engine definition. This is a broader philosophical thing so I don't think any right answer but my thoughts: I don't think "Search Engines" are actually well-defined anymore and by all means, Google is very chat-agenty these days so the boundaries are getting more and more blurred. That said, if we plan to establish a "chat agent" referer class, then please don't temporarily put ChatGPT/Perplexity into the Search Engine data as it'll just cause confusion as to why there's a temporary blip in the data.
Oct 21 2025
Just quickly chiming in on words/references:
- My mwedittypes library can do this. There's also a UI/API for it if you're curious to see what it looks like: https://wiki-topic.toolforge.org/diff-tagging. That's all hosted on Cloud Services and not actively maintained though so let me know before using it in anything live but fine for exploring, prototyping etc.
- The references are relatively straightforward. The default wikitext-based approach (what's happening in the UI above) is just counting <ref> tags that are present in the wikitext. That means it will miss some things -- e.g., see PAWS:references-wikitext-vs-html.ipynb -- but probably good enough for the use-case of analytics. You can also do it via HTML, which will be far more accurate and is implemented in the Python library (just not exposed via the UI/API). The other difference is that on the HTML side, I distinguish between references (i.e. new sources in the reflist at the bottom) and citations (i.e. in-line usage of those references). The wikitext one is really doing citations though we could adjust to capture both I think if desired.
- Words are more complex. Two things:
- What is considered "text" in an article: the library currently strips out references, templates, images, lists, categories, and a few other things (code). Essentially aiming for gathering the core text in the article. This could be over-written though if you all are interested in a different set of elements. Using HTML here also brings some additional flexibility -- e.g., if you wanted to count words in infoboxes but not clean-up templates for instance.
- How do you count "words" in text: once you have the core text, there is still the challenge of counting up words. For whitespace-delimited languages like English, that's pretty trivial (split on whitespace and because you don't care about specifics, you don't have to worry too much about cleaning up punctuation or stuff like that). For non-whitespace-delimited languages like Chinese or Thai, it's a lot trickier. We do have another library (mwtokenizer) for doing this but it's giving you the sorts of tokens you might hear discussed in the context of LLMs -- i.e. they aren't promised to be true words but instead common sequences-of-characters so sometimes full words but sometimes just chunks of words. For the moment, mwedittypes just falls back to saying how many characters were changed but I've been meaning to incorporate in the mwtokenizer logic so happy to talk about that if you're interested.
- Of note, if you go with mwedittypes, you'd get some other elements for free -- e.g., how many images were added, how many clean-up templates were removed (HTML only), if an infobox was added (HTML only), and presumably any other element that you might want to report on.
Oct 10 2025
I'm getting started with the first task, and need a few clarifications: The question is very open ended, there are a number of relevant articles on climate change - some are based on the recent current happenings, some are basic concepts, some describing past events, etc.
- As we are aiming for a list of newcomers, should I only target pages that need minor changes?
- Is it okay to think from a perspective of adding new information, fact checking, removing biased content etc.?
- Do I link such articles to my answers?
I'm a little stumped as to how to approach this. could anyone provide insights? I just need to understand what I am working with here, thanks!
@shreya-bhagat thanks for this thoughtful question! The intent is for it to be open-ended -- there won't be any one right solution here. The questions you ask make me think that you're thinking carefully about it, which is the most important part. Feel free to choose articles whose content you're more familiar with too (it doesn't have to just be the globally "most important" content). Whichever route you go, just explain why you chose the set that you did. You'll then use that set of articles for your remaining analyses.
A general reflection too: it's really powerful to go through these editor journeys via the questions they're asking mentors and their Contribution history and trying to figure out what was going on. Many are quite short (unfortunately) with many misconceptions based on their actions but it's really interesting to see them try at creating user pages, getting help, making edits, etc. And then very heartwarming when you see an editor figure it out and keep editing!
Weekly update:
- I began considering what it would mean to extract nice structured datasets of Q&A from pages -- e.g., WP:Teahouse Archives -- but paused that effort as I realized that a) it was non-trivial, and, b) that I wasn't fully sure yet what I would want to extract so better to return to the question after spending some more time with the data. I want to process the HTML but that's also a consideration: HTML will make it much easier to extract e.g., policy/help links and nice clean text from the conversations. Because I'm really only interested in the final question+answers, it's okay that HTML largely locks me into working with the current snapshot as opposed to the full history of the conversation. There are some parsers that exist for wikitext+talk pages if I decide to change direction -- they likely wouldn't work exactly for my needs but might have some of the logic around e.g., extracting timestamps, usernames, etc.
- I pivoted instead to starting some qualitative coding of editor Q&A. I began with newcomer questions via the Newcomer Homepage mentor module largely because there were some discussions happening about the impact of that module that I thought might benefit from more data. I grabbed 100 random mentor questions from English Wikipedia (query below) and have gotten through 15 of them (thanks to @TAndic for helping me think through my codebook). Still very small sample but some early takeaways:
- 5 did not really receive responses (2 mentors seemed to be generally inactive at that time, 1 was a case of the question being ignored problem, 2 were cases of the question being off-topic/unintelligible and were eventually reverted).
- Of the 10 with responses: 2 mentor responses came within ~20 minutes; 5 responses came in 12-20 hours; 2 took 1.5 days, and 1 came a month later.
- The mentor responses were largely helpful/kind -- sometimes directly answering the question, sometimes asking for clarification. Mentees almost never responded back or thanked them though. More common actually was the mentee making a follow-up in a new section (twice) or on their own talk page (once). Only twice did they actually follow-up on the original question.
- Of the questions where the intention was clearer, 7 were about editing existing articles and 4 were about creating new articles. Most questions were generic (e.g., "how do I create an article?") and probably would have benefited from some follow-up questions/answers. The needs were pretty diverse (general workflow, questions about policies, questions about wikitext/syntax, help with approving articles, etc.)
- There were reasonable COI concerns in 4 of the questions. On the flip side, several of the newcomers clearly were acting in good-faith and just trying to figure things out. Many it was unclear (generic question and not enough other activity to judge).
- The outcomes for these 15 aren't great though a few mentees made it through:
- No contributions for a month after question and then returned to edit occasionally
- Asked again about their draft article on different talk page and on Commons for some reason, but then stopped editing
- Made two more edits to their draft article about a month later but eventually declined for notability reasons and they never edited again
- Kept editing but most of it was reverted for lack of sources. Eventually blocked.
- Never edited beyond the question
- Never edited beyond the question
- Never edited beyond the question
- Made edit but was reverted. Then made more policy-conforming edit and hasn't edited since. Likely COI though.
- Never made edit they asked about or edited again. Page is still broken 2 years later from their initial attempts
- Never edited beyond the question
- Never edited beyond the question
- Figured it out and kept editing
- Figured it out and kept editing
- Fixed typo and asked follow-up in wrong place and then stopped
- Unclear what was going on with mentee but eventually they dropped off
- Next steps for me will be to pull some samples from other sources to diversify my sample. Once I have a better sense of what's out there, I'll return to the question of whether to try to more automatically extract some of this or continue in a more manual fashion.
Oct 9 2025
Echoing the welcome to everyone -- it's great to see so many applicants and all the conversation!
Oct 7 2025
FYI chiming in as I've done a bit of this in the past but am happy to pass it off to others!
Oct 3 2025
Sep 29 2025
Sep 26 2025
Adding a comment for documentation purposes:
FYI if you all ever want to support rev ID for this model, it's pretty simple to implement. A few choices:
- Map any given revision ID to its current page ID and then go from there. On one hand, this might be confusing if e.g., someone submits a revision ID from 10 years ago but gets a prediction based on content from today. On the other hand, this model is the rare model where the topic really is a concept that exists outside of the page and we're just using the page to predict it. So we don't expect it to change with every revision (that's more of a bug than a feature). So I think it'd be reasonable to just map any revision ID to the current data. It's also way simpler.
- Actually support arbitrary revisions of pages as the source of features. For this, you'd need to extract the wikilinks manually from the page. Some code below. Slower because now you're fetching page HTML but still probably pretty performant.
Sep 19 2025
Obviously let's keep this focused on tone-check but just reminding of our related use-case for a potential time topic filter where you'd do something like: articletopic:time>1960 (T397375).
Sep 18 2025
A few updates:
- Notebook for collecting Citoid params for each language (where available): https://public-paws.wmcloud.org/User:Isaac%20(WMF)/Citations/t374554-gather-citoid-params.ipynb
- Notebook for building representative article HTML dataset for every language to explore Citoid coverage etc.: https://public-paws.wmcloud.org/User:Isaac%20(WMF)/Citations/t374554-html-random-dataset.ipynb
- Some descriptive stats on a few common Citoid parameters and how well-structured references are across the different Wikipedia languages: https://public-paws.wmcloud.org/User:Isaac%20(WMF)/Citations/t374554-citoid-coverage.ipynb
- Exploration of one important parameter date and it's non-Citoid year backup: https://public-paws.wmcloud.org/User:Isaac%20(WMF)/Citations/t374554-date-vs-year-enwiki.ipynb
Sep 17 2025
Hey @Alexey_Skripnik -- glad you discovered the dataset! What you're running into is not the anonymity threshold but actually the geographic filtering that happens. You can see more details in T348504 but the relevant country policy that's being enforced here: https://foundation.wikimedia.org/wiki/Legal:Wikimedia_Foundation_Country_and_Territory_Protection_List. What that means unfortunately is that we would need to apply differential privacy to this dataset to obscure the counts, so it would be a much larger lift than just another round of aggregation. We also would need to recalculate the raw data as the filter is applied upfront. I think that's unlikely to be prioritized on our end but I'll watch for opportunities to push for the more long-term solution of switching the data pipeline over to using differential privacy.
Sep 15 2025
Just a head's up that there is some data available about common user-agent components though I'm not sure if it meets the needs of this task as it's more structured/split as opposed to the raw string:
- https://analytics.wikimedia.org/published/datasets/periodic/reports/metrics/browser/
- You can visualize the above via https://analytics.wikimedia.org/dashboards/browsers/#all-sites-by-os
- And I believe this is the relevant code if interested about implementation details: https://gerrit.wikimedia.org/g/analytics/refinery/+/edfea882db211d861d52833bedf3f4a62d522317/hql/browser/general/browser_general_iceberg.hql
Sep 10 2025
What might be happening in those cases is that they exceed the entityUsageModifierLimits (in production, up to 33 different property IDs per entity are tracked, beyond that it gets collapsed into “uses any and all statements”).
Oh interesting @Lucas_Werkmeister_WMDE - I was not aware but very cool functionality! Yeah, so then the question I guess is whether this collapse happens a lot and how much RC noise would be saved by allowing it to collapse into "any and all identifiers" instead.
Unsolicited idea: would it be feasible to also provide Lua functionality to fetch just Wikidata identifiers for an item? When I analyzed Wikidata usage on English Wikipedia a few years back (which isn't representative in how they use it but still captures some of the trends), my conclusion was that a large amount of usage was for fetching identifiers (taxonbar, authority control, etc.) but while these templates were fetching the whole item because that was the only reasonable approach (and so triggering lots of unrelated property updates in RecentChanges), they only wanted the identifiers part: https://meta.wikimedia.org/wiki/Research:External_Reuse_of_Wikimedia_Content/Wikidata_Transclusion
Sep 4 2025
Marking this stalled for now. We have talked with a number of Product teams but don't have a clear team who is positioned to take the lead on pushing these changes through just yet. This seems to reflect the general challenge that many teams would benefit from these changes and are doing work in this space, but no one team owns the topic model and the changes are based on community feedback (as opposed to a Product OKR). I'll continue to look for further stakeholders/opportunities.
Aug 13 2025
Aug 12 2025
Resolving this. A question of what this would look like for next year. There's a September 5th deadline for EACL/ACL: https://www.aclweb.org/portal/content/eaclacl-2026-joint-call-workshops but EMNLP/AACL will release a call later in the fall.
More fine-grained takeaways:
Finished! I'll record some of my takeaways below and then close this task out.
Update:
- We accepted 11 submissions. Camera-readies are due in late August but Sheridan handles that so I don't have to do any validation (as with ACL workshops).
- Next question to tackle will be how to have folks present at the conference and what the program looks like. My co-chair will be taking the lead on that though as he will be in-person and knows the community better than me.
Aug 6 2025
... You can do that by adding a fake parameter to the page and it can be anything as the site is static. It just need to be a new URL. e.g. https://research.wikimedia.org/report.html?r=12
Ahhh thanks for the tip! I'd been trying to force a refresh but that wasn't having any effect. Now I know!
Thanks @DDeSouza! Confirmed that I can see it live (weirdly only on chrome desktop or if I switch my IP address to a VPN but I won't pretend to understand how the internet caches things and presumably most people don't have this problem) so now resolving the task. Communication to wiki-research-l: https://lists.wikimedia.org/hyperkitty/list/[email protected]/thread/RRISCJJQ3SWUKT6YJ7JW5GGCKDRKD52D/
Aug 5 2025
Looks good to me - thanks @DDeSouza ! Ready to publish now.
Jul 31 2025
is this task still valid?
Yes - the articlecountry model is on LiftWing but the two dependencies listed in this task are static and have no official way of updating beyond re-running my Jupyter notebooks. Additionally, T387041: Generate Airflow DAG for creating article-country SQLite DB lists a third dependency that is also still valid. I don't know if this would fall under REng or ML at this point though.
I can vouch for Aaron - the username was provided to me over Slack by his account and matches his wiki account.
Jul 29 2025
Thanks @DDeSouza ! We're still waiting on the final sign-off for the Forward but I'll let you know when that happens. Let me know when the other sections are ready and I'll review.
Jul 28 2025
I was thinking that the addition of an access_method column, which would then be populated with desktop and mobile web (just like in the webrequest table) to keep it as similar looking to the webrequest table. What are ya'll's thoughts about that?
Oh yep, that would work for me and probably even simpler! Captures it at the current URL stage but that should still generally be reflective of which UI the person was using when they clicked the link.
Thanks @Milimetric ! One thought: would it be easier to just record whether the previous URL is mobile or desktop as opposed to the four mobile_to_mobile, mobile_to_desktop, desktop_to_desktop, desktop_to_mobile? The thinking is that the language switch presumably always either honors the previous URL or if it does change, it's not because the user requested it but because e.g., the page automatically redirected to desktop or something like that. So from an analysis perspective, you probably care most about which type of page the person started on (because that tells you which type of UI they were using when they triggered the switch) but where they ended up doesn't tell you anything additional. Hopefully simplifies the code a little bit and is easier to query as well. @CMyrick-WMF I welcome your thoughts as well!
Jul 18 2025
Also every section but the Forward is now stable and can be safely copied over. The Forward won't be stable until the 28th at the earliest. So I think a goal could be to have everything but that ready for review on the 28th and then we can hopefully review that while also providing you with the final section and be ready to go on the 30th.
Adding another ask to be done by July 30th: @DDeSouza could you have https://wikiworkshop.org/2025/ redirected to https://meta.wikimedia.org/wiki/Wiki_Workshop_2025? That way we can link to a more canonical URL in the report but have it go to the right place. I assume that just means duplicating the current redirect at https://wikiworkshop.org/ to also be at https://wikiworkshop.org/2025/.
Jul 17 2025
Thanks @Aklapper!
(Meh, the GitLab account and the Phab account use different email addresses and the Phab account is linked to some personal account instead, but I assume that is because no WMF SUL account has been provided by WMF to Kaylea, for unknown but unfortunate reasons?)
Thanks @DDeSouza ! The draft doc we're working from is here (internal only) and I've marked the sections that are stable with [ready] so they can be copied into a draft MR if you want to get a head start. I'll also let you know when the rest of the sections are finalized. Feel free to leave me questions here or in the doc if there's any oddities that you notice though I'll be out next week so likely won't be able to respond until July 28th.
@brennen: this is for work that will be documented in T399696: GitLab Private Repository Request for: research/npov-workstream-research so if you get a chance to take a look at that task as well and approve or send us any follow-up questions, that'd be much appreciated.
Chiming in to vouch for Kaylea -- she's contracting with us on the Research team and shared this task with me internally on slack
Seeing the updates now -- thanks all and no harm done! Good reminder to us to manually verify the live changes too
Jul 16 2025
@DDeSouza flagging that we're almost ready to start moving forward with the next Research Report. We're aiming for having the content ready to deploy on July 30th though most of it will be ready for you to put into a MR before then and it's just a matter of waiting for a few final confirmations before deploying. I'll follow up shortly with some more concrete asks and try to make most of the asks this week so you have at least a week to prepare but figured I'd give you a head's up now that we have a clearer timeline for the work.
Thanks @DDeSouza - looks good to me to release (I don't see the changes publicly - not sure if you're holding until confirmation or somehow the merge was unsuccesful). And thanks for including Miriam's title update as well.
Jul 15 2025
Jul 11 2025
Update:
- 18 submissions to Symposium that we're gathering reviewers for. This is a nice load as most of them appear to be good quality and last year they had 20 but that was evidently far too many so this should allow us to accept a reasonable proportion without overloading the day. They had 98 submissions last year and I suspect a large difference there is the location (South Korea) and reduced travel from the US.
Updates:
- Proceedings accepted
- Schedule largely finalized (both keynotes) but still looking for a local Wikimedian for a conversation/AMA. Turns out that Europe+August=Vacation is still alive and well :)
Jul 10 2025
Jul 9 2025
Jul 8 2025
Jul 7 2025
Jul 2 2025
Resolving this epic! A quick summary of the very large amount of work that was accomplished:
- Background report: https://meta.wikimedia.org/wiki/Research:Develop_a_working_definition_for_moderation_activity_and_moderators
- Focus on decentralized aspects of moderation: https://meta.wikimedia.org/wiki/Research:Crowdsourced_Content_Moderation
- Code: https://gitlab.wikimedia.org/repos/research/who-are-moderators
- Patrolling data dashboard (internal): https://superset.wikimedia.org/superset/dashboard/605/ (documentation)
- This work is now feeding into FY25-26 Annual Plan's WE1.3 key result, which focuses on increasing moderation actions. It has also helped inform adjacent areas of work such as the NPOV workstreams.
- It identified the importance of HTML diff data and laid the groundwork for building that dataset (T380874), though prioritization is pending the direction taken by the Moderator Tools team. It would also have benefited from productionized edit diffs (T351225) but we worked around that dependency.
