JL-Bot
|
||||||||
This page has archives. Sections older than 14 days may be auto-archived by Lowercase sigmabot III. |
Not sure this is configured correctly
editOn this page:
It looks like a couple of topics are repeated, and on (the DYK section) isn't list at all. How can I fix this? --evrik (talk) 16:25, 7 June 2022 (UTC)
- @Evrik: Do you have examples of things that are repeated/not listed? Headbomb {t · c · p · b} 22:30, 7 June 2022 (UTC)
- For the DYK section, there was an extra equals sign on the end of the line. I removed that and re-ran. The DYK section is now present. I'm not seeing any duplicate topics. Let me know what you are questioning and I will take a look at it. -- JLaTondre (talk) 01:06, 8 June 2022 (UTC)
- @JLaTondre:, thank you for the fix. @Headbomb:, "Picture of the day pictures" seems to duplicate "Featured pictures", "Featured lists" mirrors "Main page featured lists" are the two with the most overlap. --evrik (talk) 01:22, 8 June 2022 (UTC)
- That looks good. How would I code that on the page? --evrik (talk) 03:08, 8 June 2022 (UTC)
- The bot currently outputs by category. There is overlap in the categories (for example, I would assume all "Main page featured lists" are "Featured lists"), but they are not the same (not all featured content appears on the main page). So what is the ask here?
- When outputting a featured type (ex. "Featured lists"), provide an option to add a Wikipedia icon if it appeared on the main page?
- It would then be up to specifier to only include the one category type and the option; or
- The bot would only display the larger category if the option is set even if both categories are specified
- Provide an option to consolidate by page type (article, list, picture, sound) where it would show a different icon for each recognized type? So for example, the new section could be "Recognized articles" and you would get a different icon and date (where applicable) for each of "Featured article", "Former featured article", "DYK", etc.
Well, TFAs should be either current or former FAs. So a condensed option would, IMO, required that both FA and FFA are covered. And then the TFAs could be 'merged' into FA/FFA. Likewise for TFLs, which would be merged into both FL and FFL sections. So if you have
|content-featured-articles |content-former-featured-articles |content-mainpage-featured
the output would be as is, but if you had something like
|content-featured-articles |content-former-featured-articles |content-mainpage-featured=condensed (or something equivalent)
then the output would be merged as above (or similar, depending on whether or not icons were desired) Headbomb {t · c · p · b} 12:38, 8 June 2022 (UTC)
- @JLaTondre: if you run the bot on Portal:Scouting/Recognized_content I can do a full mockup. Headbomb {t · c · p · b} 01:55, 12 June 2022 (UTC)
- Done. -- JLaTondre (talk) 15:40, 12 June 2022 (UTC)
- Mockup. POTD icons will be supported once Template_talk:Icon#POTD_support is enacted. Headbomb {t · c · p · b} 16:32, 12 June 2022 (UTC)
- Done. -- JLaTondre (talk) 15:40, 12 June 2022 (UTC)
- @JLaTondre and Headbomb: Thank you both! --evrik (talk) 02:33, 13 June 2022 (UTC)
FM captions
editIs there a reason why the "caption" option actually displays the media's title rather than its caption? The titles are so rarely helpful, while the captions would definitely be! MeegsC (talk) 17:38, 7 August 2022 (UTC)
- Primarily performance, but also the lack of standard formats. Captions are on the pages that use the images and it would be add significant time to go pull them. The task already takes most of a day to run. Images can also appear on multiple pages with significantly different captions. Captions are not always in a standard format which makes pulling them from the page text problematic. -- JLaTondre (talk) 12:00, 9 August 2022 (UTC)
- Could we not use the captions that are in the picture file, and default to the title only if the picture file doesn't have an English caption? Most file captions are better than the title! MeegsC (talk) 13:14, 9 August 2022 (UTC)
- The description field? Yes, that might work. I will look into it. -- JLaTondre (talk) 16:36, 13 August 2022 (UTC)
- Could we not use the captions that are in the picture file, and default to the title only if the picture file doesn't have an English caption? Most file captions are better than the title! MeegsC (talk) 13:14, 9 August 2022 (UTC)
Highlight journal= from different character set
editIf, for example, you have |journal=Аcta Вaltico‑Slavica
, where А and В comes from the Cyrillic alphabet and the others from the Latin alphabet, it would be useful in the complilation to highlight this sort of thing, i.e. when an entry has characters from two different alphabets. If it's from a single alphabet, no highlighting is needed.
Journal1 | Type2 | Target1 | Type2 | Citations | Articles | Citations/article | Search |
---|---|---|---|---|---|---|---|
Аcta Вaltico‑Slavica | ? | Acta Baltico-Slavica | ? | 1 | 1 | 1.000 |
In general, there could be a color scheme like
- Red = Latin
- Orange = Arabic
- DarkKhaki = Chinese
- Green = Cyrillic
- Blue = Greek
- Indigo = Hebrew
- Violet = Japanese
- DeepSkyBlue = Other1
- MediumPurple = Other2 (only used when Other1 is already used)
- DeepPink = Other3 (only used when Other2 is already used, might not be needed)
Would this be difficult to implement? Headbomb {t · c · p · b} 05:30, 27 August 2025 (UTC)
- Yes, that is doable. Perl, which is what that part of the citation processing is written in, makes it easy to check language scripts. Perl can recognize all the ones listed at perlunicode#Scripts (all the ones you are requesting are on that list). For Chinese, it would really be detecting for Han script - which in my understanding is used for several Eastern languages. It will probably be a couple of weeks before I can complete it. -- JLaTondre (talk) 23:14, 27 August 2025 (UTC)
- Yes, if it's the Han alphabet, then that's the character set that should be highlighted. The point is to detect names that have multiple character sets in them, which should be rare, and usually limited to case like
|journal=The Journal of Things = Το ημερολόγιο των πραγμάτων
. - It's probably simpler to collect them and have them all reported on their own WP:JCW/Multiscript subpage, with that highlighting only in effect on that page.
- Headbomb {t · c · p · b} 00:59, 28 August 2025 (UTC)
- A separate page is easier. I can have a separate script for that vs. integrating into the regular output. -- JLaTondre (talk) 23:37, 28 August 2025 (UTC)
- Yes, if it's the Han alphabet, then that's the character set that should be highlighted. The point is to detect names that have multiple character sets in them, which should be rare, and usually limited to case like
- Should it report cases where there is a language template? For example, what should it do with
Sidirotrohia ({{langx|el|Σιδηροτροχιά}})
which will produceSidirotrohia (Σιδηροτροχιά)
(after the change discussed below)? There are also cases where people enter titles in multiple languages without the use of a template? Should it only report a mismatch when it happens within a single word? -- JLaTondre (talk) 23:53, 28 August 2025 (UTC)
I have the code to detect multiple scripts completed. It is returning 2,325 cases in the last dump. The majority are of the format of a single non-Latin script followed by Latin script (or vice versa). For example:
- 한국한문학연구 (Korean Literature Research)
- 한국전자통신학회 논문지 = the Journal of the Korea Institute of Electronic Communication Sciences
- 한국언어문화 [Journal of Korean Language and Culture]
- ЕтноАнтропоЗум / EthnoAnthropoZoom
- Езиков свят - Orbis Linguarum
- Military History Studies (军事历史研究)
- Linguistic Sciences 语言科学
- Acta Historica: Труды По Историческим И Обществоведческим Наукам
- 7iber | حبر
Should I exclude cases where it is a single non-Latin script + space or punctuation + a Latin script (and the reverse order)? It seems like these are valid cases and you are more interested in ones like Artanіya which has a Cyrillic і in the middle of Latin characters? -- JLaTondre (talk) 14:06, 13 September 2025 (UTC)
- I think Script 1 [Seperator] Script 2 can be excluded without much loss, so long as each script have 3+ letters in them. This way it excludes Езиков свят - Orbis Linguarum, but includes Acta Whatever А. Headbomb {t · c · p · b} 15:06, 13 September 2025 (UTC)
Code is complete, except for saving to Wikipedia. I uploaded the first 200 results at Wikipedia:JCW/Multiscript1. Please review and let me know what you think. If all is good, I will automate the saving (including have a legend for the colors). -- JLaTondre (talk) 15:24, 20 September 2025 (UTC)
- Looks good as far as I can tell. I might have refinements down the road, but I'll need to actually work with it for a bit to know. Headbomb {t · c · p · b} 16:12, 20 September 2025 (UTC)
- Saving has been implemented and the results have been uploaded. Take a look and see if everything looks good. The color legend in {{JCW-bottom-scripts}} is filled in by the bot along with database dump date. That way, if there are ever changes in the colors, the legend will be automatically updated by the code changes. If you want any changes to the legend, let me know and I will incorporate it into the bot. I will leave it to you to add the multipscript pages to {{JCW-Main}} so they are listed where you want them.
- The 20th dump completed today. The entire processing, including the multiscript, should happen tonight. So that will verify the whole chain works. -- JLaTondre (talk) 19:02, 21 September 2025 (UTC)
- Awesome. I've cleared so many issues it'll be good to have a new dump. It won't reflect what I fixed after the 20th, but still. Headbomb {t · c · p · b} 20:38, 21 September 2025 (UTC)
- BTW, I moved them to .../Maintenance/Multiscript[1-4]. Headbomb {t · c · p · b} 20:48, 21 September 2025 (UTC)
- I notice Hiragana = Violet, Katakana = Violet, is Kanji not a distinct set of characters? Headbomb {t · c · p · b} 21:11, 21 September 2025 (UTC)
- I updated the code for the new location & verified it worked. -- JLaTondre (talk) 21:56, 21 September 2025 (UTC)
Could you add..
editUnbalanced brackets? Like any entry like Quart. J. Math (Oxford.
Thinking <>, [], (), {}, ‹›, «», ⟨⟩, ≪≫
Thinking they could be added as a second line to WP:JCW/Invalid. Headbomb {t · c · p · b} 15:07, 9 September 2025 (UTC)
- I will do after the above items. -- JLaTondre (talk) 22:27, 9 September 2025 (UTC)
- I implemented a new regex matching option for Patterns using a {{JCW-regex}} option via the config. This allowed me to relatively easily add the unbalanced brackets|parenthesis|etc. case and an all punctuation case (to catch the "." citations mentioned above). Instead of adding special processing, this more general implementation allows for other uses down the road.
- The results are uploading to WP:JCW/PAT now. For the all punctuation case, there are more than 5 articles for the period case and for a "<>" case. That makes the results a little less useful since the articles don't get listed. I can just clean-up those cases myself (since it is easy enough for me to manually pull the article list). I figure once the current ones are fixed, there is likely to be less than 5 per dump.
- I still need to create the {{JCW-regex}} page so it renders on the config page. -- JLaTondre (talk) 19:35, 20 September 2025 (UTC)
- Sounds great! Headbomb {t · c · p · b} 22:34, 20 September 2025 (UTC)
- {{JCW-regex}} has been created. I did it as a regular template. I know {{JCW-pattern}} uses a Lua module. If that is better, I will leave it for you to convert as I'm not familiar with those. As the 20th dump completed today and will be processed tonight, I will cleanup the punctuation only cases after those results post. -- JLaTondre (talk) 19:07, 21 September 2025 (UTC)
- Sounds great! Headbomb {t · c · p · b} 22:34, 20 September 2025 (UTC)
Bug
editIn WP:JCW/Multiscript1, there's a latin only entry "Das kulturelle Erbe von Arzach". If you go to the article, you have |journal=Das kulturelle Erbe von Arzach | Արցախի մշակութային ժառանգությունը: Armenische Geschichte und deren Spuren in Berg-Karabach | Հայոց պատմությունը և դրա հետքերը Լեռնային Ղարաբաղում
The bot should be able to recognize {{!}} Headbomb {t · c · p · b} 22:41, 21 September 2025 (UTC)
- If you edit the Multiscript1 page, the full annotated name is: Das kulturelle Erbe von Arzach | Արցախի մշակութային ժառանգությունը: Armenische Geschichte und deren Spuren in Berg-Karabach | Հայոց պատմությունը և դրա հետքերը Լեռնային Ղարաբաղում. It needs the |'s escaped when placed into the template so it doesn't break the template. I will fix that. -- JLaTondre (talk) 01:12, 22 September 2025 (UTC)
- Right, I dug too deep. At least looking silly isn't as dangerous as Durin's Bane. Headbomb {t · c · p · b} 02:22, 22 September 2025 (UTC)
- Fixed. Let me know if you see anything else wrong. -- JLaTondre (talk) 23:09, 22 September 2025 (UTC)
- The only thing I see is a nitpick about brackets sometimes being the wrong color, like in Dragon Ball 大全集 2: Story Guide [Dragon Ball Complete Works 2: Story Guide] or Hōsō Kenkyū to Chōsa [NHK Monthly Report on Broadcast Research] 放送研究と調査 or Lj. Maksimović, M. Ricl, ΤΗ ΠΡΟΣΦΙΛΕΣΤΑΤΗ ΚΑΙ ΠΑΝΤΑ ΑΡΙΣΤΗ ΜΑΚΕΔΟΝΙΑΡΧΙΣΣΗ: Students and Colleagues for Professor Fanoula Papazoglou (International Conference, Belgrade, October 17–18, 2017) . Headbomb {t · c · p · b} 02:07, 23 September 2025 (UTC)
- Punctuation and numbers are valid in multiple scripts. Perl classifies them as Common script. The code is leaving them uncolored if between other scripts and grouping them with a script if between two ranges of the same script. Many times that makes sense, but not always. The other option is to never color them. Should I change it to that? -- JLaTondre (talk) 10:26, 23 September 2025 (UTC)
- Yeah, probably best to leave brackets, numbers, and other 'neutral' marks in black. Headbomb {t · c · p · b} 11:56, 23 September 2025 (UTC)
- Change made & updated results uploaded. -- JLaTondre (talk) 22:22, 24 September 2025 (UTC)
- Yeah, probably best to leave brackets, numbers, and other 'neutral' marks in black. Headbomb {t · c · p · b} 11:56, 23 September 2025 (UTC)
- Punctuation and numbers are valid in multiple scripts. Perl classifies them as Common script. The code is leaving them uncolored if between other scripts and grouping them with a script if between two ranges of the same script. Many times that makes sense, but not always. The other option is to never color them. Should I change it to that? -- JLaTondre (talk) 10:26, 23 September 2025 (UTC)
- The only thing I see is a nitpick about brackets sometimes being the wrong color, like in Dragon Ball 大全集 2: Story Guide [Dragon Ball Complete Works 2: Story Guide] or Hōsō Kenkyū to Chōsa [NHK Monthly Report on Broadcast Research] 放送研究と調査 or Lj. Maksimović, M. Ricl, ΤΗ ΠΡΟΣΦΙΛΕΣΤΑΤΗ ΚΑΙ ΠΑΝΤΑ ΑΡΙΣΤΗ ΜΑΚΕΔΟΝΙΑΡΧΙΣΣΗ: Students and Colleagues for Professor Fanoula Papazoglou (International Conference, Belgrade, October 17–18, 2017) . Headbomb {t · c · p · b} 02:07, 23 September 2025 (UTC)