Koha Hackfest 2025 in Marseille
I'm currently sitting in a TGV doing 300km/h from Marseille to Paris, traveling back home from the Koha Hackfest, hosted by BibLibre.
Results
This year I did a lot of QA, which means reviewing patches, running their test plan, verifying that everything works and finally signing off the patches and marking the bug as "Passed QA". The process is documented in the wiki. According to the scoreboard I QA'ed 8 bugs (the second highest number!). After the third or fourth time I did not even have to look up all the steps anymore.
I moderated a short panel on ElasticSearch, because I found some weird behaviors on which I needed feedback from the experts. This resulted in a bunch of new "bugs" (Koha speak for issues, in this case a mix of actual bugs an feature requests): 39494, 39548, 39549, 39551, 39552.
I did a rather detailed review of 37020 - bulkmarcimport gets killed when inserting large files. The problem here is that the current code uses MARC::Batch, which does some horrible regex "parsing" of XML to implement a stream parser (so it can handle large files without using ALL the RAM) (see more details at the end of this post). But a recent change added a check-step which validates the records and puts the valid ones onto an Perl array. Which now again takes up ALL the RAM. I reviewed the two proposed patches, but I think we should use XML::LibXML::Reader directly, which should result in cleaner, faster, less-RAM-using and correct code.
I also participated in various other discussions and hope to have provided some helpful ideas & feedback from my still semi-external Koha perspective and semi-extensive knowledge of other environments and projects (I have been doing this "web dev" stuff for quite some time now..).
After help Clemens setup L10N on his KTD setup, I submitted a doc patch to KTD explaining the SKIP_L10N
setup and hopefully making the general L10N setup a bit clearer. I generally try to improve the docs if I hit a problem and was able to fix it. Give it a try the next time, it's very rewarding!
I could also provide some Perl help to various other attendees. But I still failed most of the questions of joubus Perl quiz. My excuse is that I trained my brain on writing only good/sane/nice Perl so that I forgot how to parse all the weird corner cases...
Social
But the Hackfest is not only about hacking, there's also the "fest" part (or party?). I really enjoyed hanging out with the other attendees on the terrace during lunch in the sun. The food was as usual excellent and not too unhealthy (of course depending on how much cheese one is able to stack onto his plate). The evenings at various bars and restaurants where fun and entertaining (even though I did manage to go to bed early enough this year, and hardly had any alcohol).
I did not do any sightseeing or even just walking around Marseille this year. I blame the fact that our hotel was very near to the venue and most of the after-hack locations. And I didn't bring my swimming trunks so I was not motivated to go to the beach (but I've ticked that off last year..)
I had a lot of nice chats with old and new friends on topics ranging from the obvious (A.I., the sorry state of the world, Koha, Perl) to the obscure (US garbage collection trucks, the lifetime of ropes for hand-pulled elevators up to Greek monasteries, sweet potato heritage of Aotearoa, chicken egg sizes, anarcho-syndicalism, ...)
Thanks
Thanks to BibLibre and Paul Poulain for organizing the event, and to all the attendees for making it such a wonderful 4 days!
Postscript: The horrors of MARC::Batch
So, how does MARC::Batch handle importing huge XML files without using too much RAM?
By breaking the first rule of XML handling: It "parses" the XML via regex!
This is actually implemented in MARC::File::XML, namely here. If you have a strong stomach I'll wait for you to take a look at that code.
Here are some "highlights":
## get a chunk of xml for a record
local $/ = 'record>';
my $xml = <$fh>;
Set the input record separator
(usually newline \n
, and telling Perl what it should consider a line) to the string record>
so, basically something which looks like and XML tag ending with record. It is NOT including the start <
because the code wants to ignore XML namespaces.
The it uses <$fh>
to read the next "line" from the record, which isn't a line in any usual sense, but all bytes up to the next occurrence of record>
.
## do we have enough?
$xml .= <$fh> if $xml !~ m!</([^:]+:){0,1}record>$!;
It continues reading until it find something that looks like a closing </record>
tag (which might contain a namespace). Then some more "cleanup", and finally the xml chunk is returned.
Obviously this works, as it is used by thousands of libraries around the world on millions of records all the time.
But still: Uff!