For the Netherlands, Roos Bakker was selected as the PhD student to represent the Dutch consortium in the PhD session at the CLARIN 2025 Annual Conference in Vienna. In her PhD, Roos Bakker investigates how Natural Language Processing methods can be used for constructing knowledge graphs and ontology learning. She focuses on the legal and safety domains in Dutch and English. Roos Bakker’s work ties in well with CLARIN’s mission to support research with language resources. On the one hand, the Dutch and English datasets will enrich CLARIN’s language resources, and, on the other hand, the resulting methods for extraction and evaluation will complement the available tools in the infrastructure.
Carole Tiberius, CLARIN National Coordinator for the Netherlands
This conversation was led by Laura Gusan
Roos Bakker is a PhD candidate at Leiden University and an AI Research Scientist at the Dutch research institute TNO, where she focuses on Natural Language Processing within neurosymbolic AI applications, such as knowledge graphs and ontologies. Her background is in Linguistics and Artificial Intelligence (Utrecht University). During the CLARIN2025 Annual Conference, Roos received the award for Best PhD Poster.
Can you briefly describe your PhD project and what motivated you to undertake it?
I was motivated to pursue a PhD to deepen my knowledge of both the field and research methodologies. At the same time, I strongly value applied research and the connection to societal impact that working at a research institute offers. Combining both environments allows me to bridge fundamental research with practical applications.
My PhD focuses on the automatic extraction of structured knowledge from text, with a particular emphasis on Dutch-language sources. A vast amount of knowledge is stored in textual form, which makes it difficult for machines to process because natural language is inherently complex and ambiguous. At the same time, the volume of text is far too large for humans to analyse systematically. My research, therefore, explores how we can formalise and structure this textual knowledge so it becomes accessible for computational use.
How, and at what stage of your PhD, did you first encounter CLARIN?
I first encountered CLARIN relatively late in my PhD, about a year ago. My supervisor connected me with a colleague who was already working with CLARIN, and through that contact, I was invited to attend the conference.
At that stage of my research, I was looking for more realistic and representative datasets. Many datasets in my domain consist of relatively simplistic or toy examples, which makes it difficult to properly evaluate methods in settings that reflect real-world complexity. CLARIN provided access to resources and a community that focuses on high-quality language data, which aligned well with the needs of my research.
Can you introduce the poster you presented during the CLARIN Annual Conference?
At the CLARIN2025 Annual Conference, I presented a poster that summarises my PhD research on the automatic extraction and evaluation of structured knowledge from text. The work focuses on how Natural Language Processing techniques, including fine-tuned language models and generative large language models, can support the creation of knowledge graphs and ontologies.
The motivation behind the poster stems from the challenge that large amounts of knowledge are stored in textual data, which is difficult to process computationally, while manually building semantic models is labour-intensive. My research investigates how these processes can be partially automated. The poster presented case studies and datasets in the legal and safety domains and highlighted my work on ontology learning and automatic evaluation metrics for knowledge graphs and ontologies. It directly reflects my doctoral research by bringing together results from multiple chapters and showing how they contribute to making textual knowledge more accessible for computational use.
One of the case studies focuses on the legal domain, where laws are described in lengthy texts with complex terminology. Legal practitioners read these texts and construct their own internal interpretation of the relevant knowledge for a case. Our goal is to make this process more transparent by automatically extracting the core knowledge from legal texts and representing it in a formal model, combining a knowledge graph with an ontology as its schema. These structured representations can support practitioners and make explicit which knowledge and interpretations of the law are used. Our results show that such knowledge can be extracted with reasonable reliability, and our current work focuses on integrating the extracted knowledge into question-answering systems for end users.
Which CLARIN tools or CLARIN-deposited resources have you used and/or contributed to so far in your PhD research?
I have not yet directly used tools or deposited resources in CLARIN, mainly because I became familiar with it relatively late in my PhD. My experience is still exploratory and mostly based on conference exposure and colleagues. From a PhD perspective, guidance on where resources are and how to integrate them into research workflows would be helpful. However, I created several annotated Dutch and English datasets for knowledge graph extraction and ontology learning, which could be valuable to share through CLARIN in the future.
How has using CLARIN influenced your PhD research or skill set?
CLARIN has mainly increased my awareness of the importance of sharing datasets and ensuring reproducibility.
Do you see your experience with CLARIN as relevant for your future academic or non-academic career?
Yes. CLARIN supports best practices in language resource sharing and reproducibility, which are highly relevant to my work. I expect to use and recommend it more actively in future research and collaborations.
From your perspective as a PhD researcher, are there types of data, languages, tools, or workflows that are especially important for early-career researchers?
Maybe not so much a type of data or workflow, but an overview of datasets, benchmarks, and scores on them would really help. The source I used for that went offline, and since then, I have been searching for an alternative.
Do you think your national CLARIN node should be better integrated into doctoral training and supervision?
Yes, I was not aware it existed until late, so I would appreciate it if supervisors and research groups can centrally introduce it in relevant disciplines somewhere in the first year.