Hi there 👋
I'm Stéphan Tulkens! I'm a computational linguistics/AI person. I am currently working as a data scientist at dataprovider.com, where I mainly work on small classifiers that operate on large volumes of web text. I live in beautiful Groningen, together with my partner and twin boy and girl.
I got my Phd at CLiPS at the University of Antwerpen under the watchful eyes of Walter Daelemans (Computational Linguistics) and Dominiek Sandra (Psycholinguistics). The topic of my Phd was the way people process orthography during reading. You can find a copy here. Before that I studied computational linguistics (Ma), philosophy (Ba) and software engineering (Ba)
My goal is always to make things as fast and small as possible. I like it when simple models work well, and I love it when simple models get close in accuracy to big models. I do not believe absolute accuracy is a metric to be chased, and I think we should always be mindful of what a model computes or learns from the data.
I’m currently working on 🏃♂️ :
- reach: a library for loading and working with word embeddings.
- piecelearn: a library that trains a subword tokenizer and embeddings on the same corpus, giving you open vocabulary embeddings.
- unitoken: a library for easy pre-tokenization.
My research interests 🤖 :
- Tokenizers, specifically subword tokenizers.
- Embeddings, specifically static embeddings (so old-fashioned!
💀 ), and how to combine these in meaningful ways. - String similarity, and how to compute it without using dynamic programming.





