Sri Hari Karthick

Posted on Jun 22

Named Entity Recognition using Bidirectional LSTM and Conditional Random Fields

#lstm #nlp #ai

Stunning image by Mick Haupt, courtesy of Unsplash

Prelude

In case you would like to directly jump into the code, here's the link to the project.

Introduction

While Long Short-Term Memory (LSTM) networks have been outpaced by cutting-edge Transformer models like BERT, BART, and T5, they still offer a competitive solution, especially when resources are limited. For students or hobbyists without access to expensive GPUs, LSTMs remain a viable and practical choice.

Thanks to their ability to handle moderately long-term dependencies, LSTMs are well-suited for various NLP tasks such as sentiment classification (e.g., labeling text as positive, neutral, or negative) and beyond. During my Accelerated Natural Language Processing course as part of my Master’s, one task in particular piqued my interest: Named Entity Recognition (NER).

NER is a great candidate for a starter project as it touches nearly every essential step in a modern training pipeline:

Preprocessing dataset and embeddings.
Designing the model structure, along with its input and output format.
Training the model and saving weights at key checkpoints.
Evaluating performance on a held-out test set.
Testing it against real-world input.

So, like any curious AI hobbyist, I decided to build one myself!

I won’t go line-by-line through the code (it’s documented enough for you to explore comfortably), but I’ll walk you through the bigger picture—what the system does, how it works, and what the next steps might look like.

Glossary

Named Entity Recognition

Named Entity Recognition (NER) is the process of using a trained machine learning or statistical model to identify and tag named entities in a sentence—such as people, organizations, locations, and more. This is commonly done using the BIO tagging scheme, where:

B = Beginning of an entity
I = Inside (continuation) of an entity
O = Outside (not part of any entity)

While large language models (LLMs) have made traditional NER pipelines less prominent in high-resource settings, NER remains highly relevant in resource-constrained environments, privacy-sensitive apps, or low-latency settings, and often serves as a valuable pre-processing step for downstream NLP tasks.

BiDirectional LSTM and CRF

Long Short-Term Memory (LSTM) networks are a type of recurrent neural network (RNN) that incorporate memory gates to selectively retain important sequential information. These gates help LSTMs overcome the vanishing gradient problem, making them well-suited for learning long-range dependencies in text.

A bidirectional LSTM (BiLSTM) reads the input sequence in both forward and backward directions, allowing it to capture context from both the left and right of each token. This richer understanding is especially valuable for tasks like NER, where the meaning of a word can depend on its surrounding words.

While an LSTM alone can be used to generate BIO tags (it is essentially yet another classification task after all), adding a Conditional Random Field (CRF) layer on top improves the consistency of the output. The CRF considers the dependencies between predicted tags, enforcing constraints like:

An I- tag should not follow an O or a mismatched B- tag.
Tag sequences like B-ORG I-PER are discouraged if not statistically supported.

Embeddings

Since neural networks cannot operate directly on raw text—and unlike Transformer models, cannot generate contextualized embeddings on their own—we rely on pre-trained static word embeddings like GloVe. For this project, we use the 100-dimensional GloVe vectors.

GloVe captures semantic meaning by repeatedly analyzing the neighborhoods in which words occur, building up a sense of meaning from co-occurrence patterns. For instance, a word like bank, will have a single embedding that blends all its meanings, since the representation is not contextualized.

While this limitation makes GloVe less precise than Transformer-based embeddings, its pre-trained and lightweight nature makes it perfect for quick experimentation in resource-constrained settings like this one.

Project

Structure

The project is organized into cohesive modules (I hope so!), including utility functions, the model definition, training and validation loop wrappers, and saved model weights from my training runs. You can download the dataset and embeddings to train the model yourself, or evaluate the pre-trained weights to see how it performs. All core functionality is orchestrated from the Jupyter notebook main.ipynb.

Setup

The setup is straightforward. Start with a fresh Python environment (I recommend conda, but venv will work just as well). Install the dependencies listed in requirements.txt. If you’re using a GPU for training or inference, make sure to adjust the PyTorch + CUDA version in the requirements to match your specific hardware (The boring part).

Run Through

The provided Jupyter notebook follows the standard order of an NLP training pipeline:

Downloads the dataset from publicly available sources.
Loads the data from disk into Python arrays.
Constructs a PyTorch Dataset and DataLoader for training, validation, and testing.
Trains the model on the training set and validates it on the validation set, using early stopping and checkpointing.
Evaluates the model’s performance on a held-out test set, using standard measures of precision, recall and F1.
Demonstrates how to tag a real-world sentence using the trained model.

Discussion

Results

Below are the classification metrics generated using the scikit-learn package for each tag in the dataset:

Tag	Precision	Recall	F1-score	Support
B-LOC	0.76	0.85	0.80	1668
B-MISC	0.72	0.63	0.67	702
B-ORG	0.74	0.70	0.72	1661
B-PER	0.87	0.80	0.83	1617
I-LOC	0.75	0.33	0.46	257
I-MISC	0.67	0.41	0.51	216
I-ORG	0.66	0.32	0.43	835
I-PER	0.81	0.72	0.76	1156
O	0.97	0.99	0.98	38323

Overall Accuracy: 0.94

Macro Average F1: 0.69

Weighted Average F1: 0.93

Explanation

The model achieves high overall accuracy, largely due to the overwhelming presence of the O tag (non-entity tokens), which dominates the dataset. While entity-specific tags like B-LOC, B-PER, and B-ORG perform quite well, the lower recall on I- tags (especially I-LOC and I-ORG) suggests the model occasionally struggles to correctly continue multi-token entities.

Despite this, the model offers solid performance for a lightweight, resource-efficient system!

Improvements

There are a few simple ways to improve this system:

Increase the LSTM hidden size.
Use a higher-dimensional GloVe embedding.
Use contextual embeddings like BERT.
Expand or fine-tune on a domain-specific dataset.

Each of these comes with tradeoffs, mainly in computation, memory, and training time.

Conclusion

This was my first fully self-directed NLP project from scratch, and it taught me far more than I expected. From sourcing and preprocessing the dataset, to designing the model architecture, and building the training and evaluation pipeline, it was a refreshing departure from university coursework, where datasets are usually pre-cleaned and boilerplate code is provided.

This project gave me a hands-on appreciation for the full ML workflow and all the real-world messiness that comes with it. I'm looking forward to building on this foundation with more advanced models and use cases, with a holistic view into deployment and serving as well!

Do let me know if you have any ideas or suggestions!

DEV Community