Investigating the Impact of Sentiment Label Imbalance on Model Performance in Bangla Sentiment Analysis

Research Problem

Sentiment analysis models often suffer from degraded performance when trained on datasets with imbalanced class distributions, where certain sentiment labels (e.g., Positive, Negative, or Neutral) dominate. In the context of the Bangla Sentiment Dataset, which includes diverse sources like newspapers, social media, and blogs, such imbalances are likely due to the real-world nature of the data (e.g., social media may skew Negative due to criticism). This issue is particularly pronounced in low-resource languages like Bangla, where limited labeled data exacerbates the challenge, yet little research has explored effective mitigation strategies for Bangla sentiment analysis.

Research Aim

To investigate the impact of class imbalance on the performance of sentiment classification models using the Bangla Sentiment Dataset and to evaluate the effectiveness of techniques such as oversampling, undersampling, and weighted loss functions in improving model accuracy and robustness across diverse Bangla text sources.

Research Question

How does class imbalance in the Bangla Sentiment Dataset affect sentiment classification performance, and can techniques like oversampling (e.g., SMOTE), undersampling, or weighted loss functions mitigate these effects to improve model accuracy and generalization?

Installation

To set up the project, clone the repository and run the folder_structure_setup.sh script to create the necessary directories and files.

To install virtual environment:

python3 -m venv venv

To activate the Python virtual environment, navigate to the project directory in your terminal and run the following command:

source venv/bin/activate

To install all packages from requirements.txt, run the following command:

pip install -e .

Usage

First run the the setup python file to initialize:

python3 setup.py

Dataset

The Bangla Sentiment Dataset is a curated collection of sentiment-rich textual data in Bangla, focused on recent and trending topics. This dataset has been compiled from diverse sources, including Bangladeshi online newspapers, social media platforms, and blogs, ensuring a wide spectrum of language styles and sentiment expressions.

Key Features

Focus on Recent Topics

The dataset emphasizes contemporary issues, trending discussions, and popular topics in Bangladeshi society. This includes sentiments on political developments, social movements, entertainment, cultural events, and other recent happenings.

Source Variety

Online Newspapers: Articles, editorials, headlines, and reader comments provide structured and semi-formal sentiment data.
Social Media: Posts, tweets, and comments reflect informal, conversational language with high emotional expressiveness.
Blogs: Opinion pieces and discussions offer detailed and context-rich sentiment content.

Sentiment Labels

Each entry in the dataset is annotated with one of the following sentiment categories:

Positive (1): Texts expressing happiness, agreement, or optimism.
Negative (0): Texts reflecting criticism, disagreement, or pessimism.
Neutral (2): Texts presenting balanced or factual statements with minimal emotional bias.

Linguistic and Stylistic Diversity

The dataset captures a range of Bangla language variations, including:

Formal and informal Bangla usage
Regional dialects
Transliterated Bangla (Banglish) commonly used on social media

Real-World Context

The inclusion of recent topics ensures that the dataset is relevant for analyzing public sentiment around current events and trends. This makes it particularly useful for real-time sentiment analysis applications.

This dataset provides an invaluable resource for researchers and practitioners aiming to explore sentiment analysis in Bangla, with a special emphasis on modern-day relevance and real-world applicability.

Directory Structure

├── data-source
├── src
│   ├── __init__.py
│   ├── components
│   │   ├── __init__.py
│   │   ├── data_ingestion.py
│   │   ├── data_transformation.py
│   │   ├── model_monitoring.py
│   │   └── model_trainer.py
│   ├── exception.py
│   ├── logger.py
│   ├── pipelines
│   │   ├── __init__.py
│   │   ├── prediction_pipeline.py
│   │   └── training_pipeline.py
│   └── utils.py
├── .gitignore
├── main.py
├── app.py
├── EDA.ipynb
├── README.md
├── requirements.txt
├── folder_structure_setup.sh
├── test-logging-integration.py
└── test-request.py

Name		Name	Last commit message	Last commit date
Latest commit History 64 Commits
data-source		data-source
mitigated_datasets		mitigated_datasets
mlops_modular_project.egg-info		mlops_modular_project.egg-info
models/baseline_models		models/baseline_models
outputs		outputs
src		src
text_representation		text_representation
.gitignore		.gitignore
1_EDA.ipynb		1_EDA.ipynb
2_feature_engineering.ipynb		2_feature_engineering.ipynb
3_baseline_model_training.ipynb		3_baseline_model_training.ipynb
4_mitigation_experiments.ipynb		4_mitigation_experiments.ipynb
5_evaluation_comparison.ipynb		5_evaluation_comparison.ipynb
BanglaBert.ipynb		BanglaBert.ipynb
README.md		README.md
app.py		app.py
folder_structure_setup.sh		folder_structure_setup.sh
imbalance_analysis.md		imbalance_analysis.md
main.py		main.py
requirements.txt		requirements.txt
setup.py		setup.py
test-logging-integration.py		test-logging-integration.py
test-request.py		test-request.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Investigating the Impact of Sentiment Label Imbalance on Model Performance in Bangla Sentiment Analysis

Research Problem

Research Aim

Research Question

Table of Contents

Installation

Usage

Dataset

Key Features

Focus on Recent Topics

Source Variety

Sentiment Labels

Linguistic and Stylistic Diversity

Real-World Context

Directory Structure

About

Uh oh!

Releases

Packages

Languages

coder7475/sentiment_analysis_bangla

Folders and files

Latest commit

History

Repository files navigation

Investigating the Impact of Sentiment Label Imbalance on Model Performance in Bangla Sentiment Analysis

Research Problem

Research Aim

Research Question

Table of Contents

Installation

Usage

Dataset

Key Features

Focus on Recent Topics

Source Variety

Sentiment Labels

Linguistic and Stylistic Diversity

Real-World Context

Directory Structure

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages