Investigating the Impact of Sentiment Label Imbalance on Model Performance in Bangla Sentiment Analysis
Sentiment analysis models often suffer from degraded performance when trained on datasets with imbalanced class distributions, where certain sentiment labels (e.g., Positive, Negative, or Neutral) dominate. In the context of the Bangla Sentiment Dataset, which includes diverse sources like newspapers, social media, and blogs, such imbalances are likely due to the real-world nature of the data (e.g., social media may skew Negative due to criticism). This issue is particularly pronounced in low-resource languages like Bangla, where limited labeled data exacerbates the challenge, yet little research has explored effective mitigation strategies for Bangla sentiment analysis.
To investigate the impact of class imbalance on the performance of sentiment classification models using the Bangla Sentiment Dataset and to evaluate the effectiveness of techniques such as oversampling, undersampling, and weighted loss functions in improving model accuracy and robustness across diverse Bangla text sources.
How does class imbalance in the Bangla Sentiment Dataset affect sentiment classification performance, and can techniques like oversampling (e.g., SMOTE), undersampling, or weighted loss functions mitigate these effects to improve model accuracy and generalization?
To set up the project, clone the repository and run the folder_structure_setup.sh
script to create the necessary directories and files.
To install virtual environment:
python3 -m venv venv
To activate the Python virtual environment, navigate to the project directory in your terminal and run the following command:
source venv/bin/activate
To install all packages from requirements.txt, run the following command:
pip install -e .
First run the the setup python file to initialize:
python3 setup.py
The Bangla Sentiment Dataset is a curated collection of sentiment-rich textual data in Bangla, focused on recent and trending topics. This dataset has been compiled from diverse sources, including Bangladeshi online newspapers, social media platforms, and blogs, ensuring a wide spectrum of language styles and sentiment expressions.
The dataset emphasizes contemporary issues, trending discussions, and popular topics in Bangladeshi society. This includes sentiments on political developments, social movements, entertainment, cultural events, and other recent happenings.
- Online Newspapers: Articles, editorials, headlines, and reader comments provide structured and semi-formal sentiment data.
- Social Media: Posts, tweets, and comments reflect informal, conversational language with high emotional expressiveness.
- Blogs: Opinion pieces and discussions offer detailed and context-rich sentiment content.
Each entry in the dataset is annotated with one of the following sentiment categories:
- Positive (1): Texts expressing happiness, agreement, or optimism.
- Negative (0): Texts reflecting criticism, disagreement, or pessimism.
- Neutral (2): Texts presenting balanced or factual statements with minimal emotional bias.
The dataset captures a range of Bangla language variations, including:
- Formal and informal Bangla usage
- Regional dialects
- Transliterated Bangla (Banglish) commonly used on social media
The inclusion of recent topics ensures that the dataset is relevant for analyzing public sentiment around current events and trends. This makes it particularly useful for real-time sentiment analysis applications.
This dataset provides an invaluable resource for researchers and practitioners aiming to explore sentiment analysis in Bangla, with a special emphasis on modern-day relevance and real-world applicability.
├── data-source
├── src
│ ├── __init__.py
│ ├── components
│ │ ├── __init__.py
│ │ ├── data_ingestion.py
│ │ ├── data_transformation.py
│ │ ├── model_monitoring.py
│ │ └── model_trainer.py
│ ├── exception.py
│ ├── logger.py
│ ├── pipelines
│ │ ├── __init__.py
│ │ ├── prediction_pipeline.py
│ │ └── training_pipeline.py
│ └── utils.py
├── .gitignore
├── main.py
├── app.py
├── EDA.ipynb
├── README.md
├── requirements.txt
├── folder_structure_setup.sh
├── test-logging-integration.py
└── test-request.py