Top Interview Questions for Data Science Freshers

#datascience #nlp #machinelearning #beginners

The toughest “basic” questions
If you’re a data science fresher gearing up for interviews, this roadmap of questions (and mini‑hints) will test your conceptual clarity in ML, NLP, Statistics, and Classification.

Machine Learning Questions (Tricky but Basic)

Why prefer cross‑validation over a simple train–test split?

Context: Cross‑validation (e.g. k‑fold) reduces variance in performance estimates by averaging across multiple splits. It also helps detect data leakage or unstable models.
How does increasing the number of hidden layers in a neural network impact performance?

Hint: More layers can learn complex features but may cause vanishing gradients, overfitting, or require more data/regularization.
Why is feature scaling important for KNN and SVM, but not for decision trees?

Explanation: Distance‑based (KNN) and margin‑based (SVM) algorithms are sensitive to feature magnitudes. Tree splits use thresholds, unaffected by scale.
What’s the difference between underfitting and overfitting? Can a model be both simultaneously?

Note: Underfitting = high bias, poor train/test accuracy. Overfitting = high variance, good train but poor test. A model can underfit some regions of data while overfitting others in complex scenarios.
Why sometimes prefer simpler models (like logistic regression) over deep networks?

Considerations: Interpretability, faster training/inference, fewer data requirements, and lower risk of overfitting.
How does the learning rate impact gradient descent? What if it’s too high or too low?

Impact: Too high → divergence or oscillation. Too low → painfully slow convergence or getting stuck in local minima.
Why do deep learning models typically require large datasets?

Reason: Millions of parameters need sufficient examples to avoid overfitting and learn generalizable patterns.
What happens if you initialize all weights to zero in a neural network?

Consequence: Symmetry problem—each neuron learns the same updates, so no meaningful representation is learned.
Why is dropout used in deep learning, and how does it help prevent overfitting?

Mechanism: Randomly “drops” neurons during training, forcing the network to build redundant representations and improving generalization.
How does batch normalization improve training stability in deep networks?

Benefits: Normalizes layer inputs to reduce internal covariate shift, allows higher learning rates, and acts as a mild regularizer.

NLP Questions

If a chatbot keeps misinterpreting queries, what are possible causes?

Examples: Poor tokenization, out-of-vocabulary words, lack of contextual embeddings, ambiguous intent detection.
How does an attention mechanism help transformers?

Function: Computes relevance scores between tokens, allowing the model to focus on important parts of the input sequence.
Why is one‑hot encoding not ideal for large vocabularies?

Drawbacks: Extremely high dimensionality, sparse vectors, no notion of semantic similarity between words.
How does BERT differ from Word2Vec?

Difference: BERT is a bidirectional context‑aware transformer pre‑trained on masked language modeling; Word2Vec learns static word vectors via shallow neural nets.
Why is NER difficult in multilingual models?

Challenges: Varying entity formats, shared subword vocabularies, language‑specific name/entity patterns, data scarcity in low‑resource languages.
Why might TF‑IDF fail to capture sentence meaning?

Limitations: Ignores word order, context, and polysemy—treats each token independently and equally important across all contexts.

Statistics & Probability Questions

Why does correlation not imply causation? Give an example.

Example: Ice cream sales and drowning rates correlate (summer season) but one doesn’t cause the other.
With extreme outliers, why might the median be better than the mean?

Reason: Median is robust to extreme values, reflecting the “middle” of the data without skew.
What is Simpson’s paradox, and how can it mislead?

Definition: A trend appears in subgroups but reverses when groups are combined—beware of aggregation bias.
Why check for multicollinearity in regression?

Issue: Highly correlated predictors inflate variance of coefficient estimates, making them unstable and hard to interpret.
What is heteroscedasticity, and why is it problematic in regression?

Problem: Non‑constant error variance violates OLS assumptions—leads to inefficient estimates and invalid inference.

Classification Questions

Why is accuracy not always a good metric?

Scenario: In imbalanced datasets (e.g., fraud detection), a naive classifier can achieve high accuracy by always predicting the majority class.
Precision vs. recall—when prioritize which?

Guideline: Prioritize precision when false positives are costly (spam filter). Prioritize recall when false negatives are critical (disease screening).
Why doesn’t more data always improve classification performance?

Reasons: Noisy or irrelevant data, label errors, or the model capacity limit—garbage in, garbage out.
Why use softmax instead of sigmoid for multi‑class classification?

Reason: Softmax outputs a normalized probability distribution over classes (sums to 1), while sigmoid treats each class independently.
What happens if logistic regression is trained on highly correlated features?

Effect: Multicollinearity causes unstable coefficients and inflated standard errors—consider regularization (L1/L2) or feature selection.

Conclusion

This collection covers core yet tricky questions that probe your understanding of:

Machine Learning: model evaluation, optimization, regularization
NLP: text representation, contextual models, language understanding
Statistics: inference pitfalls, robust measures, regression assumptions
Classification: metrics, probability interpretations, real‑world trade‑offs

_Prepare sample answers, illustrate with diagrams or mini–code snippets, and back your explanations with real‑world examples. _
_Good luck with your interviews! _