Useful-Cheatsheets/machine-learning-interveiw.md at main · HusseinLezzaik/Useful-Cheatsheets

Interviews Q&A

Don't use "accuracy" on an Imbalanced dataset. Acc is not a good performance metric for these problems. Instead, use Precision Recall, F-scoure, Confusion Matrix, ROC curves
Collect more data in order to balance your data
Augment the dataset with synthetic data
Resample your data
Use "Cost-Sensitivity". Adding a cost-sensitive layer to your model is a great way to optimize your predictions. This will help to weigh the results of a model trained on imbalanced data
Check different algorithms. Decision trees are excellent at handling imbalanced classes. They're good with dealing with unstructured data.

Patterns: there must be patterns to learn
Complex: the patterns are complex
Existing data: it's possible to collect
Predictive: it's a predictive problem .. what would the answer/solution look like

Most models can perform well without even fine-tuning, and you can then push it's performance
Tabular: XGBoost/LightGBM/RF
Time series: XGBoost/LightGBM/RF
Image: ResNet18/EffNet
Text: DistilRoBERTa
Audio: ResNet/EffNet

Supervised Classification algorithm
You need labeled data and want to classify an unlabeled pt. into (thus the nearest neighbor)

Recall: known as the TP rate, the amount of positives your model claims compared to the actual nb. of positives there are throughout the data
Precision: known as positive predictive value, and it's a measure of the amount of accurate positives your model claims compared to the nb. of positives it actually claims

Type I: false positive, claiming something has happened when it hasn't, telling a man he is pregnant
Type I: false negative, claiming nothing is happening when something is, tell a pregnant woman she isn't carrying a baby