- 0. Machine Learning Algorithms
- Linear Model
- Logistic Regression, Support Vector Machine
- Tools) scikit learn, Vowpal Rabbit
- Tree-based
- Decision Tree, Random Forest, GBDT
- Tools) scikit learn, XGBoost, LightGBM
- KNN-based methods
- Tools) scikit learn
- Neural Networks
- Tools) Tensorflow, Keras, Mxnet, Pytorch, Lasagne
- 1. Data Check
- Check List
- Target Metric
- Domain Knowledge
- +How data generated
- 2. Data Pre-processing : Based on the data we checked, we have to give a slight difference on data
- Feature checking
- Types : Categorical/Numerical/Text .etc
- Date >using gaps
- Coordinates >using latitudes, longitudes
- Anonymized >Try to decode/check correlations
- Scaling > Standardization
- Missing Values(NaNs) >using Isnull(in XGBoost)/-999,-1 etc
- Outliers >just keep in mind
- Feature generation
- Label Encoding >Mark numbers rather than symbols for readability
- Frequency Encoding >write the specific
- Feature Extraction
- Text > Vector
- Bags of Words
- Embeddings(~word2vec)
- Text pre-processing
- consideration : lowercase/lemmatization/stemming/stepwords
- Image > Vector
- 3. EDA
- Tool : Visualization
- Using histograms, plots
- Explore feature relations
- 4. Building Model + Validation
- Building Model
- Metrics Optimization
- Regression metrics
- MSE
- RMSE
- R^2
- MAE
- MSPE
- MAPE
- RMSLE
- Classification metric
- Accuracy
- Logarithmic Loss
- Validation
- Tools
- Holdout
- LOO
- K-fold
- Splitting Data
- Random, Rowwise : must be independent
- Timewise >Moving Window Validation
- by Id
- Combined
- Problems
- Too limited data
- Too diverse / not inconsistent data
- Problems
- Treating values that do not present in train data
- Inference by the data that have the same feature with the value
About Metrics Optimization