AI, Deep Learning Basics/Basic

2018 February) Coursera: How to Win a Data Science Competition: Learn from Top Kagglers Week 1~ Week 3 Summary

  • 0. Machine Learning Algorithms
    • Linear Model
      • Logistic Regression, Support Vector Machine
      • Tools) scikit learn, Vowpal Rabbit
    • Tree-based
      • Decision Tree, Random Forest, GBDT
      • Tools) scikit learn, XGBoost, LightGBM
    • KNN-based methods
      • Tools) scikit learn
    • Neural Networks
      • Tools) Tensorflow, Keras, Mxnet, Pytorch, Lasagne
 
  • 1. Data Check
    • Check List
      • Target Metric
      • Domain Knowledge
      • +How data generated
 
  • 2. Data Pre-processing : Based on the data we checked, we have to give a slight difference on data
    • Feature checking
      • Types : Categorical/Numerical/Text .etc
        • Date >using gaps
        • Coordinates    >using latitudes, longitudes
        • Anonymized    >Try to decode/check correlations
      • Scaling    > Standardization
      • Missing Values(NaNs) >using Isnull(in XGBoost)/-999,-1 etc
      • Outliers    >just keep in mind
    • Feature generation
      • Label Encoding >Mark numbers rather than symbols for readability
      • Frequency Encoding >write the specific 
    • Feature Extraction
      • Text > Vector
        • Bags of Words
        • Embeddings(~word2vec)
      • Text pre-processing
        • consideration : lowercase/lemmatization/stemming/stepwords
      • Image > Vector
 
  • 3. EDA
    • Tool : Visualization
      • Using histograms, plots
    • Explore feature relations

 

  • 4. Building Model + Validation
    • Building Model
      • Metrics Optimization
        • Regression metrics
          • MSE
          • RMSE
          • R^2
          • MAE
          • MSPE
          • MAPE
          • RMSLE
        • Classification metric
          • Accuracy
          • Logarithmic Loss
    • Validation
      • Tools
        • Holdout
        • LOO
        • K-fold
      • Splitting Data
        • Random, Rowwise : must be independent
        • Timewise >Moving Window Validation
        • by Id
        • Combined
    • Problems
      • Too limited data
      • Too diverse / not inconsistent data

 

 

  • Problems
  1. Treating values that do not present in train data
    1. Inference by the data that have the same feature with the value
 
 
About Metrics Optimization