Intro and Intermediate Machine Learning

Tools

  • scikit-learn
  • seaborn
  • optuna/scikit-optimize

1. Handling Missing Values

  1. Drop rows or columns
  2. Imputation
    • sklearn.impute
      • sklearn.impute.SimpleImputer: mean, median, most_frequent or constant
      • sklearn.impute.IterativeImputer: model missing feature as a function of others
      • sklearn.impute.KNNImputer
    • hits
      • while sklearn’s imputation will remove column names
      • sometimes imputation is worse than drop or leave NaN, maybe because of noise or the wrong choice of imputation methods.
  3. Extension to Imputation
    • add an extra feature that flag the missing feature as Missing_Feature_Flag with True/False

2. Handling Categorical Variables

  1. Drop categorical features for those models can not handle them or leave categorical features for those models can.
  2. Encoder: consider the set of categorical values from train and valid maybe different
    • Ordinal variables: Ordinal Encoding
      • sklearn.preprocessing.OrdinalEncoder(categories=order_list)
    • Nominal variables:
      • Investigating cardinality:
        • Low cardinality/small number of categorical values: One-Hot Encoding/Label Encoding
        • High cardinality/large number of categorical values: Target Encoding
      • sklean.processing
        • sklearn.preprocessing.OneHotEncoder
          • handle_unknown=‘ignore’, sparse_output=False
          • return numpy, need to transform to pandas
          • encoder will remove the index
          • pd.get_dummies to add dummy variables as features may do better in One-Hot Encoder
        • sklearn.preprocessing.LabelEncoder
        • sklearn.preprocessing.TargetEncoder
          • fit(X, y).transform(X) does not equal fit_transform(X, y) because a cross fitting scheme is used in fit_transform for encoding, and transform(X) will use the full fitting scheme.
      • category_encoders.MEstimateEncoder

3. Pipeline

Pipelines are a simple way to keep your data preprocessing and modeling code organized.

  • transform Target: TransformedTargetRegressor
  • transform Features: ColumnTransformer
  • Pipeline(steps, memory)
    • Intermediate steps of the pipeline must be ‘transforms’, that is, they must implement fit and transform methods. The final estimator only needs to implement fit.
    • memory: Used to cache the fitted transformers of the pipeline. The last step will never be cached, even if it is a transformer.

4. Cross-Validation

  • For small datasets, cross-validation is needed.
  • For larger datasets, a single validation set is sufficient.

Three types of scorers:
- Clustering
- Classification
- Regression

The scorers defaulted to be Larger-Is-Better, so Loss-Function need to be negatived.

5. Gradient Boosting

  1. initializing the ensemble with a single model
  2. generate predictions and calculate loss
  3. train model to fit the loss
  4. add the new model to ensemble, and … repeat!

XGBRegressor: GBDT implementation, Gradient Boosting Decision Tree
- n_estimators: number of Decision Trees
- learning_rate: the next model will fit Loss(predict value * learning_date, target value)
- n_jobs
- fit:
- early_stopping_rounds
- eval_set

6. Data Leakage

  1. target leakage: train data have the information that after the target data in time or chronological order.
    • in this case, you can predict the feature for current time step using a linear regression of time steps before.
  2. train-test contamination: train data have the information of test data.

相关推荐

最近更新

  1. docker php8.1+nginx base 镜像 dockerfile 配置

    2024-04-07 08:08:04       98 阅读
  2. Could not load dynamic library ‘cudart64_100.dll‘

    2024-04-07 08:08:04       106 阅读
  3. 在Django里面运行非项目文件

    2024-04-07 08:08:04       87 阅读
  4. Python语言-面向对象

    2024-04-07 08:08:04       96 阅读

热门阅读

  1. 不同系统锁库存的实现方式

    2024-04-07 08:08:04       42 阅读
  2. 开源模型应用落地-qwen1.5-7b-chat-LoRA微调代码拆解

    2024-04-07 08:08:04       35 阅读
  3. 举个例子说明联邦学习

    2024-04-07 08:08:04       39 阅读
  4. 从零开始实现一个RPC框架(二)

    2024-04-07 08:08:04       39 阅读
  5. ArcGIS10.8保姆式安装教程

    2024-04-07 08:08:04       29 阅读
  6. js的check函数

    2024-04-07 08:08:04       40 阅读
  7. 【00150】2024 金融理论与实务试卷二

    2024-04-07 08:08:04       36 阅读
  8. 方格画(C/C++)

    2024-04-07 08:08:04       40 阅读
  9. CSS3

    CSS3

    2024-04-07 08:08:04      46 阅读