Intro and Intermediate Machine Learning

2024-04-07 08:08:04
开发
43

Tools

scikit-learn
seaborn
optuna/scikit-optimize

1. Handling Missing Values

Drop rows or columns
Imputation
- sklearn.impute
  - sklearn.impute.SimpleImputer: mean, median, most_frequent or constant
  - sklearn.impute.IterativeImputer: model missing feature as a function of others
  - sklearn.impute.KNNImputer
- hits
  - while sklearn’s imputation will remove column names
  - sometimes imputation is worse than drop or leave NaN, maybe because of noise or the wrong choice of imputation methods.
Extension to Imputation
- add an extra feature that flag the missing feature as Missing_Feature_Flag with True/False

2. Handling Categorical Variables

Drop categorical features for those models can not handle them or leave categorical features for those models can.
Encoder: consider the set of categorical values from train and valid maybe different
- Ordinal variables: Ordinal Encoding
  - sklearn.preprocessing.OrdinalEncoder(categories=order_list)
- Nominal variables:
  - Investigating cardinality:
    - Low cardinality/small number of categorical values: One-Hot Encoding/Label Encoding
    - High cardinality/large number of categorical values: Target Encoding
  - sklean.processing
    - sklearn.preprocessing.OneHotEncoder
      - handle_unknown=‘ignore’, sparse_output=False
      - return numpy, need to transform to pandas
      - encoder will remove the index
      - pd.get_dummies to add dummy variables as features may do better in One-Hot Encoder
    - sklearn.preprocessing.LabelEncoder
    - sklearn.preprocessing.TargetEncoder
      - fit(X, y).transform(X) does not equal fit_transform(X, y) because a cross fitting scheme is used in fit_transform for encoding, and transform(X) will use the full fitting scheme.
  - category_encoders.MEstimateEncoder

3. Pipeline

Pipelines are a simple way to keep your data preprocessing and modeling code organized.

transform Target: TransformedTargetRegressor
transform Features: ColumnTransformer
Pipeline(steps, memory)
- Intermediate steps of the pipeline must be ‘transforms’, that is, they must implement fit and transform methods. The final estimator only needs to implement fit.
- memory: Used to cache the fitted transformers of the pipeline. The last step will never be cached, even if it is a transformer.

4. Cross-Validation

For small datasets, cross-validation is needed.
For larger datasets, a single validation set is sufficient.

Three types of scorers:
- Clustering
- Classification
- Regression

The scorers defaulted to be Larger-Is-Better, so Loss-Function need to be negatived.

5. Gradient Boosting

initializing the ensemble with a single model
generate predictions and calculate loss
train model to fit the loss
add the new model to ensemble, and … repeat!

XGBRegressor: GBDT implementation, Gradient Boosting Decision Tree
- n_estimators: number of Decision Trees
- learning_rate: the next model will fit Loss(predict value * learning_date, target value)
- n_jobs
- fit:
- early_stopping_rounds
- eval_set

6. Data Leakage

target leakage: train data have the information that after the target data in time or chronological order.
- in this case, you can predict the feature for current time step using a linear regression of time steps before.
train-test contamination: train data have the information of test data.

原文地址:https://blog.csdn.net/qq_39384184/article/details/137422844 本文来自互联网用户投稿，该文观点仅代表作者本人，不代表本站立场。本站仅提供信息存储空间服务，不拥有所有权，不承担相关法律责任。如若转载，请注明出处：https://www.suanlizi.com/kf/1776763836376944640.html 如若内容造成侵权/违法违规/事实不符，请联系《酸梨子》网邮箱：1419361763@qq.com进行投诉反馈，一经查实，立即删除！

阅读全部