Tools
- scikit-learn
- seaborn
- optuna/scikit-optimize
1. Handling Missing Values
- Drop rows or columns
- Imputation
- sklearn.impute
- sklearn.impute.SimpleImputer: mean, median, most_frequent or constant
- sklearn.impute.IterativeImputer: model missing feature as a function of others
- sklearn.impute.KNNImputer
- hits
- while sklearn’s imputation will remove column names
- sometimes imputation is worse than drop or leave NaN, maybe because of noise or the wrong choice of imputation methods.
- sklearn.impute
- Extension to Imputation
- add an extra feature that flag the missing feature as Missing_Feature_Flag with True/False
2. Handling Categorical Variables
- Drop categorical features for those models can not handle them or leave categorical features for those models can.
- Encoder: consider the set of categorical values from train and valid maybe different
- Ordinal variables: Ordinal Encoding
- sklearn.preprocessing.OrdinalEncoder(categories=order_list)
- Nominal variables:
- Investigating cardinality:
- Low cardinality/small number of categorical values: One-Hot Encoding/Label Encoding
- High cardinality/large number of categorical values: Target Encoding
- sklean.processing
- sklearn.preprocessing.OneHotEncoder
- handle_unknown=‘ignore’, sparse_output=False
- return numpy, need to transform to pandas
- encoder will remove the index
- pd.get_dummies to add dummy variables as features may do better in One-Hot Encoder
- sklearn.preprocessing.LabelEncoder
- sklearn.preprocessing.TargetEncoder
- fit(X, y).transform(X) does not equal fit_transform(X, y) because a cross fitting scheme is used in fit_transform for encoding, and transform(X) will use the full fitting scheme.
- sklearn.preprocessing.OneHotEncoder
- category_encoders.MEstimateEncoder
- Investigating cardinality:
- Ordinal variables: Ordinal Encoding
3. Pipeline
Pipelines are a simple way to keep your data preprocessing and modeling code organized.
- transform Target: TransformedTargetRegressor
- transform Features: ColumnTransformer
- Pipeline(steps, memory)
- Intermediate steps of the pipeline must be ‘transforms’, that is, they must implement fit and transform methods. The final estimator only needs to implement fit.
- memory: Used to cache the fitted transformers of the pipeline. The last step will never be cached, even if it is a transformer.
4. Cross-Validation
- For small datasets, cross-validation is needed.
- For larger datasets, a single validation set is sufficient.
Three types of scorers:
- Clustering
- Classification
- Regression
The scorers defaulted to be Larger-Is-Better, so Loss-Function need to be negatived.
5. Gradient Boosting
- initializing the ensemble with a single model
- generate predictions and calculate loss
- train model to fit the loss
- add the new model to ensemble, and … repeat!
XGBRegressor: GBDT implementation, Gradient Boosting Decision Tree
- n_estimators: number of Decision Trees
- learning_rate: the next model will fit Loss(predict value * learning_date, target value)
- n_jobs
- fit:
- early_stopping_rounds
- eval_set
6. Data Leakage
- target leakage: train data have the information that after the target data in time or chronological order.
- in this case, you can predict the feature for current time step using a linear regression of time steps before.
- train-test contamination: train data have the information of test data.