5 时间特征处理

https://www.kaggle.com/code/vanpatangan/orders-forecasting-challenge

1 时间特征周期性转换、ont-hot编码       

date_col = ['date']
for _col in date_col:
    date_col = pd.to_datetime(all_df[_col], errors='coerce')
    all_df[_col + "_year"] = date_col.dt.year.fillna(-1)
    all_df[_col + "_month"] = date_col.dt.month.fillna(-1)
    all_df[_col + "_day"] = date_col.dt.day.fillna(-1)
    all_df[_col + "week"] = date_col.dt.isocalendar().week.fillna(-1)
    all_df[_col + "_day_of_week"] = date_col.dt.dayofweek.fillna(-1)
    all_df[_col + "_day_of_year"] = date_col.dt.dayofyear.fillna(-1)
    
    for m in range(1,13):
        all_df[f'month_{m}']=(all_df[_col + "_month"]==m)
    
    for d in range(7):
        all_df[f'dayofweek_{d}']=(all_df[_col + "_day_of_week"]==d)

    # Apply sine and cosine transformations
    all_df[_col + '_year_sin'] = all_df[_col + "_year"] * np.sin(2 * np.pi * all_df[_col + "_year"])
    all_df[_col + '_year_cos'] = all_df[_col + "_year"] * np.cos(2 * np.pi * all_df[_col + "_year"])
    all_df[_col + '_month_sin'] = all_df[_col + "_month"] * np.sin(2 * np.pi * all_df[_col + "_month"]/12)
    all_df[_col + '_month_cos'] = all_df[_col + "_month"] * np.cos(2 * np.pi * all_df[_col + "_month"]/12)
    all_df[_col + '_day_sin'] = all_df[_col + "_day"] * np.sin(2 * np.pi * all_df[_col + "_day"]/30)
    all_df[_col + '_day_cos'] = all_df[_col + "_day"] * np.cos(2 * np.pi * all_df[_col + "_day"]/30)
    all_df[_col + '_day_of_week_sin'] = all_df[_col + "_day_of_week"] * np.sin(2 * np.pi * all_df[_col + "_day_of_week"]/7)
    all_df[_col + '_day_of_week_cos'] = all_df[_col + "_day_of_week"] * np.cos(2 * np.pi * all_df[_col + "_day_of_week"]/7)
    all_df[_col + '_day_of_year_sin'] = all_df[_col + "_day_of_year"] * np.sin(2 * np.pi * all_df[_col + "_day_of_year"]/365)
    all_df[_col + '_day_of_year_cos'] = all_df[_col + "_day_of_year"] * np.cos(2 * np.pi * all_df[_col + "_day_of_year"]/365)
    
    all_df.loc[(all_df[_col + "_day_of_week"].isin([5, 6]))&(all_df['holiday_name']=='no_holiday'), 'holiday_name'] = 'weekend'

    
    all_df.drop(_col, axis=1, inplace=True)

2 EDA

对数据有一个整体的认识:

def check(df):
    """
    Generates a concise summary of DataFrame columns.
    """
    # Use list comprehension to iterate over each column
    summary = [
        [col, df[col].dtype, df[col].count(), df[col].nunique(), 
         df[col].isnull().sum(), df.duplicated().sum()]
        for col in df.columns]

    # Create a DataFrame from the list of lists
    df_check = pd.DataFrame(summary, columns=["column", "dtype", "instances", "unique", "sum_null", "duplicates"])

    return df_check

print("Training Data Summary")
display(check(train_df))
print("Test Data Summary")
display(check(test_df))

3 参数选择

from sklearn.model_selection import RandomizedSearchCV, cross_val_score

# Define parameter grid for Randomized Search
param_dist = {
    'eta': uniform(0.01, 0.1),
    'max_depth': randint(3, 10),
    'subsample': uniform(0.7, 0.3),
    'colsample_bytree': uniform(0.7, 0.3),
    'n_estimators': randint(100, 500)
}

# Initialize RandomizedSearchCV
random_search = RandomizedSearchCV(
    xgb.XGBRegressor(objective='reg:squarederror', n_jobs=-1),
    param_distributions=param_dist,
    n_iter=50,
    cv=5,
    scoring='neg_mean_absolute_percentage_error',
    n_jobs=-1,
    random_state=42,
    verbose=0
)

# Fit RandomizedSearchCV
random_search.fit(X_train_scaled, y_train)

# Get the best model
best_model = random_search.best_estimator_

# Perform crossvalidation
kf = KFold(n_splits=5, shuffle=True, random_state=42)
cv_scores = cross_val_score(best_model, X_train_scaled, y_train, 
                            cv=kf, scoring='neg_mean_absolute_percentage_error')
cv_mape = -cv_scores.mean()

相关推荐

  1. redis时间环结构-时序特征

    2024-07-12 04:12:02       19 阅读
  2. mysql时间处理问题

    2024-07-12 04:12:02       50 阅读
  3. tf特征处理常用函数

    2024-07-12 04:12:02       58 阅读

最近更新

  1. docker php8.1+nginx base 镜像 dockerfile 配置

    2024-07-12 04:12:02       67 阅读
  2. Could not load dynamic library ‘cudart64_100.dll‘

    2024-07-12 04:12:02       72 阅读
  3. 在Django里面运行非项目文件

    2024-07-12 04:12:02       58 阅读
  4. Python语言-面向对象

    2024-07-12 04:12:02       69 阅读

热门阅读

  1. WVP后端项目文件结构

    2024-07-12 04:12:02       31 阅读
  2. 贪心算法-以学籍管理系统为例

    2024-07-12 04:12:02       26 阅读
  3. RISC-V主要指令集介绍及规则

    2024-07-12 04:12:02       28 阅读
  4. 【ChatGPT】全面解析 ChatGPT:从起源到未来

    2024-07-12 04:12:02       22 阅读
  5. 代码随想录算法训练营第9天

    2024-07-12 04:12:02       25 阅读
  6. 担心插座预留的不够用,家里装修留多少开关插座

    2024-07-12 04:12:02       20 阅读
  7. Vue路由传参和接参如何实现

    2024-07-12 04:12:02       26 阅读
  8. android轮播图入门2——触摸停止与指示器

    2024-07-12 04:12:02       24 阅读
  9. Symfony 是一个用于构建PHP的框架

    2024-07-12 04:12:02       26 阅读