商品运营分析

本文对某个品类（猫砂）在1688的情况，进行一定维度的分析：

内容主要是：

1.品类前景

2.阿里巴巴商家平台和淘宝平台销售，销量分析（爬虫获取数据）

3.对获取的数据，进行分析和挖掘，粗略分析影响销量的因素（仅从标题、起始价、店铺是否星级、促销条件、吸引条件等一级爬虫获取的数据来分析）

一、品类前景

要想知道一个类目在国内的市场规模，Market size的问题，其实怎么算都是一个粗略估计；

①经典的费米估算法，其实也是个很粗略的计算，比没有强；

不懂的，可参考费米估算 | 产品面试中的估算问题解法 | 人人都是产品经理

②爬取主流网上平台的销量，对销量靠前的绝大部分，进行统计，可能会漏掉一个销量极小的，但无妨。此时又有两个问题：1.是很多平台反爬措施太厉害，甚至京东都不写销量 2.漏掉了线下平台、微商、私域等，各种渠道太多，根本统计不完。3.大平台销量大、小平台总体销量少，且比值并不成线性，在已知无法完全统计的情况下，很难估算准。

③COPY大法，直接找权威部门的数据，果然解决不了问题，最好的方式就是绕过问题。

《2023年-2024中国宠物行业白皮书》正式发布：市场规模达到2793亿元，犬猫数量实现“双升”_派读大数据

可以看到，猫砂的需求渗透率还是比较高的，毕竟在城市养猫，猫又不懂自己去厕所，只能买猫砂，不然整个屋子都臭臭的。

1.1宠物市场前景简单概括：

1.宠物市场近几年是逐年增长的，但这两年增速放缓了；

2.单只宠物均消费出现略微下滑，说明养宠物的数量稍微涨了些，均花费降了不到1%；

3.养猫、狗的数量，23年较22年都有所上升，物猫数量为6980万只，较2022年增长6.8%；

4.90后、80后宠主仍是养宠主力军，70后及更早的，比例在下降；

5.宠物花销消费大头是吃的，如主粮和零食，占比过半，其次是医疗方面，如药、疫苗、治疗费等，占比快30%，用品，服务市场份额较低，分别为12.5%、6.8%，但有较高上升空间。

1.2猫砂板块

猫砂板块，与2022年相比，豆腐猫砂、膨润土猫砂大幅上升。其中，豆腐猫砂上升7个百分点。膨润土猫砂上升6.9个百分点。宠主对混合猫砂的偏好度有所下降，下降3.4个百分点。

二、爬取数据

例如1688平台，网页端，可以看到有这三个选项：

基于selenium,注意反爬和sleep，碰到验证码最好等待手动输入，现在反爬规则更新太快。

默认搜索，会自动选择工业品入口。

发现网页端，很多数据并不对齐，有的店铺能显示多个offer-desc-item

1688采集字段如下：

标题，链接（保留着）,商品推广tag（是否1688爆品、严选、新品），厂家优势（先采后付、48H发货等）、促销tag(近期多人收藏、多少人加购等)，页面起始价，成交量;

最终的数据是这样：

没有采集商家信息，因为要点进详情页，频繁访问搞不好反手被封；

字段解释：

标题：产品可见标题

链接：留着随时细看商品详情

price_num:页面起始价

其中tag_1,2,3：有三种和无，（1688爆品、严选、新品、无），得用三个字段；

advan_1,2是advantage缩写，有（先采后付、48H发货、近期上新、代发包邮、深度验商等），分先后，两个字段有含两个、一个、无；

advocacy是推广，有复购率21.9%（数值范围大）、多少人已加购、近期多人加购、近期多次收藏和无；

如下图所示：不同商品间，有推广信息很多的，也有寥寥无几的。

接下来进入最麻烦的数据清洗阶段：

三、清洗

3.1基本清洗

1.首先要看看数据是否正确

有没有商品包含严选+爆品之类的，查看无

path = 'C:/Users/Administrator/downloads/maosha_1688.xlsx'
df = pd.read_excel(path
                   # ,dtype={'tag_1':str,'tag_2':str,'tag_3':str,'price_num':np.float32}
                   )
pd.set_option('display.max_columns', 10) # 设置显示列数
df.drop("标题链接",axis=1,inplace=True) # 删除标题链接列
# 1.验证1688爆品、新品、严选和为空数量是否正确，为空的不少
index_1 = df['tag_1'].notna()
index_2 = df['tag_2'].notna()
index_3 = df['tag_3'].notna()
print(len(df[index_1]),len(df[index_2]),len(df[index_3]),df.shape[0])

2.将三个tag合并

# 2.将三个字段合并为新
# def get_it(data,cols):
#     for c in cols:
#         if pd.notnull(data[c]):
#             return data[c]
#     return None
# cols = ['tag_1','tag_2','tag_3']
# df['new'] = df.apply(lambda x:get_it(x,cols),axis=1)
df['com_tag']=df[['tag_1','tag_2','tag_3']].bfill(axis=1).iloc[:,0] # 简单的方式
assert df['com_tag'].notna().sum() == df['tag_1'].notna().sum() + df['tag_2'].notna().sum() + df['tag_3'].notna().sum()
df.drop(['tag_1','tag_2','tag_3'],axis=1,inplace=True)
print(df.head(10))

3.有的商品无销量，删掉

4.有的数据是重复的，去重

# 3.删掉销量无实际数据的
df.dropna(subset=['location'],inplace=True)
print(df.shape)
# 4.清洗重复标题和起始价和销售额的--351个重复，未去空前总计1200条
df.drop_duplicates(subset=['标题','price_num','location'],inplace=True)
print('清洗后shape',df.shape)

清洗前(817, 7)
清洗后shape (466, 7)

5.清洗销售额字段

1688这块，主要是成交多少元，多少万元，还有销多少件的，后面发现还有成交几块几十块钱的，这块要发现规律仔细点；

# 5.清洗销售额字段
def clean_it(df):
    if '销' in df.location:
        return int(df.location.split('+')[0].replace('销',''))*df['price_num']
    else:
        temp_series = df['location'].split('+')[0].replace('成交','')
        if '万' in temp_series:
            return pd.to_numeric(temp_series.replace('万',''))*10000
        if '元' in temp_series:
            return pd.to_numeric(temp_series.replace('元',''))
        return temp_series

df['money_sales'] = df.apply(clean_it,axis=1)

df.drop('location',inplace=True,axis=1)
# df.to_excel('d:/1234.xlsx',index=False)

截个图，稍微初步清洗后是这样，不用管excel对数值的处理标记，因为我们压根就不用Excel

3.2基本绘图查看

稍微查看以下销售额占比

sns.set(font_scale=1.5,font='STSong') #设置字体大小、字体（这里是宋体）
df["money_sales"] = df["money_sales"].astype(str).map(lambda x:x.split('.')[0]).astype(int)
df_sort=df.sort_values(by='money_sales',ascending=False)
df_sort.reset_index(drop=True,inplace=True)
df2 = df_sort.loc[:50,:]
sns.set_style("dark")
sns.set(font_scale=1.5,font='STSong') #设置字体大小、字体（这里是宋体）
sns.barplot(x=df2.index,y=df2.money_sales,palette='YlOrRd_r',errorbar=None)
plt.xticks([])
plt.title('销量前50')
plt.show()

def compute_it(data,num):
    a = data[:num].money_sales.sum()
    b = data.money_sales.sum()
    print('前%s名销售额' % num, a)
    print('总计销售额', b)
    print("占比", str(round((a / b) * 100, 3)) + '%')
    plt.ticklabel_format(style='plain')
    plt.bar(x=["前%s名销售额" % num, "总计"], height=[a, b])
    # 标注
    for i,j in enumerate([a, b]):
        plt.text(i, j + 0.01, str(round(j, 2)), ha='center', va='bottom')
        plt.text(i, j *0.8, str(round(j/b,2)*100)+'%', ha='center', va='bottom',color='white')
    plt.show()
compute_it(df_sort ,40)

可知前40名商品，占了总销售额的80%，约到前75名，会占到90%，由于我们按照商品统计的，并没有点进链接分店铺，加之数据都是向下取整，大额的余数会更大，实际上头部商家数量，会更少。

3.3 几个特征重要性

6.两个advan的清洗和编码转换

先看看情况，可知基本就是那么些，考虑encoder

from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
df['advan_3'] = le.fit_transform(df['advan_1'])
df['advan_4'] = le.fit_transform(df['advan_2'])
df['com_tag_new'] = le.fit_transform(df['com_tag'])
df.drop(['advan_1','advan_2','com_tag'],inplace=True,axis=1)

因为我们接下来准备分析下特征重要性，所以要编码；

7.推广列advocacy

查看只显示加购、收藏等商品，数据是比较少的，并且销售额也不多，直接编码为0吧。

index = df['advocacy'].str.contains('加购|收藏',na=False)
df[index]

def clean_again(se):
    if "复购率" in se:
        return se.replace('复购率',"").replace('%',"")
    else:
        return 0
df['advocacy_new'] = df['advocacy'].astype(str).apply(clean_again)
df.drop('advocacy',axis=1,inplace=True)
df.reset_index(drop=True,inplace=True)
df

至此，除了标题列，我们完成了基本的清洗。

考虑到，将标题列，分词，会形成一个比较高维的稀疏矩阵，在查看特征重要性时，容易造成过拟合，先看看其他几个特征情况。

model_rf = RandomForestRegressor(max_depth=4
,min_samples_leaf=2
,n_jobs=-1
,max_features=4
,n_estimators=1000) # 树搞多点
x_train = df[['price_num','advan_3','advan_4','com_tag_new','advocacy_new']]
y_train = df['money_sales']
model_rf.fit(x_train,y_train)
# 对重要性进行排序
impos = dict(zip(model_rf.feature_names_in_,model_rf.feature_importances_))
impos_sort = sorted(impos.items(),key=lambda x:x[1],reverse=True)
# 换个容易理解的列名
map_dict={'price_num':'起始价'
         ,'advan_3':'厂家优势1'
         ,'advan_4':'厂家优势2'
         ,'com_tag_new':'爆、严选、新品否'
         ,'advocacy_new':'复购收藏情况'
         }
new_cols = [map_dict[i[0]] for i in impos_sort]
new_values = [i[1] for i in impos_sort]
plt.figure(figsize=(10,6))
sns.barplot(x=new_cols
           ,y=new_values)
# 标注
for i,j in enumerate(new_values):
    plt.text(i,j+0.01,round(j,2),ha='center',va='bottom')
plt.ylim((0,0.5))
plt.title('特征重要性--MSE减少')
plt.show()

换GBDT试试，GBDT更容易拟合，但也容易过拟合；

model_gbdt = GradientBoostingRegressor(max_depth=4
                                       ,min_samples_leaf=2
                                       ,n_estimators=1000
                                       ,learning_rate=0.1
                                       ,subsample=0.8
                                       ,max_features=3
                                       ,n_iter_no_change=20,  
                                       )
model_gbdt.fit(x_train,y_train)
impos_gbdt = dict(zip(model_gbdt.feature_names_in_,model_gbdt.feature_importances_))
impos_sort_gbdt = sorted(impos_gbdt.items(),key=lambda x:x[1],reverse=True)
sns.set(font_scale=1.5,font='STSong')
plt.figure(figsize=(10,6))
new_cols_boosting = [map_dict[i[0]] for i in impos_sort_gbdt]
new_values_boosting = [i[1] for i in impos_sort_gbdt]
sns.barplot(x=new_cols_boosting
           ,y=new_values_boosting)
for i,j in enumerate(new_values_boosting):
    plt.text(i,j+0.01,round(j,2),ha='center',va='bottom')
plt.ylim((0,0.8))
plt.title('GBDT特征重要性')

总结：起始价的影响最重要，推广（1688爆品、严选、新品）的影响最弱，复购率亦有较大影响，厂家优势方面，也有很大影响，所以该填还是要填个东西进去；

3.4 标题列的处理分析

一般我们认为，标题关键字越贴合商品实际、用户需求、用户搜索习惯，则曝光量越高，根据购买流程，最终成交额一般也会越高，当然这是在排除了购买排名推广、其他权重影响下的条件。

由于推荐算法，越来越复杂且机密度比较高，从商户端反推关键字SEO优化，是一件从下而上难搞的事，虽然做不到极其精准，但可做到优化后曝光率不断提升。

# 对标题进行分词
title_list = df['标题'].values.tolist()
import jieba
title_s=[]
for line in title_list:
    title_cut = jieba.lcut(line)
    title_s.append(title_cut)

# 去掉没什么意义的词
stopwords = [line.strip() for line in open('d:/停用词表.txt', 'r', encoding='utf-8').readlines()]
title_clean = []
for inner_list in title_s:
    line_clean = []
    for single_word in inner_list:
        if single_word not in stopwords:
            line_clean.append(single_word)
    title_clean.append(line_clean)

# 转为一个列表
final_list=[]
for line in title_clean:
    for word in line:
        final_list.append(word)

print('分词结果总数',len(final_list))
print('每个标题关键词均数',len(final_list)/df.shape[0])

总共6795个词，每个商品平均14.58个词,去重后有999个不同的词

# 查看词频
df_word = pd.DataFrame(data={'words':final_list})
word_count = df_word.words.value_counts().reset_index()
word_count.columns = ['word', 'count']
word_count

不知道为什么，很多人喜欢搞个词云，那就被迫随波逐流：

word_count = word_count.sort_values(by='count',ascending=False)
word_count = word_count.head(100)

from wordcloud import WordCloud
import matplotlib.pyplot as plt
import imageio.v2 as imageio
plt.figure(figsize=(10,8))
w_c = WordCloud(font_path="d:/STXINWEI.ttf"
                , background_color="white"
                , max_font_size=100
                , margin=5)
wc = w_c.fit_words({
    x[0]:x[1] for x in word_count.head(100).values
})
plt.imshow(wc, interpolation='bilinear')
plt.axis("off")
plt.show()

# 分析关键字和对应销售额
# 只要标题中包含该关键字，则相应的商品销售额计入


sales_list=[]
for single_word in word_count.word.values:
    sales_list.append(df[df['标题'].str.contains(single_word)]['money_sales'].sum())

word_count['sales'] = sales_list
# 按销售额排序
word_count = word_count.sort_values(by='sales',ascending=True)
print(word_count)
# 画个图
plt.figure(figsize=(10,20))
need = word_count.tail(30)
plt.barh(need.word.values,need.sales.values)
for i,j in enumerate(need.sales.values):
    plt.text(j,i,j,ha='left',va='center')

plt.show()

可以看到，关键词包含这些的，总体来说销量最多。

3.5定价策略

考虑不同的初始价，会影响进店率，从而间接影响总销售额

# 随意的分箱效果不佳
df['price_num'] = df['price_num'].astype(float)
df['price_num_bin'] = pd.cut(df['price_num'],bins=10)
df.groupby('price_num_bin').agg({'money_sales':'mean','money_sales':'sum'}).plot(kind='bar')

我们查看价格分分布，先分成10个箱子，发现效果不佳

发现要稍微看看大家都卖的什么价，大概是定价偏低的，总销售额会比较好。

手动分箱

df['price_num_bin'] = pd.cut(df['price_num'],bins=[4.9,9.9,12,16,20,28,39,56,76,88,120,160,190,240,270])
df.groupby('price_num_bin').agg({'money_sales':'mean','money_sales':'sum'}).plot(kind='bar',figsize=(8,8))
plt.yticks(ticks=range(0,2500000,100000),labels=range(0,2500000,100000))
plt.show()

3.6市场情况

查看众多卖家的销售额分布情况

df['money_sales_bin'] = pd.cut(df['money_sales'],bins=[1000,5000,10000,30000,80000,150000,500000,700000])
df.groupby('money_sales_bin').agg({'name':'count'}).plot(kind='bar',alpha=.6,grid=False,color='firebrick',legend=False)
plt.rcParams['axes.facecolor']='lightgrey'
plt.show()

看来，销售额较低的商品占比较大，由于区间划分的原因，头部卖家可能一个顶几十个。