网络安全攻击数据的多维度可视化分析

简介

本研究项目通过应用多种数据处理与可视化技术，对网络安全攻击事件数据集进行了深度分析。首先，利用Pandas库读取并预处理数据，包括检查缺失值、剔除冗余信息以及将时间戳转化为日期时间格式以利于后续时间序列分析。

研究步骤

数据分析前准备：加载numpy、pandas、seaborn和matplotlib等科学计算和可视化库，同时使用os模块遍历文件路径，显示数据集所在位置。
数据初步探索：读取并清洗cybersecurity_attacks.csv数据集，统计各类攻击类型的数量，并可视化攻击类型分布情况，采用柱状图清晰展示了不同攻击类型的频次差异。
特征分析：针对选定的三个关键特征（Anomaly Scores, Source Port, Packet Length）进行细致探讨，为每种特征分别绘制箱线图，揭示各特征在不同攻击类型下的统计特性。
网络协议占比分析：统计网络流量中UDP、ICMP和TCP三种主要协议的使用频率，利用饼图直观展现其在网络攻击中的相对重要性。
时间序列分析：根据攻击发生的时间戳生成年份和月份列，构建透视表并通过热力图展示攻击事件随时间和月份的变化趋势，进一步细化至每周，探究月与周日间攻击活动的周期性规律。
Payload Data内容挖掘：对Payload Data列的内容进行文本分析，将其转换为单一字符串并生成词云，借此揭示攻击载荷中的高频词汇，从而洞察潜在的攻击模式和关键词。

代码解析及实现

导入必要的库

import numpy as np # 线性代数
import pandas as pd # 数据处理，CSV文件I/O（例如pd.read_csv）
import seaborn as sns # 数据可视化
import matplotlib.pyplot as plt # 绘图
import warnings # 忽略警告
import calendar # 日历功能

# 输入数据文件位于只读的“../input/”目录中
# 例如，运行此代码（通过单击运行或按Shift+Enter）将列出输入目录下的所有文件

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

data = pd.read_csv('/kaggle/input/cyber-security-attacks/cybersecurity_attacks.csv')

data.head()

	Timestamp	Source IP Address	Destination IP Address	Source Port	Destination Port	Protocol	Packet Length	Packet Type	Traffic Type	Payload Data	...	Action Taken	Severity Level	User Information	Device Information	Network Segment	Geo-location Data	Proxy Information	Firewall Logs	IDS/IPS Alerts	Log Source
0	2023-05-30 06:33:58	103.216.15.12	84.9.164.252	31225	17616	ICMP	503	Data	HTTP	Qui natus odio asperiores nam. Optio nobis ius...	...	Logged	Low	Reyansh Dugal	Mozilla/5.0 (compatible; MSIE 8.0; Windows NT ...	Segment A	Jamshedpur, Sikkim	150.9.97.135	Log Data	NaN	Server
1	2020-08-26 07:08:30	78.199.217.198	66.191.137.154	17245	48166	ICMP	1174	Data	HTTP	Aperiam quos modi officiis veritatis rem. Omni...	...	Blocked	Low	Sumer Rana	Mozilla/5.0 (compatible; MSIE 8.0; Windows NT ...	Segment B	Bilaspur, Nagaland	NaN	Log Data	NaN	Firewall
2	2022-11-13 08:23:25	63.79.210.48	198.219.82.17	16811	53600	UDP	306	Control	HTTP	Perferendis sapiente vitae soluta. Hic delectu...	...	Ignored	Low	Himmat Karpe	Mozilla/5.0 (compatible; MSIE 9.0; Windows NT ...	Segment C	Bokaro, Rajasthan	114.133.48.179	Log Data	Alert Data	Firewall
3	2023-07-02 10:38:46	163.42.196.10	101.228.192.255	20018	32534	UDP	385	Data	HTTP	Totam maxime beatae expedita explicabo porro l...	...	Blocked	Medium	Fateh Kibe	Mozilla/5.0 (Macintosh; PPC Mac OS X 10_11_5; ...	Segment B	Jaunpur, Rajasthan	NaN	NaN	Alert Data	Firewall
4	2023-07-16 13:11:07	71.166.185.76	189.243.174.238	6131	26646	TCP	1462	Data	DNS	Odit nesciunt dolorem nisi iste iusto. Animi v...	...	Blocked	Low	Dhanush Chad	Mozilla/5.0 (compatible; MSIE 5.0; Windows NT ...	Segment C	Anantapur, Tripura	149.6.110.119	NaN	Alert Data	Firewall

5 rows × 25 columns

data.shape

(40000, 25)

data.info()

在这里插入图片描述

data.columns

Index([‘Timestamp’, ‘Source IP Address’, ‘Destination IP Address’,
‘Source Port’, ‘Destination Port’, ‘Protocol’, ‘Packet Length’,
‘Packet Type’, ‘Traffic Type’, ‘Payload Data’, ‘Malware Indicators’,
‘Anomaly Scores’, ‘Alerts/Warnings’, ‘Attack Type’, ‘Attack Signature’,
‘Action Taken’, ‘Severity Level’, ‘User Information’,
‘Device Information’, ‘Network Segment’, ‘Geo-location Data’,
‘Proxy Information’, ‘Firewall Logs’, ‘IDS/IPS Alerts’, ‘Log Source’],
dtype=‘object’)

data.isnull().sum()

在这里插入图片描述
无关列去除

columns_to_drop=['Malware Indicators', 'Alerts/Warnings', 'Proxy Information', 'Firewall Logs', 'IDS/IPS Alerts']
data=data.drop(columns_to_drop, axis=1)

data.isnull().sum()

在这里插入图片描述

data.head()

	Timestamp	Source IP Address	Destination IP Address	Source Port	Destination Port	Protocol	Packet Length	Packet Type	Traffic Type	Payload Data	Anomaly Scores	Attack Type	Attack Signature	Action Taken	Severity Level	User Information	Device Information	Network Segment	Geo-location Data	Log Source
0	2023-05-30 06:33:58	103.216.15.12	84.9.164.252	31225	17616	ICMP	503	Data	HTTP	Qui natus odio asperiores nam. Optio nobis ius...	28.67	Malware	Known Pattern B	Logged	Low	Reyansh Dugal	Mozilla/5.0 (compatible; MSIE 8.0; Windows NT ...	Segment A	Jamshedpur, Sikkim	Server
1	2020-08-26 07:08:30	78.199.217.198	66.191.137.154	17245	48166	ICMP	1174	Data	HTTP	Aperiam quos modi officiis veritatis rem. Omni...	51.50	Malware	Known Pattern A	Blocked	Low	Sumer Rana	Mozilla/5.0 (compatible; MSIE 8.0; Windows NT ...	Segment B	Bilaspur, Nagaland	Firewall
2	2022-11-13 08:23:25	63.79.210.48	198.219.82.17	16811	53600	UDP	306	Control	HTTP	Perferendis sapiente vitae soluta. Hic delectu...	87.42	DDoS	Known Pattern B	Ignored	Low	Himmat Karpe	Mozilla/5.0 (compatible; MSIE 9.0; Windows NT ...	Segment C	Bokaro, Rajasthan	Firewall
3	2023-07-02 10:38:46	163.42.196.10	101.228.192.255	20018	32534	UDP	385	Data	HTTP	Totam maxime beatae expedita explicabo porro l...	15.79	Malware	Known Pattern B	Blocked	Medium	Fateh Kibe	Mozilla/5.0 (Macintosh; PPC Mac OS X 10_11_5; ...	Segment B	Jaunpur, Rajasthan	Firewall
4	2023-07-16 13:11:07	71.166.185.76	189.243.174.238	6131	26646	TCP	1462	Data	DNS	Odit nesciunt dolorem nisi iste iusto. Animi v...	0.52	DDoS	Known Pattern B	Blocked	Low	Dhanush Chad	Mozilla/5.0 (compatible; MSIE 5.0; Windows NT ...	Segment C	Anantapur, Tripura	Firewall

data.duplicated().sum()

# 在创建DataFrame时，将'Timestamp'列转换为datetime对象
data['Timestamp'] = pd.to_datetime(data['Timestamp'])

绘制攻击类型分布的柱状图

根据给定数据集，绘制攻击类型分布的柱状图。

# 统计每种攻击类型的数量
attack_counts = data['Attack Type'].value_counts()

# 设置绘图的颜色方案
colors = sns.color_palette('viridis', len(attack_counts))

# 创建图表并绘制柱状图
plt.figure(figsize=(12, 6))
sns.barplot(x=attack_counts.index, y=attack_counts, palette=colors)

# 为每个柱子添加数据标签
for i, count in enumerate(attack_counts):
    plt.text(i, count + 0.1, str(count), ha='center', va='bottom', fontsize=10, fontweight='bold')

# 添加x轴和y轴标签以及图表标题
plt.xlabel('Attack Type', fontsize=14, fontweight='bold')
plt.ylabel('Count', fontsize=14, fontweight='bold')
plt.title('Distribution of Attack Types', fontsize=16)

# 旋转x轴标签，以改善可读性
plt.xticks(rotation=45, ha='right')

# 调整布局并显示图表
plt.tight_layout()
plt.show()

在这里插入图片描述

绘制异常分数、源端口和数据包长度的分布箱线图

绘制异常分数、源端口和数据包长度的分布情况，通过箱线图展示每种特征在不同攻击类型中的分布情况。

# 定义要绘制的列
columns = ['Anomaly Scores', 'Source Port', 'Packet Length']

# 遍历列名列表，为每列绘制箱线图
for col in columns:
    # 创建一个新的画布，设置尺寸为12英寸*6英寸
    plt.figure(figsize=(12, 6))
    # 绘制箱线图，设置不显示异常值（outliers）
    sns.boxplot(data=data, x=col, y='Attack Type', showfliers=False)
    # 设置图表标题
    plt.title(f'Distribution of {col.capitalize()} by Attack Type')
    # 设置x轴和y轴标签
    plt.xlabel(col.capitalize(), fontsize=14, fontweight='bold')
    plt.ylabel('Attack Type', fontsize=14, fontweight='bold')
    # 显示绘制的图表
    plt.show()

在这里插入图片描述

利用饼图，展示网络流量协议分布

# 定义饼图的标签
labels = ['UDP', 'ICMP', 'TCP']
# 通过data['Protocol']列的值计数，获取每个类别比例大小
sizes = data['Protocol'].value_counts()
# 爆炸效果设置，使第一个类别（UDP）突出
explode = (0.1, 0, 0)
# 设置自定义颜色方案
colors = sns.color_palette('pastel')[0:len(labels)]
# 设置seaborn的白色格子风格，用于创建具有创意主题的图表
sns.set(style="whitegrid")
# 创建饼图
plt.figure(figsize=(8, 8))
plt.pie(sizes, labels=labels, colors=colors, explode=explode, autopct='%1.1f%%', startangle=90)
# 设置等比例属性，确保饼图被绘制为一个圆形
plt.axis('equal')
# 设置图表标题
plt.title('Distribution of Network Traffic Protocols', fontsize=16, fontweight='bold')
# 显示饼图
plt.show()

在这里插入图片描述

# 为dataframe的'Timestamp'列创建'Year'和'Month'两个新列
# 'Year'列将包含时间戳对应的年份
# 'Month'列将包含时间戳对应的月份名称
data['Year'] = data['Timestamp'].dt.year  # 创建'Year'列，包含时间戳的年份
data['Month'] = data['Timestamp'].dt.month_name()  # 创建'Month'列，包含时间戳的月份名称

生成一个透视表，以年为行索引，月为列索引，统计每个年份每个月出现的次数。

"""

参数:
- data: 输入的数据，预期包含'Year', 'Month', 'Timestamp'等列。
- values: 用于计算透视表的值列，这里使用'Timestamp'列。
- index: 透视表的行索引，这里使用'Year'列。
- columns: 透视表的列索引，这里使用'Month'列。
- aggfunc: 对values列进行聚合的函数，这里使用'count'函数。
- fill_value: 缺失值的填充值，这里填充为0。

返回值:
- pivot_table: 生成的透视表，显示了每年每个月的计数。
"""
# 生成透视表
pivot_table = pd.pivot_table(data, values='Timestamp', index='Year', columns='Month', aggfunc='count', fill_value=0)

# 显示透视表
print("Pivot Table - Year vs Month:")
print(pivot_table)

在这里插入图片描述

分析时间序列，绘制热力图

# 设置热力图的可视化窗口大小
plt.figure(figsize=(12, 8))
# 生成热力图，注释表示数值，格式化为整数，使用"viridis"颜色方案，设置行宽度为0.5
sns.heatmap(pivot_table, annot=True, fmt="d", cmap="viridis", linewidths=.5)
# 添加标题和坐标轴标签
plt.title('Count of Records: Year vs Month')
plt.xlabel('Month')
plt.ylabel('Year')

# 显示热力图
plt.show()

在这里插入图片描述

# 显示透视表
print("Pivot Table - Year vs Month:")
print(pivot_table)

在这里插入图片描述
生成一个关于月份与周几的透视表

pivot_table_month_weekday = pd.pivot_table(data, values='Timestamp', index='Month', columns='Weekday', aggfunc='count', fill_value=0)
print("Pivot Table - Month vs Weekday:")
print(pivot_table_month_weekday)

在这里插入图片描述
生成一个可视化热力图，展示记录数在月份与周几之间的分布情况。

# 创建一个新的画布，设置画布的尺寸为12英寸宽，8英寸高
plt.figure(figsize=(12, 8))
# 生成热力图，设置显示数值，格式为整数，颜色映射为'YlGnBu'，线条宽度为0.5
sns.heatmap(pivot_table_month_weekday, annot=True, fmt="d", cmap="YlGnBu", linewidths=.5)
# 设置图表标题，x轴标签和y轴标签
plt.title('Count of Records: Month vs Weekday')
plt.xlabel('Weekday')
plt.ylabel('Month')

# 显示热力图
plt.show()

在这里插入图片描述

将攻击载荷中的高频词汇生成单词云图

# 将'Data Payload'列的所有值转换为字符串，并合并为一个单一字符串
text = ' '.join(data['Payload Data'].astype(str).values)

# 创建绘图子图
fig, ax = subplots(figsize=(12, 12))

# 生成并显示单词云
wordcloud = WordCloud(random_state=2023, height=1200, width=1200).generate(text)
imshow(wordcloud)
axis('off')  # 关闭坐标轴显示

# 显示单词云图
plt.show()

在这里插入图片描述

优化建议

数据预处理优化：

对于含有大量缺失值的特定列，可以考虑采用更为精细的数据填充策略，如基于统计模型（如均值、中位数、众数填充）或机器学习方法（如KNN插补、多重插补等）进行缺失值填补。
在特征选择阶段，除了基于现有业务理解剔除部分冗余信息外，还可以运用特征重要性评估（如基于随机森林、梯度提升树等模型计算特征重要性），进一步筛选出对攻击类型预测具有较高价值的特征。