Python酷库之旅-第三方库Pandas(020)

# 49、pandas.merge_asof函数
pandas.merge_asof(left, right, on=None, left_on=None, right_on=None, left_index=False, right_index=False, by=None, left_by=None, right_by=None, suffixes=('_x', '_y'), tolerance=None, allow_exact_matches=True, direction='backward')
Perform a merge by key distance.

This is similar to a left-join except that we match on nearest key rather than equal keys. Both DataFrames must be sorted by the key.

For each row in the left DataFrame:

A “backward” search selects the last row in the right DataFrame whose ‘on’ key is less than or equal to the left’s key.

A “forward” search selects the first row in the right DataFrame whose ‘on’ key is greater than or equal to the left’s key.

A “nearest” search selects the row in the right DataFrame whose ‘on’ key is closest in absolute distance to the left’s key.

Optionally match on equivalent keys with ‘by’ before searching with ‘on’.

Parameters:
left
DataFrame or named Series
right
DataFrame or named Series
on
label
Field name to join on. Must be found in both DataFrames. The data MUST be ordered. Furthermore this must be a numeric column, such as datetimelike, integer, or float. On or left_on/right_on must be given.

left_on
label
Field name to join on in left DataFrame.

right_on
label
Field name to join on in right DataFrame.

left_index
bool
Use the index of the left DataFrame as the join key.

right_index
bool
Use the index of the right DataFrame as the join key.

by
column name or list of column names
Match on these columns before performing merge operation.

left_by
column name
Field names to match on in the left DataFrame.

right_by
column name
Field names to match on in the right DataFrame.

suffixes
2-length sequence (tuple, list, …)
Suffix to apply to overlapping column names in the left and right side, respectively.

tolerance
int or Timedelta, optional, default None
Select asof tolerance within this range; must be compatible with the merge index.

allow_exact_matches
bool, default True
If True, allow matching with the same ‘on’ value (i.e. less-than-or-equal-to / greater-than-or-equal-to)

If False, don’t match the same ‘on’ value (i.e., strictly less-than / strictly greater-than).

direction
‘backward’ (default), ‘forward’, or ‘nearest’
Whether to search for prior, subsequent, or closest matches.

Returns:
DataFrame

49-2、参数

49-2-1、left(必须)：左侧DataFrame对象。

49-2-2、right(必须)：右侧DataFrame对象。

49-2-3、on(可选，默认值为None)：指定用于合并的列，这个列在两个DataFrame中都必须存在。如果没有指定left_on和right_on，则必须提供该参数。

49-2-4、left_on(可选，默认值为None)：左侧DataFrame中用于合并的列。

49-2-5、right_on(可选，默认值为None)：右侧DataFrame中用于合并的列。

49-2-6、left_index(可选，默认值为False)：布尔值，表示是否使用左侧DataFrame的索引来进行合并。

49-2-7、right_index(可选，默认值为False)：布尔值，表示是否使用右侧DataFrame的索引来进行合并。

49-2-8、by(可选，默认值为None)：在执行合并前先对指定列进行分组，by列在两个DataFrame 中都必须存在，类似于SQL中的“分区”合并。

49-2-9、left_by(可选，默认值为None)：左侧DataFrame中用于分组的列。

49-2-10、right_by(可选，默认值为None)：右侧DataFrame中用于分组的列。

49-2-11、suffixes(可选，默认值为('_x', '_y'))：当两个DataFrame中存在同名列时，指定列名的后缀。

49-2-12、tolerance(可选，默认值为None)：指定合并时允许的最大时间差，可以是一个数值或Timedelta对象。

49-2-13、allow_exact_matches(可选，默认值为True)：布尔值，表示是否允许精确匹配。

49-2-14、direction(可选，默认值为'backward')：指定匹配的方向，可以是'backward'(向后匹配)，'forward'(向前匹配)或者 'nearest'(最近匹配)。

49-3、功能

进行“按时间顺序的近似匹配”合并操作，它特别适用于时间序列数据，当两个DataFrame的时间戳并不完全匹配时，可以通过该函数找到最近的匹配点进行合并。

49-4、返回值

返回一个新的DataFrame，该DataFrame包含合并后的结果。

49-5、说明

49-5-1、功能

49-5-1-1、近似时间匹配：pandas.merge_asof()可以在两个DataFrame之间基于时间戳列进行合并，即使时间戳不完全匹配。它会根据指定的方向找到最近的匹配点。

49-5-1-2、方向控制：用户可以指定合并方向，如向后匹配(backward)、向前匹配(forward)或最近匹配(nearest)。

49-5-1-3、容差范围：可以设置一个容差范围(tolerance)，限制匹配点的最大时间差。

49-5-1-4、分组合并：可以按指定列进行分组，然后在每个分组内进行合并。

49-5-1-5、索引合并：允许使用索引进行合并，而不仅限于列。

49-5-2、返回值

49-5-2-1、合并列：用于合并操作的列(例如时间戳列)。

49-5-2-2、原始列：来自左侧和右侧DataFrame的所有列。对于同名列，会根据suffixes参数添加后缀。

49-5-2-3、匹配列：合并时所找到的最近匹配点的对应值。

49-6、用法

49-6-1、数据准备

无

49-6-2、代码示例

# 49、pandas.merge_asof函数
import pandas as pd
# 创建示例DataFrame
df1 = pd.DataFrame({
    'time': pd.to_datetime(['2024-07-13 01:00', '2024-07-13 02:00', '2024-07-13 03:00']),
    'value1': [10, 20, 30]
})
df2 = pd.DataFrame({
    'time': pd.to_datetime(['2024-07-13 01:30', '2024-07-13 02:30']),
    'value2': [100, 200]
})
# 使用merge_asof进行近似时间匹配合并
result = pd.merge_asof(df1, df2, on='time', direction='nearest', suffixes=('_left', '_right'))
print(result)

49-6-3、结果输出

# 49、pandas.merge_asof函数
#                  time  value1  value2
# 0 2024-07-13 01:00:00      10     100
# 1 2024-07-13 02:00:00      20     100
# 2 2024-07-13 03:00:00      30     200

50、pandas.concat函数

50-1、语法

# 50、pandas.concat函数
pandas.concat(objs, *, axis=0, join='outer', ignore_index=False, keys=None, levels=None, names=None, verify_integrity=False, sort=False, copy=None)
Concatenate pandas objects along a particular axis.

Allows optional set logic along the other axes.

Can also add a layer of hierarchical indexing on the concatenation axis, which may be useful if the labels are the same (or overlapping) on the passed axis number.

Parameters:
objs
a sequence or mapping of Series or DataFrame objects
If a mapping is passed, the sorted keys will be used as the keys argument, unless it is passed, in which case the values will be selected (see below). Any None objects will be dropped silently unless they are all None in which case a ValueError will be raised.

axis
{0/’index’, 1/’columns’}, default 0
The axis to concatenate along.

join
{‘inner’, ‘outer’}, default ‘outer’
How to handle indexes on other axis (or axes).

ignore_index
bool, default False
If True, do not use the index values along the concatenation axis. The resulting axis will be labeled 0, …, n - 1. This is useful if you are concatenating objects where the concatenation axis does not have meaningful indexing information. Note the index values on the other axes are still respected in the join.

keys
sequence, default None
If multiple levels passed, should contain tuples. Construct hierarchical index using the passed keys as the outermost level.

levels
list of sequences, default None
Specific levels (unique values) to use for constructing a MultiIndex. Otherwise they will be inferred from the keys.

names
list, default None
Names for the levels in the resulting hierarchical index.

verify_integrity
bool, default False
Check whether the new concatenated axis contains duplicates. This can be very expensive relative to the actual data concatenation.

sort
bool, default False
Sort non-concatenation axis if it is not already aligned. One exception to this is when the non-concatentation axis is a DatetimeIndex and join=’outer’ and the axis is not already aligned. In that case, the non-concatenation axis is always sorted lexicographically.

copy
bool, default True
If False, do not copy data unnecessarily.

Returns:
object, type of objs
When concatenating all Series along the index (axis=0), a Series is returned. When objs contains at least one DataFrame, a DataFrame is returned. When concatenating along the columns (axis=1), a DataFrame is returned.

50-2、参数

50-2-1、objs(必须)：待连接的DataFrame或Series对象的列表或字典。

50-2-2、axis(可选，默认值为0)：沿指定轴进行连接，0表示纵向(沿行)，1表示横向(沿列)。

50-2-3、join(可选，默认值为'outer')：指定连接方式，‘outer’为外连接，‘inner’为内连接。

50-2-4、ignore_index(可选，默认值为False)：若为True，则忽略索引，生成新的整数索引。

50-2-5、keys(可选，默认值为None)：用于构建多层索引，如果提供该参数，则连接结果会有一个多层索引。

50-2-6、levels(可选，默认值为None)：用于构建多层索引级别，必须与keys参数一起使用。

50-2-7、names(可选，默认值为None)：多层索引级别的名称，必须与keys参数一起使用。

50-2-8、verify_integrity(可选，默认值为False)：若为True，检查新连接的对象是否有重复索引，如果有重复，抛出异常。

50-2-9、sort(可选，默认值为False)：若为True，则根据连接的索引进行排序，为了提升性能，默认不排序。

50-2-10、copy(可选，默认值为None)：若为False，则不复制数据。

50-3、功能

用于沿指定轴将多个DataFrame或Series对象进行连接。

50-4、返回值

返回值是一个新的DataFrame或Series，具体取决于输入对象和参数设置。

50-5、说明

返回值的类型和结构主要取决于以下几个因素：

50-5-1、输入对象的类型：输入的对象可以是DataFrame或Series。

50-5-2、连接的轴(axis)：指定是沿行(axis=0)还是沿列(axis=1)进行连接。

50-5-3、连接方式(join)：指定是内连接还是外连接。

50-5-4、是否忽略索引(ignore_index)：决定是否生成新的整数索引。

50-5-5、多层索引(keys, levels, names)：如果提供这些参数，返回的将是一个具有多层索引的DataFrame。

50-6、用法

50-6-1、数据准备

无

50-6-2、代码示例

# 50、pandas.concat函数
import pandas as pd
# 创建示例DataFrame
df1 = pd.DataFrame({
    'A': ['A0', 'A1', 'A2'],
    'B': ['B0', 'B1', 'B2']
}, index=[0, 1, 2])
df2 = pd.DataFrame({
    'A': ['A3', 'A4', 'A5'],
    'B': ['B3', 'B4', 'B5']
}, index=[3, 4, 5])
# 纵向连接
result = pd.concat([df1, df2], axis=0)
print(result,end='\n\n')
# 横向连接，忽略索引
result = pd.concat([df1, df2], axis=1, ignore_index=True)
print(result, end='\n\n')
# 多层索引
result = pd.concat([df1, df2], keys=['df1', 'df2'])
print(result)

50-6-3、结果输出

# 50、pandas.concat函数
#     A   B
# 0  A0  B0
# 1  A1  B1
# 2  A2  B2
# 3  A3  B3
# 4  A4  B4
# 5  A5  B5

#      0    1    2    3
# 0   A0   B0  NaN  NaN
# 1   A1   B1  NaN  NaN
# 2   A2   B2  NaN  NaN
# 3  NaN  NaN   A3   B3
# 4  NaN  NaN   A4   B4
# 5  NaN  NaN   A5   B5

#         A   B
# df1 0  A0  B0
#     1  A1  B1
#     2  A2  B2
# df2 3  A3  B3
#     4  A4  B4
#     5  A5  B5

51、pandas.get_dummies函数

51-1、语法

# 51、pandas.get_dummies函数
pandas.get_dummies(data, prefix=None, prefix_sep='_', dummy_na=False, columns=None, sparse=False, drop_first=False, dtype=None)
Convert categorical variable into dummy/indicator variables.

Each variable is converted in as many 0/1 variables as there are different values. Columns in the output are each named after a value; if the input is a DataFrame, the name of the original variable is prepended to the value.

Parameters:
data
array-like, Series, or DataFrame
Data of which to get dummy indicators.

prefix
str, list of str, or dict of str, default None
String to append DataFrame column names. Pass a list with length equal to the number of columns when calling get_dummies on a DataFrame. Alternatively, prefix can be a dictionary mapping column names to prefixes.

prefix_sep
str, default ‘_’
If appending prefix, separator/delimiter to use. Or pass a list or dictionary as with prefix.

dummy_na
bool, default False
Add a column to indicate NaNs, if False NaNs are ignored.

columns
list-like, default None
Column names in the DataFrame to be encoded. If columns is None then all the columns with object, string, or category dtype will be converted.

sparse
bool, default False
Whether the dummy-encoded columns should be backed by a SparseArray (True) or a regular NumPy array (False).

drop_first
bool, default False
Whether to get k-1 dummies out of k categorical levels by removing the first level.

dtype
dtype, default bool
Data type for new columns. Only a single dtype is allowed.

Returns:
DataFrame
Dummy-coded data. If data contains other columns than the dummy-coded one(s), these will be prepended, unaltered, to the result.

51-2、参数

51-2-1、data(必须)：要转换的输入数据，可以是数组、Series或DataFrame。

51-2-2、prefix(可选，默认值为None)：前缀字符串，用于哑变量列的命名，如果输入是DataFrame，可以传递一个字典来分别为每一列指定前缀。

51-2-3、prefix_sep(可选，默认值为'_')：前缀和分类值之间的分隔符。例如，如果前缀是A，分类值是cat，那么结果列名将是A_cat。

51-2-4、dummy_na(可选，默认值为False)：如果为True，则会为NaN/缺失值添加一列指示变量，缺失值将被视为一个有效的分类。

51-2-5、columns(可选，默认值为None)：指定要转换的列，如果未指定，将转换所有分类变量列(包括object和category类型的列)。

51-2-6、sparse(可选，默认值为False)：如果为True，返回的哑变量列将是稀疏的(SparseDataFrame 或 SparseArray)，这对于大数据集可能更有效。

51-2-7、drop_first(可选，默认值为False)：如果为True，则删除第一个分类变量的哑变量列，以避免多重共线性，这在回归模型中很常用。

51-2-8、dtype(可选，默认值为None)：指定输出哑变量列的dtype，默认情况下，输出列为uint8类型。

51-3、功能

用于将分类变量转换为哑变量(虚拟变量)或指标变量，它可以将带有分类数据的列转换为多个二进制(0/1)列，方便在机器学习模型中使用。

51-4、返回值

返回值是一个DataFrame，其中包含原始数据框中的所有非分类变量列，以及为每个分类变量生成的哑变量列。

51-5、说明

无

51-6、用法

51-6-1、数据准备

无

51-6-2、代码示例

# 51、pandas.get_dummies函数
# 51-1、基本用法
import pandas as pd
df = pd.DataFrame({
    'A': ['a', 'b', 'a'],
    'B': ['c', 'c', 'b'],
    'C': [1, 2, 3]
})
print('原始数据框：')
print(df, end='\n\n')
result = pd.get_dummies(df)
print('基本用法：')
print(result, end='\n\n')

# 51-2、指定前缀和前缀分隔符
import pandas as pd
df = pd.DataFrame({
    'A': ['a', 'b', 'a'],
    'B': ['c', 'c', 'b'],
    'C': [1, 2, 3]
})
result = pd.get_dummies(df, prefix=['colA', 'colB'], prefix_sep='-')
print('指定前缀和前缀分隔符：')
print(result, end='\n\n')

# 51-3、删除第一个哑变量列
import pandas as pd
df = pd.DataFrame({
    'A': ['a', 'b', 'a'],
    'B': ['c', 'c', 'b'],
    'C': [1, 2, 3]
})
result = pd.get_dummies(df, drop_first=True)
print('删除第一个哑变量列：')
print(result, end='\n\n')

# 51-4、为特定列生成哑变量
import pandas as pd
df = pd.DataFrame({
    'A': ['a', 'b', 'a'],
    'B': ['c', 'c', 'b'],
    'C': [1, 2, 3]
})
result = pd.get_dummies(df, columns=['A'])
print('为特定列生成哑变量：')
print(result)

51-6-3、结果输出

# 51、pandas.get_dummies函数
# 51-1、基本用法
# 原始数据框：
#    A  B  C
# 0  a  c  1
# 1  b  c  2
# 2  a  b  3

# 基本用法：
#    C    A_a    A_b    B_b    B_c
# 0  1   True  False  False   True
# 1  2  False   True  False   True
# 2  3   True  False   True  False

# 51-2、指定前缀和前缀分隔符
# 指定前缀和前缀分隔符：
#    C  colA-a  colA-b  colB-b  colB-c
# 0  1    True   False   False    True
# 1  2   False    True   False    True
# 2  3    True   False    True   False

# 51-3、删除第一个哑变量列
# 删除第一个哑变量列：
#    C    A_b    B_c
# 0  1  False   True
# 1  2   True   True
# 2  3  False  False

# 51-4、为特定列生成哑变量
# 为特定列生成哑变量：
#    B  C    A_a    A_b
# 0  c  1   True  False
# 1  c  2  False   True
# 2  b  3   True  False