Python酷库之旅-第三方库Pandas(019)

# 47、pandas.merge函数
pandas.merge(left, right, how='inner', on=None, left_on=None, right_on=None, left_index=False, right_index=False, sort=False, suffixes=('_x', '_y'), copy=None, indicator=False, validate=None)
Merge DataFrame or named Series objects with a database-style join.

A named Series object is treated as a DataFrame with a single named column.

The join is done on columns or indexes. If joining columns on columns, the DataFrame indexes will be ignored. Otherwise if joining indexes on indexes or indexes on a column or columns, the index will be passed on. When performing a cross merge, no column specifications to merge on are allowed.

Warning

If both key columns contain rows where the key is a null value, those rows will be matched against each other. This is different from usual SQL join behaviour and can lead to unexpected results.

Parameters:
leftDataFrame or named Series
rightDataFrame or named Series
Object to merge with.

how{‘left’, ‘right’, ‘outer’, ‘inner’, ‘cross’}, default ‘inner’
Type of merge to be performed.

left: use only keys from left frame, similar to a SQL left outer join; preserve key order.

right: use only keys from right frame, similar to a SQL right outer join; preserve key order.

outer: use union of keys from both frames, similar to a SQL full outer join; sort keys lexicographically.

inner: use intersection of keys from both frames, similar to a SQL inner join; preserve the order of the left keys.

cross: creates the cartesian product from both frames, preserves the order of the left keys.

onlabel or list
Column or index level names to join on. These must be found in both DataFrames. If on is None and not merging on indexes then this defaults to the intersection of the columns in both DataFrames.

left_onlabel or list, or array-like
Column or index level names to join on in the left DataFrame. Can also be an array or list of arrays of the length of the left DataFrame. These arrays are treated as if they are columns.

right_onlabel or list, or array-like
Column or index level names to join on in the right DataFrame. Can also be an array or list of arrays of the length of the right DataFrame. These arrays are treated as if they are columns.

left_indexbool, default False
Use the index from the left DataFrame as the join key(s). If it is a MultiIndex, the number of keys in the other DataFrame (either the index or a number of columns) must match the number of levels.

right_indexbool, default False
Use the index from the right DataFrame as the join key. Same caveats as left_index.

sortbool, default False
Sort the join keys lexicographically in the result DataFrame. If False, the order of the join keys depends on the join type (how keyword).

suffixeslist-like, default is (“_x”, “_y”)
A length-2 sequence where each element is optionally a string indicating the suffix to add to overlapping column names in left and right respectively. Pass a value of None instead of a string to indicate that the column name from left or right should be left as-is, with no suffix. At least one of the values must not be None.

copybool, default True
If False, avoid copy if possible.

Note

The copy keyword will change behavior in pandas 3.0. Copy-on-Write will be enabled by default, which means that all methods with a copy keyword will use a lazy copy mechanism to defer the copy and ignore the copy keyword. The copy keyword will be removed in a future version of pandas.

You can already get the future behavior and improvements through enabling copy on write pd.options.mode.copy_on_write = True

indicatorbool or str, default False
If True, adds a column to the output DataFrame called “_merge” with information on the source of each row. The column can be given a different name by providing a string argument. The column will have a Categorical type with the value of “left_only” for observations whose merge key only appears in the left DataFrame, “right_only” for observations whose merge key only appears in the right DataFrame, and “both” if the observation’s merge key is found in both DataFrames.

validatestr, optional
If specified, checks if merge is of specified type.

“one_to_one” or “1:1”: check if merge keys are unique in both left and right datasets.

“one_to_many” or “1:m”: check if merge keys are unique in left dataset.

“many_to_one” or “m:1”: check if merge keys are unique in right dataset.

“many_to_many” or “m:m”: allowed, but does not result in checks.

Returns:
DataFrame
A DataFrame of the two merged objects.

47-2、参数

47-2-1、left(必须)：左侧DataFrame。

47-2-2、right(必须)：右侧DataFrame。

47-2-3、how(可选，默认值为'inner')：字符串，指定合并的方式：
47-2-3-1、'left': 左连接，返回左表的所有行，即使右表中没有匹配，如果右表中没有匹配，则结果中右表的部分会以NaN填充。
47-2-3-2、'right': 右连接，与左连接相反。
47-2-3-3、'outer': 外连接，返回两个表中所有的行，当某行在一侧表中没有匹配时，另一侧表的相应部分将以NaN填充。
47-2-3-4、'inner': 内连接，仅返回两个表中都有匹配的行。

47-2-4、on(可选，默认值为None)：字符串或字符串列表，用于连接的列名，必须是两个DataFrame共有的列名，如果提供了多个列名，则DataFrame会在这些列上共同进行合并。

47-2-5、left_on(可选，默认值为None)：字符串或字符串列表，左侧DataFrame中用于连接的列名或列名列表，如果未指定on参数，则left_on和right_on必须被定义。

47-2-6、right_on(可选，默认值为None)：字符串或字符串列表，右侧DataFrame中用于连接的列名或列名列表，与left_on类似。

47-2-7、left_index(可选，默认值为False)：布尔值，如果为True，则使用左侧DataFrame的索引(行标签)作为连接键。

47-2-8、right_index(可选，默认值为False)：布尔值，如果为True，则使用右侧DataFrame的索引(行标签)作为连接键。

47-2-9、sort(可选，默认值为False)：布尔值，如果为True，则合并后的DataFrame会根据合并键进行排序，注意，这在大数据集上可能非常耗时。

47-2-10、suffixes(可选，默认值为('_x', '_y'))：元组，用于在合并过程中处理非唯一列名的后缀，如果合并的DataFrame中存在同名的列，并且没有指定on、left_on或right_on参数来指定合并键，则这些列名会被加上后缀以区分。

47-2-11、copy(可选，默认值为None)：布尔值，如果为False，并且没有复制底层数据，则原始数据可能会被修改(尽管这通常不会发生)，此参数在Pandas的较新版本中已弃用，因为Pandas现在总是返回合并数据的副本。

47-2-12、indicator(可选，默认值为False)：布尔值或字符串，如果为True，则会在结果DataFrame中添加一个名为"_merge" 的列，指示每行数据源(如'left_only'、'right_only'、'both')，如果为字符串，则将该字符串用作新列的列名。

47-2-13、validate(可选，默认值为None)：用于指定合并时应如何验证DataFrame，例如'one_to_one'或'm:n'可以用来确保合并符合特定的关系类型，这主要用于调试目的。

47-3、功能

允许用户根据一个或多个键(列名或索引)将两个DataFrame对象合并起来，合并操作类似于SQL中的JOIN操作，可以根据需要选择内连接(inner join)、外连接(outer join)、左连接(left join)或右连接(right join)等不同的合并方式。此外，用户还可以指定合并时使用的列名(通过on、left_on和right_on参数)或索引(通过left_index和right_index参数)。

47-4、返回值

返回值是一个新的DataFrame对象，该对象包含了合并后的数据。根据合并方式的不同，返回的DataFrame将包含以下数据：

47-4-1、内连接(inner join)：仅返回两个DataFrame中在合并键上都有匹配的行。
47-4-2、外连接(outer join)：返回两个DataFrame中的所有行，如果某行在一侧表中没有匹配，则另一侧表的相应部分将以NaN填充。
47-4-3、左连接(left join)：返回左表的所有行，即使右表中没有匹配，如果右表中没有匹配，则结果中右表的部分会以NaN填充。
47-4-4、右连接(right join)：与左连接相反，返回右表的所有行，如果左表中没有匹配，则左表的部分会以NaN填充。

47-5、说明

如果设置了indicator参数为True，则返回的DataFrame还将包含一个名为"_merge"的列，用于指示每行数据的来源('left_only'、'right_only'或'both')。

47-6、用法

47-6-1、数据准备

无

47-6-2、代码示例

# 47、pandas.merge函数
import pandas as pd
# 创建示例数据帧
left = pd.DataFrame({
    'key': ['A', 'B', 'C', 'D'],
    'value_left': [1, 2, 3, 4]
})
right = pd.DataFrame({
    'key': ['B', 'D', 'E', 'F'],
    'value_right': [5, 6, 7, 8]
})
# 合并数据帧，使用'key'列作为连接键，默认内连接
merged_df = pd.merge(left, right, how='inner', on='key')
print("Inner Merge on 'key':")
print(merged_df)
# 创建示例数据帧，带有不同的连接键列名
left = pd.DataFrame({
    'key_left': ['A', 'B', 'C', 'D'],
    'value_left': [1, 2, 3, 4]
})
right = pd.DataFrame({
    'key_right': ['B', 'D', 'E', 'F'],
    'value_right': [5, 6, 7, 8]
})
# 合并数据帧，指定不同的连接键列名
merged_df = pd.merge(left, right, how='inner', left_on='key_left', right_on='key_right')
print("\nInner Merge with Different Key Names:")
print(merged_df)
# 创建示例数据帧，使用索引作为连接键
left = pd.DataFrame({
    'value_left': [1, 2, 3, 4]
}, index=['A', 'B', 'C', 'D'])
right = pd.DataFrame({
    'value_right': [5, 6, 7, 8]
}, index=['B', 'D', 'E', 'F'])
# 合并数据帧，使用索引作为连接键
merged_df = pd.merge(left, right, how='inner', left_index=True, right_index=True)
print("\nInner Merge on Index:")
print(merged_df)
# 合并数据帧，显示连接指示器
merged_df = pd.merge(left, right, how='outer', left_index=True, right_index=True, indicator=True)
print("\nOuter Merge with Indicator:")
print(merged_df)
# 验证合并
merged_df = pd.merge(left, right, how='inner', left_index=True, right_index=True, validate='1:1')
print("\nValidated Inner Merge:")
print(merged_df)

47-6-3、结果输出

# 47、pandas.merge函数
# Inner Merge on 'key':
#   key  value_left  value_right
# 0   B           2            5
# 1   D           4            6
# 
# Inner Merge with Different Key Names:
#   key_left  value_left key_right  value_right
# 0        B           2         B            5
# 1        D           4         D            6
# 
# Inner Merge on Index:
#    value_left  value_right
# B           2            5
# D           4            6
# 
# Outer Merge with Indicator:
#    value_left  value_right      _merge
# A         1.0          NaN   left_only
# B         2.0          5.0        both
# C         3.0          NaN   left_only
# D         4.0          6.0        both
# E         NaN          7.0  right_only
# F         NaN          8.0  right_only
# 
# Validated Inner Merge:
#    value_left  value_right
# B           2            5
# D           4            6

48、pandas.merge_ordered函数

48-1、语法

# 48、pandas.merge_ordered函数
pandas.merge_ordered(left, right, on=None, left_on=None, right_on=None, left_by=None, right_by=None, fill_method=None, suffixes=('_x', '_y'), how='outer')
Perform a merge for ordered data with optional filling/interpolation.

Designed for ordered data like time series data. Optionally perform group-wise merge (see examples).

Parameters:
left
DataFrame or named Series
right
DataFrame or named Series
on
label or list
Field names to join on. Must be found in both DataFrames.

left_on
label or list, or array-like
Field names to join on in left DataFrame. Can be a vector or list of vectors of the length of the DataFrame to use a particular vector as the join key instead of columns.

right_on
label or list, or array-like
Field names to join on in right DataFrame or vector/list of vectors per left_on docs.

left_by
column name or list of column names
Group left DataFrame by group columns and merge piece by piece with right DataFrame. Must be None if either left or right are a Series.

right_by
column name or list of column names
Group right DataFrame by group columns and merge piece by piece with left DataFrame. Must be None if either left or right are a Series.

fill_method
{‘ffill’, None}, default None
Interpolation method for data.

suffixes
list-like, default is (“_x”, “_y”)
A length-2 sequence where each element is optionally a string indicating the suffix to add to overlapping column names in left and right respectively. Pass a value of None instead of a string to indicate that the column name from left or right should be left as-is, with no suffix. At least one of the values must not be None.

how
{‘left’, ‘right’, ‘outer’, ‘inner’}, default ‘outer’
left: use only keys from left frame (SQL: left outer join)

right: use only keys from right frame (SQL: right outer join)

outer: use union of keys from both frames (SQL: full outer join)

inner: use intersection of keys from both frames (SQL: inner join).

Returns:
DataFrame
The merged DataFrame output type will be the same as ‘left’, if it is a subclass of DataFrame.

48-2、参数

48-2-1、left(必须)：左侧DataFrame对象。

48-2-2、right(必须)：右侧DataFrame对象。

48-2-3、on(可选，默认值为None)：字符串或字符串列表，用于连接的列名，必须是两个DataFrame共有的列名，如果提供了多个列名，则DataFrame会在这些列上共同进行合并。

48-2-4、left_on(可选，默认值为None)：左侧DataFrame中用作连接键的列名或列名列表。

48-2-5、right_on(可选，默认值为None)：右侧DataFrame中用作连接键的列名或列名列表。

48-2-6、left_by(可选，默认值为None)：对于左侧DataFrame，可以通过指定一列或多列来进行分组，然后在这些组内进行排序和合并，这对于按类别或其他准则进行排序的数据非常有用。

48-2-7、right_by(可选，默认值为None)：类似于left_by，但作用于右侧DataFrame对象。

48-2-8、fill_method(可选，默认值为None)：在进行连接后，此参数可以指定一种填充方法来处理那些在排序后产生的空缺值。例如，可以使用ffill来向前填充缺失值。

48-2-9、suffixes(可选，默认值为('_x', '_y'))：一个元组，用以指定当两个DataFrame中存在除连接键以外的同名列时，在左右DataFrame的列名后分别附加的后缀。

48-2-10、how(可选，默认值为'outer')：定义合并的方式，可选的参数包括outer，inner，left，right，即保留左右DataFrame中的所有键。

48-3、功能

用于在保持键值顺序的情况下，对数据帧进行连接操作，它常用于时间序列和其他需要考虑元素顺序的合并任务。

48-4、返回值

返回值是一个新的DataFrame，该DataFrame由输入的两个DataFrame按照指定的连接键和顺序合并而成。

48-5、说明

返回值的特点如下：

48-5-1、保持顺序：合并后的DataFrame会保持原始数据的顺序(通常是按时间或其他顺序列进行排序)，这对于时间序列数据尤其重要。

48-5-2、包含所有连接键：返回的DataFrame包含了所有连接键，除非使用了特定的how参数来限制返回的键(例如inner仅返回两个DataFrame共有的键)。

48-5-3、处理缺失值：如果在合并过程中有缺失值，返回的DataFrame会保留这些缺失值，除非使用了fill_method参数来填充这些缺失值。

48-5-4、带有后缀的重复列：如果两个DataFrame中存在同名列，返回的DataFrame会根据suffixes参数添加后缀，以区分来自不同DataFrame的列。

48-6、用法

48-6-1、数据准备

无

48-6-2、代码示例

# 48、pandas.merge_ordered函数
import pandas as pd
df1 = pd.DataFrame({
    'key': ['a', 'c', 'e', 'f'],
    'value1': [1, 3, 5, 7]
})
df2 = pd.DataFrame({
    'key': ['a', 'b', 'd', 'f'],
    'value2': [2, 4, 6, 8]
})
result = pd.merge_ordered(df1, df2, on='key', fill_method='ffill', suffixes=('_left', '_right'))
print(result)

48-6-3、结果输出

# 48、pandas.merge_ordered函数
#   key  value1  value2
# 0   a       1       2
# 1   b       1       4
# 2   c       3       4
# 3   d       3       6
# 4   e       5       6
# 5   f       7       8