Python酷库之旅-第三方库Pandas(016)

# 39、pandas.DataFrame.to_stata函数
DataFrame.to_stata(path, *, convert_dates=None, write_index=True, byteorder=None, time_stamp=None, data_label=None, variable_labels=None, version=114, convert_strl=None, compression='infer', storage_options=None, value_labels=None)
Export DataFrame object to Stata dta format.

Writes the DataFrame to a Stata dataset file. “dta” files contain a Stata dataset.

Parameters:
pathstr, path object, or buffer
String, path object (implementing os.PathLike[str]), or file-like object implementing a binary write() function.

convert_datesdict
Dictionary mapping columns containing datetime types to stata internal format to use when writing the dates. Options are ‘tc’, ‘td’, ‘tm’, ‘tw’, ‘th’, ‘tq’, ‘ty’. Column can be either an integer or a name. Datetime columns that do not have a conversion type specified will be converted to ‘tc’. Raises NotImplementedError if a datetime column has timezone information.

write_indexbool
Write the index to Stata dataset.

byteorderstr
Can be “>”, “<”, “little”, or “big”. default is sys.byteorder.

time_stampdatetime
A datetime to use as file creation date. Default is the current time.

data_labelstr, optional
A label for the data set. Must be 80 characters or smaller.

variable_labelsdict
Dictionary containing columns as keys and variable labels as values. Each label must be 80 characters or smaller.

version{114, 117, 118, 119, None}, default 114
Version to use in the output dta file. Set to None to let pandas decide between 118 or 119 formats depending on the number of columns in the frame. Version 114 can be read by Stata 10 and later. Version 117 can be read by Stata 13 or later. Version 118 is supported in Stata 14 and later. Version 119 is supported in Stata 15 and later. Version 114 limits string variables to 244 characters or fewer while versions 117 and later allow strings with lengths up to 2,000,000 characters. Versions 118 and 119 support Unicode characters, and version 119 supports more than 32,767 variables.

Version 119 should usually only be used when the number of variables exceeds the capacity of dta format 118. Exporting smaller datasets in format 119 may have unintended consequences, and, as of November 2020, Stata SE cannot read version 119 files.

convert_strllist, optional
List of column names to convert to string columns to Stata StrL format. Only available if version is 117. Storing strings in the StrL format can produce smaller dta files if strings have more than 8 characters and values are repeated.

compressionstr or dict, default ‘infer’
For on-the-fly compression of the output data. If ‘infer’ and ‘path’ is path-like, then detect compression from the following extensions: ‘.gz’, ‘.bz2’, ‘.zip’, ‘.xz’, ‘.zst’, ‘.tar’, ‘.tar.gz’, ‘.tar.xz’ or ‘.tar.bz2’ (otherwise no compression). Set to None for no compression. Can also be a dict with key 'method' set to one of {'zip', 'gzip', 'bz2', 'zstd', 'xz', 'tar'} and other key-value pairs are forwarded to zipfile.ZipFile, gzip.GzipFile, bz2.BZ2File, zstandard.ZstdCompressor, lzma.LZMAFile or tarfile.TarFile, respectively. As an example, the following could be passed for faster compression and to create a reproducible gzip archive: compression={'method': 'gzip', 'compresslevel': 1, 'mtime': 1}.

New in version 1.5.0: Added support for .tar files.

Changed in version 1.4.0: Zstandard support.

storage_optionsdict, optional
Extra options that make sense for a particular storage connection, e.g. host, port, username, password, etc. For HTTP(S) URLs the key-value pairs are forwarded to urllib.request.Request as header options. For other URLs (e.g. starting with “s3://”, and “gcs://”) the key-value pairs are forwarded to fsspec.open. Please see fsspec and urllib for more details, and for more examples on storage options refer here.

value_labelsdict of dicts
Dictionary containing columns as keys and dictionaries of column value to labels as values. Labels for a single variable must be 32,000 characters or smaller.

New in version 1.4.0.

Raises:
NotImplementedError
If datetimes contain timezone information

Column dtype is not representable in Stata

ValueError
Columns listed in convert_dates are neither datetime64[ns] or datetime.datetime

Column listed in convert_dates is not in DataFrame

Categorical label contains more than 32,000 characters

39-2、参数

39-2-1、path(必须)：要写入的文件的路径(包括文件名)。

39-2-2、convert_dates(可选，默认值为None)：字典，指定哪些列应该被转换为Stata的日期或日期时间格式，键是列名，值是日期时间格式(如'tc'表示Stata中的日期时间，'td'表示日期)。如果列名不是DataFrame中的列，则会被忽略。

39-2-3、write_index(可选，默认值为True)：是否将DataFrame的索引作为一列写入Stata文件。如果为False，则不写入索引。

39-2-4、byteorder(可选，默认值为None)：字节顺序，用于写入文件。通常为None，允许pandas自行决定(通常是< 表示小端序)，但在某些特殊情况下，如果Stata文件需要在特定系统或版本上读取，可能需要手动设置。

39-2-5、time_stamp(可选，默认值为None)：写入文件的时间戳，这不会改变文件内容，但会在Stata中作为数据集的创建或修改时间显示。

39-2-6、data_label(可选，默认值为None)：数据集标签，一个简短的描述性文本字符串，用于在Stata中标识数据集。

39-2-7、variable_labels(可选，默认值为None)：字典，指定DataFrame中各列的变量标签，键是列名，值是描述性文本字符串。

39-2-8、version(可选，默认值为114)：Stata文件的版本，对应于Stata 14及更高版本，不同版本的Stata支持不同的数据类型和特性。

39-2-9、convert_strl(可选，默认值为None)：Stata 14引入了strl类型，用于存储长度可变的字符串，这个参数允许你指定哪些列应该被转换为strl类型(如果version参数允许)。默认情况下pandas会根据列中的最大字符串长度自动决定是否使用strl类型。

39-2-10、compression(可选，默认值为'infer')：压缩方法。'infer' 会根据 path 的文件扩展名自动选择压缩方法(如果文件扩展名为.zip或.xz)，'zip'和'xz'分别指定ZIP和XZ压缩。如果为None，则不进行压缩。

39-2-11、storage_options(可选，默认值为None)：用于任何存储连接的额外选项，例如存储账户凭证，这通常用于云存储系统(如S3、GCS、HDFS等)，对于本地文件系统或标准的文件I/O操作，此参数通常不使用。

39-2-12、value_labels(可选，默认值为None)：字典，用于为DataFrame中的分类变量指定值标签。键是列名，值是一个从类别值到标签的映射字典，这对于在Stata中创建易于理解的分类变量非常有用。

39-3、功能

用于将pandas DataFrame保存到Stata的.dta格式文件中。

39-4、返回值

本身并不返回任何值(即返回值为None)，它的主要作用是将DataFrame的内容写入到指定的 .dta文件中，而不是在Python环境中返回一个对象或值。

39-5、说明

Stata是一种广泛使用的统计软件，.dta文件是Stata的专有数据格式，用于存储数据集。通过这个函数，用户可以将pandas DataFrame中的数据保存为Stata可以直接读取和处理的文件格式。

39-6、用法

39-6-1、数据准备

无

39-6-2、代码示例

# 39、pandas.DataFrame.to_stata函数
import pandas as pd
# 创建一个示例DataFrame
data = {
    'name': ['John', 'Anna', 'Peter', 'Linda'],
    'age': [28, 34, 29, 32],
    'date_of_birth': pd.to_datetime(['1992-01-01', '1988-02-15', '1991-07-23', '1989-10-10']),
    'city': ['New York', 'Paris', 'Berlin', 'London']
}
df = pd.DataFrame(data)
# 设置变量标签
variable_labels = {
    'name': 'Person Name',
    'age': 'Age in Years',
    'date_of_birth': 'Date of Birth',
    'city': 'City of Residence'
}
# 设置数据标签
data_label = 'Demo Dataset for Pandas to Stata Conversion'
# 将 DataFrame 保存到 Stata 文件
# 这里我们使用了 Stata 114 格式（即 Stata 14 及以上版本），它支持字符串变量长度超过 244 字符
# 我们还指定了转换日期，写入索引，并添加了变量和数据标签
df.to_stata('example.dta',
            convert_dates={'date_of_birth': 'td'},  # 将 'date_of_birth' 转换为 Stata 日期格式
            write_index=False,  # 不写入索引到 Stata 文件
            variable_labels=variable_labels,  # 添加变量标签
            data_label=data_label,  # 添加数据标签
            version=114)  # 指定 Stata 文件的版本
print("DataFrame has been successfully saved to Stata file.")

39-6-3、结果输出

# DataFrame has been successfully saved to Stata file.

40、pandas.read_stata函数

40-1、语法

# 40、pandas.read_stata函数
pandas.read_stata(filepath_or_buffer, *, convert_dates=True, convert_categoricals=True, index_col=None, convert_missing=False, preserve_dtypes=True, columns=None, order_categoricals=True, chunksize=None, iterator=False, compression='infer', storage_options=None)
Read Stata file into DataFrame.

Parameters:
filepath_or_bufferstr, path object or file-like object
Any valid string path is acceptable. The string could be a URL. Valid URL schemes include http, ftp, s3, and file. For file URLs, a host is expected. A local file could be: file://localhost/path/to/table.dta.

If you want to pass in a path object, pandas accepts any os.PathLike.

By file-like object, we refer to objects with a read() method, such as a file handle (e.g. via builtin open function) or StringIO.

convert_datesbool, default True
Convert date variables to DataFrame time values.

convert_categoricalsbool, default True
Read value labels and convert columns to Categorical/Factor variables.

index_colstr, optional
Column to set as index.

convert_missingbool, default False
Flag indicating whether to convert missing values to their Stata representations. If False, missing values are replaced with nan. If True, columns containing missing values are returned with object data types and missing values are represented by StataMissingValue objects.

preserve_dtypesbool, default True
Preserve Stata datatypes. If False, numeric data are upcast to pandas default types for foreign data (float64 or int64).

columnslist or None
Columns to retain. Columns will be returned in the given order. None returns all columns.

order_categoricalsbool, default True
Flag indicating whether converted categorical data are ordered.

chunksizeint, default None
Return StataReader object for iterations, returns chunks with given number of lines.

iteratorbool, default False
Return StataReader object.

compressionstr or dict, default ‘infer’
For on-the-fly decompression of on-disk data. If ‘infer’ and ‘filepath_or_buffer’ is path-like, then detect compression from the following extensions: ‘.gz’, ‘.bz2’, ‘.zip’, ‘.xz’, ‘.zst’, ‘.tar’, ‘.tar.gz’, ‘.tar.xz’ or ‘.tar.bz2’ (otherwise no compression). If using ‘zip’ or ‘tar’, the ZIP file must contain only one data file to be read in. Set to None for no decompression. Can also be a dict with key 'method' set to one of {'zip', 'gzip', 'bz2', 'zstd', 'xz', 'tar'} and other key-value pairs are forwarded to zipfile.ZipFile, gzip.GzipFile, bz2.BZ2File, zstandard.ZstdDecompressor, lzma.LZMAFile or tarfile.TarFile, respectively. As an example, the following could be passed for Zstandard decompression using a custom compression dictionary: compression={'method': 'zstd', 'dict_data': my_compression_dict}.

New in version 1.5.0: Added support for .tar files.

storage_optionsdict, optional
Extra options that make sense for a particular storage connection, e.g. host, port, username, password, etc. For HTTP(S) URLs the key-value pairs are forwarded to urllib.request.Request as header options. For other URLs (e.g. starting with “s3://”, and “gcs://”) the key-value pairs are forwarded to fsspec.open. Please see fsspec and urllib for more details, and for more examples on storage options refer here.

Returns:
DataFrame or pandas.api.typing.StataReader.

40-2、参数

40-2-1、filepath_or_buffer(必须)：字符串、路径对象或任何对象实现read()方法(如文件句柄或StringIO)，这是要读取的.dta文件的路径或文件对象。

40-2-2、convert_dates(可选，默认值为True)：布尔值，如果为True，则尝试将列转换为日期类型，如果数据中包含Stata日期时间，这非常有用。

40-2-3、convert_categoricals(可选，默认值为True)：布尔值，如果为True，则尝试将列中的Stata值标签(value labels)转换为pandas的类别数据类型(Categorical dtype)。

40-2-4、index_col(可选，默认值为None)：字符串或字符串列表，用作DataFrame行索引的列名或列名列表，如果传递了多个列名，将生成一个MultiIndex。

40-2-5、convert_missing(可选，默认值为False)：布尔值，如果为True，则Stata 缺失值(如 .)将被转换为pandas的NaN值。然而，请注意，pandas通常已经能够正确处理这些缺失值，除非你有特定的理由需要更改此行为。

40-2-6、preserve_dtypes(可选，默认值为True)：布尔值，如果为False，则在读取数据时不会尝试保留Stata 数据类型(如Stata 的字符串类型将被转换为pandas的object类型)。在某些情况下，这可以提高读取速度，但可能会丢失数据类型信息。

40-2-7、columns(可选，默认值为None)：字符串列表，返回DataFrame中要包含的列名列表，如果为None，则读取所有列。

40-2-8、order_categoricals(可选，默认值为True)：布尔值，如果为True，则对读取的类别数据类型(Categorical dtype)的类别进行排序，这基于Stata文件中定义的类别顺序。

40-2-9、chunksize(可选，默认值为None)：整数，如果指定了非零值，则返回一个迭代器，该迭代器以chunksize行数为块提供DataFrame，这对于处理大型文件时节省内存非常有用。

40-2-10、iterator(可选，默认值为False)：布尔值，如果为True，则返回TextFileReader对象，该对象可以迭代以分块读取文件，这与chunksize参数结合使用时特别有用。

40-2-11、compression(可选，默认值为'infer')：字符串或None，用于指定文件压缩类型的字符串，如'gzip'、'bz2'、'zip'、'xz'或'infer'(如果filepath_or_buffer是字符串，则自动检测压缩)，如果为None，则不进行解压缩。

40-2-12、storage_options(可选，默认值为None)：字典，对于存储在如Google Cloud Storage、Amazon S3等云存储服务中的文件，此参数允许传递额外的选项来访问这些文件。

40-3、功能

将Stata的.dta格式文件读取到pandas DataFrame中。

40-4、返回值

返回值是一个pandas DataFrame对象，该对象包含了从.dta文件中读取的数据。

40-5、说明

无

40-6、用法

40-6-1、数据准备

# 使用pandas.DataFrame.to_stata函数创建.dta文件
import pandas as pd
# 创建一个示例DataFrame
data = {
    'name': ['John', 'Anna', 'Peter', 'Linda'],
    'age': [28, 34, 29, 32],
    'date_of_birth': pd.to_datetime(['1992-01-01', '1988-02-15', '1991-07-23', '1989-10-10']),
    'city': ['New York', 'Paris', 'Berlin', 'London']
}
df = pd.DataFrame(data)
# 设置变量标签
variable_labels = {
    'name': 'Person Name',
    'age': 'Age in Years',
    'date_of_birth': 'Date of Birth',
    'city': 'City of Residence'
}
# 设置数据标签
data_label = 'Demo Dataset for Pandas to Stata Conversion'
# 将 DataFrame 保存到 Stata 文件
# 这里我们使用了 Stata 114 格式（即 Stata 14 及以上版本），它支持字符串变量长度超过 244 字符
# 我们还指定了转换日期，写入索引，并添加了变量和数据标签
df.to_stata('example.dta',
            convert_dates={'date_of_birth': 'td'},  # 将 'date_of_birth' 转换为 Stata 日期格式
            write_index=False,  # 不写入索引到 Stata 文件
            variable_labels=variable_labels,  # 添加变量标签
            data_label=data_label,  # 添加数据标签
            version=114)  # 指定 Stata 文件的版本
print("DataFrame has been successfully saved to Stata file.")

40-6-2、代码示例

# 40、pandas.read_stata函数
import pandas as pd
# 指定.dta文件的路径
file_path = 'example.dta'
# 使用pandas的read_stata函数读取文件
df = pd.read_stata(file_path)
# 显示DataFrame的前几行以确认数据已正确读取
print(df.head())

40-6-3、结果输出

# 40、pandas.read_stata函数
#     name  age date_of_birth      city
# 0   John   28    1992-01-01  New York
# 1   Anna   34    1988-02-15     Paris
# 2  Peter   29    1991-07-23    Berlin
# 3  Linda   32    1989-10-10    London