从 Pandas 到 Polars 十五:对于特征工程,Polars的透视表(pivot)功能表现非常强大






fake_news_df = pl.DataFrame(
    'publication': [
        'The Daily Deception', 'Faux News Network', 'The Fabricator', 'The Misleader', 'The Hoax Herald', ],
    'title': [
        'Scientists Discover New Species of Flying Elephant', 
        'Aliens Land on Earth and Offer to Solve All Our Problems', 
        'Study Shows That Eating Pizza Every Day Leads to Longer Life', 
        'New Study Finds That Smoking is Good for You', 
        "World's Largest Iceberg Discovered in Florida"],
    'text': [
        'In a groundbreaking discovery, scientists have found a new species of elephant that can fly. The flying elephants, which were found in the Amazon rainforest, have wings that span over 50 feet and can reach speeds of up to 100 miles per hour. This is a game-changing discovery that could revolutionize the field of zoology.',

        'In a historic moment for humanity, aliens have landed on Earth and offered to solve all our problems. The extraterrestrial visitors, who arrived in a giant spaceship that landed in Central Park, have advanced technology that can cure disease, end hunger, and reverse climate change. The world is waiting to see how this incredible offer will play out.',

        'A new study has found that eating pizza every day can lead to a longer life. The study, which was conducted by a team of Italian researchers, looked at the eating habits of over 10,000 people and found that those who ate pizza regularly lived on average two years longer than those who didn\'t. The study has been hailed as a breakthrough in the field of nutrition.',

        'In a surprising twist, a new study has found that smoking is actually good for you. The study, which was conducted by a team of British researchers, looked at the health outcomes of over 100,000 people and found that those who smoked regularly had lower rates of heart disease and cancer than those who didn\'t. The findings have sparked controversy among health experts.',

        'In a bizarre turn of events, the world\'s largest iceberg has been discovered in Florida. The iceberg, which is over 100 miles long and 50 miles wide, was found off the coast of Miami by a group of tourists on a whale-watching tour. Scientists are baffled by the discovery and are scrambling to figure out how an iceberg of this size could have']



        pl.col("text").str.to_lowercase().str.split(" "),
shape: (5, 4)
│ publication         ┆ title                         ┆ text                         ┆ placeholder │
│ ---                 ┆ ---                           ┆ ---                          ┆ ---         │
│ str                 ┆ str                           ┆ list[str]                    ┆ i32         │
│ The Daily Deception ┆ Scientists Discover New       ┆ ["in", "a", … "zoology."]    ┆ 1           │
│                     ┆ Species …                     ┆                              ┆             │
│ Faux News Network   ┆ Aliens Land on Earth and      ┆ ["in", "a", … "out."]        ┆ 1           │
│                     ┆ Offer t…                      ┆                              ┆             │
│ The Fabricator      ┆ Study Shows That Eating Pizza ┆ ["a", "new", … "nutrition."] ┆ 1           │
│                     ┆ Ev…                           ┆                              ┆             │
│ The Misleader       ┆ New Study Finds That Smoking  ┆ ["in", "a", … "experts."]    ┆ 1           │
│                     ┆ is …                          ┆                              ┆             │
│ The Hoax Herald     ┆ World's Largest Iceberg       ┆ ["in", "a", … "have"]        ┆ 1           │
│                     ┆ Discover…                     ┆                              ┆             │

通过将字符串值拆分,我们将字符串列转换为具有 Polars pl.List(str) 数据类型的列。在之前的文章中,我展示了 pl.List 类型如何允许快速操作,因为每行在底层都是一个 Polars Series,而不是缓慢的 Python 列表。

然而,最好还是将 pl.List 列展开,以便每个列表的每个元素都有一行。同时,我们还想保留原始文章的元数据,如出版名称和标题。

我们通过调用 text 列上的 explode 方法来实现这种展开,以便每个列表的每个元素都有一行。

        pl.col("text").str.to_lowercase().str.split(" "),
shape: (306, 4)
│ publication         ┆ title                             ┆ text           ┆ placeholder │
│ ---                 ┆ ---                               ┆ ---            ┆ ---         │
│ str                 ┆ str                               ┆ str            ┆ i32         │
│ The Daily Deception ┆ Scientists Discover New Species … ┆ in             ┆ 1           │
│ The Daily Deception ┆ Scientists Discover New Species … ┆ a              ┆ 1           │
│ The Daily Deception ┆ Scientists Discover New Species … ┆ groundbreaking ┆ 1           │
│ The Daily Deception ┆ Scientists Discover New Species … ┆ discovery,     ┆ 1           │
│ …                   ┆ …                                 ┆ …              ┆ …           │
│ The Hoax Herald     ┆ World's Largest Iceberg Discover… ┆ this           ┆ 1           │
│ The Hoax Herald     ┆ World's Largest Iceberg Discover… ┆ size           ┆ 1           │
│ The Hoax Herald     ┆ World's Largest Iceberg Discover… ┆ could          ┆ 1           │
│ The Hoax Herald     ┆ World's Largest Iceberg Discover… ┆ have           ┆ 1           │



        pl.col("text").str.to_lowercase().str.split(" "),
shape: (5, 166)
│ publication         ┆ title              ┆ 10,000 ┆ 100  ┆ … ┆ world's ┆ years ┆ you. ┆ zoology. │
│ ---                 ┆ ---                ┆ ---    ┆ ---  ┆   ┆ ---     ┆ ---   ┆ ---  ┆ ---      │
│ str                 ┆ str                ┆ i32    ┆ i32  ┆   ┆ i32     ┆ i32   ┆ i32  ┆ i32      │
│ The Daily Deception ┆ Scientists         ┆ null   ┆ 1    ┆ … ┆ null    ┆ null  ┆ null ┆ 1        │
│                     ┆ Discover New       ┆        ┆      ┆   ┆         ┆       ┆      ┆          │
│                     ┆ Species …          ┆        ┆      ┆   ┆         ┆       ┆      ┆          │
│ Faux News Network   ┆ Aliens Land on     ┆ null   ┆ null ┆ … ┆ null    ┆ null  ┆ null ┆ null     │
│                     ┆ Earth and Offer t… ┆        ┆      ┆   ┆         ┆       ┆      ┆          │
│ The Fabricator      ┆ Study Shows That   ┆ 1      ┆ null ┆ … ┆ null    ┆ 1     ┆ null ┆ null     │
│                     ┆ Eating Pizza Ev…   ┆        ┆      ┆   ┆         ┆       ┆      ┆          │
│ The Misleader       ┆ New Study Finds    ┆ null   ┆ null ┆ … ┆ null    ┆ null  ┆ 1    ┆ null     │
│                     ┆ That Smoking is …  ┆        ┆      ┆   ┆         ┆       ┆      ┆          │
│ The Hoax Herald     ┆ World's Largest    ┆ null   ┆ 1    ┆ … ┆ 1       ┆ null  ┆ null ┆ null     │
│                     ┆ Iceberg Discover…  ┆        ┆      ┆   ┆         ┆       ┆      ┆          │



        pl.col("text").str.to_lowercase().str.split(" "),
shape: (5, 166)
│ publication         ┆ title               ┆ 10,000 ┆ 100 ┆ … ┆ world's ┆ years ┆ you. ┆ zoology. │
│ ---                 ┆ ---                 ┆ ---    ┆ --- ┆   ┆ ---     ┆ ---   ┆ ---  ┆ ---      │
│ str                 ┆ str                 ┆ i32    ┆ i32 ┆   ┆ i32     ┆ i32   ┆ i32  ┆ i32      │
│ The Daily Deception ┆ Scientists Discover ┆ 0      ┆ 1   ┆ … ┆ 0       ┆ 0     ┆ 0    ┆ 1        │
│                     ┆ New Species …       ┆        ┆     ┆   ┆         ┆       ┆      ┆          │
│ Faux News Network   ┆ Aliens Land on      ┆ 0      ┆ 0   ┆ … ┆ 0       ┆ 0     ┆ 0    ┆ 0        │
│                     ┆ Earth and Offer t…  ┆        ┆     ┆   ┆         ┆       ┆      ┆          │
│ The Fabricator      ┆ Study Shows That    ┆ 1      ┆ 0   ┆ … ┆ 0       ┆ 1     ┆ 0    ┆ 0        │
│                     ┆ Eating Pizza Ev…    ┆        ┆     ┆   ┆         ┆       ┆      ┆          │
│ The Misleader       ┆ New Study Finds     ┆ 0      ┆ 0   ┆ … ┆ 0       ┆ 0     ┆ 1    ┆ 0        │
│                     ┆ That Smoking is …   ┆        ┆     ┆   ┆         ┆       ┆      ┆          │
│ The Hoax Herald     ┆ World's Largest     ┆ 0      ┆ 1   ┆ … ┆ 1       ┆ 0     ┆ 0    ┆ 0        │
│                     ┆ Iceberg Discover…   ┆        ┆     ┆   ┆         ┆       ┆      ┆          │





