从 Pandas 到 Polars 十五:对于特征工程,Polars的透视表(pivot)功能表现非常强大

最近在我的机器学习流程中,我发现自己会用自己编写的Polars表达式来替换一些更简单的scikit-learn指标,如均方根误差。这种方法省去了将数据复制到不同格式的麻烦,并确保我能够保持Polars的正常优势,如并行化、优化和扩展到大型数据集。

最近,我在研究数据透视时,我意识到CountVectorizer方法基于透视。我决定看看在Polars中重新实现这个方法需要多少努力。

对于不熟悉CountVectorizer的人来说,它是一种特征工程技术,其中二维数组的每一列对应一个单词,每一行对应一个文档。如果某个单词在该文档中,则单元格中的值为1,否则为0。以下是一个输出示例。

获取一些假数据

我需要一些假文本数据来进行这个练习,所以我请ChatGPT生成了一个包含假新闻文章、出版名称和标题的小数据集。它为我提供了一个真正的假数据集,包含来自《每日欺骗报》和《假新闻网络》的文章:

fake_news_df = pl.DataFrame(
    {
    'publication': [
        'The Daily Deception', 'Faux News Network', 'The Fabricator', 'The Misleader', 'The Hoax Herald', ],
    'title': [
        'Scientists Discover New Species of Flying Elephant', 
        'Aliens Land on Earth and Offer to Solve All Our Problems', 
        'Study Shows That Eating Pizza Every Day Leads to Longer Life', 
        'New Study Finds That Smoking is Good for You', 
        "World's Largest Iceberg Discovered in Florida"],
    'text': [
        'In a groundbreaking discovery, scientists have found a new species of elephant that can fly. The flying elephants, which were found in the Amazon rainforest, have wings that span over 50 feet and can reach speeds of up to 100 miles per hour. This is a game-changing discovery that could revolutionize the field of zoology.',

        'In a historic moment for humanity, aliens have landed on Earth and offered to solve all our problems. The extraterrestrial visitors, who arrived in a giant spaceship that landed in Central Park, have advanced technology that can cure disease, end hunger, and reverse climate change. The world is waiting to see how this incredible offer will play out.',

        'A new study has found that eating pizza every day can lead to a longer life. The study, which was conducted by a team of Italian researchers, looked at the eating habits of over 10,000 people and found that those who ate pizza regularly lived on average two years longer than those who didn\'t. The study has been hailed as a breakthrough in the field of nutrition.',

        'In a surprising twist, a new study has found that smoking is actually good for you. The study, which was conducted by a team of British researchers, looked at the health outcomes of over 100,000 people and found that those who smoked regularly had lower rates of heart disease and cancer than those who didn\'t. The findings have sparked controversy among health experts.',

        'In a bizarre turn of events, the world\'s largest iceberg has been discovered in Florida. The iceberg, which is over 100 miles long and 50 miles wide, was found off the coast of Miami by a group of tourists on a whale-watching tour. Scientists are baffled by the discovery and are scrambling to figure out how an iceberg of this size could have']
    }
)

拆分、展开和透视

首先,我们需要做的是将文本转换为小写,并将每篇文章拆分成单独的单词。我们使用来自str命名空间的表达式来完成这一操作。此外,我们还添加了一个名为placeholder的列,其值为1。这些1将稍后填充到我们的特征矩阵中。

(
    fake_news_df
    .with_columns(
        pl.col("text").str.to_lowercase().str.split(" "),
        pl.lit(1).alias("placeholder")
    )
)
shape: (5, 4)
┌─────────────────────┬───────────────────────────────┬──────────────────────────────┬─────────────┐
│ publication         ┆ title                         ┆ text                         ┆ placeholder │
│ ---                 ┆ ---                           ┆ ---                          ┆ ---         │
│ str                 ┆ str                           ┆ list[str]                    ┆ i32         │
╞═════════════════════╪═══════════════════════════════╪══════════════════════════════╪═════════════╡
│ The Daily Deception ┆ Scientists Discover New       ┆ ["in", "a", … "zoology."]    ┆ 1           │
│                     ┆ Species …                     ┆                              ┆             │
│ Faux News Network   ┆ Aliens Land on Earth and      ┆ ["in", "a", … "out."]        ┆ 1           │
│                     ┆ Offer t…                      ┆                              ┆             │
│ The Fabricator      ┆ Study Shows That Eating Pizza ┆ ["a", "new", … "nutrition."] ┆ 1           │
│                     ┆ Ev…                           ┆                              ┆             │
│ The Misleader       ┆ New Study Finds That Smoking  ┆ ["in", "a", … "experts."]    ┆ 1           │
│                     ┆ is …                          ┆                              ┆             │
│ The Hoax Herald     ┆ World's Largest Iceberg       ┆ ["in", "a", … "have"]        ┆ 1           │
│                     ┆ Discover…                     ┆                              ┆             │
└─────────────────────┴───────────────────────────────┴──────────────────────────────┴─────────────┘

通过将字符串值拆分,我们将字符串列转换为具有 Polars pl.List(str) 数据类型的列。在之前的文章中,我展示了 pl.List 类型如何允许快速操作,因为每行在底层都是一个 Polars Series,而不是缓慢的 Python 列表。

然而,最好还是将 pl.List 列展开,以便每个列表的每个元素都有一行。同时,我们还想保留原始文章的元数据,如出版名称和标题。

我们通过调用 text 列上的 explode 方法来实现这种展开,以便每个列表的每个元素都有一行。

(
    fake_news_df
    .with_columns(
        pl.col("text").str.to_lowercase().str.split(" "),
        pl.lit(1).alias("placeholder")
    )
    .explode("text")
)
shape: (306, 4)
┌─────────────────────┬───────────────────────────────────┬────────────────┬─────────────┐
│ publication         ┆ title                             ┆ text           ┆ placeholder │
│ ---                 ┆ ---                               ┆ ---            ┆ ---         │
│ str                 ┆ str                               ┆ str            ┆ i32         │
╞═════════════════════╪═══════════════════════════════════╪════════════════╪═════════════╡
│ The Daily Deception ┆ Scientists Discover New Species … ┆ in             ┆ 1           │
│ The Daily Deception ┆ Scientists Discover New Species … ┆ a              ┆ 1           │
│ The Daily Deception ┆ Scientists Discover New Species … ┆ groundbreaking ┆ 1           │
│ The Daily Deception ┆ Scientists Discover New Species … ┆ discovery,     ┆ 1           │
│ …                   ┆ …                                 ┆ …              ┆ …           │
│ The Hoax Herald     ┆ World's Largest Iceberg Discover… ┆ this           ┆ 1           │
│ The Hoax Herald     ┆ World's Largest Iceberg Discover… ┆ size           ┆ 1           │
│ The Hoax Herald     ┆ World's Largest Iceberg Discover… ┆ could          ┆ 1           │
│ The Hoax Herald     ┆ World's Largest Iceberg Discover… ┆ have           ┆ 1           │
└─────────────────────┴───────────────────────────────────┴────────────────┴─────────────┘

请注意,explode方法可以与Polars的流式处理引擎一起使用,因此你可以用它来处理大于内存容量的数据集。

现在,是时候转换text列了,以便我们为每个不同的单词都有一个列,并为每篇文章都有一个行。我们通过调用pivot来实现这一点,使用元数据列(publication和title)作为每行的索引,text列来确定列名,并使用placeholder值作为值。

(
    fake_news_df
    .with_columns(
        pl.col("text").str.to_lowercase().str.split(" "),
        pl.lit(1).alias("placeholder")
    )
    .explode("text")
    .pivot(
        index=["publication","title"],
        columns="text",
        values="placeholder",
        sort_columns=True
    )
)
shape: (5, 166)
┌─────────────────────┬────────────────────┬────────┬──────┬───┬─────────┬───────┬──────┬──────────┐
│ publication         ┆ title              ┆ 10,000 ┆ 100  ┆ … ┆ world's ┆ years ┆ you. ┆ zoology. │
│ ---                 ┆ ---                ┆ ---    ┆ ---  ┆   ┆ ---     ┆ ---   ┆ ---  ┆ ---      │
│ str                 ┆ str                ┆ i32    ┆ i32  ┆   ┆ i32     ┆ i32   ┆ i32  ┆ i32      │
╞═════════════════════╪════════════════════╪════════╪══════╪═══╪═════════╪═══════╪══════╪══════════╡
│ The Daily Deception ┆ Scientists         ┆ null   ┆ 1    ┆ … ┆ null    ┆ null  ┆ null ┆ 1        │
│                     ┆ Discover New       ┆        ┆      ┆   ┆         ┆       ┆      ┆          │
│                     ┆ Species …          ┆        ┆      ┆   ┆         ┆       ┆      ┆          │
│ Faux News Network   ┆ Aliens Land on     ┆ null   ┆ null ┆ … ┆ null    ┆ null  ┆ null ┆ null     │
│                     ┆ Earth and Offer t… ┆        ┆      ┆   ┆         ┆       ┆      ┆          │
│ The Fabricator      ┆ Study Shows That   ┆ 1      ┆ null ┆ … ┆ null    ┆ 1     ┆ null ┆ null     │
│                     ┆ Eating Pizza Ev…   ┆        ┆      ┆   ┆         ┆       ┆      ┆          │
│ The Misleader       ┆ New Study Finds    ┆ null   ┆ null ┆ … ┆ null    ┆ null  ┆ 1    ┆ null     │
│                     ┆ That Smoking is …  ┆        ┆      ┆   ┆         ┆       ┆      ┆          │
│ The Hoax Herald     ┆ World's Largest    ┆ null   ┆ 1    ┆ … ┆ 1       ┆ null  ┆ null ┆ null     │
│                     ┆ Iceberg Discover…  ┆        ┆      ┆   ┆         ┆       ┆      ┆          │
└─────────────────────┴────────────────────┴────────┴──────┴───┴─────────┴───────┴──────┴──────────┘

请注意,我们使用sort_columns参数来按字母顺序对text列进行排序。

最后一步是将null值替换为0,以便明确我们如何处理这些值。

(
    fake_news_df
    .with_columns(
        pl.col("text").str.to_lowercase().str.split(" "),
        pl.lit(1).alias("placeholder")
    )
    .explode("text")
    .pivot(
        index=["publication","title"],
        columns="text",
        values="placeholder",
        sort_columns=True
    )
    .fill_null(value=0)
)
shape: (5, 166)
┌─────────────────────┬─────────────────────┬────────┬─────┬───┬─────────┬───────┬──────┬──────────┐
│ publication         ┆ title               ┆ 10,000 ┆ 100 ┆ … ┆ world's ┆ years ┆ you. ┆ zoology. │
│ ---                 ┆ ---                 ┆ ---    ┆ --- ┆   ┆ ---     ┆ ---   ┆ ---  ┆ ---      │
│ str                 ┆ str                 ┆ i32    ┆ i32 ┆   ┆ i32     ┆ i32   ┆ i32  ┆ i32      │
╞═════════════════════╪═════════════════════╪════════╪═════╪═══╪═════════╪═══════╪══════╪══════════╡
│ The Daily Deception ┆ Scientists Discover ┆ 0      ┆ 1   ┆ … ┆ 0       ┆ 0     ┆ 0    ┆ 1        │
│                     ┆ New Species …       ┆        ┆     ┆   ┆         ┆       ┆      ┆          │
│ Faux News Network   ┆ Aliens Land on      ┆ 0      ┆ 0   ┆ … ┆ 0       ┆ 0     ┆ 0    ┆ 0        │
│                     ┆ Earth and Offer t…  ┆        ┆     ┆   ┆         ┆       ┆      ┆          │
│ The Fabricator      ┆ Study Shows That    ┆ 1      ┆ 0   ┆ … ┆ 0       ┆ 1     ┆ 0    ┆ 0        │
│                     ┆ Eating Pizza Ev…    ┆        ┆     ┆   ┆         ┆       ┆      ┆          │
│ The Misleader       ┆ New Study Finds     ┆ 0      ┆ 0   ┆ … ┆ 0       ┆ 0     ┆ 1    ┆ 0        │
│                     ┆ That Smoking is …   ┆        ┆     ┆   ┆         ┆       ┆      ┆          │
│ The Hoax Herald     ┆ World's Largest     ┆ 0      ┆ 1   ┆ … ┆ 1       ┆ 0     ┆ 0    ┆ 0        │
│                     ┆ Iceberg Discover…   ┆        ┆     ┆   ┆         ┆       ┆      ┆          │
└─────────────────────┴─────────────────────┴────────┴─────┴───┴─────────┴───────┴──────┴──────────┘

当然,与CountVectorizer的输出仍然存在一些差异——例如,CountVectorizer默认返回一个稀疏矩阵。此外,CountVectorizer使用更复杂的正则表达式来分隔单词——但我们可以通过使用str.extract_all而不是.str.split来重新实现这一点。

(
    fake_news_df
    .with_columns(
        pl.col("text").str.to_lowercase().str.extract_all('(?u)\\b\\w\\w+\\b'),
        pl.lit(1).alias("placeholder")
    )
)

因此,在这里我们已经看到了如何使用Polars快速实现一种经典的NLP(自然语言处理)特征工程方法。我确信在未来几年里,我们会看到更多Polars作为全能数据得力助手的例子。

  • 4
    点赞
  • 7
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值