nsfw_如何使用机器学习设置NSFW内容检测

最新推荐文章于 2025-04-10 11:42:09 发布

cumian8165

最新推荐文章于 2025-04-10 11:42:09 发布

阅读量1.6w

点赞数

文章标签：人工智能深度学习机器学习 python java

原文链接：https://www.freecodecamp.org/news/how-to-set-up-nsfw-content-detection-with-machine-learning-229a9725829c/

版权

本文介绍了如何使用机器学习构建NSFW（不适合工作）内容检测模型。通过获取大量数据、标记和清理数据、利用Keras和转移学习以及优化模型，可以实现高达93%的准确率。文章还分享了作者在构建过程中遇到的挑战和解决方法，鼓励读者尝试训练自己的模型。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

nsfw

Teaching a machine to recognize indecent content wasn’t difficult in retrospect, but it sure was tough the first time through.

回想起来，教一台机器识别不雅内容并不难，但第一次肯定很难。

Here are some lessons learned, and some tips and tricks I uncovered while building an NSFW model.

这里是一些经验教训，以及在构建NSFW模型时发现的一些技巧。

Though there are lots of ways this could have been implemented, the hope of this post is to provide a friendly narrative so that others can understand what this process can look like.

尽管可以采用很多方法来实现，但本文的希望是提供一个友好的叙述，以便其他人可以理解此过程的外观。

If you’re new to ML, this will inspire you to train a model. If you’re familiar with it, I’d love to hear how you would have gone about building this model and ask you to share your code.

如果您不熟悉ML，这将激发您训练模型的灵感。如果您熟悉它，我很想听听您如何构建该模型并要求您共享代码。

计划： (The Plan:)

Get lots and lots of data
获取大量数据
Label and clean the data
标记并清理数据
Use Keras and transfer learning
使用Keras并转移学习
Refine your model
优化模型

获取大量数据 (Get lots and lots of data)

Fortunately, a really cool set of scraping scripts were released for a NSFW dataset. The code is simple already comes with labeled data categories. This means that just accepting this data scraper’s defaults will give us 5 categories pulled from hundreds of subreddits.

幸运的是，为NSFW数据集发布了一套非常酷的抓取脚本。该代码很简单，已经带有标记的数据类别。这意味着仅接受此数据收集器的默认值将为我们提供数百种subreddit中的5种类别。

The instructions are quite simple, you can simply run the 6 friendly scripts. Pay attention to them as you may decide to change things up.

说明非常简单，您只需运行6个友好的脚本即可。请注意它们，因为您可能决定进行更改。

If you have more subreddits that you’d like to add, you should edit the source URLs before running step 1.

如果您要添加更多子目录，则应在运行步骤1之前编辑源URL。

E.g. — If you were to add a new source of neutral examples, you’d add to the subreddit list in nsfw_data_scraper/scripts/source_urls/neutral.txt.

例如-如果要添加新的中性示例来源，则将其添加到nsfw_data_scraper/scripts/source_urls/neutral.txt的subreddit列表中。

Reddit is a great resource of content around the web, since most subreddits are slightly policed by humans to be on target for that subreddit.

Reddit是Web上大量内容的资源，因为大多数子Reddit都是由人类轻微监管的，因此该Reddit必定是目标。

标记并清理数据 (Label and clean the data)

The data we got from the NSFW data scraper is already labeled! But expect some errors. Especially since Reddit isn’t perfectly curated.

我们从NSFW数据收集器获得的数据已被标记！但要期待一些错误。特别是由于Reddit的策划不完善。

Duplication is also quite common, but fixable without slow human comparison.

复制也很常见，但是可以在不进行人工比较的情况下进行修复。

The first thing I like to run is duplicate-file-finder which is the fastest exact file match and deleter. It’s powered in Python.

我想运行的第一件事是duplicate-file-finder ，它是最快的精确文件匹配和删除器。它使用Python供电。

最低0.47元/天解锁文章

评论

被折叠的条评论为什么被折叠?

到【灌水乐园】发言

查看更多评论

添加红包

成就一亿技术人!

hope_wisdom

发出的红包

实付元

使用余额支付

点击重新获取

扫码支付

钱包余额 0

抵扣说明：

1.余额是钱包充值的虚拟货币，按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载，可以购买VIP、付费专栏及课程。