nsfw_如何使用机器学习设置NSFW内容检测

nsfw

Teaching a machine to recognize indecent content wasn’t difficult in retrospect, but it sure was tough the first time through.

回想起来,教一台机器识别不雅内容并不难,但第一次肯定很难。

Here are some lessons learned, and some tips and tricks I uncovered while building an NSFW model.

这里是一些经验教训,以及在构建NSFW模型时发现的一些技巧。

Though there are lots of ways this could have been implemented, the hope of this post is to provide a friendly narrative so that others can understand what this process can look like.

尽管可以采用很多方法来实现,但本文的希望是提供一个友好的叙述,以便其他人可以理解此过程的外观。

If you’re new to ML, this will inspire you to train a model. If you’re familiar with it, I’d love to hear how you would have gone about building this model and ask you to share your code.

如果您不熟悉ML,这将激发您训练模型的灵感。 如果您熟悉它,我很想听听您如何构建该模型并要求您共享代码。

计划: (The Plan:)

  1. Get lots and lots of data

    获取大量数据
  2. Label and clean the data

    标记并清理数据
  3. Use Keras and transfer learning

    使用Keras并转移学习
  4. Refine your model

    优化模型

获取大量数据 (Get lots and lots of data)

Fortunately, a really cool set of scraping scripts were released for a NSFW dataset. The code is simple already comes with labeled data categories. This means that just accepting this data scraper’s defaults will give us 5 categories pulled from hundreds of subreddits.

幸运的是,为NSFW数据集发布了一套非常酷的抓取脚本 。 该代码很简单,已经带有标记的数据类别。 这意味着仅接受此数据收集器的默认值将为我们提供数百种subreddit中的5种类别。

The instructions are quite simple, you can simply run the 6 friendly scripts. Pay attention to them as you may decide to change things up.

说明非常简单,您只需运行6个友好的脚本即可。 请注意它们,因为您可能决定进行更改。

If you have more subreddits that you’d like to add, you should edit the source URLs before running step 1.

如果您要添加更多子目录,则应在运行步骤1之前编辑源URL。

E.g. — If you were to add a new source of neutral examples, you’d add to the subreddit list in nsfw_data_scraper/scripts/source_urls/neutral.txt.

例如-如果要添加新的中性示例来源,则将其添加到nsfw_data_scraper/scripts/source_urls/neutral.txt的subreddit列表中。

Reddit is a great resource of content around the web, since most subreddits are slightly policed by humans to be on target for that subreddit.

Reddit是Web上大量内容的资源,因为大多数子Reddit都是由人类轻微监管的,因此该Reddit必定是目标。

标记并清理数据 (Label and clean the data)

The data we got from the NSFW data scraper is already labeled! But expect some errors. Especially since Reddit isn’t perfectly curated.

我们从NSFW数据收集器获得的数据已被标记! 但要期待一些错误。 特别是由于Reddit的策划不完善。

Duplication is also quite common, but fixable without slow human comparison.

复制也很常见,但是可以在不进行人工比较的情况下进行修复。

The first thing I like to run is duplicate-file-finder which is the fastest exact file match and deleter. It’s powered in Python.

我想运行的第一件事是duplicate-file-finder ,它是最快的精确文件匹配和删除器。 它使用Python供电。

  • 0
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值