nsfw
Teaching a machine to recognize indecent content wasn’t difficult in retrospect, but it sure was tough the first time through.
回想起来,教一台机器识别不雅内容并不难,但第一次肯定很难。
Here are some lessons learned, and some tips and tricks I uncovered while building an NSFW model.
这里是一些经验教训,以及在构建NSFW模型时发现的一些技巧。
Though there are lots of ways this could have been implemented, the hope of this post is to provide a friendly narrative so that others can understand what this process can look like.
尽管可以采用很多方法来实现,但本文的希望是提供一个友好的叙述,以便其他人可以理解此过程的外观。
If you’re new to ML, this will inspire you to train a model. If you’re familiar with it, I’d love to hear how you would have gone about building this model and ask you to share your code.
如果您不熟悉ML,这将激发您训练模型的灵感。 如果您熟悉它,我很想听听您如何构建该模型并要求您共享代码。
计划: (The Plan:)
- Get lots and lots of data 获取大量数据
- Label and clean the data 标记并清理数据
- Use Keras and transfer learning 使用Keras并转移学习
- Refine your model 优化模型
获取大量数据 (Get lots and lots of data)
Fortunately, a really cool set of scraping scripts were released for a NSFW dataset. The code is simple already comes with labeled data categories. This means that just accepting this data scraper’s defaults will give us 5 categories pulled from hundreds of subreddits.
幸运的是,为NSFW数据集发布了一套非常酷的抓取脚本 。 该代码很简单,已经带有标记的数据类别。 这意味着仅接受此数据收集器的默认值将为我们提供数百种subreddit中的5种类别。
The instructions are quite simple, you can simply run the 6 friendly scripts. Pay attention to them as you may decide to change things up.
说明非常简单,您只需运行6个友好的脚本即可。 请注意它们,因为您可能决定进行更改。
If you have more subreddits that you’d like to add, you should edit the source URLs before running step 1.
如果您要添加更多子目录,则应在运行步骤1之前编辑源URL。
E.g. — If you were to add a new source of neutral examples, you’d add to the subreddit list in
nsfw_data_scraper/scripts/source_urls/neutral.txt
.例如-如果要添加新的中性示例来源,则将其添加到
nsfw_data_scraper/scripts/source_urls/neutral.txt
的subreddit列表中。
Reddit is a great resource of content around the web, since most subreddits are slightly policed by humans to be on target for that subreddit.
Reddit是Web上大量内容的资源,因为大多数子Reddit都是由人类轻微监管的,因此该Reddit必定是目标。
标记并清理数据 (Label and clean the data)
The data we got from the NSFW data scraper is already labeled! But expect some errors. Especially since Reddit isn’t perfectly curated.
我们从NSFW数据收集器获得的数据已被标记! 但要期待一些错误。 特别是由于Reddit的策划不完善。
Duplication is also quite common, but fixable without slow human comparison.
复制也很常见,但是可以在不进行人工比较的情况下进行修复。
The first thing I like to run is duplicate-file-finder
which is the fastest exact file match and deleter. It’s powered in Python.
我想运行的第一件事是duplicate-file-finder
,它是最快的精确文件匹配和删除器。 它使用Python供电。