机器学习的语音性别识别_机器可以对性别进行分类吗?

机器学习的语音性别识别

前言 (Preface)

Our names do not always define us, but they do give us a sense of identity. Labelling ourselves with a set of letters and words gives a kind of structure to our sense of self and anchors our presence in this world.

我们的名字并不总是定义我们,但是确实给我们一种认同感。 用一组字母和单词来标记自己可以为我们的自我感觉提供一种结构,并锚定我们在这个世界上的存在。

“It ain’t what they call you, it’s what you respond to.” — W.C.Fields

“这不是他们所说的您,而是您的回应。” — WCFields

动机 (Motivation)

Somehow (and very strangely), we know, through our life experiences, that if someone is named ‘XYZ’ then he or she is likely to be a ‘little like this’ or have a ‘little bit of that’ in them.

通过某种方式( 非常奇怪 ),我们通过我们的生活经验知道,如果某人被命名为“ XYZ ”,那么他或她就可能是“ 有点像这样 ”或其中有些“ 有点 ”。

We may or may not be correct (mostly incorrect in my case) in mapping a personality to a name, but I wondered if a machine could do better than me.

在将个性映射到名称时,我们可能正确,也可能不正确( 在我的情况下,大多数情况是错误的 ),但是我想知道一台机器是否能比我做得更好。

As a simple litmus test, I considered gender as one of the identities of an individual. The motivating question was —

作为一个简单的石蕊测试,我将性别视为个人的身份之一。 激励问题是-

Can you predict a person’s gender from his or her name?

您可以根据某人的姓名预测其性别吗?

Strangely simple isn’t it? Yet, it is not as easy as it seems. Cultural and linguistic differences make the task difficult.

很奇怪,不是吗? 然而,这并不像看起来那么容易。 文化和语言上的差异使这项工作变得困难。

And that motivated me to build a bare bones machine learning model and see if it could predict a person’s gender any better than me.

这促使我建立了一个简单的机器学习模型,看看它能否比我更好地预测一个人的性别。

The entire code can be found on my GitHub repository.

完整的代码可以在我的GitHub存储库中找到。

数据源 (Data Source)

So where can a machine get a ton of life experience? Through historic, meticulously collated data of course! One such source is the catalogue of baby names in the US, available at data.gov.

那么,机器可以在哪里获得大量的生活经验呢? 当然,通过历史细致的数据! 这样的来源之一就是美国的婴儿名字目录,可从data.gov获得

This data is freely available and published by the Social Security Administration. It contains first names and their frequency of occurrence in Social Security Card applications in each State since 1919 and up to 2018.

此数据可免费获得,并由社会保障局发布。 它包含自1919年至2018年在每个州的社会保障卡申请中使用的名字及其出现的频率。

数据预处理 (Data Preprocessing)

I downloaded this data and it is a collection of text files for each state. Each text file contains all the first names which have occurred at least 5 times in a given year. I cleaned up the data a little and the final concatenated dataset takes up nearly 320 MB of memory!

我下载了此数据,它是每个州的文本文件的集合。 每个文本文件包含在给定年份中至少出现过5次的所有名字。 我稍微整理了一下数据,最终的连接数据集占用了将近320 MB的内存!

<class 'pandas.core.frame.DataFrame'>
Int64Index: 6028151 entries, 0 to 28019
Data columns (total 6 columns):
# Column Dtype
--- ------ -----
0 yob int64
1 name object
2 gender object
3 number int64
4 state object
5 chk int64
dtypes: int64(3), object(3)
memory usage: 321.9+ MB

(Note: The gender assigned here against each name is as per the data present in the text file. A helpful README file is also present in the downloadable zip file and it explains certain features about the dataset)

(注意:此处为每个名称分配的性别与文本文件中显示的数据相同。可下载的zip文件中还包含有用的README文件,它解释了数据集的某些功能)

探索性数据分析 (Exploratory Data Analysis)

I wanted to explore the data a little and visualise some summary stats.

我想稍微研究一下数据并可视化一些摘要统计数据。

1)过去100年中的100个最常见和不常见的名字 (1) 100 most common and uncommon names in last 100 years)

There are certain names which are favoured by parents every year and a wordcloud helps to visually represent this phenomenon.

每年都有某些名称受到父母的青睐,而wordcloud可以直观地代表这一现象。

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值