机器学习的语音性别识别_机器可以对性别进行分类吗?

机器学习的语音性别识别

前言 (Preface)

Our names do not always define us, but they do give us a sense of identity. Labelling ourselves with a set of letters and words gives a kind of structure to our sense of self and anchors our presence in this world.

我们的名字并不总是定义我们,但是确实给我们一种认同感。 用一组字母和单词来标记自己可以为我们的自我感觉提供一种结构,并锚定我们在这个世界上的存在。

“It ain’t what they call you, it’s what you respond to.” — W.C.Fields

“这不是他们所说的您,而是您的回应。” — WCFields

动机 (Motivation)

Somehow (and very strangely), we know, through our life experiences, that if someone is named ‘XYZ’ then he or she is likely to be a ‘little like this’ or have a ‘little bit of that’ in them.

通过某种方式( 非常奇怪 ),我们通过我们的生活经验知道,如果某人被命名为“ XYZ ”,那么他或她就可能是“ 有点像这样 ”或其中有些“ 有点 ”。

We may or may not be correct (mostly incorrect in my case) in mapping a personality to a name, but I wondered if a machine could do better than me.

在将个性映射到名称时,我们可能正确,也可能不正确( 在我的情况下,大多数情况是错误的 ),但是我想知道一台机器是否能比我做得更好。

As a simple litmus test, I considered gender as one of the identities of an individual. The motivating question was —

作为一个简单的石蕊测试,我将性别视为个人的身份之一。 激励问题是-

Can you predict a person’s gender from his or her name?

您可以根据某人的姓名预测其性别吗?

Strangely simple isn’t it? Yet, it is not as easy as it seems. Cultural and linguistic differences make the task difficult.

很奇怪,不是吗? 然而,这并不像看起来那么容易。 文化和语言上的差异使这项工作变得困难。

And that motivated me to build a bare bones machine learning model and see if it could predict a person’s gender any better than me.

这促使我建立了一个简单的机器学习模型,看看它能否比我更好地预测一个人的性别。

The entire code can be found on my GitHub repository.

完整的代码可以在我的GitHub存储库中找到。

数据源 (Data Source)

So where can a machine get a ton of life experience? Through historic, meticulously collated data of course! One such source is the catalogue of baby names in the US, available at data.gov.

那么,机器可以在哪里获得大量的生活经验呢? 当然,通过历史细致的数据! 这样的来源之一就是美国的婴儿名字目录,可从data.gov获得

This data is freely available and published by the Social Security Administration. It contains first names and their frequency of occurrence in Social Security Card applications in each State since 1919 and up to 2018.

此数据可免费获得,并由社会保障局发布。 它包含自1919年至2018年在每个州的社会保障卡申请中使用的名字及其出现的频率。

数据预处理 (Data Preprocessing)

I downloaded this data and it is a collection of text files for each state. Each text file contains all the first names which have occurred at least 5 times in a given year. I cleaned up the data a little and the final concatenated dataset takes up nearly 320 MB of memory!

我下载了此数据,它是每个州的文本文件的集合。 每个文本文件包含在给定年份中至少出现过5次的所有名字。 我稍微整理了一下数据,最终的连接数据集占用了将近320 MB的内存!

<class 'pandas.core.frame.DataFrame'>
Int64Index: 6028151 entries, 0 to 28019
Data columns (total 6 columns):
# Column Dtype
--- ------ -----
0 yob int64
1 name object
2 gender object
3 number int64
4 state object
5 chk int64
dtypes: int64(3), object(3)
memory usage: 321.9+ MB

(Note: The gender assigned here against each name is as per the data present in the text file. A helpful README file is also present in the downloadable zip file and it explains certain features about the dataset)

(注意:此处为每个名称分配的性别与文本文件中显示的数据相同。可下载的zip文件中还包含有用的README文件,它解释了数据集的某些功能)

探索性数据分析 (Exploratory Data Analysis)

I wanted to explore the data a little and visualise some summary stats.

我想稍微研究一下数据并可视化一些摘要统计数据。

1)过去100年中的100个最常见和不常见的名字 (1) 100 most common and uncommon names in last 100 years)

There are certain names which are favoured by parents every year and a wordcloud helps to visually represent this phenomenon.

每年都有某些名称受到父母的青睐,而wordcloud可以直观地代表这一现象。

Image for post
Image for post
100 most common names in the last 100 years (image credit: author)
过去100年中100个最常用的名字(图片来源:作者)

As an example, girls have been popularly named as Elizabeth or Katherine or Margaret (to name a few) while James or William or John have been popular names for boys. Apart from common names, it is also interesting to know a little bit about the uncommon names.

例如,女孩通常被冠以伊丽莎白凯瑟琳玛格丽特的 名字 ( 仅举几例 ),而詹姆斯威廉约翰则是男孩的通称 。 除了通用名称之外,了解一些不常见的名称也很有趣。

Image for post
Image for post
100 most uncommon names in the last 100 years (image credit: author)
过去100年中100个最不常用的名字(图片来源:作者)

Some really interesting and uncommon names for girls have been Baelyn, Selda and Demani, while for boys, it has been Rodarius, Jamaul and Kelvon!

女孩的一些非常有趣的和不常见的名字已经Baelyn,SeldaDemani,而对于男生来说,它已经Rodarius,JamaulKelvon!

2)2018年的前5名 (2) Top 5 names of 2018)

The top 5 names for 2018 were Emma, Olivia, Ava, Isabella and Sophia among girls. Liam, Noah, William, James and Oliver were top ranked boys names.

2018年的前五名是女孩中的艾玛(Emma)奥利维亚(Olivia)艾娃(Ava)伊莎贝拉(Isabella)索菲娅(Sophia)利亚姆诺亚威廉詹姆斯奥利弗是排名最高的男孩。

Image for post
Top 5 names for 2018 (image credit: author)
2018年的前5名(图片来源:作者)

3)2018年排名最高的名字的历史 (3) History of top ranked names of 2018)

To know the past popularity of these names, I plotted the time series of each of the top 5 names of 2018. It is interesting to see that Emma and Olivia have been historically popular names for girls, while Ava, Isabell and Sophia gained popularity after the 1980s.

要了解这些名字的过去流行度,我绘制了2018年前5个名字中每个名字的时间序列。有趣的是, 艾玛奥利维亚在历史上一直是女孩的流行名字,而艾娃(Ava)伊莎贝尔(Isabell)索菲娅(Sophia) 1980年代。

Image for post
Historic popularity of top ranked female names of 2018 (image credit: author)
2018年排名最高的女性名字的历史性受欢迎程度(图片来源:作者)

A similar trend is also observable among boys names. James and William have been consistently popular over the years and ranked in the top 10 names. Liam, Noah and Oliver have become increasingly popular after 1980.

在男孩名字中也观察到类似的趋势。 多年来, 詹姆斯威廉一直很受欢迎,并排在前十名中。 利亚姆诺亚奥利弗在1980年以后变得越来越受欢迎。

Image for post
Historic popularity of top ranked male names of 2018 (image credit: author)
2018年排名最高的男性名字的历史性普及(图片来源:作者)

4)可视化2018年每个州的地名受欢迎程度 (4) Visualising popularity of names across each state in 2018)

I also wanted to visualise the State-wise popularity of the top 10 ranked names of 2018 and an easy way is to create a choropleth of the ranks.

我还想想象一下2018年排名前10位的州名在州内的受欢迎程度,一种简单的方法是创建排名的首位。

(This also gave me chance to play with Imageio, a simple library to create gifs. This post by Yong Cui, Ph.D. was really helpful)

( 这也给了我玩 Imageio的 机会, Imageio 是创建gif的简单库。这篇 Yong Cui博士 撰写的 帖子 确实很有帮助 )

Image for post
Popularity of top 10 female names of 2018 (image credit: author)
2018年十大女性名字的受欢迎程度(图片来源:作者)
Image for post
Popularity of top 10 male names of 2018 (image credit: author)
2018年十大男性名字的受欢迎程度(图片来源:作者)

5)名字的长度 (5) Length of names)

I wanted to know if boys are given shorter name than girls. So I plotted the name lengths as a histogram. It seems that girls tend to have slightly longer names than boys and most of the names would be about 5 letters long.

我想知道男孩的名字是否比女孩的名字短。 因此,我将名称长度绘制为直方图。 似乎女孩的名字往往比男孩长,大多数名字大约有5个字母长。

Image for post
Length of names from the last 100 years (image credit: author)
最近100年的名字长度(图片来源:作者)

6)名称中的最后一个字母是元音 (6) The last letter in a name is a vowel)

I had this strange feeling that girls ten to have names that end in vowels, while boys didn’t. To test this I plotted a pie chart. I was correct in my assumption. Names of girls largely tend to end in vowels, especially ‘a’ and ‘e’, while names of boys predominantly end in ‘e’ or ‘o’.

我有一种奇怪的感觉,女孩十岁的名字以元音结尾,而男孩却没有。 为了测试这一点,我绘制了一个饼图。 我的假设是正确的。 女孩的名字大多以元音结尾,尤其是“ a”和“ e”,而男孩的名字则主要以“ e”或“ o”结尾。

Image for post
Percentage of names ending in vowels (Image credit: author)
以元音结尾的名字的百分比(图片来源:作者)

根据名字预测性别 (Predicting genders from names)

Alright, now let’s get to the interesting stuff. Can a machine predict a gender from a name? To help it, I engineered a few features based on the insights I got from the data exploration.

好了,现在让我们来看看有趣的东西。 机器可以根据名称预测性别吗? 为了帮助它,我根据从数据探索中获得的见解设计了一些功能。

1)特征工程 (1) Feature Engineering)

Converting names to ASCII values

将名称转换为ASCII值

Names are proper nouns and I really didn’t want to follow a semantic approach. Instead, I converted each letter in a name to it’s ASCII value and calculated the average ASCII value for each name.

名称是专有名词,我真的不想遵循语义方法。 相反,我将名称中的每个字母转换为它的ASCII值,并计算了每个名称的平均ASCII值。

ascii_mean('Albert')
>>> 100.33333333333333ascii_mean('Amelia')
>>> 97.5

Length of a name

名称长度

I created a new feature that contained the length of each name, ie, the total number of letters in each name.

我创建了一个新功能,其中包含每个名称的长度,即每个名称中字母的总数。

Number of vowels and consonants

元音和辅音的数量

I assumed that, in general, girls have more vowels in their names, while boys have more consonants. To quantify this assumption, I created two additional features — Number of vowels and Number of consonants in a name

我认为,通常来说,女孩的名字中有更多的元音,而男孩则有更多的辅音。 为了量化此假设,我创建了两个附加功能-名称中的元音数量和辅音数量

Last letter of a name

名字的最后一个字母

The assumption was that the name of a girl is more likely to end in a vowel as compared to the name of a boy. To quantify this belief, I created a feature that had binary values — 1 if the last letter in a name is a vowel and 0 if not.

假设与男孩的名字相比,女孩的名字更有可能以元音结尾。 为了量化这种信念,我创建了一个具有二进制值的功能-如果名称中的最后一个字母是元音,则为1,否则为0。

With these 4 features, I wanted my binary classifier to classify the name into either a ‘Male’ class or a ‘Female’ class.

通过这4个功能,我希望我的二进制分类器将名称分类为“男性”类或“女性”类。

(This is a 2 class problem, purely on grounds of data availability. The dataset has only these 2 genders identified)

(这是一个2类问题,仅出于数据可用性的考虑。数据集中仅识别出这2个性别)

2)根据名字对性别进行分类 (2) Classifying genders from names)

I built two very simple models — A Logistic Regression Classifier and a Support Vector Classifier. I split the data into a training set (24245 names) and a testing set (10392 names) by a 70:30 split.

我建立了两个非常简单的模型-Logistic回归分类器和支持向量分类器。 我将数据按70:30的比例分为训练集(24245个名称)和测试集(10392个名称)。

I was interested in building a no frills, back of the napkin, model without too much hyper-parameter tuning. So I avoided creating a validation set altogether. The intent was to make fun, non-serious, predictions!

我对构建没有褶皱的餐巾纸,没有太多超参数调整的模型感兴趣。 因此,我避免完全创建一个验证集。 目的是进行有趣的,非严肃的预测!

Logistic Regression Classifier

逻辑回归分类器

The logistic regression model has a sensitivity (or recall) of 67.8% and specificity of about 68.3%, which is relatively better than a random prediction. The precision for predicting Female is 76% while for predicting Male is 58%.

逻辑回归模型的灵敏度(或召回率)为67.8%,特异性约为68.3%,相对优于随机预测。 女性的预测精度为76%,而男性的预测精度为58%。

What this means, is that although the model can correctly predict the gender about 68% of the time, it makes a lot less mistake in predicting Females than Males.

这意味着,尽管该模型可以在大约68%的时间内正确预测性别,但在预测女性方面的错误要比男性少得多。

Image for post

Support Vector Classifier

支持向量分类器

The SVC with a linear kernel gives a lower sensitivity (64.5%) than the logistic regression model. The specificity, 72.5%, is better than the previous model. The precision for predicting Female is nearly 78% while for predicting Male is 57%.

与逻辑回归模型相比,具有线性核的SVC灵敏度较低(64.5%)。 特异性为72.5%,优于以前的模型。 预测女性的精确度接近78%,而预测男性的精确度则为57%。

In other words, the SVC can correctly predict that a name belongs to a Male, much better than it can predict a name as Female. But, it makes a lot less mistake in predicting a name as Female, than it does in predicting Males.

换句话说,SVC可以正确地预测一个名字属于男性,这比它可以将一个名字预测为女性更好。 但是,与预测男性相比,预测女性名字的错误要少得多。

Image for post

我与机器 (Me vs Machine)

Image for post
Image courtesy: https://gph.is/g/aQQ9R9V
图片提供: https : //gph.is/g/aQQ9R9V

Now comes the real test — can I predict a gender better than the machine I built??

现在是真正的测试-我能预测比我制造的机器更好的性别吗?

To find out, I randomly selected 10 names from the dataset (masking their gender of course) and pitted myself against the trained machines.

为了找出答案,我从数据集中随机选择了10个名字( 掩盖了他们的性别 ),并将自己与受过训练的机器相对比。

The results of the game are shown below —

游戏结果如下所示-

Image for post

The performances of the players can be understood by a series of questions —

玩家的表演可以通过一系列问题来理解-

Q1: When the names actually belonged to Females, how often is the player correct (in technical terms: recall or sensitivity)?

问题1:当名字实际上属于女性时,玩家多久正确一次( 在技​​术术语上:回忆或敏感性 )?

Me: 83% of the time

我:83%的时间

Machine (both): 83% of the time

机器(两者):83%的时间

Q2: When the player predicts a name as Female, how often is the player correct (in technical terms: precision)?

问题2:当玩家预测名字为“女性”时,该玩家多久正确一次( 在技​​术术语上为“精确度” )?

Me: 100% of the time

我:100%的时间

Machine (both): 71% of the time

机器(两者):71%的时间

Q3: When the names actually belonged to Males, how often is the player correct (in technical terms: recall or sensitivity)?

问题3:当名字实际上属于男性时,玩家正确的频率是多少( 在技​​术术语上是:回忆或敏感度 )?

Me: 100% of the time

我:100%的时间

Machine (both): 50% of the time

机器(两者):50%的时间

Q4: When the player predicts a name as Male, how often is the player correct (in technical terms: precision)?

问题4:当玩家预测一个名字为Male时,该玩家多久正确一次( 在技​​术术语上为Precision )?

Me: 80% of the time

我:80%的时间

Machine (both): 67% of the time

机器(两者):67%的时间

谁赢了?? 你决定。 (Who won?? You decide.)

So there you have it!

所以你有它!

This was a quick and fun project for me and allowed me to learn a bunch of interesting concepts. I hope you liked it and any comments to improve or tweak the models are most welcome!

对我来说,这是一个快速而有趣的项目,使我能够学习很多有趣的概念。 希望您喜欢它,并且欢迎提出任何改进或调整模型的意见!

Ciao!

再见!

翻译自: https://towardsdatascience.com/can-a-machine-classify-genders-3119d6e39377

机器学习的语音性别识别

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值