Datacamp Project Practice| Python: Name Game: Gender Prediction using Sound

最新推荐文章于 2022-04-24 11:16:52 发布

Trance_Fu63

最新推荐文章于 2022-04-24 11:16:52 发布

阅读量227

点赞数

分类专栏：数据分析 python 文章标签： python

本文链接：https://blog.csdn.net/qq_41504254/article/details/111923550

版权

数据分析同时被 2 个专栏收录

19 篇文章 0 订阅

订阅专栏

python

10 篇文章 1 订阅

订阅专栏

Datacamp 是一个数据科学相关的学习网站，部分课程和练习免费，大部分收费，但是学校给我们买了会员，所以这段时间打算拿来用来练习。在每次练习后，我将尝试复盘练习内容作为分享，也当作知识巩固。

Name Game: Gender Prediction using Sound

本练习主要包含简单的 Numpy, Pandas 和 matplotlib 等相关包的使用，分为 8 个task
数据集和solution：b3x1

TASK 1: 探索NYSIIS算法

本练习的目的只是为了让你熟悉 fuzzy.nysiis 函数

导入fuzzy模块
探索fuzzy.nysiis使用任何单词的函数。
使用模糊.nysiis测试两个你认为听起来一样的词是否相等。

有许多模糊名称匹配算法。在本练习中，我们将使用 fuzzy 模组中的NYSIIS演算法。函数fuzzy.nysiis获取一个字符串并输出该字符串的语音（即声音）版本。例如，fuzzy.nysiis(“colour”)和 fuzzy.nysiis（'color'）都输出 'CALAR'，这是单词的发音方式。该算法对于捕捉（和纠正）某些拼写错误非常有用。例如，tomorrow通常被错拼为tommorow。是否能用fuzzy.nysiis把他们等同起来？

# Importing the fuzzy package
import fuzzy

# Exploring the output of fuzzy.nysiis
fuzzy.nysiis("grey")

# Testing equivalence of similar sounding words
fuzzy.nysiis("grey") == fuzzy.nysiis("gray")

TASK 2: 读入数据并提取作者的名字

导入pandas 模块。
读取datasets/nytkids_yearly.csv进入作者的视野。读入nytkids_yearly.csv时应使用分号（;）作为分隔符。
通过author_df['author']循环提取作者的名字，并将其附加到first_name。
在author_df中添加firstname作为一列。
使用author_df.head()查看author_df 的前几行

作者的全名（Author_df['Author']）是简单的short字符串。组成字符串的单词可以使用split方法进行分隔。在这个例子中，名字之间只有一个空格。

# Importing the pandas module
#code here#

# Reading in datasets/nytkids_yearly.csv, which is semicolon delimited.
#code here#

# Looping through author_df['Author'] to extract the authors first names
first_name = []
#code here#

# Adding first_name as a column to author_df
#code here#

# Checking out the first few rows of author_df
#code here#

TASK 3: 创建等同于作者名字（firstname）的NYSIIS。

导入numpy模块。
对于每个first_name，创建一个nysiis等效名称并附加到nysiis_name。
将nysiis_name作为列添加到author_df。
做sanity check：打印出first names的unique numbers和NYSIIS names的unique numbers之间的差异。是否大于0？

可以numpy作为np导入，应用np.unique() 获取唯一值的列表。

# Importing numpy
 #code here#

# Looping through author's first names to create the nysiis (fuzzy) equivalent
nysiis_name = []
for firstname in author_df['first_name']:
    #code here#

# Adding nysiis_name as a column to author_df
 #code here#

# Printing out the difference between unique firstnames and unique nysiis_names:
 #code here#

TASK 4: 读入数据集，并添加一个新列。

读入datasets/babynames_nysiis.csv文件记为babies_df。该文件是以分号（；）分隔的。
循环遍历babies_df 的索引 (indexes /rows)。根据perc_female和perc_male的值，在gender中加上’M’（男）、‘F’（女）或’N’（中性）。
将gender作为一列添加到数据框中。
打印出babies_df的前几行。

# Reading in datasets/babynames_nysiis.csv, which is semicolon delimited.
#code here#

# Looping through the rows of babies_df to and filling up gender
gender = []
#code here#

# Adding a gender column to babies_df
#code here#

# Printing out the first few rows of babies_df
#code here#

TASK 5:

循环遍历author_df['nysiis_name']以在babies_df中找到每个作者的名字的索引。
使用此索引从babies_df中提取性别，并将其附加到author_gender中。对于babies_df中不存在姓名的情况，请改为附加 ‘Unknown’。
将author_gender添加到author_df并打印出author_df的前几行。
使用value_counts()来统计author_df['author_gender']中的不同值。

项目作者已经提供了locate_in_list(a_list, element)函数，用于检索 a_list 中的element。可以使用此选项将“author”列表中的姓名与“babies”列表中的姓名进行匹配。如果元素不在 a_list 中，函数将返回值-1。locate_in_list将a list of names作为输入，因此需要首先将DataFrame列转换为列表。
这可以通过list(babies_df['babynysiis'])实现

# This function returns the location of an element in a_list.
# Where an item does not exist, it returns -1.
def locate_in_list(a_list, element):
    loc_of_name = a_list.index(element) if element in a_list else -1
    return(loc_of_name)

# Looping through author_df['nysiis_name'] and appending the gender of each
# author to author_gender.
author_gender = []
#code here#

# Adding author_gender to the author_df
#code here#

# Counting the author's genders
#code here#

TASK 6: 统计作者的性别

按升序创建包含唯一年份值（来自author_df）的年份列表 year。
循环遍历年份值，计算每年（M, F, Unknown）的出现次数，并将males_by_yr, females_by_yr和 unknowns_by_yr添加到列表中。
打印年度值以了解随时间的变化作者性别比例的改变

在author_df的一列中，可以用以下代码来统计次数：

len(author_df[author_df['Gender']=='F'])

也可以使用&运算符添加更多条件。

# Creating a list of unique years, sorted in ascending order.
#code here#

# Intializing lists
males_by_yr = []
females_by_yr = []
unknown_by_yr = []

# Looping through years to find the number of male, female and unknown authors per year
#code here#

# Printing out yearly values to examine changes over time
#code here#

TASK 7: 用bar plot 可视化foreign-born的作者。

使用plt.bar做一个unknown_by_yr by year条形图。
[optional] 向图表添加标题和轴标签。

# Importing matplotlib
import matplotlib.pyplot as plt

# This makes plots appear in the notebook
%matplotlib inline

# Plotting the bar chart
#code here#

# [OPTIONAL] - Setting a title, and axes labels
#code here#

TASK 8: 使用分组条形图(grouped bar chart)比较男性和女性作者authorship

创建一个新的列表years_shifted，其中0.25被添加到每个years元素中。(0.25 is added to each element in years)
用width=0.25，color='lightblue'，为男性绘制一个条形图。
为女性绘制一个条形图，width=0.25，color=pink'。
[可选]添加轴标签和标题。

画grouped bar chart的一种方法是在两个条形图的顶部绘制两个条形图，但其中一个条形图的x位置会发生偏移，并且条形图会变得更窄，以避免重叠。

# Creating a new list, where 0.25 is added to each year
#code here#

# Plotting males_by_yr
#code here#

# Plotting females_by_yr by years_shifted
#code here#

# [OPTIONAL] - Adding relevant Axes labels and Chart Title
#code here#