NGender
- 不到20行纯Python代码(核心部分)
- 无任何依赖库
- 兼容python3, python2, pypy
- 82%的准确率
- 可用于猜测性别
- 也可用于判断名字的男性化/女性化程度
使用
pip install ngender
或者(OSX)
然后在命令行中
$ ng 赵本山 宋丹丹
name: 赵本山 => gender: male, probability: 0.9836229687547046
name: 宋丹丹 => gender: female, probability: 0.9759486128949907
当然也可以在Python程序中用
>>> import ngender
>>> ngender.guess('赵本山')
('male', 0.9836229687547046)
>>> ngender.guess('宋丹丹')
('female', 0.9759486128949907)
>>> %timeit guess('宋丹丹')
100000 loops, best of 3: 4.01 µs per loop
原理
数学
贝叶斯公式: P(Y|X) = P(X|Y) * P(Y) / P(X)
当X条件独立时, P(X|Y) = P(X1|Y) * P(X2|Y) * ...
应用到猜名字上
P(gender=男|name=本山)
= P(name=本山|gender=男) * P(gender=男) / P(name=本山)
= P(name has 本|gender=男) * P(name has 山|gender=男) * P(gender=男) / P(name=本山)
计算
元数据是1.csv的内容
char,male,female
明,378860,63221
伟,378757,51232
军,378096,29518
建,366515,51477
华,344928,174529
文,314939,114048
国,314608,29055
测试代码
step1:分析每个字在男女名字的占比
step2:p为男女的概率
step3:每个字在男女总字数里出现的概率
step4:最后用男女的对比算的占比推断出性别的概率
#!/usr/bin/env python
# -*- coding: utf-8 -*-
import os
__all__ = ['guess']
def py2compat(name):
try:
name = name.decode('utf-8')
except:
pass
return name
class Guesser(object):
#step1:init class
def __init__(self):
self._load_model()
#
def _load_model(self):
self.male_total = 0
self.female_total = 0
self.freq = {}
with open(os.path.join(os.path.dirname(__file__),
'1.csv'),
'rb') as f:
# skip first line
next(f)
for line in f:
line = line.decode('utf-8')
char, male, female = line.split(',')
#
char = py2compat(char)
self.male_total += int(male)
self.female_total += int(female)
self.freq[char] = (int(female), int(male))
self.total = self.male_total + self.female_total
print(self.total)
print(self.male_total)
print(self.female_total)
for char in self.freq:
female, male = self.freq[char]
self.freq[char] = (1. * female / self.female_total,
1. * male / self.male_total)
print(self.freq) #step1:分析每个字在男女名字的占比
#{'醪': (9.270400048354406e-08, 0.0), '咨': (2.781120014506322e-06, 2.396555163413343e-06), '屛':
def guess(self, name):
name = py2compat(name)
firstname = name[1:]
for char in firstname:
assert u'\u4e00' <= char <= u'\u9fa0', u'姓名必须为中文'
pf = self.prob_for_gender(firstname, 0)
print('------------')
pm = self.prob_for_gender(firstname, 1)
#step4:最后用男女的对比算的占比推断出性别的概率
if pm > pf:
return ('male', 1. * pm / (pm + pf))
elif pm < pf:
return ('female', 1. * pf / (pm + pf))
else:
return ('unknown', 0)
def prob_for_gender(self, firstname, gender=0):
p = 1. * self.female_total / self.total \
if gender == 0 \
else 1. * self.male_total / self.total
print(p)#step2:p为男女的概率
for char in firstname:
p *= self.freq.get(char, (0, 0))[gender] #step3:每个字在男女总字数里出现的概率
print(char)
print(p)
return p
guesser = Guesser()
def guess(name):
return guesser.guess(name)
if __name__ == '__main__':
print(guess("张结论"))
-
文件
charfreq.csv
是怎么来的?曾经有个东西叫开房记录.avi(雾),里面有名字和性别, 2000w条, 统计一下得出
-
怎么算
P(name has 本|gender=男)
?“本”在男性名字中出现的次数 / 男性字出现的总次数
-
怎么算
P(gender=男)
?男性名出现的次数 / 总次数
-
怎么算
P(name=本山)
?不用算, 在算概率的时候会互相约去
坑
>>> ngender.guess('李胜男')
('male', 0.851334658742)