PHP根据姓名分析男女性别
参考:https://github.com/observerss/ngender
统计数据下载:https://download.csdn.net/download/webben/11253847
$name = $argv[1];
$gender = new Gender();
$ret = $gender->guess( $name);
echo $name."\t".$ret[0]."\t".$ret[1];
class Gender
{
/**
* @var int 样本总数
*/
protected $total = 0;
/**
* @var int 男性
*/
protected $male_total = 0;
/**
* @var int 女性
*/
protected $female_total = 0;
/**
* @var array 比例
*/
protected $freq = array();
protected $sample = './charfreq.csv';
public function __construct()
{
$this->load_model();
}
public function guess( $name)
{
$name = $this->name_to_array( $this->to_utf8( $name));
array_shift( $name);
foreach( $name as $val)
{
if( !$this->is_chinese( $val)) {
throw new Exception('名字必须是中文');
}
}
$pf = $this->prob_for_gender( $name, 0);
$pm = $this->prob_for_gender( $name, 1);
if( $pm > $pf){
return array( 'male', $pm/( $pm+$pf));
}elseif( $pm < $pf){
return array( 'female', $pf/( $pm+$pf));
}
return array( 'unknown', 0);
}
protected function prob_for_gender( $name, $gender = 0)
{
$p = ( $gender == 0)
? $this->female_total / $this->total
: $this->male_total / $this->total;
foreach( $name as $val)
{
$p *= $this->freq[$val][$gender];
}
return $p;
}
protected function is_chinese( $name)
{
return preg_match('/^[\x{4e00}-\x{9fa5}]+$/u', $name) === 1;
}
protected function name_to_array( $name)
{
preg_match_all("/./u", $name, $matches);
return $matches[0];
}
protected function load_model()
{
if( !file_exists( $this->sample)) throw new Exception('样本数据不存在,请下载:https://download.csdn.net/download/webben/11253847');
$fp = fopen( $this->sample, 'r');
fgetcsv( $fp);
while( ($line = fgets( $fp)) !== false)
{
$line = trim( $line);
list( $char, $male, $female) = explode(',', $line);
$char = $this->to_utf8( $char);
$this->male_total += $male;
$this->female_total += $female;
$this->freq[$char] = array( $female, $male);
}
$this->total = $this->male_total + $this->female_total;
foreach( $this->freq as $char => $val)
{
list( $female, $male) = $val;
$this->freq[ $char] = array( $female/$this->female_total,
$male / $this->male_total);
}
}
protected function to_utf8( $name)
{
return $name;
}
}
NGender
根据中文姓名猜测其性别
- 不到20行纯Python代码(核心部分)
- 无任何依赖库
- 82%的准确率
- 可用于猜测性别
- 也可用于判断名字的男性化/女性化程度
使用
然后在命令行中
$ ng 赵本山 宋丹丹
name: 赵本山 => gender: male, probability: 0.9836229687547046
name: 宋丹丹 => gender: female, probability: 0.9759486128949907
当然也可以在Python程序中用
php ngender.php 赵本山
('male', 0.9836229687547046)
php ngender.php 宋丹丹
('female', 0.9759486128949907)
原理
数学
贝叶斯公式: P(Y|X) = P(X|Y) * P(Y) / P(X)
当X条件独立时, P(X|Y) = P(X1|Y) * P(X2|Y) * ...
应用到猜名字上
P(gender=男|name=本山)
= P(name=本山|gender=男) * P(gender=男) / P(name=本山)
= P(name has 本|gender=男) * P(name has 山|gender=男) * P(gender=男) / P(name=本山)
计算
-
文件
charfreq.csv
是怎么来的?曾经有个东西叫开房记录.avi(雾),里面有名字和性别, 2000w条, 统计一下得出
-
怎么算
P(name has 本|gender=男)
?“本”在男性名字中出现的次数 / 男性字出现的总次数
-
怎么算
P(gender=男)
?男性名出现的次数 / 总次数
-
怎么算
P(name=本山)
?不用算, 在算概率的时候会互相约去
坑
php ngender.php 李胜男
('male', 0.851334658742)
虽然两个字都很偏男性,但是结合起来就是女性名