前言
在之前的分享中,我们已经学会了简书和知乎小姐姐的爬虫。今天罗罗攀把魔爪伸向了微博网红们,我们找找谁是最美网红。今天的流程如下:
网页分析
这里的微博网红链接:https://weibo.com/a/hot/7549094253303809_1.html,这个是微博关注中的新鲜事(大家不需要了解太多,就这个url即可)。里面收集了近段时间的热门网红微博。
Python学习交流群:1004391443
这个网页简单,我们直接使用lxml库来解析即可。这里就强调一点,这个图片是普清的,进入详细页面可以是高清图片,但我发现只需要将图片的url中的“thumb180”换成“mw690”就可以将图换成高清。例如:
https
:
//ww4.sinaimg.cn/thumb180/6960aeaaly1g23wtlad3sj21sc2dsu0x.jpg
https
:
//ww4.sinaimg.cn/mw690/6960aeaaly1g23wtlad3sj21sc2dsu0x.jpg
爬虫代码
根据上面的思路,我们编写爬虫代码:
import
requests
from
lxml
import
etree
import
re
headers
=
{
'cookie'
:
''
}
url
=
'https://weibo.com/a/hot/7549094253303809_1.html'
res
=
requests
.
get
(
url
,
headers
=
headers
)
html
=
etree
.
HTML
(
res
.
text
)
infos
=
html
.
xpath
(
'//div[@class="UG_list_a"]'
)
for
info
in
infos
:
name
=
info
.
xpath
(
'div[2]/a[2]/span/text()'
)[
0
] content
=
info
.
xpath
(
'h3/text()'
)[
0
].
strip
()
imgs
=
info
.
xpath
(
'div[@class="list_nod clearfix"]/div/img/@src'
)
(
name
,
content
)
i
=
1
for
img
in
imgs
:
href
=
'https:'
+
img
.
replace
(
'thumb180'
,
'mw690'
)
(
href
)
res_1
=
requests
.
get
(
href
,
headers
=
headers
)
fp
=
open
(
'row_img/'
+
name
+
'+'
+
content
+
'+'
+
str
(
i
)
+
'.jpg'
,
'wb'
)
fp
.
write
(
res_1
.
content
)
i
=
i
+
1
记得换上自己的cookie后就可以直接使用啦~
人脸识别API
之前我们就讲解过了人脸识别API的使用,这里把在讲解一遍。
首先,打开网址(http://ai.baidu.com/tech/face),登陆后立即使用,我们首先创建一个人脸识别的应用。api的使用说简单很简单(看文档就好了),说难也很难(大家的阅读能力在慢慢下降)。首先,我们看着文档(https://ai.baidu.com/docs#/Face-Detect-V3/top),一步步来。
接着我们通过API Key和Secret Key获取token:
import
requests
ak
=
''
sk
=
''
host
=
'https://aip.baidubce.com/oauth/2.0/token?grant_type=client_credentials&client_id={}&client_secret={}'
.
format
(
ak
,
sk
)
res
=
requests
.
post
(
host
)
(
res
.
text
)
我们拿着token,来请求对应的网页就可以获取图片的内容了。我们拿张超越妹妹的图片做例子~
import
base64
import
json
token
=
''
def
get_img_base
(
file
):
with
open
(
file
,
'rb'
)
as
fp
:
content
=
base64
.
b64encode
(
fp
.
read
())
return
content
request_url
=
"https://aip.baidubce.com/rest/2.0/face/v3/detect"
request_url
=
request_url
+
"?access_token="
+
token
params
=
{
'image'
:
get_img_base
(
'test.jpg'
),
'image_type'
:
'BASE64'
,
'face_field'
:
'age,beauty,gender'
}
res
=
requests
.
post
(
request_url
,
data
=
params
)
result
=
res
.
text
json_result
=
json
.
loads
(
result
)
code
=
json_result
[
'error_code'
]gender
=
json_result
[
'result'
][
'face_list'
][
0
][
'gender'
][
'type'
]beauty
=
json_result
[
'result'
][
'face_list'
][
0
][
'beauty'
(
code
,
gender
,
beauty
)
### result 0 female 76.25
这里的token为前面请求得到的,params的参数中,图片需要base64编码~超越妹妹76.25,还算给力。
综合使用
最后,我们逐一请求我们保存的图片,获取小姐姐图片的分数(这里处理为1-10分),并分别存在不同的文件夹中。
import
requests
import
os
import
base64
import
json
import
time
def
get_img_base
(
file
):
with
open
(
file
,
'rb'
)
as
fp
:
content
=
base64
.
b64encode
(
fp
.
read
())
return
content
file_path
=
'row_img'
list_paths
=
os
.
listdir
(
file_path
)
for
list_path
in
list_paths
:
img_path
=
file_path
+
'/'
+
list_path
# print(img_path)
token
=
'24.890f5b6340903be0642f9643559aa7a1.2592000.1557979582.282335-15797955'
request_url
=
"https://aip.baidubce.com/rest/2.0/face/v3/detect"
request_url
=
request_url
+
"?access_token="
+
token
params
=
{
'image'
:
get_img_base
(
img_path
),
'image_type'
:
'BASE64'
,
'face_field'
:
'age,beauty,gender'
}
res
=
requests
.
post
(
request_url
,
data
=
params
)
json_result
=
json
.
loads
(
res
.
text
)
code
=
json_result
[
'error_code'
]
if
code
==
222202
:
continue
try
:
gender
=
json_result
[
'result'
][
'face_list'
][
0
][
'gender'
][
'type'
]
if
gender
==
'male'
:
continue
beauty
=
json_result
[
'result'
][
'face_list'
][
0
][
'beauty'
] new_beauty
=
round
(
beauty
/
10
,
1
)
(
img_path
,
new_beauty
)
if
new_beauty
>=
8
:
os
.
rename
(
os
.
path
.
join
(
file_path
,
list_path
),
os
.
path
.
join
(
'8分'
,
str
(
new_beauty
)
+
'+'
+
list_path
))
elif
new_beauty
>=
7
:
os
.
rename
(
os
.
path
.
join
(
file_path
,
list_path
),
os
.
path
.
join
(
'7分'
,
str
(
new_beauty
)
+
'+'
+
list_path
))
elif
new_beauty
>=
6
:
os
.
rename
(
os
.
path
.
join
(
file_path
,
list_path
),
os
.
path
.
join
(
'6分'
,
str
(
new_beauty
)
+
'+'
+
list_path
))
elif
new_beauty
>=
5
:
os
.
rename
(
os
.
path
.
join
(
file_path
,
list_path
),
os
.
path
.
join
(
'5分'
,
str
(
new_beauty
)
+
'+'
+
list_path
))
else
:
os
.
rename
(
os
.
path
.
join
(
file_path
,
list_path
),
os
.
path
.
join
(
'其他分'
,
str
(
new_beauty
)
+
'+'
+
list_path
))
time
.
sleep
(
1
)
except
KeyError
:
pass
except
TypeError
:
pass