案例一:
打开http://www.diyiziti.com/Builder在线生成书法字。
手动在输入框输入字进行转换后,在Chrome浏览器的More Tools > Developer Tools下,点击Network > Doc ,查看最低端的输入数据。
可以看到以下数据是输入到表单上提交的数据。然而我们人为操作输入的数据很可能只有两个:FontInFold 和 Content。
FontInFold 是字体下拉框选择的字体类型,值为数字。
Content是我们在输入框中输入要进行转换的字。
于是用以下代码爬取某在线生成书法字网站。
half_url = 'http://www.diyiziti.com/Builder'
user_agent = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/56.0.2924.87 Safari/537.36'
headers = {'User-Agent' : user_agent}
sort = ('99','104','100','103','82','105','113','392','374','384')
font = {'99':'柳公权柳体书法字体','104':'颜真卿颜体书法字体','100':'柳公权楷书繁体','103':'赵孟頫楷书字体','82':'欧阳询体书法字体',
'105':'褚遂良楷书书法字体','113':'毛笔字','392':'北魏楷书字体','374':'汉仪全唐诗字体','384':'黄自元楷书'}
def getImg(wd):
for Sort in sort:
print '++++++++++', Sort, font[Sort]
#try:
url = '%s/%s' % (half_url, Sort)
data = urllib.urlencode({
'FontInfoId': Sort,
'FontSize': '75'})
request = urllib2.Request(url, data, headers)
response = urllib2.urlopen(request)
html = response.read()
print(html)
得到一系列乱码:
++++++++++ 99 柳公权柳体书法字体
�PNG
IHDR : �GX� sRGB ��� gAMA ���a cHRM z& �� � �� u0 �` :� p��Q< �IDATx^�Ё à�S�Pa��0`��0`��0`��0`��0`��0`��0`��0`��0`��0`��0`��0`��0`��0`��0`��0`��0`��0`��0`��0`��0`��0`��0`��0`��0`��0`��0`��0`��0`��0`��0`��0`��0`��0`��0`��0`��0`��0`��0`��0`��0`��0`��0`��0`��0`��0`��0`��0`��0`��0`��0`��0`��0`��0`��0`��0`��0`��0`��0`��0`��0`��0`��0`��0`��0`��0`��0`��0`��0`��0`��0`��0`��0`��0`��0`��0`��0`��0`��0`��0`��0`��0`��0`��0`��0`��0`��0`��0`��0`��0`��0`��0`��0`��0`��0`��0`��0`��0`��0`��0`��0`��0`��0`��0`��0`��0`��0`��0`��0`��0`��0`��0`��0`��0`��0`��0`��0`��0`��0`��0`��0`��0`��0`��0`��0`��0`��0`��0`��0`��0`��0`��0`�����-t k�f� IEND�B`�
【坑一】:爬虫给表单传参时要把完整的参数传进去。因为只有浏览器端会帮我们默认设定参数值。
案例二:
爬取http://www.zhenhaotv.com/在线生成书法字网站。
吸取案例一的教训,把完整的参数传入,如下:
url = 'http://www.zhenhaotv.com'
user_agent = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/56.0.2924.87 Safari/537.36'
headers = {'User-Agent' : user_agent}#, 'Referer':'https://www.zhenhaotv.com/'}
sort = ('2','4','8','9','26','30','34')
font = {'2':'方正行楷繁体)','4':'汉仪雪君体繁','8':'博洋柳体字体','9':'博洋欧体字体','26':'腾祥铁山楷繁','30':'苏新诗柳楷简','34':'新蒂赵孟頫楷'}
def getImg(wd):
for Sort in sort:
print '++++++++++', Sort, font[Sort]
#try:
data = urllib.urlencode({
'text':wd,
'font':Sort,
'size':'68',
'color':'#000000',
'bg':'#ffffff',
'list':'open'})
request = urllib2.Request(url, data, headers)
response = urllib2.urlopen(request)
html = response.read()
print(html)
结果:
网页似乎没动,就跟刚在浏览器中打入网址时一模一样。参数跟没传进去似的。
想破脑袋,偶然间把网址从地址栏里复制出来再粘贴到爬虫中,新世界的大门就此被我开启了。。。
【坑二】:爬虫要注意爬取的网址是最后浏览器端粘贴回来的网址,以保证协议等内容不会出错。
我这错就错在网址不是:http://www.zhenhaotv.com 而是https://www.zhenhaotv.com/。协议不一样。
这些还是基础的爬虫。
爬虫坑多不易,且行且珍惜。。。