爬虫——爬取汽车之家上所有的汽车资讯（2005~2019）

最新推荐文章于 2020-08-18 16:29:51 发布

weixin_30407613

最新推荐文章于 2020-08-18 16:29:51 发布

阅读量211

点赞数

文章标签：爬虫

原文链接：http://www.cnblogs.com/clement-chiu/p/11303626.html

版权

话不多说，直接看代码（还是有缺陷的，写入数据库是最好的，写到文件里面太low了，而且速度太慢啦，用分布式爬虫会更好。从晚上10点一直爬到第二天一点钟。。。）

 1 from bs4 import BeautifulSoup
 2 import requests
 3 import time
 4 for i in range(1,7018):
 5     url='https://www.autohome.com.cn/all/'+str(i)+'/'
 6     response=requests.get(url=url)
 7     response.encoding=response.apparent_encoding#防止解码出现乱码
 8 
 9     soup=BeautifulSoup(response.text,features='html.parser')
10     target=soup.find(id='auto-channel-lazyload-article')
11     li_list=target.find_all('li')
12 
13 
14     for item in li_list:
15         a=item.find('a')#find_all 是列表
16         try:#nonetype 没有 attrs,则需要加一个异常处理机制
17             href=a.attrs.get('href')
18             title=a.find('h3').text
19             img_src=a.find('img').attrs.get('src')
20             print('链接： '+href)
21             print('标题 :'+title)
22             print('图片地址: '+img_src)
23             time_write=time.asctime( time.localtime(time.time()) )
24             print('写入时间',time_write)
25             print('=========================================================')
26             with open(r'1.txt','a+') as f:
27                 f.write(href+'\n'+title+'\n'+img_src+'\n'+time_write+'\n'+'=========================================='+'\n')
28 
29         except Exception as e:
30             pass

注意

w+：先清空所有文件内容，然后写入，然后你才可以读取你写入的内容
r+：不清空内容，可以同时读和写入内容。 写入文件的最开始
a+：追加写，所有写入的内容都在文件的最后

转载于:https://www.cnblogs.com/clement-chiu/p/11303626.html

weixin_30407613

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
爬虫——爬取汽车之家上所有的汽车资讯（2005~2019）

话不多说，直接看代码（还是有缺陷的，写入数据库是最好的，写到文件里面太low了，而且速度太慢啦，用分布式爬虫会更好。从晚上10点一直爬到第二天一点钟。。。） 1 from bs4 import BeautifulSoup 2 import requests 3 import time 4 for i in range(1,7018): 5 url='https://...
复制链接

扫一扫