python爬虫入门级项目
比如我们想要拿到全国所有城市的邮编,做为某项目的数据支撑,我们就可以写一个爬虫去爬取这些数据。
我爬取邮编信息的网站是http://www.ip138.com/post/postal22/
爬虫需要用到哪些python库
常用的html解析库有
lxml和bs4
from lxml import etree
from bs4 import BeautifulSoup
有些网站返回数据格式是json格式,那么需要导入json库来解析
import json
而有些信息格式比较复杂,需要用正则提取所需信息,那么就需要导入re库
import re
有时候需要得到网站内容发布时间,但是网站发布时间的格式不一,想要统一格式,就需要datetime库
import datetime
另外发起请求所需的requests也是必不可少的
import requests
因为爬虫是通过请求别人网站的资源来获取数据,如果请求过于频繁,就会对别人服务器造成压力,这无异于网站攻击,所以需要设置请求的时间间隔,那么就用到了time库
import time
而有些网站具有反爬机制,请求频率不变的话会被拉黑,所以请求间隔时间应该随机
import random
爬虫是很不稳定的,随时可能会挂掉,想知道爬虫的状态,可以导入logging库打印日志
import logging
完整代码如下
import requests
from lxml import etree
import json
from bs4 import BeautifulSoup
import random
import time
import json
import logging
import re
import datetime
class GetPostNum:
def __init__(self):
self.url="http://www.ip138.com/post/postal22/%s"
self.header={
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3",
"Accept-Encoding": "gzip, deflate",
"Accept-Language": "zh-CN,zh;q=0.9",
"Connection": "keep-alive",
"Host": "www.ip138.com",
"Referer": "http://www.ip138.com/post/postal22/indexdbc4.html",
"Upgrade-Insecure-Requests": "1",
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.169 Safari/537.36"
}
self.file_path="d:/post_data.txt"
def start(self):
session=requests.session()
session.headers=self.header
data=session.get(self.url % "top.js")
if data.status_code==200:
print("请求成功")
# print(data.text)
# table_data=etree.HTML(data.text)
urls=re.findall(r'\<a href\=\"\S+html\"\>\S+\<\/a\>',data.text)
for url in urls:
href=re.findall(r'\"\S+\.html\"',url)[0]
href=href[1:-1]
province=re.findall(r'\>\S+\<',url)[0]
province=province[1:-1]
list=[]
detail=session.get(self.url%href)
detail.encoding='gb2312'
table_data_d=etree.HTML(detail.text)
trs_d=table_data_d.xpath("//table//tr")
if len(trs_d) > 0:
city=""
for tr_d in trs_d[2:]:
tds_d = tr_d.xpath("./td")
if len(tds_d) ==4:
if len(tds_d[0].xpath("./b/text()"))>0:
if "市" not in province:
city=tds_d[0].xpath("./b/text()")[0]
data1={"province": province,"city": city,"district": "" ,"disnum":tds_d[1].text,"postnumber":tds_d[2].text }
list.append(data1)
else:
data1 = {"province": province, "city": city, "district": tds_d[0].text,
"disnum": tds_d[1].text, "postnumber": tds_d[2].text}
list.append(data1)
else:
data1 = {"province": province, "city": city, "district": tds_d[0].text,
"disnum": tds_d[1].text, "postnumber": tds_d[2].text}
list.append(data1)
data2 = {"province": province, "city": city, "district": tds_d[3].text,
"disnum": tds_d[4].text, "postnumber": tds_d[5].text}
list.append(data2)
time.sleep(random.randint(3,5))
print(list)
self.save_to_file(list)
def save_to_file(self,list):
with open(self.file_path, "a" ,encoding='utf-8') as f:
for json_data in list:
jdata=json.dumps(json_data, ensure_ascii=False)
f.write(jdata)
f.write("\n")
if __name__ == '__main__':
GetPostNum().start()