(一)导入说需要的库:
import requests
import re
![graph TD; A-->B; B-->C;](https://img-blog.csdnimg.cn/20210422201857479.png)
(二),定义函数爬取网页源代码
root_url = 'http://www.stats.gov.cn/tjsj/tjbz/tjyqhdmhcxhfdm/2017/index.html'
def getHTML(url):
headers = {
'user-agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.108 Safari/537.36'
}
r = requests.get(url, headers=headers)
r.encoding = r.apparent_encoding
html = r.text
return html
html = getHTML(root_url)
(三),使用正则表达式匹配对应的部分
pattern ="<td><a href='(.+?)'>(.+?)<br/></a></td>"
p = re.compile(pattern)
news_links = p.findall(html)
news_urls=[root_url[:root_url.rfind("/")+1]+item[0] for item in news_links]
news_title=[item[1] for item in news_links]
(四)定义函数逐步爬取并写入文件
def contiue(news_urls,news_title):
for i,n in zip(news_urls, news_title):
f = open(n + '.txt', 'w', encoding='utf-8')
html = getHTML(i)
pattern = "<tr class='citytr'><td><a href='.+?'>.+?</a></td><td><a href='(.+?)'>(.+?)</a></td>"
p = re.compile(pattern)
news_links = p.findall(html)
news_urls = [i[:i.rfind("/")+1]+ item[0] for item in news_links]
news_title = [item[1] for item in news_links]
for m in news_title:
f.writelines(m+"\n")
for i in news_urls:
html = getHTML(i)
pattern = "<tr class='countytr'><td><a href='.+?'>.+?</a></td><td><a href='(.+?)'>(.+?)</a></td>"
p = re.compile(pattern)
news_links = p.findall(html)
news_urls = [i[:i.rfind("/") + 1] + item[0] for item in news_links]
news_title = [item[1] for item in news_links]
for m in news_title:
f.writelines(m + "\n")
for i in news_urls:
html = getHTML(i)
pattern = "<tr class='towntr'><td><a href='.+?'>.+?</a></td><td><a href='(.+?)'>(.+?)</a></td>"
p = re.compile(pattern)
news_links = p.findall(html)
news_urls = [i[:i.rfind("/") + 1] + item[0] for item in news_links]
news_title = [item[1] for item in news_links]
for m in news_title:
f.writelines(m + "\n")
for i in news_urls:
html = getHTML(i)
pattern = "<tr class='villagetr'><td>.+?</td><td>.+?</td><td>(.+?)</td></tr>"
p = re.compile(pattern)
news_links = p.findall(html)
for m in news_links:
f.write(m)
f.close()
(五)调用函数
contiue(news_urls,news_title)
最后结果如图
本次爬取应注意正则表达式的使用,准确地匹配对应字符。
re详细教程自取:https://www.runoob.com/python/python-reg-expressions.html
其次是对文件写入的控制,注意各级省市乡镇的写入换行问题,使文件内容规范