不管你现在出于什么阶段都一定要开始注重自己的穿搭,各种烂大街的格子衫,程序员们真的别再穿了,外在形象真的比你想象的重要的多,不是要你颜值有多好看,而是最起码你得收拾,你得干净吧。
面向对象编写爬虫
首先明确基本的思路:发送请求,解析请求,保存内容
搭建基本结构:
#需要的导入的模块
import requests
from lxml import etree
import json
#本次解析方式选择使用xpath
class ZhiHuArticle(object):
def __init__(self):
pass
def get_response(self):
pass
def str_to_html(self):
pass
def pares(self):
pass
def save(self):
pass
def work(self):
pass
if __name__ == '__main__':
article = ZhiHuArticle()
article.work()
发送请求
检查文章 找到对应的url,uesr_agent,发送请求
def __init__(self):
self.url = "https://www.zhihu.com/question/268776431/answer/1276747242"
# self.file =open("jitang.text","wb")
def get_response(self,url):
headers = {
"user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/84.0.4147.89 Safari/537.36"
}
text = requests.get(url,headers=headers).text
return text
解析请求
首先需要把获取的text文件转换成HTML格式:
html = etree.HTML(xxx)
然后用xpath解析html,找到对应的内容,可以在网址上用xpath解析工具查看找到的内容是否正确
def str_to_html(self,text):
html = etree.HTML(text)
return html
def parse(self,html):
nodes = html.xpath("//div[@class='RichContent-inner']/span/p/text()")
return nodes
保存内容
def save(self,nodes):
# self.file.write(dict_data)
with open("jitang.text",'wb') as f:
dict_data = {}
for node in nodes[0:53]:
dict_data = node
f.write(json.dumps(dict_data,ensure_ascii=False).encode())
源码(可能写的不规范,,别介意)
# !/usr/bin/env python
# _*_ coding:utf-8 _*_
import requests
from lxml import etree
import json
class Jitang(object):
def __init__(self):
self.url = "https://www.zhihu.com/question/268776431/answer/1276747242"
# self.file =open("jitang.text","wb")
def get_response(self,url):
headers = {
"user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/84.0.4147.89 Safari/537.36"
}
text = requests.get(url,headers=headers).text
return text
def str_to_html(self,text):
html = etree.HTML(text)
return html
def parse(self,html):
nodes = html.xpath("//div[@class='RichContent-inner']/span/p/text()")
return nodes
def save(self,nodes):
# self.file.write(dict_data)
with open("jitang.text",'wb') as f:
dict_data = {}
for node in nodes[0:53]:
dict_data = node
f.write(json.dumps(dict_data,ensure_ascii=False).encode())
def run(self):
response = self.get_response(self.url)
html_ = self.str_to_html(response)
pares_ = self.parse(html_)
self.save(pares_)
if __name__ == '__main__':
jitang=Jitang()
jitang.run()
结果
然后感觉这篇文章写的挺好的,,可能还是有这个年纪的多愁善感,生活不易,一起加油,努力奔向想要的生活。