获取authorization_python获取知乎问题答案并转换为MarkDown文件

前期准备

  • python2.7
  • html2text
  • markdownpad(这里随意,只要可以支持md就行)
  • 会抓包
  • 最重要的是你要有代理,因为知乎开始封IP了
2d6c9fba97c878e7df4f4707a4712530.png

原理

原理说起来很简单:获取请求到的内容的BODY部分,然后重新构建一个HTML文件,接着利用html2text这个模块将其转换为markdown文件,最后对图片及标题按照markdown的格式做一些处理就好了。目前应用的场景主要是在知乎。

80687c4369db8219ef05a7b1de723461.png

代码实现

获取知乎答案

写代码的时候,主要考虑了两种使用场景。第一,获取某一特定答案的数据然后进行转换;第二,获取某一个问题的所有答案进行然后挨个进行转换,在这里可以 通过赞同数来对要获取的答案进行质量控制。

某一个特定答案的数据获取

url:https://www.zhihu.com/question/27621722/answer/48658220'''(前面那个是问题ID,后边的是答案ID)'''

这一数据的获取我这里分为了两个部分,第一部分请求上述网址,拿到答案主体数据以及赞同数,第二部分请求下面这个接口:

https://www.zhihu.com/api/v4/answers/48658220

为什么会这样?因为这个接口得到的答案正文数据不是完整数据,所以只能分两步了。

某一个特定答案的数据获取

这一个数据就可以通过很简单的方式得到了,接口如下:

https://www.zhihu.com/api/v4/questions/27621722/answers?sort_by=default&include=data%5B%2A%5D.is_normal%2Cis_collapsed%2Ccollapse_reason%2Cis_sticky%2Ccollapsed_by%2Csuggest_edit%2Ccomment_count%2Ccan_comment%2Ccontent%2Ceditable_content%2Cvoteup_count%2Creshipment_settings%2Ccomment_permission%2Cmark_infos%2Ccreated_time%2Cupdated_time%2Crelationship.is_authorized%2Cis_author%2Cvoting%2Cis_thanked%2Cis_nothelp%2Cupvoted_followees%3Bdata%5B%2A%5D.author.follower_count%2Cbadge%5B%3F%28type%3Dbest_answerer%29%5D.topics&limit=20&offset=3

返回的都是JSON数据,很方便获取。但是这里有一个地方需要注意,从这里面取的答案正文数据就是文本数据,不是一个完整的html文件,所以需要在构造一下。

保存的字段

  • author_name 回答用户名
  • answer_id 答案ID
  • question_id 问题ID
  • question_title 问题
  • vote_up_count 赞同数
  • create_time 创建时间
  • 答案主体

主脚本:zhihu.py

import osimport reimport jsonimport requestsimport html2textfrom parse_content import parse​"""just for study and funTalk is cheapshow me your code"""​class ZhiHu(object): def __init__(self): self.request_content = None​ def request(self, url, retry_times=10): header = { 'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/57.0.2987.133 Safari/537.36', 'authorization': 'oauth c3cef7c66a1843f8b3a9e6a1e3160e20', 'Host': 'www.zhihu.com' } times = 0 while retry_times>0: times += 1 print 'request %s, times: %d' %(url, times) try: ip = 'your proxy ip' if ip: proxy = { 'http': 'http://%s' % ip, 'https': 'http://%s' % ip } self.request_content = requests.get(url, headers=header, proxies=proxy, timeout=10).content except Exception, e: print e retry_times -= 1 else: return self.request_content​ def get_all_answer_content(self, question_id, flag=2): first_url_format = 'https://www.zhihu.com/api/v4/questions/{}/answers?sort_by=default&include=data%5B%2A%5D.is_normal%2Cis_collapsed%2Ccollapse_reason%2Cis_sticky%2Ccollapsed_by%2Csuggest_edit%2Ccomment_count%2Ccan_comment%2Ccontent%2Ceditable_content%2Cvoteup_count%2Creshipment_settings%2Ccomment_permission%2Cmark_infos%2Ccreated_time%2Cupdated_time%2Crelationship.is_authorized%2Cis_author%2Cvoting%2Cis_thanked%2Cis_nothelp%2Cupvoted_followees%3Bdata%5B%2A%5D.author.follower_count%2Cbadge%5B%3F%28type%3Dbest_answerer%29%5D.topics&limit=20&offset=3' first_url = first_url_format.format(question_id) response = self.request(first_url) if response: contents = json.loads(response) print contents.get('paging').get('is_end') while not contents.get('paging').get('is_end'): for content in contents.get('data'): self.parse_content(content, flag) next_page_url = contents.get('paging').get('next').replace('http', 'https') contents = json.loads(self.request(next_page_url)) else: raise ValueError('request failed, quit......')​ def get_single_answer_content(self, answer_url, flag=1): all_content = {} question_id, answer_id = re.findall('https://www.zhihu.com/question/(d+)/answer/(d+)', answer_url)[0]​ html_content = self.request(answer_url) if html_content: all_content['main_content'] = html_content else: raise ValueError('request failed, quit......')​ ajax_answer_url = 'https://www.zhihu.com/api/v4/answers/{}'.format(answer_id) ajax_content = self.request(ajax_answer_url) if ajax_content: all_content['ajax_content'] = json.loads(ajax_content) else: raise ValueError('request failed, quit......')​ self.parse_content(all_content, flag, )​ def parse_content(self, content, flag=None): data = parse(content, flag) self.transform_to_markdown(data)​ def transform_to_markdown(self, data): content = data['content'] author_name = data['author_name'] answer_id = data['answer_id'] question_id = data['question_id'] question_title = data['question_title'] vote_up_count = data['vote_up_count'] create_time = data['create_time']​ file_name = u'%s--%s的回答[%d].md' % (question_title, author_name,answer_id) folder_name = u'%s' % (question_title)​ if not os.path.exists(os.path.join(os.getcwd(),folder_name)): os.mkdir(folder_name) os.chdir(folder_name)​ f = open(file_name, "wt") f.write("-" * 40 + "") origin_url = 'https://www.zhihu.com/question/{}/answer/{}'.format(question_id, answer_id) f.write("## 本答案原始链接: " + origin_url + "") f.write("### question_title: " + question_title.encode('utf-8') + "") f.write("### Author_Name: " + author_name.encode('utf-8') + "") f.write("### Answer_ID: %d" % answer_id + "") f.write("### Question_ID %d: " % question_id + "") f.write("### VoteCount: %s" % vote_up_count + "") f.write("### Create_Time: " + create_time + "") f.write("-" * 40 + "")​ text = html2text.html2text(content.decode('utf-8')).encode("utf-8") # 标题 r = re.findall(r'**(.*?)**', text, re.S) for i in r: if i != " ": text = text.replace(i, i.strip())​ r = re.findall(r'_(.*)_', text) for i in r: if i != " ": text = text.replace(i, i.strip()) text = text.replace('_ _', '')​ # 图片 r = re.findall(r'![]((?:.*?))', text) for i in r: text = text.replace(i, i + "")​ f.write(text)​ f.close()​​if __name__ == '__main__': zhihu = ZhiHu() url = 'https://www.zhihu.com/question/27621722/answer/105331078' zhihu.get_single_answer_content(url)​ # question_id = '27621722' # zhihu.get_all_answer_content(question_id)

zhihu.py为主脚本,内容很简单,发起请求,调用解析函数进行解析,最后再进行保存。

解析函数脚本:parse_content.py

import timefrom bs4 import BeautifulSoup  def html_template(data): # api content html = '''    %s    ''' % data return html  def parse(content, flag=None): data = {} if flag == 1: # single main_content = content.get('main_content') ajax_content = content.get('ajax_content')  soup = BeautifulSoup(main_content.decode("utf-8"), "lxml") answer = soup.find("span
  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值