python 知乎接口_python-获取知乎问题答案并转换为MarkDown文件

最新推荐文章于 2024-08-25 07:06:38 发布

weixin_39743622

最新推荐文章于 2024-08-25 07:06:38 发布

阅读量317

点赞数

文章标签： python 知乎接口

首先说明, 这个代码不是原创的, 是参考崔老师博客上的文章写的, 代码基本都是照搬的. 原链在这里https://cuiqingcai.com/4607.html

不过原项目使用python2写的, 自己修改成了python3

观察页面请求, 寻找规律

打开某个知乎问题的链接, 比如这个知乎-男生 25 岁了，应该明白哪些道理？

然后打开开发者工具, 观察到页面中的文本数据基本上来自这个api

https://www.zhihu.com/api/v4/questions/37400041/answers?include=data%5B%5D.is_normal%2Cadmin_closed_comment%2Creward_info%2Cis_collapsed%2Cannotation_action%2Cannotation_detail%2Ccollapse_reason%2Cis_sticky%2Ccollapsed_by%2Csuggest_edit%2Ccomment_count%2Ccan_comment%2Ccontent%2Ceditable_content%2Cvoteup_count%2Creshipment_settings%2Ccomment_permission%2Ccreated_time%2Cupdated_time%2Creview_info%2Crelevant_info%2Cquestion%2Cexcerpt%2Crelationship.is_authorized%2Cis_author%2Cvoting%2Cis_thanked%2Cis_nothelp%2Cupvoted_followees%3Bdata%5B%5D.mark_infos%5B%5D.url%3Bdata%5B%5D.author.follower_count%2Cbadge%5B%3F(type%3Dbest_answerer)%5D.topics&offset=3&limit=20&sort_by=default

好吧, 那就不管了, 就用这个接口吧

解析接口数据

观察接口上面那个接口返回的json数据, 发现结构是这样的

paging中的previous是前面一个ajax请求接口, next是后面一个ajax请求接口

is_end代表是否是最后一个请求

is_start代表是否是第一个请求

data中一共有20条数据, 均是这个问题下的回答数据

所以我们的解析函数可以这样写: 先解析data中的数据, 然后判断是否是最后一条数据, 如果不是, 就递归调用该函数本身继续解析

另外可以观察到, 这个接口is_start为true, 确实是第一个请求, 所以按照这个顺序往下解析就能获取所有的数据了, 貌似并不像原文章中说的那样要分为两部分请求

代码如下:

def request(self, url):

try:

response = requests.get(url=url, headers=headers)

if response.status_code == 200:

# 不管是不是最后一条数据, 先进行解析再说

text = response.text

# 此处进行进一步解析

# print('url =', url, 'text =', text)

content = json.loads(text)

self.parse_content(content)

# 如果不是最后一条数据, 继续递归请求并解析

if not content.get('paging').get('is_end'):

next_page_url = content.get('paging').get('next').replace('http', 'https')

self.request(next_page_url)

return None

except RequestException:

print('请求网址错误')

return None

将内容转换为markdown

这一部分的代码我基本是照搬照抄的了, 没有仔细琢磨. 粗略看了一下思路, 主要是使用html2text模块的html2text方法将html格式的文本转换成了text格式, 然后使用正则整理了一下格式, 接着使用正则查找图片链接替换成本地的图片地址

代码有点长, 如下:

def parse_content(self, content):

if 'data' in content.keys():

for data in content.get('data'):

parsed_data = self.parse_data(data)

self.transform_to_markdown(parsed_data)

def parse_data(self, content):

data = {}

answer_content = content.get('content')

# print('content =', content)

author_name = content.get('author').get('name')

print('author_name =', author_name)

answer_id = content.get('id')

question_id = content.get('question').get('id')

question_title = content.get('question').get('title')

vote_up_count = content.get('voteup_count')

create_time = time.strftime("%Y-%m-%d %H:%M:%S", time.localtime(content.get('created_time')))

content = html_template(answer_content)

soup = BeautifulSoup(content, 'lxml')

answer = soup.find("body")

soup.body.extract()

soup.head.insert_after(soup.new_tag("body", **{'class': 'zhi'}))

soup.body.append(answer)

img_list = soup.find_all("img", class_="content_image lazy")

for img in img_list:

img["src"] = img["data-actualsrc"]

img_list = soup.find_all("img", class_="origin_image zh-lightbox-thumb lazy")

for img in img_list:

img["src"] = img["data-actualsrc"]

noscript_list = soup.find_all("noscript")

for noscript in noscript_list:

noscript.extract()

data['content'] = soup

data['author_name'] = author_name

data['answer_id'] = answer_id

data['question_id'] = question_id

data['question_title'] = question_title

data['vote_up_count'] = vote_up_count

data['create_time'] = create_time

return data

def transform_to_markdown(self, data):

content = data['content']

author_name = data['author_name']

answer_id = data['answer_id']

question_id = data['question_id']

question_title = data['question_title']

vote_up_count = data['vote_up_count']

create_time = data['create_time']

file_name = 'vote[%d]_%s的回答.md' % (vote_up_count, author_name)

folder_name = question_title

# 如果文件夹不存在, 就创建文件夹

question_dir = os.path.join(os.getcwd(), folder_name)

if not os.path.exists(question_dir):

os.mkdir(folder_name)

answer_path = os.path.join(os.getcwd(), folder_name, file_name)

with open(answer_path, 'w+', encoding='utf-8') as f:

# f.write("-" * 40 + "\n")

origin_url = 'https://www.zhihu.com/question/{}/answer/{}'.format(question_id, answer_id)

# print('origin_url =', origin_url)

f.write("### 本答案原始链接: " + origin_url + "\n")

f.write("### question_title: " + question_title + "\n")

f.write("### Author_Name: " + author_name + "\n")

f.write("### Answer_ID: %d" % answer_id + "\n")

f.write("### Question_ID %d: " % question_id + "\n")

f.write("### VoteCount: %s" % vote_up_count + "\n")

f.write("### Create_Time: " + create_time + "\n")

f.write("-" * 40 + "\n")

text = html2text.html2text(content.decode('utf-8'))

# 标题

r = re.findall(r'\*\*(.*?)\*\*', text, re.S)

for i in r:

if i != " ":

text = text.replace(i, i.strip())

r = re.findall(r'_(.*)_', text)

for i in r:

if i != " ":

text = text.replace(i, i.strip())

text = text.replace('_ _', '')

text = text.replace('_b.', '_r.')

# 图片

r = re.findall(r'!\[\]\((?:.*?)\)', text)

for i in r:

text = text.replace(i, i + "\n\n")

folder_name = '%s/image' % os.getcwd()

if not os.path.exists(folder_name):

os.mkdir(folder_name)

img_url = re.findall('\((.*)\)', i)[0]

save_name = img_url.split('/')[-1]

file_path = '%s/%s' % (folder_name, save_name)

try:

content = self.download_image(img_url)

if content:

self.save_image(content, file_path)

except Exception as e:

print(e)

else: # if no exception,get here

text = text.replace(img_url, file_path)

f.write(text)

f.close()

成果展示

在finder按照名称逆序排列了一下, 这样就能按照赞同数从多到少浏览这些答案了

话说, 赞同数第一的是个什么鬼?明显就是广告, 不能评论, 引用了一句矫情的话语, 赞同数肯定是刷上去的, 我果断给举报了

总结

其实我最初是想有一个好的方式去看我在知乎上关注的问题, 因为有的人写的答案还是很有价值的.但是显然这种markdown的方式并不是很好, 因为查看回答需要一个个的打开markdown文件.而我认为理想的方式是像在知乎的网页上浏览一样, 但并不需要翻页或者什么的, 直接在一个html里面加载好了所有的内容, 就像我以前写的一篇Python-给简书收藏加一个搜索功能一样.

现在来看还是将爬取到的数据保存到本地数据库, 然后一次性加载到网页这种方式比较合适.为了实现这一点我还得去学学前端了.另外我还想在退出网页的时候自动保存我上次浏览到的位置, 下次再打开时自动回到那个位置.希望能够实现.

另外关于代码方面, 因为将图片下载到本地, 这一过程其实占用了主要的时间, 其实可以用图片的在线地址, 这样会快很多.还有就是本来我想使用像街拍美图中那样使用进程池的多进程的, 但是在这里好像应用不上, 因为下一次的请求地址是在本次请求的返回结果里的, 必须先解析了这次请求的内容才能进行下一次的请求.

github代码地址