众所周知,百度文库不允许非VIP直接复制它的内容,我这种穷鬼对此深恶痛绝,所以决定用python实现对于其内容的爬取
# coding =utf-8
import tkinter as tk
import re # 正则表达式
import urllib
import requests
window = tk.Tk()
url = ""
window.title('百度文库爬虫')
window.geometry('500x300')
baseNum = tk.Label(window, text='请输入网址:')
baseNum.pack()
base_text = tk.StringVar()
base = tk.Entry(window, textvariable=base_text)
base.pack()
def xxxx():
print("wo you shuchu")
url = base_text.get()
content_list = kaishi(url)
f = open(r'C:\Users\Administrator\Desktop\123.txt', 'a+')
for i in range(0, len(content_list)):
f.write(str(content_list[i]))
print("xxxxx")
print(str(content_list))
def main():
tk.Button(window, text="生成桌面文件", command=xxxx).pack()
tk.Button(window, text="退出1", command=window.quit).pack()
window.mainloop()
findgupiao = re.compile('">(.*?)</p>')
def kaishi(url):
headers = {
xxxxxxxxxx
}
request = urllib.request.Request(url, headers=headers)
html = ""
try:
response = urllib.request.urlopen(request)
html = response.read().decode('unicode_escape')
except urllib.error.URLError as e:
if hasattr(e, "code"):
print(e.code)
if hasattr(e, "reason"):
print(e.reason)
content_list=[]
content_list = re.findall('"c":"(.*?)","p"', html)
return content_list
# with open("rsp.html", "w+", encoding="utf-8")as f:
# f.write(session.get(url1).text)
if __name__ == "__main__":
main()
如果库都有的话,会生成这样的界面
这里需要注意的是输入的网址必须是你通过抓包获得的url
类似于这样的
中的 Request URL:
然后会在桌面生成一个123.txt文件,里面就是你需要爬取的内容了
当然如果有很多页你也需要复制很多个url,这个问题暂时水平有限解决不了,以后水平提高再来继续改进把