1.米切尔 (Ryan Mitchell) (作者), 陶俊杰 (译者), 陈小莉 (译者)的Python网络数据采集
2.范传辉 (作者)的Python爬虫开发与项目实战
https://blog.csdn.net/weixin_43160833/article/details/82818530
教程 https://blog.csdn.net/weixin_41269004/article/details/80869692
首先安装requests库
pip install requests
安装Beautiful Soup
https://www.crummy.com/software/BeautifulSoup/#Download
pip install beautifulsoup4
requests的测试
import requests
url='http://www.baidu.com'
respone=requests.get(url)#请求百度首页
print(respone.status_code)#打印请求结果的状态码
print(respone.content)#打印请求到的网页源码
我用的必应,有个辅助功能树
<a class="title" href="/p/4b3204829fb5" target="_blank">新《新白娘子传奇》,你不配叫这个名</a>
from urllib import request
from bs4 import BeautifulSoup
url = "http://www.jianshu.com"
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.87 Safari/537.36'}
page = request.Request(url, headers=headers)
page_info = request.urlopen(page).read().decode('utf-8') # 打开Url,获取HttpResponse返回对象并读取其ResposneBody
soup = BeautifulSoup(page_info, 'html.parser')# 将获取到的内容转换成BeautifulSoup格式,并将html.parser作为解析器
titles = soup.find_all('a', 'title') # 查找所有a标签中class='title'的语句
with open(r"C:\Users\renxianshou\Desktop\python_json\python_json.txt", "w") as file: # 在D盘根目录以只写的方式打开/创建一个名为 test 的txt文件
for title in titles: # 用一个循环将文章的标题还有链接写入到txt文件
file.write(title.string + "\n")
file.write("http://www.jianshu.com" + title.get('href') + '\n\n')