python
一、安装requests库 和 bs4
pip install requests
pip install bs4
二、分析步骤
三、实践(爬取文章)
1、代码:
import io
import os
import sys
import requests
from bs4 import BeautifulSoup
sys.stdout = io.TextIOWrapper(sys.stdout.buffer, encoding='utf-8') # 编码格式
def urlBS(url): # 定义发起请求函数
resp = requests.get(url)
html = resp.content.decode('gbk')
soup = BeautifulSoup(html, 'lxml') # 解析网页
# print(soup)
return soup
firsturl = 'http://www.rensheng5.com/zx/onduzhe/' # 目标地址
urlBS(firsturl)
def main(url):
soup = urlBS(url) #调用函数
lis = soup.find('ul', class_="i1 ico1").find_all('li') # 从网页获取的信息
# 数据保存的目录(os.getced()创建文件夹)
path = os.getcwd()+u'/爬取的文章/'
if not os.path.isdir(path): # 判断是否有这个文件夹
os.mkdir(path)
for i in lis:
newurl = i.find('a')['href']
print(newurl)
# 请求每篇文章
result = urlBS(newurl) #调用函数
title = result.find('div', class_="artview").find('h1').get_text() # 获取标题
print(title)
writer = result.find('div', class_="artinfo").get_text() # 获取作者
print(writer)
# 保存的文件格式:
filename = path + title + '.txt'
print(filename)
#写入操作
new = open(filename, 'w')
new.write('<<' + title + '>>\n\n') # 写入标题
new.write(writer + '\n\n') # 写入作者
text = result.find('div', class_="artbody").find('p').get_text()
new.write(text) # 写入内容
new.close() # 关闭
if __name__ == '__main__':
fristurl = 'http://www.rensheng5.com/zx/onduzhe/'
main(firsturl)
2、效果:
3、说明:
四、合并为一个.txt文件
1、在命令行窗口,进入需要合并的Txt文件的目录。
2、确认目录正确后,输入type *.txt >>e:\111.txt
,该命令将把当前目录下的所有txt文件的内容输出到e:\111.txt。
3、到此,打开合并后的e:\111.txt,即可看到多个Txt文件都已按顺序合并到F盘的111.txt文件中。
五、解决爬虫获取网页,出现乱码问题
通用解决方案:
response=request.get("url网站")
data=bytes(response.text,response.encoding).decode("gbk","ignore")
六、实践(爬取小说)
1、代码:
import requests
from bs4 import BeautifulSoup
response = requests.get('http://www.biquw.com/book/19877/')
response2=bytes(response.text,response.encoding).decode("utf-8","ignore")
# print(response2)
# 网页选择器实例化
soup = BeautifulSoup(response2,'lxml')
data_list = soup.find('ul')
for book in data_list.find_all('a'):
print('{}:{}'.format(book.text,'http://www.biquw.com/book/19877/'+ book['href']))
book_url = 'http://www.biquw.com/book/19877/' + book['href']
data_book = requests.get(book_url).text
soup = BeautifulSoup(data_book,'lxml') # 解析网页
data = soup.find('div',{'id':'htmlContent'}) # 查看网页获取
data2 = bytes(data.text, response.encoding).decode("utf-8", "ignore")
print(data2)
# 文件操作
# 方式一、输出到同一个txt文件
file = open('book2.txt', 'a', encoding='utf-8')
file.write(data2)
file.close()
#方式二、输入到各自的txt文件
# with open(book.text + '.txt','a',encoding='utf-8') as f:
# f.write(data2)
2、效果:
3、说明
如果看了这篇文章对你有帮助或让你学到了知识,请给我一个赞吧,谢谢!
下一篇 Python多线程爬取小说