python3爬虫基础笔记（一）

最新推荐文章于 2022-06-22 16:21:52 发布

97小菜鸟

最新推荐文章于 2022-06-22 16:21:52 发布

阅读量243

点赞数

分类专栏： python 文章标签： python xpath 爬虫

本文链接：https://blog.csdn.net/weixin_43468000/article/details/105047699

版权

python 专栏收录该内容

3 篇文章 0 订阅

订阅专栏

网页爬取

1.引入模块

import requests 
from lxml import etree

2.找到想要爬取的网址

#声明网址变量
url = "https://www.12zw.com/2/2671/"  #笔趣阁 “成神” 小说网址
# 添加浏览器header，模拟浏览器登录
header = {"User-Agent":"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.25 Safari/537.36 Core/1.70.3704.400 QQBrowser/10.4.3620.400"}
result = requests(url,header)
resultHtml = etree.HTML(result.text) # 自动补全html标签
context = resultHtml.xpath("//div[@id='context']")

网页分析

1.xpath 使用

`/`	选取当前节点下的字内容
`//`	选取当前节点下所有子孙节点
`.`	选择当前节点
`..`	选择当前节点的父节点
`@`	选取属性

文本获取
/text() 例如：result = html.xpath(’//li[@class=“item-0”]/text()’)
属性获取
/@id 例如：result = html.xpath(’//li[@class=“item-0”]/@id’)

示例：获取网页小说 “仙王的日常生活”

import requests
from lxml import etree

url = "https://www.biqukan.com/38_38269/"
hearder = {
"user-agent": "Mozilla/5.0 (iPhone; CPU iPhone OS 11_0 like Mac OS X) AppleWebKit/604.1.38 (KHTML, like Gecko) Version/11.0 Mobile/15A372 Safari/604.1"
}
result = requests.get(url,hearder)
result.encoding = "gb2312"
html = result.text
htmlContext = etree.HTML(html)
nameList = htmlContext.xpath("//div[@class='listmain']//dd/a/text()")
hrefList = htmlContext.xpath("//div[@class='listmain']//dd/a/@href")
print(hrefList)
url2 = "https://www.biqukan.com"
for i in range(len(hrefList)):
    context = requests.get(url2+hrefList[i])
    context.encoding = "gb2312"
    context = context.text
    contextHtml = etree.HTML(context)
    con = contextHtml.xpath("//div[@id='content']//text()") #文章内容
    contitle = nameList[i] # 文章标题

    con.insert(0,contitle + "\n")
    file = open("仙王的日常生活.txt","a",encoding="utf-8")
    file.writelines(con)
    file.close()
    print(contitle)

97小菜鸟

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
python3爬虫基础笔记（一）

网页爬取1.引入模块import requests from lxml import etree2.找到想要爬取的网址#声明网址变量url = "https://www.12zw.com/2/2671/" #笔趣阁 “成神” 小说网址header = {"User-Agent":"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/53...
复制链接

扫一扫