python 爬虫

最新推荐文章于 2024-03-29 22:05:02 发布

sunshine0625

最新推荐文章于 2024-03-29 22:05:02 发布

阅读量542

点赞数

分类专栏：【python】

本文链接：https://blog.csdn.net/u012680593/article/details/53818792

版权

本文介绍了如何在Python3.5环境下编写爬虫抓取糗事百科的段子。内容包括下载网页、使用XPath解析，以及爬虫过程中需要注意的请求头设置、创建opener、处理cookie以及异常捕获等关键点。

摘要由CSDN通过智能技术生成

爬虫之抓取糗事百科的段子（python3.5环境）：

1.下载页面

2.解析（xpath方法）

# -*-coding:utf-8 -*-
import urllib.request
import sys
import io
from lxml import etree
from urllib.parse import urljoin
sys.stdout = io.TextIOWrapper(sys.stdout.buffer,encoding='gb18030') #改变标准输出的默认编码

def download(originer_url,p):
    url=str(originer_url)+str(p)
    print(url)
    print (p)
    #添加header
    headers={'User-Agent':r'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)','Connection':'keep-alive'}
    #创建opener
    opener=urllib.request.build_opener()
    opener.addheaders=[headers]
    try:
        page=opener.open(str(url)).read().dec