python 爬虫一

最新推荐文章于 2024-08-06 11:55:39 发布

狗瑶宝贝蛋

最新推荐文章于 2024-08-06 11:55:39 发布

阅读量157

点赞数

分类专栏：爬虫

爬虫专栏收录该内容

1 篇文章 0 订阅

订阅专栏

一. 解析方式

json 解析
直接处理
正则表达
beautifulsoup
PyQuery
Xpath
一般的网页和我们看到的东西不一样是因为用JavasCript

Urllib 详解。
python内置HTTP请求库
urllib.request 请求模块
urllib.error 异常处理模块
urllib.parse url解析模块
urllib.robotparser 识别哪些网页可以爬

import urllib.request

urllib.request.urlopen
参数：url 返回网站的源代码
第一步：把百度的GET类型获取源码

import urllib.request
response=urllib.request.urlopen('http://www.baidu.com')
print(response.read().decode('utf-8'))

获取POST类型：带密码的那种.
加data参数。以post方式发送否则是GET方式

import urllib.parse
import urllib.request
data= bytes(urllib.parse.urlencode({'world','hello'}),encoding='utf-8')
response=urllib.request.urlopen('http://httpbin.org/post'.data=data)
print(response.read())

timeout参数


import urllib.request
response=urllib.request.urlopen('http://httpbin.org/post'.timeout=1)
print(response.read())

返回响应头和状态码

import urllib.request
response = urllib.request.urlopen('http://www.python.org')
print(response.status)
print(response.getheaders())
print(response.getheader('Server'))

把header and data都写出来的很清晰的结构。

from urllib import request,parse
url='http://httpbin.org/post'
headers={
    'User-Agent':'Mozilla/5.0 (Windows NT 6.1; WOW64)',
    'Host':'httpbin.org'
}
dict = {
    'name':'Germey'
}
data = bytes(parse.urlencode(dict),encoding='utf8')
req = request.Request(url=url,data=data,headers=headers,method='POST')
response=request.urlopen(req)
print(response.read().decode('utf-8'))

设置代理handler:帮助我们处理额外的工作？

request库使用
requests.get传参数

import requests
data={
'age':'22'
'name':'germey'
}
 response = request.get('http://httpbin.org/get',parmas=data)
 print(response.text)

狗瑶宝贝蛋

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫

专栏目录