代码大多数为纯手打 没百分百验证过 可能有单词之类的细小错误
1、一个类型六个方法
import urllib.request
url="hht://www.baidu.com"
headers = {
"User-Agent":"Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.25 Safari/537.36 Core/1.70.3877.400 QQBrowser/10.8.4506.400"
}
response = urllib.request.urlopen(url)
#一个一个字节的读取
content=response.read()
#读取指定个字符
content=response.read(6)
#一行一行的读取
content=response.readline()
#以行的形式读取
content=response.readlines()
#返回状态码 200 为正常
content=response.getcode()
2、下载(图片 视频)
import urllib.request
url="图片或者视频的地址"
urllib.request.urlretrieve(url,"下载到本地的文件名+扩展名")
#图片扩展名:.png .jpg
#视频扩展名: .mp4
3、parse下的 quota()方法和urlencode()方法
get 方法下的 parse.quota方法
将汉字解码为 计算机语言(少数参数时)
import urllib.request
import urllib.parse
url='https://www.baidu.com/s?wd='
headers={
'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/95.0.4638.54 Safari/537.36'
}
#对象的定制
request =urllib.request.Request(url=url,headers=headers)
name= urllib.parse.quote("周杰伦")
url= url+name
response = urllib.request.urlopen(request)
print(response.read().decode("utf-8"))
#多个参数时 parse下的urlencode方法
import urllib.parse
import urllib.request
url="http://www.baidu.com/s?wd="
headers={'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/95.0.4638.54 Safari/537.36'}
data={
'wd':'周杰伦',
'sex':'男'
}
a= urllib.parse.urlencode(data)
url=url+a
# print(url)
request = urllib.request.Request(url=url,headers=headers)
response = urllib.request.urlopen(request)
print(response.read().decode("utf-8"))
4、post请求百度翻译之详细翻译
#post请求百度翻译之详细翻译
import urllib.request
import urllib.parse
url="https://fanyi.baidu.com/v2transapi?from=en&to=zh"
headers={
'Accept': '*/*',
# Accept-Encoding: gzip, deflate, br
'Accept-Language': 'zh-CN,zh;q=0.9',
'Connection': 'keep-alive',
'Content-Length': '116',
'Content-Type': 'application/x-www-form-urlencoded; charset=UTF-8',
'Cookie': 'PSTM=1631544540; BIDUPSID=B74ACF8D3BD1D161547588A33D619C78; __yjs_duid=1_efef6d99e93049494baaea9a626323521631597478758; BDUSS_BFESS=ElkMFRONHJ6b0RFYkRHfjF6eTBWV2k3cS04WUVPZVZJd081UHNpY3VKanItcDFoRUFBQUFBJCQAAAAAAAAAAAEAAABaaQK9AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAOttdmHrbXZhM; BAIDUID_BFESS=7E5DC3BDB9905D0DCED9DB249F08E575:FG=1; BAIDUID=CBCAB115F7BD0D37F327A6A010530209:FG=1; BDRCVFR[S_ukKV6dOkf]=mk3SLVN4HKm; H_PS_PSSID=34947_34067_31253_34712_34600_34584_34517_34706_34916_26350_34760_34827_34868; delPer=0; PSINO=6; BDORZ=FFFB88E999055A3F8A630C64834BD6D0; REALTIME_TRANS_SWITCH=1; FANYI_WORD_SWITCH=1; HISTORY_SWITCH=1; SOUND_SPD_SWITCH=1; SOUND_PREFER_SWITCH=1; BA_HECTOR=20200k8085850galu01gnq90n0r; Hm_lvt_64ecd82404c51e03dc91cb9e8c025574=1635582592,1635591192; ab_sr=1.0.1_M2QyYzY2ZjVkYmVmNWM0M2Y2MTkzODE2ZTAwOWQ5MmFiODZlY2MxZDhkZWVhNmQzNGE0YzM5NzY5ZTY5MzVjZjE3N2IwZDZhMzkwYzhhYWEyMzhkMmEyN2I2MmMyYzhhYmM5OGQ2ZDQyNmQxMDk5MTk1ZDA4MGJiY2ZmNjEzZmM2ODY5MThhZDk2NTA1NWMwMWU1MDQ1ZjRhMDBmZTk4OQ==; __yjs_st=2_ZWE4ZWU1ZGZlMmYwNjFmZTc2MTRkNjMwMGVlZjU5Y2I0ZDM0NGYxNzViOWFhODBkMGE4ZTcyNGRmN2Y4MTM1MjQ1ODY5ZDFkMzYzYzA5ZGU4YmVkZGYzNzE4YTU2MmRhMDViYjM0MjNjMjFiYTI5NzE3OWI0NmMzZWMwZjRlMGM4Y2M5NjFiYTJmZjI4NTk5MzBjZjllMTNiNDI3NzlkNGY3MGJmNjVhZmQ4MzE3OWU3MzI3YjdkYmJlN2IxZjQyMjMxMzA5OWJlNGRkYTY0OWFiYTMwYzA5N2U1NjNmMTJmNmQzM2M5N2M2ZTRlZGEwZWRlMGU2N2MzOTFmZmVlOF83XzAzMmE0MzU3; Hm_lpvt_64ecd82404c51e03dc91cb9e8c025574=1635591388',
'Host': 'fanyi.baidu.com',
'Origin': 'https://fanyi.baidu.com',
'Referer':' https://fanyi.baidu.com/?aldtype=16047',
'Sec-Fetch-Mode': 'cors',
'Sec-Fetch-Site': 'same-origin',
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/77.0.3865.120 Safari/537.36 Core/1.77.54.400 QQBrowser/10.9.4520.400',
'X-Requested-With': 'XMLHttpRequest'
}
data={
'from': 'en',
'to': 'zh',
'query': 'love',
'simple_means_flag': '3',
'sign': '198772.518981',
'token': 'ce70ebe3cb3e4b29e681869508f71909',
'domain': 'common'
}
#post时 需要进行编码 并且用encode方法转换为 字节形式
data= urllib.parse.urlencode(data).encode("utf-8")
request= urllib.request.Request(url=url,data=data,headers=headers)
response= urllib.request.urlopen(request)
content= response.read().decode("utf-8")
print(content)
import json
obj= json.loads(content)
print(obj)
5、ajax get方法
#爬取豆瓣电影 动作分类的第一页数据
import urllib.request
url="https://movie.douban.com/j/chart/top_list?type=5&interval_id=100%3A90&action=&start=0&limit=20"
headers={
'User-Agent':' Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/77.0.3865.120 Safari/537.36 Core/1.77.54.400 QQBrowser/10.9.4520.400'
}
request=urllib.request.Request(url=url,headers=headers)
response= urllib.request.urlopen(request)
content = response.read().decode("utf-8")
with open("doubanp1.json","w",encoding="utf-8")as fp:
fp.write(content)
6、ajax get方法 下载豆瓣动作电影排行榜前十页
#ajax get方法下载 豆瓣电影 动作分类前10页数据
#我们要确定的东西为 url
#第一页https://movie.douban.com/j/chart/top_list?type=5&interval_id=100%3A90&action=&
# start=0&limit=20
#第二页
#https://movie.douban.com/j/chart/top_list?type=5&interval_id=100%3A90&action=&
# start=20&limit=20
#page 1 2 3
#start 0 20 60
#start=(page-1)*20
import urllib.request
import urllib.parse
def get_request(page):
url="https://movie.douban.com/j/chart/top_list?type=5&interval_id=100%3A90&action=&"
headers={'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/95.0.4638.54 Safari/537.36'}
data={
'start':(page-1)*20,
'limit':20
}
new_url= urllib.parse.urlencode(data)
url=url+new_url
request = urllib.request.Request(url=url,headers=headers)
return request
def get_content(request):
response = urllib.request.urlopen(request)
context = response.read().decode("utf-8")
return context
def get_down(page,context):
with open("douban_"+str(page)+".json","w",encoding="utf-8") as fp:
fp.write(context)
#程序入口
if __name__=="__main__":
star__page=int(input("请输入起始页面:"))
end__page=int(input("请输入结束页面:"))
for page in range(star__page,end__page+1):
#每一页的对象定制
request = get_request(page)
context = get_content(request)
get_down(page,context)
7、ajax post方法 肯德基餐厅信息查询
#每一页的url都一样 不一样的是 data里面的 pageindex 所以从pageindex下手 找规律得出 pageindex和页码一样
import urllib.request
import urllib.parse
def get_request(page):
url="http://www.kfc.com.cn/kfccda/ashx/GetStoreList.ashx?op=cname"
headers={'User-Agent':' Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/95.0.4638.54 Safari/537.36'}
#因为是post方法 所以要用到data参数
data={
'cname': '重庆',
'pid': "",
'pageIndex': page,
'pageSize': 10
}
#post方法必须进行编码 并加 encode(“utf-8”)进行 转 字节类型 的动作
data=urllib.parse.urlencode(data).encode("utf-8")
#对象的定制
request = urllib.request.Request(url=url,data=data,headers=headers)
return request
def get_conten(request):
response = urllib.request.urlopen(request)
content = response.read().decode("utf-8")
return content
def get_down(content,page):
with open("kfc_"+str(page)+".json","w",encoding="utf-8") as fp:
fp.write(content)
if __name__ == '__main__':
start_page=int(input("请输入起始页面:"))
end_page=int(input("请输入结束页面:"))
for page in range(start_page,end_page+1):
request = get_request(page)
content= get_conten(request)
get_down(content,page)
8、异常
# 完整的URL由这几个部分构成:scheme://host:port/path?query#hash:
# scheme:通信协议,常用的有http、https、ftp、mailto等。
# host:主机域名或IP地址。
# port:端口号,可选。省略时使用协议的默认端口,如http默认端口为80。
# path:路径由零或多个"/"符号隔开的字符串组成,一般用来表示主机上的一个目录或文件地址。
# query:查询,可选。用于传递参数,可有多个参数,用"&"符号隔开,每个参数的名和值用"="符号隔开。
# hash:信息片断字符串,也称为锚点。用于指定网络资源中的片断。
异常分为:HTTPError和URLError异常
HTTPError是URLError的子类
import urllib.request
import urllib.error
url="https://blog.csdn.netcuowu/qq_40608132/article/details/120958953"
headers={
'User-Agent':' Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/95.0.4638.54 Safari/537.36'
}
try:
request = urllib.request.Request(url=url,headers=headers)
response = urllib.request.urlopen(request)
content = response.read().decode("utf-8")
print(content)
except urllib.error.HTTPError:
print("HTTPError错误")
except urllib.error.URLError:
print("URLRrror错误")
9、cookie
数据采集时 需要绕过登录的时候
#cookie qq空间直接访问会跳到 登录界面 加了cookie可以跳过(前提登录过一次)
import urllib.request
url="https://user.qzone.qq.com/1963855603"
headers={
'User-Agent':' Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/95.0.4638.54 Safari/537.36',
'cookie':'pgv_pvid=2618165221; RK=vvo17PPrMU; ptcz=399a0338e92e03f3be52f71ba6ef83054b5ac12778301a2791064cc1c26037c1; eas_sid=A1n6W391k994c3r4X2p761P6E6; tvfe_boss_uuid=a2156721322910a8; ptui_loginuin=3430763371; qz_screen=1549x872; QZ_FE_WEBP_SUPPORT=1; cpu_performance_v8=3; uin_cookie=o3430763371; ied_qq=o3430763371; o_cookie=3430763371; pac_uid=1_3430763371; _qpsvr_localtk=0.9606942087477905; uin=o1963855603; skey=@DKXo4hYDY; p_uin=o1963855603; pt4_token=zTgSVZkBhKXiw4teH8nl3W5M-swrQrRdNHRZ5pLcLYY_; p_skey=O8*6kH9t6fw2ZD8E3K*YSIOUWVJRiUJhBiNeI3RmL2s_; Loading=Yes; 1963855603_todaycount=2; 1963855603_totalcount=22681; pgv_info=ssid=s324147360if-modified-since: Sat, 30 Oct 2021 12:32:47 GMT'
}
request= urllib.request.Request(url=url,headers=headers)
response=urllib.request.urlopen(request)
content= response.read().decode("utf-8")
print(content)
10、handler处理器
import urllib.request
url="https://www.baidu.com"
headers={
'User-Agent':' Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/95.0.4638.54 Safari/537.36'
}
#对象的定制
request = urllib.request.Request(url=url,headers=hesders)
#handler buil_opener open
#获取handler对象
handler = urllib.request.HTTPHanler()
#获取buil_opener对象
opener = urllib.request.buil_opener(handler)
#open方法
response = opener.open(request)
content= response.read().decod("utf-8")
print(content)
11、代理
1)
import urllib.request
url="https://www.baidu.com/s?wd=ip"
headers={
'User-Agent':' Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/77.0.3865.120 Safari/537.36 Core/1.77.54.400 QQBrowser/10.9.4520.400'
}
#对象的定制
request = urllib.request.Request(url=url,headers=headers)
#hanlder buil_opener open
proxies={
'http':'121.232.194.143:9000' //'http':'id:端口号'
}
hanlder = urllib.request.ProxyHanlder(proxies=proxies)
opener = urllib.request.buil_opener(hanlder)
response = opener.open()
content = response.read().decode("utf-8")
with open('daili.html','w',decoding='utf-8') as fp:
fp.write(content)
2)代理池
import urllib.request
import random
url="https://www.baidu.com/s?wd=ip"
headers={
'User-Agent':' Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/77.0.3865.120 Safari/537.36 Core/1.77.54.400 QQBrowser/10.9.4520.400'
}
#对象的定制
request = urllib.request.Request(url=url,headers=headers)
proxies_list=[
{'http':'121.232.194.143:9000'},
{'http':'183.195.106.118:8118'}
]
#调用random中的choice方法 随机提取 池里的ip
proies=random.choice(proxies_list)
#hanlder buil_opener open
hanlder = urllib.request.ProxyHanlder(proxies=proxies)
opener = urllib.request.buil_opener(hanlder)
response = opener.open()
content = response.read().decode("utf-8")
with open('daili.html','w',decoding='utf-8') as fp:
fp.write(content)