python爬虫入门-CSDN博客

本文链接：https://blog.csdn.net/u013000310/article/details/100545543

1:导入requests模块

import requests

2:确定是get/post请求，构造请求参数

url = requests.get("https://www.baidu.com")

url = requests.post(url=url,data=data,headers=header)

3:接收响应并解析

print(url.text)
content = json.loads(url.text) #需要导入json

4:使用Beautiful Soup解析网页

Beautiful Soup是python的一个库，最主要的功能是从网页中抓取数据。

Beautiful Soup目前已经被移植到bs4中，在导入Beautiful Soup时需要先安装bs4库。还需要装lxml库。

import requests
from bs4 import BeautifulSoup
url = "http://www.cntour.cn/"
response = requests.get(url)
soup = BeautifulSoup(response.text,'lxml') #使用lxml解析器进行解析。解析后便将复杂的html文档转换成树形结构，并且每个节点都是python对象
data = soup.select('#main > div > div.mtop.firstMod.clearfix > div.leftBox > div:nth-child(2) > ul > li:nth-child(2) > a')
print(data)

5:反爬虫

5.1 构造请求头：

headers={'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.110 Safari/537.36'}

5.2 控制请求频率

增加延时 time.sleep(3)

构建自己的代理IP池

proxies={
    "http":"http://10.10.1.10:3128",
    "https":"http://10.10.1.10:1080",
}
response = requests.get(url, proxies=proxies)

问题

使用Beautiful Soup定位并取到指定元素有问题，需要再深入学习

参考链接：

http://c.biancheng.net/view/2011.html