python爬虫学习
开发环境
- 编译器版本Python 3.6
- 32bit:python-3.6.2.exe
- 64bit:python-3.6.5.exe
- 开发工具:Pycharm Jupyter-notebook
- 浏览器类型:google最新版本
安装步骤
- Python3.6
- Pycharm
- Juptyter
安装库
- requests ->安装方法:pip install requests
- beautifulsoup4 ->安装方法:pip install beautifulsoup4
- html5lib ->安装方法:pip install html5lib
- lxml ->安装方法:pip install lxml
requests库
- 导包:import requests
- Http请求:post(),get(),put(),delete(),head(),options()
- r = requests.get(“https://api.github.com/events”)
- r = requests.post(‘http://httpbin.org/post’, data = {‘key’:‘value’})
- r = requests.put(‘http://httpbin.org/put’, data = {‘key’:‘value’})
- r = requests.delete(‘http://httpbin.org/delete’)
- r = requests.head(‘http://httpbin.org/get’)
- r = requests.options(‘http://httpbin.org/get’)
定制请求头
- url = ‘https://api.github.com/some/endpoint’
- headers = {‘user-agent’: ‘my-app/0.0.1’}
- r = requests.get(url, headers=headers)
响应状态码
r = requests.get('http://httpbin.org/get')
r.status_code
正常输出结果为:200
BeautifulSoup4
简介
Beautiful Soup 是一个可以从HTML或XML文件中提取数据的Python库.它能够通过你喜欢的转换器实现惯用的文档导航,查找,修改文档的方式.Beautiful Soup会帮你节省数小时甚至数天的工作时间.
- 导包:from bs4 import BeautifulSoup
- soup = BeautifulSoup(open(“index.html”))
- soup = BeautifulSoup(“data”)
对象的种类Tag:Tag对象与XML或HTML原生文档中的tag相同
- soup = BeautifulSoup(‘Extremely bold’)
- tag = soup.b
- type(tag)
- 输出: <class ‘bs4.element.Tag’>
String
- 遍历的字符串
- tag.string
- 输出:u’Extremely bold’
- type(tag.string)
- 输出: <class ‘bs4.element.NavigableString’>
find_all,find
- 搜索函数:find_all(),搜索所有满足条件的内容,返回list列表
- find()函数:搜索一个内容,第一个,返回tag对象
下面给出一个示例代码:
#!-*-coding:utf-8-*-
#! Author : WX
# time 2018 10 30
import requests
from bs4 import BeautifulSoup
import os
from urllib.request import urlretrieve
def get_two_page():
# 1.发送请求
# 2.判断状态
# 3.获取内容
# 4.使用bs4解析内容
# 5.重新定义规则:1.名字 2.出生日期 3.身高 4.三围 5.详细信息。。。 6.私人照
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) \
AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.106 Safari/537.36'
}
response = requests.get(url=URL, headers=headers)
if response.status_code == 200:
response.encoding = 'utf-8'
soup = BeautifulSoup(response.content, 'html5lib')
file = open("校花网二级页面数据.txt", "w", encoding='utf-8')
txt = ''
# td内容
for tabel in soup.find_all('table'):
for tr in tabel.find_all('tr'):
# 人物信息
name = tr.find('td').next_element.next_element.string
txt += "姓名:" + str(name) + "\n"
# 详细信息
for div_hot_tag in soup.find('div',attrs={'class':'infocontent'}):
# 详细信息
all_news = div_hot_tag.string
txt += "详细信息:" + str(all_news) + "\n"
# 图片
ul_list = soup.find('div',attrs={'class':'post_entry'})
for ul in ul_list:
if ul != None:
for li_list in ul_list.find_all('li'):
for li in li_list:
img_path = li.find('img')['src']
txt += "图片:" + img_path + "\n"
get_info(img_path)
# 写入
file.write(txt)
# 关闭
file.close()
print("采集完毕")
else:
print("你访问的内容属于和谐,访问失败")
def get_info(img_path):
download1 = 'download2Pic'
# 判断目录是否存在
if not os.path.exists(download1):
os.mkdir(download1)
name = img_path.split('/')
# 获取最后一位的内容
str = name[len(name) - 1]
try:
print(str + ".jpg" + "下载中....................")
urlretrieve(img_path, download1 + '/' + str + '.jpg')
except:
print("未满18岁,不能观看,下载失败")
if __name__ == '__main__':
URL = "http://www.xiaohuar.com/p-1-1994.html"
get_two_page()
2018/10/30 20:22:40