一、用IDLE爬取淘宝商品(交互式)
*这是模人类浏览的爬取,不对服务器形成骚扰,所以没有去关注/robots.txt
大规模爬取请一定要遵循网站/robots.txt
1.用浏览器打开要爬取的网站,右键查看源代码,明确爬取内容存放的位置。
2.用requests抓取对应网站
3.用BeautifulSoup对内容进行提取
4.对提取的内容进行保存
'''加载库'''
>>> import requests
>>> from bs4 import BeautifulSoup
>>> import json
'''requests爬取网站,验证爬取是否成功,修正编码'''
>>> url = 'https://sf.taobao.com/item_list.htm?province=%B9%E3%B6%AB'
>>> html = requests.get(url)
>>> print(html.status_code)
200
>>> html.encoding = html.apparent_encoding
'''BeautifulSoup 筛选需要的数据,并验证'''
>>> soup = BeautifulSoup(html.text)
>>> soup.head.text[:500]
"\n\n\n广东拍卖 - 司法拍卖 - 阿里拍卖_ 拍卖房产汽车车牌土地海关罚没等\n\n\n\n\n\n\r\n@font-face {\r\n font-family: 'iconfont-sf';\r\n src: url('//at.alicdn.com/t/font_1449481554_1450233.eot'); /* IE9*/\r\n src: url('//at.alicdn.com/t/font_1449481554_1450233.eot?#iefix') format('embedded-opentype'), /* IE6-IE8 */\r\n url('//at.alicdn.com/t/font_1449481554_1450233.woff') format('woff'), /* chrome、firefox */\r\n url('//at.alicdn.com/t/font_1449481554_1450233.ttf') format('truetype'), /* chrome、firefox、opera、Safari, Android, iOS 4.2+*/\r\n url"
>>> soup2 = soup.find('script',{'id':"sf-item-list-data"}).contents
>>> soup3 = json.loads(soup2[0])
>>> soup_data = soup3['data']
>>> ls=[['标的物','起拍价','围观人数']]
>>> for t in range(len(soup_data)):
ls.append([soup_data[t]['title'],str(soup_data[t]['initialPrice']),str(soup_data[t]['viewerCount'])])
>>> ls[1][2]
'5753'
>>> ls[2]
['东莞坦博鞋业科技有限公司的存放于厚街镇三屯恒通科技工业园一楼内机器设备等财产一批', '32000.0', '1435']
'''建立文件保存数据,csv格式'''
>>> f = open('d:/python_测试/data.csv','x')
>>> for x in ls:
for z in x:
f.write(z+',')
f.write('\n')
>>> f.close()
保存数据如下:
二、用IDLE爬取淘宝商品(文件式)
分三部分:1.获取网页内容 、2.分析内容、3.保存数据
对比交互式,文件式采用模块化,具备容错机制,关键可以重复调用,稍作修改可调用其他场景。
import requests
import os
import json
from bs4 import BeautifulSoup
def Geturl(url,num_retries=5):
try:
down = requests.get(url)
html = down.text
if down.status_code >=400:
html = None
if num_retries and 500 <= down.status_code <600 :
return Geturl(url,num_retries-1)
except requests.exceptions.RequestExcption as e:
print('download error:',e.reason)
html = None
return html
def Soup_data(html):
soup = BeautifulSoup(html,'html.parser')
soup1 = soup.find('script',{'id':"sf-item-list-data"}).contents
soup2 = json.loads(soup1[0])
soup3 = soup2['data']
ls=[['标的物','起拍价','围观人数']]
for t in range(len(soup3)):
ls.append([soup3[t]['title'],str(soup3[t]['initialPrice']),str(soup3[t]['viewerCount'])])
return ls
def Save(ls):
add = 'D:/python_测试/data/'
path = add + '测试数据.csv'
try:
if not os.path.exists(add):
os.mkdir(add)
with open(path,'x') as f:
for x in ls:
for z in x:
f.write(z+',')
f.write('\n')
f.close()
print('数据保存成功。')
except:
print('文件保存出错!')
return None
def main():
url = 'https://sf.taobao.com/item_list.htm?province=%B9%E3%B6%AB'
Save(Soup_data(Geturl(url)))
main()
F5运行后的结果,保存数据与交互式一致
==================== RESTART: C:/Users/xx/Desktop/1.py ====================
数据保存成功。
>>>
三、简单的Scrapy最小框架爬取
1.分析网站内容
2.运行cmd,scrapy startproject +工程名
创建工程
3.scrapy genspider+蜘蛛名称+起始爬取位置
创建spider
4.创建item模型
5.修改spider,处理返回的response
6.运行爬虫,保存文件
ps:Scrapy的功能远远不止这些,他的管道、中间件等使得他更加适合进行更加复杂,更具深度、增量式的爬取。像这中单个定向式的爬取,其实利用‘requests+beautifulsoup’来爬取会更加简单方便。当然,初次接触Scrapy可以通过这样的小工程逐步了解Scrapy框架。
'''创建工程'''
scrapy startproject +工程名 '''创建工程,在当前目录下会生成这样的文件'''
|——(工程名文件)
| |——— _init_.py '''包定义 '''
| |——— items.py '''模型定义 '''
| |——— pipelines.py '''管道定义 '''
| |——— settings.py '''配置文件 '''
| |——— spiders '''蜘蛛(spider)文件夹 '''
| |——— _init_.py '''控制文件'''
| |——— (蜘蛛名称).py '''蜘蛛代码文件 '''
|——scrapy.cfg '''运行配置文件 '''
'''创建蜘蛛spider'''
scrapy genspider+蜘蛛名称+起始爬取位置
'''创建item模型,即修改 items.py '''
import scrapy
class DemoItem(scrapy.Item):
title = scrapy.Field()
data = scrapy.Field()
'''修改spider,处理返回的response,即修改 (蜘蛛名称).py '''
import scrapy
from ..items import DemoItem
from bs4 import BeautifulSoup
class DomeWebSpider(scrapy.Spider):
name = 'dome_web'
allowed_domains = ['https://python123.io']
start_urls = ['https://python123.io/ws/demo.html']
def parse(self, response):
soup = BeautifulSoup(response.text)
item = DemoItem()
item['title'] = soup.find('p',{'class':'title'}).text
item['data'] = soup.find('p',{'class':'course'}).text
return item
'''运行爬虫,保存文件'''
scrapy crawl dome_web -o date.json
#爬虫名:dome_web,数据保存为date.json
爬取的地址:https://python123.io/ws/demo.html
爬取数据为:
[
{“title”: “The demo python introduces several python courses.”, “data”: “Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:\r\nBasic Python and Advanced Python.”}
]