Python网络数据采集5:存储数据

     本章将介绍三种主要的数据管理方法,对绝大多数应用都适用。如果你准备创建一个网站的后端服务或者创建自己的API,那么可能都需要让爬虫把数据写入数据库。如果你需要一个快速简单的方法收集网上的文档,然后存到你的硬盘里,那么可能需要创建一个文件流来实现。如果还要为偶然事件提醒,或者每天定时收集当天累计的数据,就给自己发一封邮件吧!

5.1 媒体文件

    只存储文件的URL链接的缺点:

  • 这些内嵌在你的网站或应用中的外站URL链接被称为盗链(hotlinking),使用盗链可能会让你麻烦不断,每个网站都会实施防盗链措施。
  • 因为你的链接文件在别人的服务器上,所以你的应用就要跟着别人的节奏运行了。
  • 盗链是很容易改变的。如果你把盗链图片放在博客上,要是被对方服务器发现,很可能被恶搞。如果你把URL链接存起来准备以后再用,可能用的时候链接已经失效了,或者是变成了完全无关的内容。
  • 现实中的网络浏览器不仅可以请求HTML页面并切换页面,它们也会下载访问页面上所有的资源。下载文件会让你的爬虫看起来更像是人在浏览网站,这样做反而有好处。

在 Python 3.x 版本中, urllib.request.urlretrieve 可以根据文件的 URL 下载文件:

from urllib.request import urlretrieve
from urllib.request import urlopen
from bs4 import BeautifulSoup

html = urlopen("http://www.pythonscraping.com")
bsObj = BeautifulSoup(html)
imageLocation = bsObj.find("a", {"id":"logo"}).find("img")["src"]
urlretrieve(imageLocation, "logo.jpg")
批量下载

# -*- coding: utf-8 -*-
import os
from urllib.request import urlretrieve
from urllib.request import urlopen
from bs4 import BeautifulSoup

downloadDirectory = "downloaded"
baseUrl = "http://pythonscraping.com"

def getAbsoluteURL(baseUrl, source):
	if source.startswith("http://www."):
		url = "http://"+source[11:]
	elif source.startswith("http://"):
		url = source
	elif source.startswith("www."):
		urt = source[4:]
		url = "http://" + source
	else:
		url = baseUrl + "/" + source
	if baseUrl not in url:
		return None
	return url 

def getDownloadPath(baseUrl, absoluteUrl, downloadDirectory):
	path = absoluteUrl.replace("www.", "")
	path = path.replace(baseUrl, "")
	path = downloadDirectory + path
	directory = os.path.dirname(path)

	if not os.path.exists(directory):
		os.makedirs(directory)

	return path

html = urlopen("http://www.pythonscraping.com")
bsObj = BeautifulSoup(html)
downloadList = bsObj.findAll(src=True)

for download in downloadList:
	fileUrl = getAbsoluteURL(baseUrl, download["src"])
	if fileUrl is not None:
		print(fileUrl)
		urlretrieve(fileUrl, getDownloadPath(baseUrl, fileUrl, downloadDirectory))
注:这个程序没有对所有下载文件的类型进行检查,也不应用管理员权限运行它,记得经常备份重要的文件,不要在硬盘上存储敏感信息。

5.2 把数据存储到CSV

   逗号分隔值

import csv
csvFile = open("test.csv", 'w+')
try:
	writer = csv.writer(csvFile)
	writer.writerow(('number', 'number plus 2', 'number times 2'))
	for i in range(10):
		writer.writerow( (i, i+2, i*2) )
finally:
	csvFile.close()
HTML表格转换成CSV文件

import csv
from urllib.request import urlopen
from bs4 import BeautifulSoup

html = urlopen("http://en.wikipedia.org/wiki/Comparison_of_text_editors")
bsObj = BeautifulSoup(html)
# 
table = bsObj.findAll("table", {"class":"wikitable"})[0]
rows = table.findAll("tr")

csvFile = open("editors.csv", 'wt', newline="", encoding='utf-8')
writer = csv.writer(csvFile)
try:
	for row in rows:
		csvRow = []
	for cell in row.findAll(['td', 'th']):
		csvRow.append(cell.get_text())
		writer.writerow(csvRow)
finally:
	csvFile.close()

5.3 MySQL

$sudo apt-get install mysql-server

mysql -u root -p

create database scraping;
use scraping;
create table pages (id BIGINT(7) NOT NULL AUTO_INCREMENT, title VARCHAR(200), content VARCHAR(10000), created TIMESTAMP DEFAULT CURRENT_TIMESTAMP, PRIMARY KEY (id));

describe pages;

insert into pages (title, content) VALUES ("Test page title", "This is some test page content. It can be up to 10,000 characters long.");

select * from pages where id=1;

select * from pages where title like "%test%";     ########

delete from pages where id = 1;

与Python整合

import pymysql
conn = pymysql.connect(host='127.0.0.1', unix_socket='/var/run/mysqld/mysqld.sock', user='root', passwd='dong', db='mysql')
cur = conn.cursor()
cur.execute("use scraping")
cur.execute("select * from pages where id=2")
print(cur.fetchone())
cur.close()
conn.close()
连接和光标模式是数据库编程中常用的模式。连接模式除了要连接数据库之外,还要发送数据库信息,处理回滚操作,创建新的光标对象,等等。而一个连接可以有很多个光标,一个光标跟踪一种状态信息,比如跟踪数据库的使用状态。如果你有多个数据库,且需要向所有数据库写内容,就需要多个光标来处理。光标还会包含最后一次查询执行的结果。通过调用光标函数,比如cur.fetchone()

# -*- coding: utf-8 -*-
import pymysql
conn = pymysql.connect(host='127.0.0.1', unix_socket='/var/run/mysqld/mysqld.sock', user='root', passwd='dong', db='mysql', charset='utf8')
cur = conn.cursor()
cur.execute("use scraping")

from urllib.request import urlopen
from bs4 import BeautifulSoup
import datetime
import random
import re
import json

random.seed(datetime.datetime.now())
def store(title, content):
	cur.execute("insert into pages (title, content) values (\"%s\", \"%s\")", (title, content))
	cur.connection.commit()

# https://en.wikipedia.org/wiki/Python_(programming_language)
def getLinks(articleUrl):
	html = urlopen("http://en.wikipedia.org"+articleUrl)
	bsObj = BeautifulSoup(html)
	title = bsObj.find("h1").get_text()
	content = bsObj.find("div", {"id":"mw-content-text"}).find("p").get_text()
	store(title, content)
	return bsObj.find("div",{"id":"bodyContent"}).findAll("a", href=re.compile("^(/wiki/)((?!:).)*$"))
links = getLinks("/wiki/Kevin_Bacon")
try:
	while len(links) > 0:
		newArticle = links[random.randint(0, len(links)-1)].attrs['href']
		print(newArticle)
		links = getLinks(newArticle)
finally:
	cur.close()
	conn.close()
MySQL里的“六度空间游戏”

页面A到页面B的链接:insert into links (fromPageId, toPageId) values (A, B)

需要设计一个带有两张数据表的数据库来分别存储页面和链接,两张表都带有创建时间和独立的ID号

create table pages ( id INT NOT NULL AUTO_INCREMENT, url VARCHAR(255) NOT NULL, created TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP, PRIMARY KEY(id));

create table links (id INT NOT NULL AUTO_INCREMENT, fromPageId INT NULL, toPageId INT NULL, created TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP, PRIMARY KEY (id));

# -*- coding: utf-8 -*-
from urllib.request import urlopen
from bs4 import BeautifulSoup
import re
import pymysql
conn = pymysql.connect(host='127.0.0.1', unix_socket='/var/run/mysqld/mysqld.sock', user='root', passwd='dong', db='mysql', charset='utf8')
cur = conn.cursor()
cur.execute("use wikipedia")

def insertPageIfNotExists(url):
	cur.execute("select * from pages where url = %s", (url))
	if (cur.rowcount) == 0:
		print("insert pages: %s", (url))
		cur.execute("insert into pages (url) values (%s)", (url))
		conn.commit()
		return cur.lastrowid
	else:
		return cur.fetchone()[0]

def insertLink(fromPageId, toPageId):
	cur.execute("select * from links where fromPageId=%s and toPageId=%s", (int(fromPageId), int(toPageId)))
	if (cur.rowcount == 0):
		print("insert link %s -> %s", int(fromPageId), int(toPageId))
		cur.execute("insert into links (fromPageId, toPageId) values (%s, %s)", (int(fromPageId), int(toPageId)))
		conn.commit()

pages = set()
def getLinks(pageUrl, recursionLevel):
	global pages
	if recursionLevel > 4:
		return;
	pageId = insertPageIfNotExists(pageUrl)
	html = urlopen("http://en.wikipedia.org" + pageUrl)
	bsObj = BeautifulSoup(html)
	for link in bsObj.findAll("a", href=re.compile("^(/wiki/)((?!:).)*$")):
		insertLink(pageId, insertPageIfNotExists(link.attrs['href']))
		if link.attrs['href'] not in pages:
			# 遇到一个新页面,加入集合并搜索里面的词条链接
			newPage = link.attrs['href']
			pages.add(newPage)
			getLinks(newPage, recursionLevel + 1)

getLinks("/wiki/Kevin_Bacon", 0)
cur.close()
conn.close()
需要注意的是,这个程序可能要运行好几天才会结束

5.4 Email

   通过SMTP(简单邮件传输协议)传输的。需要SMTP协议的服务器

import smtplib
from email.mime.text import MIMEText
msg = MIMEText("The body of the email is here")
msg['Subject'] = "An Email Alert"
msg['From'] = "ryan@pythonscraping.com"
msg['To'] = "webmaster@pythonscraping.com"
s = smtplib.SMTP('localhost')
s.send_message(msg)
s.quit()
Python 有两个包可以发送邮件: smtplib 和 email
import smtplib
from email.mime.text import MIMEText
from bs4 import BeautifulSoup
from urllib.request import urlopen
import time
def sendMail(subject, body):
msg = MIMEText(body)
msg['Subject'] = subject
msg['From'] = "christmas_alerts@pythonscraping.com"
msg['To'] = "ryan@pythonscraping.com"
s = smtplib.SMTP('localhost')
s.send_message(msg)
s.quit()
bsObj = BeautifulSoup(urlopen("https://isitchristmas.com/"))
while(bsObj.find("a", {"id":"answer"}).attrs['title'] == "NO"):
print("It is not Christmas yet.")
time.sleep(3600)
bsObj = BeautifulSoup(urlopen("https://isitchristmas.com/"))
sendMail("It's Christmas!",
"According to http://itischristmas.com, it is Christmas!")
这个程序每小时检查一次 https://isitchristmas.com/ 网站(根据日期判断当天是不是圣诞节)




  • 0
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值