数据存储
一、存储到TXT或CSV
1.将数据存储到TXT
几种打开文件的方式:
读写方式 | 可否读写 | 若文件不存在 | 写入方式 |
---|---|---|---|
w | 写入 | 创建 | 覆盖写入 |
w+ | 读取+写入 | 创建 | 覆盖写入 |
r | 读取 | 报错 | 不可写入 |
r+ | 读取+写入 | 创建 | 覆盖写入 |
a | 写入 | 创建 | 附加写入 |
a+ | 读取+写入 | 创建 | 附加写入 |
title = "This is a test sentence."
with open(r'D:\title.txt', "a+") as f:
f.write(title)
f.close()
有格式的存储:
output = '\t'.join(['name','title','age','gender'])
with open(r'D:\test.txt', "a+") as f:
f.write(output)
f.close()
2.读取文件
with open(r'D:\title.txt', "r", encoding ='utf-8') as f:
result = f.read()
print (result)
with open(r'D:\title.txt', "r", encoding ='utf-8') as f:
result = f.read().splitlines()
print (result)
2.将数据存入CSV
import csv
with open('test.csv', 'r',encoding='utf-8') as csvfile:
csv_reader = csv.reader(csvfile)
for row in csv_reader:
print(row)
print(row[0])
import csv
output_list = ['1', '2','3','4']
with open('test2.csv', 'a+', encoding='utf-8', newline='') as csvfile:
w = csv.writer(csvfile)
w.writerow(output_list)
二、存储到MySQL数据库
import pymysql
# 打开数据库连接
db = pymysql.connect("localhost","root","password","scraping" )
# 使用cursor()方法获取操作游标
cursor = db.cursor()
# SQL 插入语句
sql = """INSERT INTO urls (url, content) VALUES ('www.baidu.com', 'This is content.')"""
try:
# 执行sql语句
cursor.execute(sql)
# 提交到数据库执行
db.commit()
except:
# 如果发生错误则回滚
db.rollback()
# 关闭数据库连接
db.close()
将从网上爬取到的内容存储到MySql数据库中:
import requests
from bs4 import BeautifulSoup
import pymysql
db = pymysql.connect("localhost","root","password","scraping" )
cursor = db.cursor()
link = "http://www.santostang.com/"
headers = {'User-Agent' : 'Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US; rv:1.9.1.6) Gecko/20091201 Firefox/3.5.6'}
r = requests.get(link, headers= headers)
soup = BeautifulSoup(r.text, "lxml")
title_list = soup.find_all("h1", class_="post-title")
for eachone in title_list:
url = eachone.a['href']
title = eachone.a.text.strip()
cursor.execute("INSERT INTO urls (url, content) VALUES (%s, %s)", (url, title))
db.commit()
db.close()
三、存储MongoDB数据库
安装完成以后,可以尝试用Python操作MongoDB,检测能否正常连接到数据库。
from pymongo import MongoClient
client = MongoClient('localhost',27017)
db = client.blog_database
collection = db.blog
将爬取博客主页的所有文章标题存储至MongoDB数据库:
import requests
import datetime
from bs4 import BeautifulSoup
from pymongo import MongoClient
client = MongoClient('localhost',27017)
db = client.blog_database
collection = db.blog
link = "http://www.santostang.com/"
headers = {'User-Agent' : 'Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US; rv:1.9.1.6) Gecko/20091201 Firefox/3.5.6'}
r = requests.get(link, headers= headers)
soup = BeautifulSoup(r.text, "lxml")
title_list = soup.find_all("h1", class_="post-title")
for eachone in title_list:
url = eachone.a['href']
title = eachone.a.text.strip()
post = {"url": url,
"title": title,
"date": datetime.datetime.utcnow()}
collection.insert_one(post)
在上面的代码中,首先将爬虫获取的数据存入post的字典中,然后使用insert+one加入集合colllection中,进入目录C:\Program Files\MongoDB\Server\4.0\bin,双击mongo.exe,输入:
use blog_database
db.blog.find().pretty()
这样就能够查询数据集合的数据了。