爬虫学习----数据存储

最新推荐文章于 2024-08-07 08:19:34 发布

勤奋的小学生

最新推荐文章于 2024-08-07 08:19:34 发布

阅读量596

点赞数

分类专栏：网络爬虫文章标签： Python 网络爬虫 txt和csv mysql MongoDB

本文链接：https://blog.csdn.net/gyt15663668337/article/details/86286131

版权

网络爬虫专栏收录该内容

9 篇文章 6 订阅

订阅专栏

网络爬虫的第一步是获取网页数据，第二步是解析网页数据，第三步就是要存储我们得到的数据，存储数据的方式主要学习以下两种。

存储在文件中，包括TXT文件和CSV文件
存储在数据库中，包括MySQL关系数据库和MongoDB数据库

一、存储在TXT或CSV

1. 存储在TXT文件

把数据存储在TXT文件中很简单，之前就用到过，但是这节书中介绍了三种路径的使用方法，这是一个很好的学习点。我们分别用书上的例子来学习。

第一种：C:\\Python\\title.txt

title = "This is a test sentence."
with open('C:\\Python\\title.txt', "a+") as f:
    f.write(title)
    f.close()

第二种：r'C:\Python\title.txt'

title = "Hello world,Hello python"
with open(r'C:\Python\title.txt', "a+") as f:
    f.write(title)
    f.close()

第三种：'C:/Python/title.txt'

title = "\n nihao,shijie"
with open('C:/Python/title.txt', "a+") as f:
    f.write(title)
    f.close()

这三种方式，在window下均能找到正确路径下的文件。

有时需要把变量写入TXT文件中，这时分隔符就比较重要了。可以采用Tab进行分隔。因为字符串中一般不会出现Tab符号。用“\t”.join()将变量连接成一个字符串。

# 往txt中加入变量，采用tab进行分隔
output = '\t'.join(['name', 'title', 'age', 'gender'])
with open('C:/Python/test.txt', "a+") as f:
    f.write(output)
    f.close()

读取txt文件中的数据

# 读取数据
with open('C:/Python/test.txt', 'r', encoding='UTF-8') as f:
    result = f.read()
    print(result)

name	title	age	gender

2. 存储至CSV文件

CSV（Comma-Separated Values）是逗号分隔值文件格式，其文件以纯文本的形式存储表格数据。每一行都是换行符分隔，每一列之间用逗号分隔。可以使用Excel和记事本打开。

写入csv文件

import csv
output_list = ['A1', 'A2', 'A3', 'A4']
with open('test2.csv', 'a+', encoding='UTF-8', newline='') as csvfile:
    w = csv.writer(csvfile)
    w.writerow(output_list)

读取CSV文件

import csv
output_list = ['A1', 'A2', 'A3', 'A4']
with open('test2.csv', 'a+', encoding='UTF-8', newline='') as csvfile:
    w = csv.writer(csvfile)
    w.writerow(output_list)

['1\t2\t3\t4']
['5\t6\t7\t8']
['9\t10\t11\t12']
['13\t14\t15\t16']

二、使用MySQL数据库和MongoDB数据库存储

1. 使用MySQL数据库存储数据

在使用数据库之前，当然是需要先安装数据库，网上有很多教程，可以照着安装一下。然后就是数据库的基本操作。下面介绍一下简单的数据库基本操作。

# 登录数据库
mysql -u root -p mysql123;
# 查看数据库
show databases;
# 创建数据库
create database scraping
# 使用数据库
use scraping
# 创建数据表
create table urls(
id int not null auto_increment,
url varchar(1000) not null,
content varchar(4000) not null,
created_time timestamp default current_timestamp,
primary key (id));
# 查看数据表
descrise urls;
# 插入数据
insert into urls (url, content) values ("www.baidu.com", "内容");
# 查询数据
select * from urls where id=1;
# 查询url，content
select url, content from urls where id=1;
# 删除数据
delete from urls where url='www.baidu.com'
# 修改数据
update urls set url="www.google.com", content="Google" where id=2;

使用Python连接数据库，操作数据库进行存储数据。

import pymysql
pymysql.install_as_MySQLdb()
import MySQLdb
conn = MySQLdb.connect(host='localhost', user='root',passwd='mysql123',db='scraping')
cur = conn.cursor()
cur.execute("insert into urls(url, content) values ('www.bai.com', 'this is content')")
cur.close()
conn.commit()
conn.close()

然后，爬取博客网页，然后将数据存入mysql数据库。

import requests
from bs4 import BeautifulSoup
import MySQLdb

# 连接数据库
conn = MySQLdb.connect(host='localhost', user='root', passwd='mysql123', db='scraping')
cur = conn.cursor()

link = "http://www.santostang.com//"
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36'}
r = requests.get(link, headers=headers)

soup = BeautifulSoup(r.text, "lxml")
title_list = soup.find_all("h1", class_="post-title")
for eachone in title_list:
    url = eachone.a['href']
    title = eachone.a.text.strip()
    cur.execute("insert into urls(url, content) values (%s, %s)", (url, title))
cur.close()
conn.commit()
conn.close()

2. 使用MongoDB数据库存储数据

同样需要安装MongoDB数据库，网上有很多教程，去找一个，照着做就可以。

MongoDB的集合（collection）对应mysql数据库中的数据表，文档（document）对应mysql数据库中的记录行，域（field）对应mysql数据库中的数据字段。接下来，就可以使用MongoDB来存储数据了。

import requests
import datetime
from bs4 import BeautifulSoup
from pymongo import MongoClient

# 连接数据库
# 创建一个客户端
client = MongoClient('localhost', 27017)
# 连接数据库
db = client.blog_database
# 选择数据的集合
collection = db.blog

# 爬取数据
link = "http://www.santostang.com//"
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36'}
r = requests.get(link, headers=headers)

soup = BeautifulSoup(r.text, "lxml")
title_list = soup.find_all("h1", class_="post-title")
for eachone in title_list:
    url = eachone.a['href']
    title = eachone.a.text.strip()
    # 将数据存入post中
    post = {
        "url": url,
        "title": title,
        "date": datetime.datetime.utcnow()
    }
    # 将输入加入集合中
    collection.insert_one(post)

然后，在打开的mongo.exe窗口中，输入

use blog_database

db.blog_find().pretty()

就可以看到，我们爬取下来的数据已经存储到数据库中了。