数据保存!!!Python 爬取网页数据后,三种保存格式---保存为txt文件、CSV文件和mysql数据库

    Python爬取网站数据后,数据的保存方式是大家比较关心的意一件事情,也是为接下来是否能够更简便的处理数据的关键步骤。下面,就Python爬取网页数据后的保存格式进行简单介绍。三种保存格式为txt格式、CSV格式和数据库格式。

    首先,保存为txt格式。话不多说,直接上代码!    

# -*- coding: utf-8 -*-
import requests
import json
import html
import urllib
import sys
import re
import random
import time
from threading import Timer
from bs4 import BeautifulSoup


reload(sys)
sys.setdefaultencoding('utf-8')
headers ={'user-agent':'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/47.0.2526.106 BIDUBrowser/8.7 Safari/537.36'}


def get_html1(i):
    url = 'https://www.ppmoney.com/StepUp/List/-1/{}/fixedterm/true/false?_={}'
    html = requests.get(url.format(i,random.randint(1501050773102,1501051774102)),headers=headers)
    return html.content
def get_data1(html):
    data1 = json.loads(html)
    data = data1['PackageList']['Data']
    for i in data:
        #产品名称,利率,金额
         print i['name'],'\t',i['profit'],'\t',i['investedMoney']
       with open('d:PPmonenyshengxinbao9.6.txt','a') as f:
           f.write(i['name']+'\t'+str(i['profit'])+'\t'+str(i['investedMoney'])+'\n'
for i in range(1,10):
    get_data1(get_html1(i))   

执行代码后,生成文件打开后显示如下:

    2.保存为CSV格式。

# -*- coding: utf-8 -*-
import requests
import pandas as pd
import numpy as np
import json
import html
import urllib
import sys
import re
import random
import time
from threading import Timer
from bs4 import BeautifulSoup


reload(sys)
sys.setdefaultencoding('utf8')
headers ={'user-agent':'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/47.0.2526.106 BIDUBrowser/8.7 Safari/537.36'}

def get_html1(i):
    url = 'https://www.ppmoney.com/StepUp/List/-1/{}/fixedterm/true/false?_={}'
    html = requests.get(url.format(i,random.randint(1501050773102,1501051774102)),headers=headers)
    ceshi1=html.content
    data = json.loads(ceshi1)
    return(data['PackageList']['Data'])


data_ceshi=pd.DataFrame([])
html_list=[]
for i in range(100):
    html_list.append(get_html1(i))
for i,heml_avg in enumerate(html_list):
    tmp=pd.DataFrame(heml_avg)
    tmp["page_id"]=i
    data_ceshi=data_ceshi.append(tmp)


print data_ceshi
data_ceshi.to_csv('e:/data.csv',encoding='gbk')

保存后,结果如下:

3.保存到数据库。

# -*- coding: utf-8 -*-
import requests
import pandas as pd
import numpy as np
import json
import html
import urllib
import sys
import re
import random
import MySQLdb
import time
from threading import Timer
from bs4 import BeautifulSoup


reload(sys)
sys.setdefaultencoding('utf8')
headers ={'user-agent':'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/47.0.2526.106 BIDUBrowser/8.7 Safari/537.36'}

db = MySQLdb.connect(host = 'localhost',port = 3306,user = 'zhouliye',passwd = '123456zz',db = 'abbc',charset='utf8')
print '连接上数据库了!'
cursor = db.cursor()


cursor.execute("DROP TABLE IF EXISTS shengxb")
sql = """CREAtE TABLE SHENGXB(
          beginTime DATETIME,
          endTime DATETIME,
          investedMoney float,
          name CHAR(50))"""


cursor.execute(sql)


def get_html1(i):
    url = 'https://www.ppmoney.com/StepUp/List/-1/{}/fixedterm/true/false?_={}'
    html = requests.get(url.format(i,random.randint(1501050773102,1501051774102)),headers=headers)
    ceshi1=html.content
    data = json.loads(ceshi1)
    return(data['PackageList']['Data'])


data_ceshi=pd.DataFrame([])#建立一个空数据框
html_list =[]#建立一个空列表


for i in range(10):
    html_list.append(get_html1(i))


for i in html_list:
    for j in i:
        a = j['beginTime']
        b = j['endTime']
        c = j['investedMoney']
        d = j['name']
        print u'开始时间: ' + str(a) + u'结束时间: '+ str(b) + u'投资金额: ' + str(c) + u'项目名称' + str(d)
        insert_SHENGXB = ("INSERT INTO SHENGXB (beginTime,endTime,investedMoney,name) VALUES(%s,%s,%s,%s)")
        data_data= (a, b, c, d)
        cursor.execute(insert_SHENGXB,data_data)
        db.commit()
        print '******完成此条插入!'




print '爬取数据并插入mysql数据库完成...'

保存后显示如下:

  • 14
    点赞
  • 138
    收藏
    觉得还不错? 一键收藏
  • 1
    评论
抱歉,作为AI语言模型,我无法执行爬虫操作和访问本地文件系统。但您可以参考以下步骤来完成该任务: 1. 安装Python的爬虫框架Scrapy和MySQL数据库连接库pymysql: ``` pip install scrapy pip install pymysql ``` 2. 创建一个Scrapy项目: ``` scrapy startproject douban_movie ``` 3. 编写爬虫代码。在douban_movie/spiders目录下创建一个名为douban_spider.py的文件,实现对豆瓣电影top250页面的爬取,并将结果保存至本地csv文件: ```python import scrapy import csv class DoubanSpider(scrapy.Spider): name = 'douban' start_urls = ['https://movie.douban.com/top250'] def parse(self, response): for movie in response.css('.item'): yield { 'title': movie.css('.title::text').get(), 'rating': movie.css('.rating_num::text').get(), 'comment': movie.css('.quote span::text').get() } next_page = response.css('.next a::attr(href)').get() if next_page is not None: yield response.follow(next_page, self.parse) # 保存至本地csv文件 with open('douban_movie.csv', mode='a', newline='', encoding='utf-8') as file: writer = csv.writer(file) for movie in response.css('.item'): writer.writerow([ movie.css('.title::text').get(), movie.css('.rating_num::text').get(), movie.css('.quote span::text').get() ]) ``` 4. 运行爬虫并将结果导入MySQL数据库。在douban_movie目录下创建一个名为mysql_pipeline.py的文件,实现将csv文件中的数据导入MySQL数据库: ```python import csv import pymysql class MysqlPipeline: def __init__(self): self.conn = pymysql.connect( host='localhost', port=3306, user='root', password='password', db='douban_movie', charset='utf8mb4' ) self.cursor = self.conn.cursor() def process_item(self, item, spider): self.cursor.execute( "INSERT INTO movie(title, rating, comment) VALUES (%s, %s, %s)", (item['title'], item['rating'], item['comment']) ) self.conn.commit() return item def close_spider(self, spider): self.cursor.close() self.conn.close() if __name__ == '__main__': with open('douban_movie.csv', mode='r', encoding='utf-8') as file: reader = csv.reader(file) next(reader) # 跳过表头 for row in reader: pipeline = MysqlPipeline() pipeline.process_item({ 'title': row[0], 'rating': row[1], 'comment': row[2] }, None) ``` 5. 运行爬虫并导入数据: ``` scrapy crawl douban python mysql_pipeline.py ``` 注意:在运行mysql_pipeline.py文件之前,需要先创建MySQL数据库和movie表。可以使用以下SQL语句: ``` CREATE DATABASE douban_movie CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci; USE douban_movie; CREATE TABLE movie ( id INT(11) NOT NULL AUTO_INCREMENT, title VARCHAR(255) NOT NULL, rating FLOAT NOT NULL, comment VARCHAR(255), PRIMARY KEY (id) ) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_unicode_ci; ```
评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值