Python 与数据存储

最新推荐文章于 2024-07-18 19:21:27 发布

银霜覆秋枫

最新推荐文章于 2024-07-18 19:21:27 发布

阅读量883

点赞数

分类专栏： Python 文章标签： python 数据存储爬虫

本文链接：https://blog.csdn.net/u011974126/article/details/51381592

版权

Python 专栏收录该内容

8 篇文章 0 订阅

订阅专栏

1、存储媒体文件

媒体文件的存储有两种方式，一种是只存储媒体文件的url链接，这种方式对于媒体文件的url链接基本不发生变化，或对该媒体文件的引用只有一次的情况下较为适用；对于url链接经常变化，或引用次数较多的媒体文件，通常需要下载该每天文件到本地。

在Python3.x中，urllib.request.urlretrieve可以根据文件的url下载文件。下边的代码将实现从百度词条"英雄联盟"中下载正文中的图片。

from bs4 import BeautifulSoup
import re
import os
from urllib.request import urlopen
from urllib.request import urlretrieve

downLoadDirectory="F:\\PythonScrawler\\LOL\\"

url="http://baike.baidu.com/subview/3049782/11262116.htm"

html=urlopen(url)
bsObj=BeautifulSoup(html,"html.parser")


def GetDownLoadPath(url,downLoadPath):
    pathSplit=url.split('/')
    length=len(pathSplit)
    if length>1:
        filename=pathSplit[length-1]
        filePathName=downLoadPath+filename
        directory=os.path.dirname(downLoadPath)
        if not os.path.exists(directory):
            os.makedirs(directory)
        return filePathName

for node in bsObj.find("div",{"class":"main-content"}).findAll("div",{"class":"para"}):
    links=node.findAll("div",{"class":"lemma-picture text-pic layout-right"})
    if len(links)>0:
        for link in links:
            children=link.findAll('img')
            for child in children:
                if child.has_attr('data-src'):
                    if child.attrs['data-src']!=None:
                        #print(child.attrs['data-src'])
                        urlretrieve(child.attrs['data-src'],GetDownLoadPath(child.attrs['data-src'],downLoadDirectory))

在上边的代码中，GetDownLoadPath()函数用来获取要保存的文件的全名，主要代码注释如下;

os.path.dirname(downLoadPath)   #获取文件路径

os.path.exists(directory)<span style="white-space:pre">	</span>#判断路劲是否存在
os.makedirs(directory)<span style="white-space:pre">		</span>#生成路劲

在网页中，查看对应项的审查元素，发现图片的url位于html节点img中，链接在该节点的src属性中，但在python的BeautifulSoup中查看对应项，发现其属性为data-src中，具体原因待查，本文在代码中使用data-src属性。程序抓取的效果如下：

2、保存为CSV文件

CSV文件是存储表格数据的常用文件格式。下边的代码用于抓取百度词条php中的第一张表格，并保存在本地的CSV文件中。

#下载百度百科PHP词条中的表格#
import csv
from bs4 import BeautifulSoup
from urllib.request import urlopen
import re


csvFile=open('F:\\PythonScrawler\\PHPCSV\\set.csv','wt')
writer=csv.writer(csvFile)


url="http://baike.baidu.com/subview/99/5828265.htm"
html=urlopen(url)
bsObj=BeautifulSoup(html.read(),'html.parser')

table=bsObj.findAll("table",{"class":"table-view log-set-param"})[0]
rows=table.findAll("tr")
#print(table)

try:
    for row in rows:
        #print(row)
        csvRows=[]
        for col in row.findAll(re.compile("t(d|h)")):
            csvRows.append(col.get_text())
        writer.writerow(csvRows)
finally:
    csvFile.close()

在上边的代码中，首先打开了一个本地的CSV文件，并创建了该文件的writer对象，用于对该文件的写操作。在写的过程中，首先创建了一个空的序列，对于属于同一行不同列的数据，分别添加至同一序列中，然后用writer对象的writerow函数将该序列中的所有元素写如CSV文件的一行中。实验效果如下图：

3、保存至MySQL数据库中

使用Python对MySQL数据库进行操作，需要用到PyMySQL数据库，本文使用的版本为0.7.2。在刚刚装好的MySQL的数据库中，需要先对帐号进行访问授权，否则将出现数据库无法访问的错误。授权代码如下：

grant all privileges on *.* to root@'%' identified by '12345678' with grant option

上述代码中，允许帐号root，密码12345678在任意IP对数据库进行指定操作的访问。如果需要限定IP，可以使用：

grant all privileges on *.* to root@'xxx.xxx.xxx.xxx' identified by '12345678' with grant option;

其中，xxx.xxx.xxx.xxx为要指定的IP地址。

下边的代码为使用python抓取百度词条“英雄联盟”中的子词条，并将链接和子词条名保存在数据库中。

from bs4 import BeautifulSoup
import re
from urllib.request import urlopen
import pymysql


##  百度百科：英雄联盟##
html=urlopen("http://baike.baidu.com/subview/3049782/11262116.htm")
bsObj=BeautifulSoup(html.read(),"html.parser")
#print(bsObj.prettify())

conn=pymysql.connect(host='202.115.52.234',unix_socket='/tmp/mysql.sock',user='Julian',passwd='Julian2016',db='mysql',charset='gbk')
cur=conn.cursor()
cur.execute("USE LOLset")
#row=0

for node in bsObj.find("div",{"class":"main-content"}).findAll("div",{"class":"para"}):
    links=node.findAll("a",href=re.compile("^(/view/)[0-9]+\.htm$"))
    for link in links:
        if 'href' in link.attrs:
            print(link.attrs['href'],link.get_text())
            cur.execute("INSERT INTO pages (links,words) VALUES (\"%s\",\"%s\")",(link.attrs['href'],link.get_text()))
            cur.connection.commit()
            #row=row+1

cur.close()
conn.close()

由于词条名为中文，因此，在建立数据库连接的时候，需要指定字符集为gbk，

<pre name="code" class="python">conn=pymysql.connect(host='202.115.52.234',unix_socket='/tmp/mysql.sock',user='Julian',passwd='Julian2016',db='mysql',charset='gbk')

数据库名为LOLset，表名为pages。pages的定义如下：