python爬虫之pymysql库的使用(2)

最新推荐文章于 2023-08-30 08:38:27 发布

田野里的秋刀鱼仔

最新推荐文章于 2023-08-30 08:38:27 发布

阅读量419

点赞数

分类专栏： python网络数据采集文章标签： pymysql python 正则表达式 random 爬虫

本文链接：https://blog.csdn.net/anmo1221/article/details/77849161

版权

python网络数据采集专栏收录该内容

3 篇文章 0 订阅

订阅专栏

要想使用pymysql库，要先安装，因为这是第三方库。

同样，可使用pip大法。安装过程就不说了，比较简单。

下面通过实例来说明这个库和爬虫的结合使用：

from urllib.request import urlopen
from bs4 import BeautifulSoup
import re                  #导入正则表达式库
import datetime
import random
import pymysql

config = {
          'host':'localhost',
          'port':3306,
          'user':'root',
          'password':'root',
          'db':'scraping',  
          'charset':'utf8'                
          }
          # 'charset':''  #设置编码格式        
conn=pymysql.connect(**config)
cur=conn.cursor()              #使用连接对象获得一个cursor（光标）对象

#cursor提供的方法包括两大类:1.执行命令,2.接收返回值
# execute(self, query,args):执行单条sql语句,接收的参数为sql语句本身和使用的参数列表,返回值为受影响的行数
# fetchone(self):返回一条结果行.
# fetchall(self):接收全部的返回结果行.

random.seed(datetime.datetime.now())   #随即种子（当前时间）

def store(title,content):
	cur.execute("INSERT INTO pages (title,content) VALUE (%s,%s)",(title,content))
	cur.connection.commit()           #提交保存修改过的数据

def getLinks(articleUrl):
	html=urlopen("http://en.wikipedia.org"+articleUrl)
	bsObj=BeautifulSoup(html,"html.parser")
	title=bsObj.find("h1").get_text()
	content=bsObj.find("div",{"id":"mw-content-text"}).find("p").get_text()
	store(title,content)
	return bsObj.find("div",{"id":"bodyContent"}).findAll("a",href=re.compile("^(/wiki/)((?!:).)*$"))

links=getLinks("/wiki/Kevin_Bacon")
try:
	while len(links)>0:
		newAricle = links[random.randint(0,len(links)-1)].attrs['href']
		print(newAricle)
		links = getLinks(newAricle)
finally:
	cur.close()
	conn.close()

用完数据库记得关闭光标和连接对象，以防数据泄漏