python网络数据采集豆瓣_《python网络数据采集》——第三天

最新推荐文章于 2022-10-17 19:14:06 发布

weixin_39747075

最新推荐文章于 2022-10-17 19:14:06 发布

阅读量59

点赞数

文章标签： python网络数据采集豆瓣

7-17

维基百科六度分割理论

也就是我们常说的小世界现象，两个不相认识的人，通过很少的中间人就能建立起联系

指向词条页面的链接有三个共同点1.他们都在id是bodycontent的div标签里2.URL链接不包含分号3.url链接都是以/wiki/开头

接下来就是利用起始页面里的词条链接列表设置成链接列表，在利用循环，页面找一个词条抽取herf属性，打印页面链接，再传入getlink函数，重新获取链接

from urllib.request import urlopen

from bs4 import BeautifulSoup

import datetime

import random

import re

random.seed(datetime.datetime.now())

def getLinks(articleUrl):

html = urlopen("http://en.wikipedia.org"+articleUrl)

bsObj = BeautifulSoup(html,"html.parser")

return bsObj.find("div", {"id":"bodyContent"}).findAll("a",

href=re.compile("^(/wiki/)((?!:).)*$"))

links = getLinks("/wiki/Kevin_Bacon")

while len(links) > 0:

newArticle = links[random.randint(0, len(links)-1)].attrs["href"]

print(newArticle)

links = getLinks(newArticle)

datatime模块分析

datetime模块定义了5个类，分别是

1.datetime.date：表示日期的类

2.datetime.datetime：表示日期时间的类

3.datetime.time：表示时间的类

4.datetime.timedelta：表示时间间隔，即两个时间点的间隔

5.datetime.tzinfo：时区的相关信息

其实也就是时间信息https://www.cnblogs.com/cindy-cindy/p/6720196.html

re模块分析

re就是正则表达式

“有些人面临一个问题时会想：‘我知道，可以用正则表达式来解决这个问题。’于是现在他们就有两个问题了”——Jamie Zawinski(看到的一句话很赞同哈哈哈)

re模块中包含一个重要函数是compile(pattern [, flags]) ，该函数根据包含的正则表达式的字符串创建模式对象。可以实现更有效率的匹配。在直接使用字符串表示的正则表达式进行search,match和findall操作时，python会将字符串转换为正则表达式对象。而使用compile完成一次转换之后，在每次使用模式的时候就不用重复转换。当然，使用re.compile()函数进行转换后，re.search(pattern, string)的调用方式就转换为 pattern.search(string)的调用方式。所以说compile就是增加匹配速度

https://www.jianshu.com/p/eb87f02a7e34

random

程序最开始首先是利用系统时间做一个随机数，保证每次程序运行时，wiki词条选择是随机的

链接去重

在代码运行时，把已发现的链接保存起来，在方便查询的列表里（python集合的set类型），只有新的链接才会被采集

pages = set()

def getLinks(pageUrl):

global pages#全局变量

html = urlopen("http://en.wikipedia.org"+pageUrl)

bsObj = BeautifulSoup(html)

for link in bsObj.findAll("a", href=re.compile("^(/wiki/)")):

if 'href' in link.attrs:

if link.attrs['href'] not in pages:

# 我们遇到了新页面

newPage = link.attrs['href']

print(newPage)

pages.add(newPage)

getLinks(newPage)

getLinks("")

收集网站数据

要拟定一个采集模式，所有的标题都是在h1→span标签里，而且页面上只有一个h1

所有的正文都在div#mw-content-text→p，编辑链接只出现在词条页面上，位于li#ca-edit标签的li#ca-edit→span→a里面

from urllib.request import urlopen

from bs4 import BeautifulSoup

import re

pages = set()

def getLinks(pageUrl):

global pages

html = urlopen("http://en.wikipedia.org"+pageUrl)

bsObj = BeautifulSoup(html,"html.parser")

try:

print(bsObj.h1.get_text())

print(bsObj.find(id="mw-content-text").findAll("p")[0])

print(bsObj.find(id="ca-edit").find("span").find("a").attrs['href'])

except AttributeError:

print("页面缺少一些属性！不过不用担心！")

for link in bsObj.findAll("a", href=re.compile("^(/wiki/)")):

if 'href' in link.attrs:

if link.attrs['href'] not in pages:

# 我们遇到了新页面

newPage = link.attrs['href']

print("----------------\n"+newPage)

pages.add(newPage)

getLinks(newPage)

getLinks("")

最后一个是用getlink处理空的url，也就是维基百科首页。

收集内链外链

scrape库

scrape是一个降低网页链接查找和识别工作复杂度的库，但是因为scrape目前仅支持python2.7，所以先不实践

scrape爬虫需要先进行一些设置https://www.jianshu.com/p/be856bc15afb先码住

weixin_39747075

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
python网络数据采集豆瓣_《python网络数据采集》——第三天

7-17维基百科六度分割理论也就是我们常说的小世界现象，两个不相认识的人，通过很少的中间人就能建立起联系指向词条页面的链接有三个共同点1.他们都在id是bodycontent的div标签里2.URL链接不包含分号3.url链接都是以/wiki/开头接下来就是利用起始页面里的词条链接列表设置成链接列表，在利用循环，页面找一个词条抽取herf属性，打印页面链接，再传入getlink函数，重新获取链接f...
复制链接

扫一扫

评论

被折叠的条评论为什么被折叠?

到【灌水乐园】发言

查看更多评论

添加红包

成就一亿技术人!

hope_wisdom

发出的红包

实付元

使用余额支付

点击重新获取

扫码支付

钱包余额 0

抵扣说明：

1.余额是钱包充值的虚拟货币，按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载，可以购买VIP、付费专栏及课程。