Python网络数据采集(1)

最新推荐文章于 2023-05-06 10:16:50 发布

thomasdxsn

最新推荐文章于 2023-05-06 10:16:50 发布

阅读量301

点赞数

分类专栏：阅读爬虫文章标签：爬虫阅读 python

本文链接：https://blog.csdn.net/thomasdxsn/article/details/53787025

版权

阅读同时被 2 个专栏收录

2 篇文章 0 订阅

订阅专栏

爬虫

2 篇文章 0 订阅

订阅专栏

Python 网络数据采集

第一篇博客，用来记录自己的学习。今天因为request过多，被wiki给封掉ip。

贴一个小爬虫：

pages = set()   # 声明一个集合，用来去重
def get_link(pageUrl):
    """搜寻一个页面内容里含有/wiki/的href属性链接，进入搜寻到的第一个链接内继续循环,并将其加入集合进行去重"""
    global pages
    html = requests.get('http://en/wikipedia.org'+pageUrl).text
    soup = Beautiful(html,'html.parser')

    for link in soup.all("a",href=re.compile('^(/wiki/)'):
        if link.attrs['href'] is not None:
            if link.attrs['href'] not in pages:
                newLink = link.attrs['href']
                print(newlink)
                pages.add('http://en.wikipedia.org' + newlink)
                with open('pages.txt','w') as f:
                    #用join把集合串接成字符串
                    pageStr = '\n'.join(pages) 
                    f.writer(pageStr)
                get_link(newlink)

# 第一次启动函数时传入空格参数就行了 
get_link('')