如何从网站中提取相关的链接并将其保存为.csv文件

在爬取网站数据时,常常需要提取特定内容,例如某一类链接,然后将这些链接保存为表格文件以方便后续数据分析。

在编程中,可以使用urllib2库来访问网页,并使用BeautifulSoup库来解析网页内容。但是,在实际操作中,可能会遇到一些问题,例如无法正确提取链接或无法将提取的链接保存为表格文件。
在这里插入图片描述

2、解决方案

import csv
import urllib2
from datetime import datetime
from urlparse import urljoin
from bs4 import BeautifulSoup

base_url = 'http://en.wikipedia.org'
page = urllib2.urlopen("http://en.wikipedia.org/wiki/List_of_human_stampedes")
soup = BeautifulSoup(page)

# build a list of references
references = {}
for item in soup.select('ol.references li[id]'):
    links = [a['href'] if a['href'].startswith('http') else urljoin(base_url, a['href'])
             for a in item.select('span.reference-text a[href]')]
    references[item['id']] = links


events = soup.find('span', id='20th_century').parent.find_next_siblings()
with open('output.csv', 'wb') as f:
    writer = csv.writer(f)
    for tag in events:
        if tag.name == 'h2':
            break

        for event in tag.find_all('li'):
            # extract text
            try:
                date_string, _ = event.text.split(':', 1)
                date = datetime.strptime(date_string, '%B %d, %Y').strftime('%d/%m/%Y')
            except ValueError:
                continue

            # extract links and write data
            links = event.find_all('a', href=lambda x: x.startswith('#cite_note-'))
            if links:
                for link in links:
                    for ref in references[link['href'][1:]]:
                        writer.writerow([date, ref])
            else:
                writer.writerow([date, ''])

以上就是完整的解决方案,它可以将指定网站中提取的链接保存为表格文件。

代码示例

import csv
import urllib2
from datetime import datetime
from urlparse import urljoin
from bs4 import BeautifulSoup

base_url = 'http://en.wikipedia.org'
page = urllib2.urlopen("http://en.wikipedia.org/wiki/List_of_human_stampedes")
soup = BeautifulSoup(page)

# build a list of references
references = {}
for item in soup.select('ol.references li[id]'):
    links = [a['href'] if a['href'].startswith('http') else urljoin(base_url, a['href'])
             for a in item.select('span.reference-text a[href]')]
    references[item['id']] = links


events = soup.find('span', id='20th_century').parent.find_next_siblings()
with open('output.csv', 'wb') as f:
    writer = csv.writer(f)
    for tag in events:
        if tag.name == 'h2':
            break

        for event in tag.find_all('li'):
            # extract text
            try:
                date_string, _ = event.text.split(':', 1)
                date = datetime.strptime(date_string, '%B %d, %Y').strftime('%d/%m/%Y')
            except ValueError:
                continue

            # extract links and write data
            links = event.find_all('a', href=lambda x: x.startswith('#cite_note-'))
            if links:
                for link in links:
                    for ref in references[link['href'][1:]]:
                        writer.writerow([date, ref])
            else:
                writer.writerow([date, ''])
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值