Python_2018-10-24_机器学习——Python——网站爬虫

最新推荐文章于 2023-11-28 20:52:48 发布

智能之心

最新推荐文章于 2023-11-28 20:52:48 发布

阅读量181

点赞数

分类专栏：机器学习文章标签：数据挖掘 python小工具 python

本文链接：https://blog.csdn.net/weixin_41275726/article/details/83340234

版权

机器学习专栏收录该内容

21 篇文章 3 订阅

订阅专栏

教学代码

# coding:utf-8
'''
第一步：
我们将从这个网站爬虫有关纽约公共交通地铁站旋转门的数据：
http://web.mta.info/developers/turnstile.html
'''
import requests
import urllib.request
import time
from bs4 import BeautifulSoup

'''
第二步：
查看网页源码，确定了链接的位置(如下)，让我们开始编程
<a href=”data/nyct/turnstile/turnstile_180922.txt”>Saturday, September 22, 2018</a>
'''
url = 'http://web.mta.info/developers/turnstile.html'
response = requests.get(url)
print(response)

'''
第三步：
使用html嵌套数据结构
'''
soup = BeautifulSoup(response.text, "html.parser")

'''
第四步：
寻找<a>标记的代码段，指定要爬虫的对象
'''
soup.findAll('a')

'''
第五步：
提取我们想要的实际链接。先测试第一个链接第36行。
输出link为：data/nyct/turnstile/turnstile_181020.txt
'''
one_a_tag = soup.findAll('a')[36]
link = one_a_tag['href']
print(link)

'''
第六步：
组合成整体链接 download_url
urlretrieve(url, filename=None, reporthook=None, data=None)
参数 url 下载链接
参数 finename 指定了保存本地路径
'''
download_url = 'http://web.mta.info/developers/' + link
print(download_url)
print('./'+link[link.find('/turnstile_')+1:])
'''
第七步：
下载到本地
'''
urllib.request.urlretrieve(download_url, './'+link[link.find('/turnstile_')+1:])
time.sleep(1)#防止被当广告拦截，因为广告是反复访问的操作。

数据来源

我们将从这个网站爬虫有关纽约公共交通地铁站旋转门的数据：
http://web.mta.info/developers/turnstile.html

运用代码

# coding:utf-8
'''
http://web.mta.info/developers/turnstile.html
'''
import requests
import urllib.request
import time
from bs4 import BeautifulSoup

'''
<a href=”data/nyct/turnstile/turnstile_180922.txt”>Saturday, September 22, 2018</a>
'''
url = 'http://web.mta.info/developers/turnstile.html'
response = requests.get(url)
print(response)

soup = BeautifulSoup(response.text, "html.parser")

# To download the whole data set, let's do a for loop through all a tags
for i in range(36, len(soup.findAll('a'))+1):  #find <a>
   one_a_tag = soup.findAll('a')[i]
   print(one_a_tag)
   link = one_a_tag['href']
   download_url = 'http://web.mta.info/developers/' + link
   urllib.request.urlretrieve(download_url, './'+link[link.find('/turnstile_')+1:])
   time.sleep(1) #pause the code for a sec

参考来源

https://towardsdatascience.com/how-to-web-scrape-with-python-in-4-minutes-bc49186a8460

智能之心

关注

0
点赞
踩
1

收藏

觉得还不错? 一键收藏
打赏
0
评论
Python_2018-10-24_机器学习——Python——网站爬虫

教学代码# coding:utf-8'''第一步：我们将从这个网站爬虫有关纽约公共交通地铁站旋转门的数据：http://web.mta.info/developers/turnstile.html'''import requestsimport urllib.requestimport timefrom bs4 import BeautifulSoup'''第二步：...
复制链接

扫一扫