Python新手学习(十):从WEB抓取数据

12.从web抓取数据
1)项目:利用webbrowser模块的maplt.py
     webbrowser模块 webbrowser.open(‘url’)
     test_1201.py

#! python3
# mapIt.py - launches a map in the browser using an address from the command line or clipboard.

import webbrowser,sys,pyperclip
if len(sys.argv) > 1:
    #Get address from command line.
    address = ' '.join(sys.argv[1:])
else:
    #Get address from clipboard.
    address = pyperclip.paste()

webbrowser.open('www.baidu.com/s?wd=' + address)

2)用requests模块下载文件
     用requests模块下载网页
     requests.get(‘url’)
     res.status_code = requests.codes.ok
     res.text
     检查错误res.raise_for_status()           
3)下载文件保存到硬盘上
     保存文件 res.iter_content()
     测试程序:test_1202.py

import requests

res = requests.get('https://www.csdn.net/')
try:
    res.raise_for_status()
except Exception as exc:
    print('There was a problem: %s' % (exc))

playFile = open('d:/temp/csdn.txt','wb')
for chunk in res.iter_content(100000):
    playFile.write(chunk)

playFile.close()

4)HTML
    查看页面源代码
    打开浏览器开发者工具 ,寻找HTML元素。
    test_1203.html

<html>
    <body>
        <strong>hello </strong> world!
        Al's free <a href ="https://www.csdn.net">Python books</a>
    </body>
</html>

5)用bs4模块解析HTML
  Beautiful Soup模块bs4从HTML页面提取信息
  test_1204.py

import requests,bs4

exampleFile = open('test_1204.html')
exampleSoup = bs4.BeautifulSoup(exampleFile.read(),'html.parser')

elems = exampleSoup.select('#author')
print('elem:' +str(elems[0]))
print('elem.text :' + elems[0].getText())
print('elem.attrs:' + str(elems[0].attrs))

pelems =exampleSoup.select('p')
print('<p>0:' + str(pelems[0]))
print('<p>0-text: ' + pelems[0].getText())

print('<p>1:' + str(pelems[1]))
print('<p>2:'+ str(pelems[2]))

  test_1204.html

<!-- This is the example.html example file. -->

<html>
<head><title>The Website Title</title></head>
<body>
    <p>Download my <strong>Python</strong>book from <a href="https://www.csdn.net">my website</a>.</p>
    <p class="slogan">Learn Python the easy way!</p>
    <p>By <span id="author">Al Sweigart</span></p>
</body>
</html>

    从HTML创建BeautifulSoup对象 bs4.BeautifulSoup()
    用select()方法寻找元素 soup.select()
  a)找a标签
    print(soup.select('a')) # 通过标签的名称查找
  b)通过类名来查找:class="sister"
    print(soup.select('.sister'))
  c)通过id查找:id="link1"
    print(soup.select('#link1'))
  d)特殊的查找方式:
    选择父元素是 <head> 的所有 < title> 元素。注意是'head > title',不是'head' > title'
    print(soup.select('head > 'title')) 
  e)获取文本内容
    print(soup.select('title')[0].string)
    print(soup.select('title')[0].get_text())
  通过元素属性获取数据。elem.get()
6)项目:打开所有搜索结果
  Google被禁止,除非用梯子。
  百度有反爬设置,通过网站直接搜索可以,将Url带到程序中,返回页面没有信息。
  test_1205.py  目前还不可用

#! python3
# searchpypi.py - Opens serveral search results.

import requests,sys,webbrowser,bs4
import logging
logging.basicConfig(level=logging.DEBUG,format='%(asctime)s - %(levelname)s - %(message)s')

searchUrl = 'https://www.baidu.com/s?wd=site:finance.sina.com.cn ' +  ' '.join(sys.argv[1:])
logging.info('Search...' + searchUrl) # display text while downloading the search result page
webbrowser.open(searchUrl)

res = requests.get(searchUrl)
res.raise_for_status()

# Retrieve top search result links.
logging.debug('res len:' + str(len(res.text)))

pFile= open('d:/temp/res.html','wb')
for chunk in res.iter_content(10000):
    pFile.write(chunk)
pFile.close()

soup = bs4.BeautifulSoup(res.text,'html.parser')

# Open  browser tabl for each result.
linkElems = soup.select('a[href="https://www.baidu.com]')

numOpen = min(5,len(linkElems))
logging.debug('numOpen:' + str(numOpen))


for i in range(numOpen):
    urlToOpen = linkElems[i].get('href')
    print('Open--',urlToOpen)
    webbrowser.open(urlToOpen)

7)项目:下载所有XKCD动画
  使用XKCD中文网站,页面元素与书中不同,做了修改。加了前十张爬图的限制
  test_1206.py 

#! python3
# downloadXkcd.py - Downloads every single XKCD comic.

import requests,os,bs4,logging

os.chdir('d:/temp')

urlXkcd = 'https://xkcd.in/'            # starting url

url = urlXkcd
os.makedirs('d:/temp/xkcd',exist_ok = True)     # store comics in ./xkcd
for phn in range(10):
    if url.endswith('#'):
        break;
    # Download the page.
    print('No. %s Downloading page %s...' % (phn,url))
    res = requests.get(url)
    res.raise_for_status()
   
    print('res len:' + str(len(res.text)))

    #pFile= open('d:/temp/img_res.html','wb')
    #for chunk in res.iter_content(10000):
    #    pFile.write(chunk)
    #    pFile.close()"""  
  
    soup = bs4.BeautifulSoup(res.text,'html.parser')

    # Find the Url of the comic image.
    comicElem = soup.select('.comic-body a img')
    if comicElem == []:
        print('Could not find currentImg image.')
    else:
        comicUrl = urlXkcd + comicElem[0].get('src')
        # Download the image.
        print('Downloading image %s...' % (comicUrl))
        res = requests.get(comicUrl)
        res.raise_for_status()

        # Save the image to ./xkcd.
        imageFile = open(os.path.join('d:/temp/xkcd',os.path.basename(comicUrl)),'wb')

        for chunk in res.iter_content(100000):
            imageFile.write(chunk)
        imageFile.close()
        
    # Get the Next button's url
    nextLink = soup.select('.nextLink a')[0]
    url = urlXkcd + str(nextLink.get('href'))

print('Done.')

8)用selenium模块控制浏览器
  from selenium import webdriver
  browser = webdriver.Chrome() #用Chrome浏览器
  browser.get(‘http://www.baidu.com’)
  在页面寻找元素webdriver的方法
    find_element_by_class_name(name)
 

  • 3
    点赞
  • 10
    收藏
    觉得还不错? 一键收藏
  • 2
    评论
评论 2
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值