Python学习9-11.1-11.10 从web抓取信息

本文介绍了Python进行web信息抓取的相关知识,包括使用webbrowser、requests、BeautifulSoup模块以及selenium来控制浏览器。在使用selenium时,提到了解决找不到'geckodriver'环境路径的问题,包括添加到Firefox安装目录、修改Host文件、更新环境变量等步骤。然而,即使完成这些步骤,速度问题可能依然存在。
摘要由CSDN通过智能技术生成


本文为学习python编程时所记录的笔记,仅供学习交流使用。

11.1 利用webbrowser模块的mapIt.py

>>> import webbrowser
>>> webbrowser.open('http://www.baidu.com')
True
#! python3
# mapIt.py-Launches a map in the browser using an address from the command line or clipboard.

import webbrowser,sys,pyperclip
if len(sys.argv)>1:
    #Get address from command line.
    address=''.join(sys.argv[1:])
else:
    #Get address from clipboard.
    address=pyperclip.paste()

webbrowser.open('https://map.baidu.com/search/'+address)

11.2 用request模块从web下载文件

>>> import requests
>>> res=requests.get('http://www.gutenberg.org/cache/epub/1112/pg1112.txt')
>>> type(res)
<class 'requests.models.Response'>
>>> res.status_code==requests.codes.ok
True
>>> len(res.text)
179378
>>> print(res.text[:250])
The Project Gutenberg EBook of Romeo and Juliet, by William Shakespeare





*******************************************************************

THIS EBOOK WAS ONE OF PROJECT GUTENBERG'S EARLY FILES PRODUCED AT A

TIME WHEN PROOFING METHODS AND TOO

11.3 将下载的文件保存到硬盘

>>> import requests
>>> res=requests.get('http://www.gutenberg.org/cache/epub/1112/pg1112.txt')
>>> res.raise_for_status()
>>> playFile=open('RomeoandJuliet.txt','wb')
>>> for chunk in res.iter_content(100000):
	playFile.write(chunk)

	
100000
79380
>>> palyFile.close()
Traceback (most recent call last):
  File "<pyshell#15>", line 1, in <module>
    palyFile.close()
NameError: name 'palyFile' is not defined
>>> playFile.close()

11.4 HTML

11.5 用BeautifulSoup模块解析HTML

>>> import requests,bs4
>>> exampleFile=open('example.html')
>>> exampleSoup=bs4.BeautifulSoup(exampleFile)

Warning (from warnings module):
  File "C:/Users/VECTOR/AppData/Local/Programs/Python/Python37/mapIt.py", line 1
    #! python3
UserWarning: No parser was explicitly specified, so I'm using the best available HTML parser for this system ("html.parser"). This usually isn't a problem, but if you run this code on another system, or in a different virtual environment, it may use a different parser and behave differently.

The code that caused this warning is on line 1 of the file C:/Users/VECTOR/AppData/Local/Programs/Python/Python37/mapIt.py. To get rid of this warning, pass the additional argument 'features="html.parser"' to the BeautifulSoup constructor.

>>> type(exampleSoup)
<class 'bs4.BeautifulSoup'>

11.8 用selenium模块控制浏览器

>>> from selenium import webdriver
Traceback (most recent call last):
  File "<pyshell#0>", line 1, in <module>
    from selenium import webdriver
  File "C:\Users\VECTOR\AppData\Local\Programs\Python\Python37\selenium.py", line 1, in <module>
    from selenium import webdriver
ImportError: cannot import name 'webdriver' from 'selenium' (C:\Users\VECTOR\AppData\Local\Programs\Python\Python37\selenium.py)

原因因为以上路径中新建的名称叫selenium.py,导致Python会先导入这个文件,然后再导入标准库里面的selenium.py。把当前目录下的文件删除或者重命名之后再运行即可正常。

>>> from selenium import webdriver
>>> browser=webdriver.Firefox()
Traceback (most recent call last):
  File "C:\Users\VECTOR\AppData\Local\Programs\Python\Python37\lib\site-packages\selenium\webdriver\common\service.py", line 76, in start
    stdin=PIPE)
  File "C:\Users\VECTOR\AppData\Local\Programs\Python\Python37\lib\subprocess.py", line 775, in __init__
    restore_signals, start_new_session)
  File "C:\Users\VECTOR\AppData\Local\Programs\Python\Python37\lib\subprocess.py", line 1178, in _execute_child
    startupinfo)
FileNotFoundError: [WinError 2] 系统找不到指定的文件。

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "<pyshell#2>", line 1, in <module>
    browser=webdriver.Firefox()
  File "C:\Users\VECTOR\AppData\Local\Programs\Python\Python37\lib\site-packages\selenium\webdriver\firefox\webdriver.py", line 164, in __init__
    self.service.start()
  File "C:\Users\VECTOR\AppData\Local\Programs\Python\Python37\lib\site-packages\selenium\webdriver\common\service.py", line 83, in start
    os.path.basename(self.path), self.start_error_message)
selenium.common.exceptions.WebDriverException: Message: 'geckodriver' executable needs to be in PATH. 

找不到’geckodriver’ 的环境path,解决方案:
(1)下载geckodriver.exe 放到Firefox的安装目录下,如:(C:\Program Files\Mozilla Firefox);
下载链接
github下载超慢?修改Host文件
a host文件夹在 C:\Windows\System32\drivers\etc
b 利用https://www.ipaddress.com/ 来获得以下GitHub域名的IP地址:
github.com 140.82.113.3
github.global.ssl.fastly.net 199.232.69.194
c 添加到Host文件中
在这里插入图片描述
d 在cmd命令中刷新dns缓存
ipconfig /flushdns
弄完了,然鹅,好像也没啥用,还是很慢。

(2)将火狐安装目录(C:\Program Files\Mozilla Firefox)添加到环境变量path中
(3)关闭IDLE 重新打开

>>> browser=webdriver.Firefox()
>>> type(browser)
<class 'selenium.webdriver.firefox.webdriver.WebDriver'>
>>> browser.get('http://inventwithpython.com')
>>> try:
	elem=browser.find_element_by_class_name('bookcover')
	print('Found <%s> element with that class name!'%(elem.tag_name))
except:
	print('was not able to find an elemment with that name.')

	
was not able to find an elemment with that name.
>>> try:
	elem=browser.find_element_by_class_name('navbar-brand')
	print('Found <%s> element with that class name!'%(elem.tag_name))
except:
	print('was not able to find an elemment with that name.')

	
Found <a> element with that class name!
>>> linkElem=browser.find_element_by_link_text('Read It Online')
Traceback (most recent call last):
  File "<pyshell#14>", line 1, in <module>
    linkElem=browser.find_element_by_link_text('Read It Online')
  File "C:\Users\VECTOR\AppData\Local\Programs\Python\Python37\lib\site-packages\selenium\webdriver\remote\webdriver.py", line 428, in find_element_by_link_text
    return self.find_element(by=By.LINK_TEXT, value=link_text)
  File "C:\Users\VECTOR\AppData\Local\Programs\Python\Python37\lib\site-packages\selenium\webdriver\remote\webdriver.py", line 978, in find_element
    'value': value})['value']
  File "C:\Users\VECTOR\AppData\Local\Programs\Python\Python37\lib\site-packages\selenium\webdriver\remote\webdriver.py", line 321, in execute
    self.error_handler.check_response(response)
  File "C:\Users\VECTOR\AppData\Local\Programs\Python\Python37\lib\site-packages\selenium\webdriver\remote\errorhandler.py", line 242, in check_response
    raise exception_class(message, screen, stacktrace)
selenium.common.exceptions.NoSuchElementException: Message: Unable to locate element: Read It Online

>>> linkElem=browser.find_element_by_link_text('Read Online for Free')
>>> type(linkElem)
<class 'selenium.webdriver.firefox.webelement.FirefoxWebElement'>
>>> linkElem.click()

passwordElem.send_keys(‘12345’)

内容来源

[1] [美]斯维加特(Al Sweigart).Python编程快速上手——让繁琐工作自动化[M]. 王海鹏译.北京:人民邮电出版社,2016.7.p189-

评论 2
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值