Python学习9-11.1-11.10 从web抓取信息
本文为学习python编程时所记录的笔记,仅供学习交流使用。
11.1 利用webbrowser模块的mapIt.py
>>> import webbrowser
>>> webbrowser.open('http://www.baidu.com')
True
#! python3
# mapIt.py-Launches a map in the browser using an address from the command line or clipboard.
import webbrowser,sys,pyperclip
if len(sys.argv)>1:
#Get address from command line.
address=''.join(sys.argv[1:])
else:
#Get address from clipboard.
address=pyperclip.paste()
webbrowser.open('https://map.baidu.com/search/'+address)
11.2 用request模块从web下载文件
>>> import requests
>>> res=requests.get('http://www.gutenberg.org/cache/epub/1112/pg1112.txt')
>>> type(res)
<class 'requests.models.Response'>
>>> res.status_code==requests.codes.ok
True
>>> len(res.text)
179378
>>> print(res.text[:250])
The Project Gutenberg EBook of Romeo and Juliet, by William Shakespeare
*******************************************************************
THIS EBOOK WAS ONE OF PROJECT GUTENBERG'S EARLY FILES PRODUCED AT A
TIME WHEN PROOFING METHODS AND TOO
11.3 将下载的文件保存到硬盘
>>> import requests
>>> res=requests.get('http://www.gutenberg.org/cache/epub/1112/pg1112.txt')
>>> res.raise_for_status()
>>> playFile=open('RomeoandJuliet.txt','wb')
>>> for chunk in res.iter_content(100000):
playFile.write(chunk)
100000
79380
>>> palyFile.close()
Traceback (most recent call last):
File "<pyshell#15>", line 1, in <module>
palyFile.close()
NameError: name 'palyFile' is not defined
>>> playFile.close()
11.4 HTML
11.5 用BeautifulSoup模块解析HTML
>>> import requests,bs4
>>> exampleFile=open('example.html')
>>> exampleSoup=bs4.BeautifulSoup(exampleFile)
Warning (from warnings module):
File "C:/Users/VECTOR/AppData/Local/Programs/Python/Python37/mapIt.py", line 1
#! python3
UserWarning: No parser was explicitly specified, so I'm using the best available HTML parser for this system ("html.parser"). This usually isn't a problem, but if you run this code on another system, or in a different virtual environment, it may use a different parser and behave differently.
The code that caused this warning is on line 1 of the file C:/Users/VECTOR/AppData/Local/Programs/Python/Python37/mapIt.py. To get rid of this warning, pass the additional argument 'features="html.parser"' to the BeautifulSoup constructor.
>>> type(exampleSoup)
<class 'bs4.BeautifulSoup'>
11.8 用selenium模块控制浏览器
>>> from selenium import webdriver
Traceback (most recent call last):
File "<pyshell#0>", line 1, in <module>
from selenium import webdriver
File "C:\Users\VECTOR\AppData\Local\Programs\Python\Python37\selenium.py", line 1, in <module>
from selenium import webdriver
ImportError: cannot import name 'webdriver' from 'selenium' (C:\Users\VECTOR\AppData\Local\Programs\Python\Python37\selenium.py)
原因因为以上路径中新建的名称叫selenium.py,导致Python会先导入这个文件,然后再导入标准库里面的selenium.py。把当前目录下的文件删除或者重命名之后再运行即可正常。
>>> from selenium import webdriver
>>> browser=webdriver.Firefox()
Traceback (most recent call last):
File "C:\Users\VECTOR\AppData\Local\Programs\Python\Python37\lib\site-packages\selenium\webdriver\common\service.py", line 76, in start
stdin=PIPE)
File "C:\Users\VECTOR\AppData\Local\Programs\Python\Python37\lib\subprocess.py", line 775, in __init__
restore_signals, start_new_session)
File "C:\Users\VECTOR\AppData\Local\Programs\Python\Python37\lib\subprocess.py", line 1178, in _execute_child
startupinfo)
FileNotFoundError: [WinError 2] 系统找不到指定的文件。
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "<pyshell#2>", line 1, in <module>
browser=webdriver.Firefox()
File "C:\Users\VECTOR\AppData\Local\Programs\Python\Python37\lib\site-packages\selenium\webdriver\firefox\webdriver.py", line 164, in __init__
self.service.start()
File "C:\Users\VECTOR\AppData\Local\Programs\Python\Python37\lib\site-packages\selenium\webdriver\common\service.py", line 83, in start
os.path.basename(self.path), self.start_error_message)
selenium.common.exceptions.WebDriverException: Message: 'geckodriver' executable needs to be in PATH.
找不到’geckodriver’ 的环境path,解决方案:
(1)下载geckodriver.exe 放到Firefox的安装目录下,如:(C:\Program Files\Mozilla Firefox);
下载链接
github下载超慢?修改Host文件
a host文件夹在 C:\Windows\System32\drivers\etc
b 利用https://www.ipaddress.com/ 来获得以下GitHub域名的IP地址:
github.com 140.82.113.3
github.global.ssl.fastly.net 199.232.69.194
c 添加到Host文件中
d 在cmd命令中刷新dns缓存
ipconfig /flushdns
弄完了,然鹅,好像也没啥用,还是很慢。
(2)将火狐安装目录(C:\Program Files\Mozilla Firefox)添加到环境变量path中
(3)关闭IDLE 重新打开
>>> browser=webdriver.Firefox()
>>> type(browser)
<class 'selenium.webdriver.firefox.webdriver.WebDriver'>
>>> browser.get('http://inventwithpython.com')
>>> try:
elem=browser.find_element_by_class_name('bookcover')
print('Found <%s> element with that class name!'%(elem.tag_name))
except:
print('was not able to find an elemment with that name.')
was not able to find an elemment with that name.
>>> try:
elem=browser.find_element_by_class_name('navbar-brand')
print('Found <%s> element with that class name!'%(elem.tag_name))
except:
print('was not able to find an elemment with that name.')
Found <a> element with that class name!
>>> linkElem=browser.find_element_by_link_text('Read It Online')
Traceback (most recent call last):
File "<pyshell#14>", line 1, in <module>
linkElem=browser.find_element_by_link_text('Read It Online')
File "C:\Users\VECTOR\AppData\Local\Programs\Python\Python37\lib\site-packages\selenium\webdriver\remote\webdriver.py", line 428, in find_element_by_link_text
return self.find_element(by=By.LINK_TEXT, value=link_text)
File "C:\Users\VECTOR\AppData\Local\Programs\Python\Python37\lib\site-packages\selenium\webdriver\remote\webdriver.py", line 978, in find_element
'value': value})['value']
File "C:\Users\VECTOR\AppData\Local\Programs\Python\Python37\lib\site-packages\selenium\webdriver\remote\webdriver.py", line 321, in execute
self.error_handler.check_response(response)
File "C:\Users\VECTOR\AppData\Local\Programs\Python\Python37\lib\site-packages\selenium\webdriver\remote\errorhandler.py", line 242, in check_response
raise exception_class(message, screen, stacktrace)
selenium.common.exceptions.NoSuchElementException: Message: Unable to locate element: Read It Online
>>> linkElem=browser.find_element_by_link_text('Read Online for Free')
>>> type(linkElem)
<class 'selenium.webdriver.firefox.webelement.FirefoxWebElement'>
>>> linkElem.click()
passwordElem.send_keys(‘12345’)
内容来源
[1] [美]斯维加特(Al Sweigart).Python编程快速上手——让繁琐工作自动化[M]. 王海鹏译.北京:人民邮电出版社,2016.7.p189-