#### Web Scraping with Python####
--------------------------------------------------------------------------------------------(Translation and application )
_____________________________ONE_____________________________
Example:
>>>from urllib.request import urlopen
>>>from bs4 import BeautifulSoup
>>>html = urlopen("http://www.pythonscraping.com/pages/page1.html")
>>>bsObj = BeautifulSoup(html.read())
>>>print(bsObj.h1)
Output:
<h1>An Interesting Title</h1>
In fact, any of the following function calls would producethe same output :
>>>bsObj.html.body.h1
>>>bsObj.body.h1
>>>bsObj.html.h1 : All the same output
In the next syntax,If
1、The page is not found on the server (or there was some error in retrieving it). or
2、The server is not found .
>>html= urlopen("www.pythonscraping.com/page1.html")
we didn't write any "try..expect.." in this syntax, if goes wrong,the spider will maybe break down or even refuse to work .
cause the error in the internet we can't forsee,
So we should protected our codes .
In the first situation, an HTTP error will be returned. This HTTP error may be “404 Page Not Found,” “500 Internal Server Error,”
etc. In all of these cases, the urlopen
function will throw the "HTTPError", We can handle this exception in the following way:
>>
>>try :
>> html =urlopen("www.pyraping.com/page1.html")
>>except HTTPError as e :
>> print(e)
>> .......
>> #return null, break, or do some other "Plan B"
>>else :
>> .......
>> # program continues. Note : If you return or break
>> # in th exception catch, you do not need
>> # to use the "else" statement
If an HTTP error code is returned, the program prints the error, and doesn' t execute the rest of the program under
the else statement .
>>>if html is None:
>>> print("URL is not found")
>>>else:
>>> #program continues
Of course, if the page is retrieved successfully from the server, there is still the issue of the content on the pagenot quite being
what we expected. Every time you access a tag
in a BeautifulSoup object(when you调用BeautifulSoup对象里的一个标签时), it’s smart to add acheck to make sure the tag
actually exists. If you attempt to access a tag that does not exist, BeautifulSoup will return a None object. The problem is,
attempting to access a tag on a None object itself will result in anAttributeError being thrown.
For an example:
>>>print ( bsObj.nonExistentTag )
the tag doesn't exist in the object bsObj ,so the syntax "bsObj.nonExistentTag" will return aNone object ,This
"None object" is necessary to check for. Because The trouble comes if you don't check for it, but instead go on and try to call
some other function on the None object
(ex:调用this object的子标签 ), For example:
>>> print(bsObj.nonExistentTag.someTag)
Because the object "bsObj.nonExistentTag" itself is a
"None" , If we call a function on the None object,which will
returns the exception:
>>>AttributeError: 'NoneType' object has no attribute (attribute:属性 ) 'someTag'
So how can we guard against these two situations? The easiest way is to explicitly check for both situations:
>>> try:
>>> badContent = bsObj.nonExistingTag.anotherTag
>>> except AttributeError as e:
>>> print("Tag was not found")
>>> else:
>>> if badContent == None:
>>> print ("Tag was not found")
>>> else:
>>> print(badContent)
or the more beautiful codes like the following :
>>from urllib.request import urlopen
>>from urllib.error import HTTPError
>>from bs4 import BeautifulSoup
>> def getTitle(url):
>> try:
>> html = urlopen(url)
>> except HTTPError as e:
>> return None
>> try:
>> bsObj = BeautifulSoup(html.read())
>> title = bsObj.body.h1
>> except AttributeError as e:
>> return None
>> return title
>>title = getTitle("www.pythonscraping.compage1.html")
>>if title == None:
>> print("Title could not be found")
>>else:
>> print(title)
When writing scrapers, it’s important to think about the overall pattern of your code in order to handle exceptions
and make it readable at the same time.
写爬虫的时候,要思考代码的总体布局,让代码即可以捕捉异常,有易读。运用getTitle这种函数,还可以很方便的重用代码。