17 - 05 - 25 Python codes run happily

                                #### Web Scraping with Python####     

--------------------------------------------------------------------------------------------(Translation and application )

_____________________________ONE_____________________________

Example

>>>from urllib.request import urlopen

>>>from bs4 import BeautifulSoup

>>>html = urlopen("http://www.pythonscraping.com/pages/page1.html")

>>>bsObj = BeautifulSoup(html.read())

>>>print(bsObj.h1)

Output:

 <h1>An Interesting Title</h1>

 

In fact, any of the following function calls would producethe same output :

>>>bsObj.html.body.h1

>>>bsObj.body.h1

>>>bsObj.html.h1      :                    All the same output

 

In the next syntax,If

1、The page is not found on the server (or there was some error in retrieving it).    or

2、The server is not found .

>>html= urlopen("www.pythonscraping.com/page1.html")

we didn't write any "try..expect.." in this syntax, if goes wrong,the spider will maybe break down or even refuse to work . 

 cause the error in the internet we can't forsee,

So we should protected our codes .

 

In the first situation, an HTTP error will be returned. This HTTP error may be “404 Page Not Found,” “500 Internal Server Error,” 

etc. In all of these cases, the  urlopen

function will throw the "HTTPError", We can handle this exception in the following way:

>>

>>try :

>>   html =urlopen("www.pyraping.com/page1.html")

>>except HTTPError as e :

>>    print(e)

>>    .......

>>    #return null, break, or do some other "Plan B"

>>else :

>>    .......

>>    # program continues.  Note : If you return or break

>>    # in th exception catch, you do not need

>>    # to use the "else" statement


If an HTTP error code is returned, the program  prints the error, and doesn' t execute the rest of the program under

 the else statement .


>>>if html is None:

>>> print("URL is not found")

>>>else:

>>> #program continues

Of course, if the page is retrieved successfully from the server, there is still the issue of the content on the pagenot quite being

what we expected. Every time you access a tag

in a BeautifulSoup object(when you调用BeautifulSoup对象里的一个标签时), it’s smart to add acheck to make sure the tag 

actually exists. If you attempt to access a tag that does not exist, BeautifulSoup will return a None object. The problem is, 

attempting to access a tag on a  None object itself will  result in  anAttributeError being thrown.

 

 For an example:

>>>print ( bsObj.nonExistentTag )

 

the tag doesn't exist in the object bsObj ,so the syntax "bsObj.nonExistentTag" will return aNone object ,This

"None object"  is necessary to check for. Because The trouble comes if you don't check for it, but instead go on and try to call 

some other function on the None object

(ex:调用this object的子标签 ), For example:

>>> print(bsObj.nonExistentTag.someTag)

Because the object "bsObj.nonExistentTag" itself is a

"None" , If we call a function on the None object,which will

returns the exception:

>>>AttributeError: 'NoneType' object has no attribute (attribute:属性 ) 'someTag'

 

So how can we guard against these two situations? The easiest way is to explicitly check for both situations:

>>> try:

>>>     badContent = bsObj.nonExistingTag.anotherTag

>>> except AttributeError as e:

>>>     print("Tag was not found")

>>> else:

>>>     if badContent == None:

>>>         print ("Tag was not found")

>>>     else:

>>>         print(badContent)

  or the more beautiful codes like the following :

>>from urllib.request import urlopen

>>from urllib.error import HTTPError

>>from bs4 import BeautifulSoup

>>    def getTitle(url):

>>        try:

>>            html = urlopen(url)

>>        except HTTPError as e:

>>            return None

>>        try:

>>            bsObj = BeautifulSoup(html.read())

>>            title = bsObj.body.h1

>>        except AttributeError as e:

>>            return None

>>        return title

>>title = getTitle("www.pythonscraping.compage1.html")

>>if title == None:

>>    print("Title could not be found")

>>else:

>>    print(title)

 

When writing scrapers, it’s important to think about the overall pattern of your code in order to handle exceptions 

and make it readable at the same time. 

写爬虫的时候,要思考代码的总体布局,让代码即可以捕捉异常,有易读。运用getTitle这种函数,还可以很方便的重用代码。


 

  • 1
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值