17 - 05 - 25 Python codes run happily

最新推荐文章于 2024-04-24 09:59:59 发布

Sodaoo

最新推荐文章于 2024-04-24 09:59:59 发布

阅读量430

点赞数 1

分类专栏： Python 文章标签： python urllib beautifulsoup spider 爬虫

Python 专栏收录该内容

9 篇文章 0 订阅

订阅专栏

#### Web Scraping with Python####

--------------------------------------------------------------------------------------------(Translation and application )

_____________________________ONE_____________________________

Example：

>>>from urllib.request import urlopen

>>>from bs4 import BeautifulSoup

>>>html = urlopen("http://www.pythonscraping.com/pages/page1.html")

>>>bsObj = BeautifulSoup(html.read())

>>>print(bsObj.h1)

Output：

<h1>An Interesting Title</h1>

In fact, any of the following function calls would producethe same output :

>>>bsObj.html.body.h1

>>>bsObj.body.h1

>>>bsObj.html.h1 : All the same output

In the next syntax，If

1、The page is not found on the server (or there was some error in retrieving it). or

2、The server is not found .

>>html= urlopen("www.pythonscraping.com/page1.html")

we didn't write any "try..expect.." in this syntax, if goes wrong,the spider will maybe break down or even refuse to work .

cause the error in the internet we can't forsee,

So we should protected our codes .

In the first situation, an HTTP error will be returned. This HTTP error may be “404 Page Not Found,” “500 Internal Server Error,”

etc. In all of these cases, the urlopen

function will throw the "HTTPError", We can handle this exception in the following way:

>>try :

>> html =urlopen("www.pyraping.com/page1.html")

>>except HTTPError as e :

>> print(e)

>> .......

>> #return null, break, or do some other "Plan B"

>>else :

>> .......

>> # program continues. Note : If you return or break

>> # in th exception catch, you do not need

>> # to use the "else" statement

If an HTTP error code is returned, the program prints the error, and doesn' t execute the rest of the program under

the else statement .

>>>if html is None:

>>> print("URL is not found")

>>>else:

>>> #program continues

Of course, if the page is retrieved successfully from the server, there is still the issue of the content on the pagenot quite being

what we expected. Every time you access a tag

in a BeautifulSoup object(when you调用BeautifulSoup对象里的一个标签时), it’s smart to add acheck to make sure the tag

actually exists. If you attempt to access a tag that does not exist, BeautifulSoup will return a None object. The problem is,

attempting to access a tag on a None object itself will result in anAttributeError being thrown.

For an example：

>>>print ( bsObj.nonExistentTag )

the tag doesn't exist in the object bsObj ,so the syntax "bsObj.nonExistentTag" will return aNone object ,This

"None object" is necessary to check for. Because The trouble comes if you don't check for it, but instead go on and try to call

some other function on the None object

(ex：调用this object的子标签 ), For example：

>>> print(bsObj.nonExistentTag.someTag)

Because the object "bsObj.nonExistentTag" itself is a

"None" , If we call a function on the None object,which will

returns the exception:

>>>AttributeError: 'NoneType' object has no attribute (attribute:属性 ) 'someTag'

So how can we guard against these two situations? The easiest way is to explicitly check for both situations:

>>> try:

>>> badContent = bsObj.nonExistingTag.anotherTag

>>> except AttributeError as e:

>>> print("Tag was not found")

>>> else:

>>> if badContent == None:

>>> print ("Tag was not found")

>>> else:

>>> print(badContent)

or the more beautiful codes like the following :

>>from urllib.request import urlopen

>>from urllib.error import HTTPError

>>from bs4 import BeautifulSoup

>> def getTitle(url):

>> try:

>> html = urlopen(url)

>> except HTTPError as e:

>> return None

>> try:

>> bsObj = BeautifulSoup(html.read())

>> title = bsObj.body.h1

>> except AttributeError as e:

>> return None

>> return title

>>title = getTitle("www.pythonscraping.compage1.html")

>>if title == None:

>> print("Title could not be found")

>>else:

>> print(title)

When writing scrapers, it’s important to think about the overall pattern of your code in order to handle exceptions

and make it readable at the same time.

写爬虫的时候，要思考代码的总体布局，让代码即可以捕捉异常，有易读。运用getTitle这种函数，还可以很方便的重用代码。

Sodaoo

关注

1
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
17 - 05 - 25 Python codes run happily

#### Web Scraping with Python#### --------------------------------------------------------------------------------------------(Translation and application )_____________________________ONE____
复制链接

扫一扫

专栏目录