BeautifulSoup笔记

Beautiful Soup is aPython library for pulling data out of HTML and XML files. It workswith your favorite parser to provide idiomatic ways of navigating,searching, and modifying the parse tree. It commonly saves programmershours or days of work.


BeautifulSoup有四种对象:Tag,NavigableString,BeautifulSoup,Comment

  • ATag object corresponds to an XML or HTML tag in the original document.
  • A string corresponds to a bit of text within a tag. Beautiful Soupuses theNavigableString class to contain these bits of text.

  • The BeautifulSoup object itself represents the document as awhole. For most purposes, you can treat it as aTag object.

  • Tag,NavigableString, andBeautifulSoup cover almosteverything you’ll see in an HTML or XML file, but there are a fewleftover bits. The only one you’ll probably ever need to worry aboutis the comment.


浏览html:

  • 向里:
  • 用标签名导航
  •  .contents  :   返回一个list
  •  .children  :  返回一个迭代器

The .contents and.children attributes only consider a tag’sdirect children.

  •   .descendants :    The .descendants attribute lets you iterate over allof a tag’s children, recursively: its direct children, the children ofits direct children, and so on.   返回迭代器
  •  .string : 
    •  If a tag has only one child, and that child is aNavigableString,the child is made available as.string  
    •  If a tag’s only child is another tag, and that tag has a.string, then the parent tag is considered to have the same.string as its child.  
    •  If a tag contains more than one thing, then it’s not clear what.string should refer to, so.string is defined to be None .  返回字符串
  •  .strings  AND  .stripped_strings :      If there’s more than one thing inside a tag, you can still look atjust the strings. Use the.strings generator. These strings tend to have a lot of extra whitespace, which you canremove by using the.stripped_strings generator instead.
  • 向上:
  •  .parent : You can access an element’s parent with the.parent attribute.
  •  .parents : You can iterate over all of an element’s parents with .parents .
  • 向旁边:

                        When tags all are direct children of the same tag. We call themsiblings. When a document is pretty-printed, siblings show up at the same indentation level.

  •  .next_sibling   .previous_sibling :  You can use.next_sibling and.previous_sibling to navigatebetween page elements that are on the same level of the parse tree.

In real documents, the .next_sibling or .previous_sibling of a tag will usually be a string containing whitespace. Going back to the“three sisters” document:

<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a>,
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>

You might think that the .next_sibling of the first <a> tag would be the second <a> tag. But actually, it’s a string: the comma and newline that separate the first <a> tag from the second :

link = soup.a
link
# <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>

link.next_sibling
# u',\n' 
  •  .next_siblings   .previous_siblings :  You can iterate over a tag’s siblings with.next_siblings or.previous_siblings .
  • Going back and forth :

Take a look at the beginning of the document below :

<html><head><title>The Dormouse's story</title></head>
<p class="title"><b>The Dormouse's story</b></p>

An HTML parser takes this string of characters and turns it into a series of events: “open an <html> tag”, “open a <head> tag”, “open a<title> tag”, “add a string”, “close the <title> tag”, “open a <p>tag”, and so on. Beautiful Soup offers tools for reconstructing theinitial parse of the document.

  •  .next_element   .previous_element :  The .next_element attribute of a string or tag points to whateverwas parsed immediately afterwards. It might be the same as.next_sibling, but it’s usually drastically different.

example:

<p><a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

soup.p.a.next_sibling
# '; and they lived at the bottom of a well.'
soup.p.a.next_element
# u'Tillie'
  •    .next_elements and.previous_elements : You should get the idea by now. You can use these iterators to moveforward or backward in the document as it was parsed.



搜索html树:

  • 过滤器: a string ; a regular expression ; a list ; True ; a function

You can use them to filter based on a tag’s name,on its attributes, on the text of a string, or on some combination ofthese.

  •  find_all(name,attrs,recursive,string,limit,**kwargs) :
  • name : Pass in a value for name and you’ll tell Beautiful Soup to only consider tags with certain names.
  • keyword : Any argument that’s not recognized will be turned into a filter on one of a tag’s attributes. If you pass in a value for an argument calledid,Beautiful Soup will filter against each tag’s ‘id’ attribute
soup.find_all(id='link2')
# [<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]
  • searching by CSS class/attrs : It’s very useful to search for a tag that has a certain CSS class, but the name of the CSS attribute, “class”, is a reserved word in Python. Usingclass as a keyword argument will give you a syntax error. As of Beautiful Soup 4.1.2, you can search by CSS class using the keyword argumentclass_ .
  • string :  With string you can search for strings instead of tags.
  • limit :  find_all() returns all the tags and strings that match yourfilters. This can take a while if the document is large. If you don’tneedall the results, you can pass in a number forlimit. Thisworks just like the LIMIT keyword in SQL. It tells Beautiful Soup tostop gathering results after it’s found a certain number.
soup.find_all("a", limit=2)
# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
#  <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]
  • recursive : If you call mytag.find_all(), Beautiful Soup will examine all the descendants ofmytag: its children, its children’s children, and so on. If you only want Beautiful Soup to consider direct children,you can pass inrecursive=False.
  • Calling a tag is like calling find_all() :

Because find_all() is the most popular method in the BeautifulSoup search API, you can use a shortcut for it. If you treat the BeautifulSoup object or a Tag object as though it were afunction, then it’s the same as callingfind_all() on that object. These two lines of code are equivalent:

soup.find_all("a")
soup("a")

These two lines are also equivalent:

soup.title.find_all(string=True)
soup.title(string=True)
  • find(name,attrs,recursive,string,**kwargs) :

The find_all() method scans the entire document looking for results, but sometimes you only want to find one result. If you know a document only has one <body> tag, it’s a waste of time to scan the entire document looking for more. Rather than passing inlimit=1 every time you callfind_all, you can use thefind()method.\

The only difference is that find_all() returns a list containing the single result, andfind() just returns the result.

If find_all() can’t find anything, it returns an empty list. If find() can’t find anything, it returns None .

  • find_parents(name,attrs,string,limit, **kwargs)
  • find_parent(name,attrs,string,**kwargs)

  • find_next_siblings(name,attrs,string,limit, **kwargs)
  • find_next_sibling(name,attrs,string,**kwargs)

  • find_previous_siblings(name,attrs,string,limit, **kwargs)
  • find_previous_sibling(name,attrs,string,**kwargs)

  • find_all_next(name,attrs,string,limit, **kwargs)                   :          element
  • find_next(name,attrs,string,**kwargs)

  • find_all_previous(name,attrs,string,limit, **kwargs)
  • find_previous(name,attrs,string,**kwargs)



更改内容:

  • Changing tags names and attributes :  You can rename a tag, change the values of its attributes, add new attributes, and delete attributes.
  • Modifying .string : If you set a tag’s.string attribute, the tag’s contents are replaced with the string you give.
  • append()  :  You can add to a tag’s contents withTag.append(). It works just like calling.append() on a Python list.
  • NavigableString     .new_tag() :
  • insert() :  Tag.insert() is just like Tag.append(), except the new element doesn’t necessarily go at the end of its parent’s.contents. It’ll be inserted at whatever numeric position you say. It works just like.insert() on a Python list.
  • insert_before()  insert_after() : Tag.insert() is just likeTag.append(), except the new element doesn’t necessarily go at the end of its parent’s.contents. It’ll be inserted at whatever numeric position yousay. It works just like .insert() on a Python list. Theinsert_after() method moves a tag or string so that it immediately follows something else in the parse tree.
  • clear() :  Tag.clear() removes the contents of a tag.
  • extract() :  PageElement.extract() removes a tag or string from the tree. Itreturns the tag or string that was extracted.
  • decompose() : Tag.decompose() removes a tag from the tree, then completelydestroys it and its contents.
  • replace_with() PageElement.replace_with() removes a tag or string from the tree,and replaces it with the tag or string of your choice.
  • wrap() PageElement.wrap() wraps an element in the tag you specify. It returns the new wrapper.
  • unwrap() Tag.unwrap() is the opposite of wrap(). It replaces a tag with whatever’s inside that tag. It’s good for stripping out markup.


输出:

  • pretty-printing : The prettify() method will turn a Beautiful Soup parse tree into a nicely formatted Unicode string, with each HTML/XML tag on its own line.
  • Non-pretty printing :  If you just want a string, with no fancy formatting, you can call unicode() orstr() on a BeautifulSoup object, or a Tag within it.
  • Output formatters
  • get_text() :  If you only want the text part of a document or tag, you can use the get_text() method. It returns all the text in a document or beneath a tag, as a single Unicode string.




以上内容多是提取自 Beautiful Soup Documentation 原文。


















































  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值