python 爬虫笔记

最新推荐文章于 2024-09-05 10:00:00 发布

张章章Sam

最新推荐文章于 2024-09-05 10:00:00 发布

阅读量927

点赞数

文章标签： python html 爬虫

本文链接：https://blog.csdn.net/qq_16103331/article/details/52671215

版权

from    urllib.request  import  urlopen
from    bs4 import  BeautifulSoup
html    =   urlopen("http://www.pythonscraping.com/exercises/exercise1.html")
bsObj   =   BeautifulSoup(html.read());
print(bsObj.h1)

发现本机电脑urllib2

BeautifulSoup4 是一个解析库

import urllib2 
response = urllib2.urlopen('http://python.org/') 
html = response.read()
bsObj   =   BeautifulSoup(html.read());
//html 输出一次需要重新获取
print(bsObj.h1)

爬取span标签中 class 为green 的内容

#! /usr/bin/env python
#coding=utf-8

import urllib2
from    bs4 import  BeautifulSoup

html=urllib2.urlopen("http://www.pythonscraping.com/pages/warandpeace.html")
bsObj=BeautifulSoup(html)
nameList=bsObj.findAll("span",{"class":"green"})
for name in nameList:
        print(name.get_text())

find()  and findAll()   with    BeautifulSoup
BeautifulSoup’s  find()     and  findAll()  are the two functions   you will    likely  use the most.
With    them,   you can easily  filter  HTML    pages   to  find    lists   of  desired tags,   or  a   single  tag,
based   on  their   various attributes.
The two functions   are extremely   similar,    as  evidenced   by  their   definitions in  the
BeautifulSoup   documentation:
findAll(tag,    attributes, recursive,  text,   limit,  keywords)
find(tag,   attributes, recursive,  text,   keywords)
In  all likelihood, 95% of  the time    you will    find    yourself    only    needing to  use the first   two
arguments:   tag    and  attributes .   However,    let’s   take    a   look    at  all of  the arguments   in
greater detail.
The  tag    argument    is  one that    we’ve   seen    before  —   you can pass    a   string  name    of  a   tag or
even    a   Python  list    of  string  tag names.  For example,    the following   will    return  a   list    of  all
the header  tags    in  a   document: 1

.findAll({"h1","h2","h3","h4","h5","h6"})
.findAll("span",    {"class":"green",   "class":"red"})
nameList    =   bsObj.findAll(text="the prince")
print(len(nameList))
allText =   bsObj.findAll(id="text")
print(allText[0].get_text())
bsObj.findAll(id="text")
bsObj.findAll("",   {"id":"text"})

A Caveat to the keyword Argument
The keyword argument can be very helpful in some situations. However, it is technically redundant as a
BeautifulSoup feature. Keep in mind that anything that can be done with keyword can also be accomplished
using techniques we will discuss later in this chapter (see Regular Expressions and Lambda Expressions).
For instance, the following two lines are identical:

bsObj.findAll(id="text")
bsObj.findAll("",   {"id":"text"})

In addition, you might occasionally run into problems using keyword , most notably when searching for
elements by their class attribute, because class is a protected keyword in Python. That is, class is a
reserved word in Python that cannot be used as a variable or argument name (no relation to the
BeautifulSoup.findAll() keyword argument, previously discussed).2 For example, if you try the following
call, you’ll get a syntax error due to the nonstandard use of clasbsObj.findAll(class="green")tead, you can use BeautifulSoup’s somewhat clumsy solution, which involves adding an underscore: