python与爬虫-01简单介绍

PS:论文写完了,似乎又没写完!打算研究一下爬虫!也不知道能坚持多久呢!最后,求论文过!!!过,过,过!!!

序:网页抓取需要抛开一些接口的遮挡,比如,浏览器层、网络连接层。

1.模仿A与B的网络通信
A:10101010,包括请求头和消息体,请求头包含B的本地路由器MAC地址、A的IP地址,消息体包含B对A服务器应用的请求。
B:本地路由器可收到10101010,数据包packet,从B的MAC地址寄到A的IP地址,B的路由器把数据包附上自己的IP地址,通过互联网发送出去。
经过:B的数据包经过中介服务器,到了A的服务器。A的服务器在A的IP地址收到数据包,A的服务器读取数据包里面的请求头的目标端口,传送到网络服务器应用,(目标端口通常是网络应用的80端口)
网络服务器应用从服务器处理器收到数据,GET请求+文件index.html,然后,打包文件发送给B,本体路由器传送到B的电脑上。

2.简单案例

from urllib.request import urlopen
html = urlopen('http://pythonscraping.com/pages/page1.html')
print(html.read())

PS:代码的话,在gitee码云上,搜索python-scraping,就可以找到了!估计会有很多个!但是,熟悉这本书的小伙伴可能会知道是哪一个!如果不知道这本书的就算了!
抓取结果如下:
下面的\n就是换行的意思!

b'<html>\n<head>\n<title>A Useful Page</title>\n</head>\n<body>\n<h1>An Interesting Title</h1>\n<div>\nLorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.\n</div>\n</body>\n</html>\n'

PS:看着不像英语!
网页地址截图如下:
在这里插入图片描述
输出了目标地址的全部HTML代码,即是输出在域名为http://pythonscraping.com的服务器上的<网络应用根地址>/pages文件夹的HTML文件page1.html的源代码。
python程序直接请求了单个HTML文件,查找python的request模块。
补充:urllib是Python的标准库,功能:网页请求数据、处理cookie、改变类似请求头和用户代理这些元数据的函数。urlopen用来打开并读取一个从网络获取的远程对象。

PS:当我运行如下代码时候,html = urlopen('https://github.com/search?q=REMitchell/python-scraping/blob/master/Chapter01_BeginningToScrape.ipynb') print(html.read()),会发现很难爬取,所以会有结果:URLError: <urlopen error [WinError 10060] 由于连接方在一段时间后没有正确答复或连接的主机没有反应,连接尝试失败。>

3.BeautifulSoup
建议:这本书可能会建议你单独安装一些模块,给你一点tips,首先,jupyter和这些库,可能需要一个anaconda3就都搞定了,因为,这个软件好像都事先装好了!在安装目录的lib文件夹下面,仔细找一找,应该能够找到你需要的模块(site-packages文件夹)。
3.1.案例分析
执行代码:

from urllib.request import urlopen
from bs4 import BeautifulSoup
html=urlopen('http://www.pythonscraping.com/pages/page1.html')
bs = BeautifulSoup(html.read(),'html.parser')
print(bs.h1)
print(bs.html.body.div)

显示结果为:

<h1>An Interesting Title</h1>
<div>
Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.
</div>

再次输入代码:

bs = BeautifulSoup(html.read(),'lxml')
print(bs.h1)
bs = BeautifulSoup(html.read(),'html5lib')
print(bs.h1)

结果为

None
None

PS:以为你没安装这两个模块嘛?并不是!输入pip install lxml会显示,你已经安装了!
3.2.知识总结
BeautifulSoup库通过定位HTML标签来格式化和组织复杂的网页信息。创建BeautifulSoup对象时,需要两个参数,一个是HTML文本,第二个是解析器。解析器包括:html.parser,lxml,html5lib等。
3.3.简单异常处理
第一种:网页在服务器上不存在。解决方式:HTTPError。
第二中:服务器不存在。解决方式:URLError。
示例代码:

from urllib.error import HTTPError
from urllib.error import URLError
try:
    html = urlopen('https://github.com/search?q=REMitchell/python-scraping/blob/master/Chapter01_BeginningToScrape.ipynb')    
except HTTPError as e:
    print(e)
except URLError as e:
    print('The server could not be found!')
else:
    print(html.read())

PS:运行这段代码的时候,这个页面居然被成功爬取出来了!突然好难过!!!
即使成功从服务器获取页面,当页面并非预期结果的时候,比如并没有这个内容,None对象和此对象的其他的内容,检查!
代码:

try:
    html=urlopen('http://www.pythonscraping.com/pages/page1.html')
    bs = BeautifulSoup(html.read(),'html.parser')
    badContent = bs.html.h1
except AttributeError as e:
    print("Tag was not found")
else:
    if badContent == None:
        print('Tag was not found')
    else:
        print(badContent)

结果:<h1>An Interesting Title</h1>
当参数为badContent = bs.html.h1.ss,显示为:Tag was not found

补充:给了一个通用的参考函数

from urllib.request import urlopen
from urllib.error import HTTPError
from urllib.error import URLError
def getTitle(url):
    try:
        html = urlopen(url)  
        bs = BeautifulSoup(html.read(),'html.parser')
        title = bs.html.h1
    except HTTPError as e:
        print(e)
    except URLError as e:
        print('The server could not be found!')
    except AttributeError as e:
        print("Tagtitle was not found")
    else:
        if title == None:
            print('title was not found')
        else:
            print(title)

运行:title = getTitle('http://www.pythonscraping.com/pages/page1.html')
结果:<h1>An Interesting Title</h1>

今天就先学到这里!
这是第一章的内容!
论文已经上交了!忐忑不安!最近得减肥了!
本人还开了一个新坑——不咸不淡的小文文!!!!没有money的生活也开始多姿多彩了!我的业余生活就是这样!
打游戏,没钱!出去逛街,没钱!买东西,没钱!还得买房子等等!养家糊口都困难!只能找点儿别的乐趣了!

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
Python Web Scraping - Second Edition by Katharine Jarmul English | 30 May 2017 | ASIN: B0725BCPT1 | 220 Pages | AZW3 | 3.52 MB Key Features A hands-on guide to web scraping using Python with solutions to real-world problems Create a number of different web scrapers in Python to extract information This book includes practical examples on using the popular and well-maintained libraries in Python for your web scraping needs Book Description The Internet contains the most useful set of data ever assembled, most of which is publicly accessible for free. However, this data is not easily usable. It is embedded within the structure and style of websites and needs to be carefully extracted. Web scraping is becoming increasingly useful as a means to gather and make sense of the wealth of information available online. This book is the ultimate guide to using the latest features of Python 3.x to scrape data from websites. In the early chapters, you'll see how to extract data from static web pages. You'll learn to use caching with databases and files to save time and manage the load on servers. After covering the basics, you'll get hands-on practice building a more sophisticated crawler using browsers, crawlers, and concurrent scrapers. You'll determine when and how to scrape data from a JavaScript-dependent website using PyQt and Selenium. You'll get a better understanding of how to submit forms on complex websites protected by CAPTCHA. You'll find out how to automate these actions with Python packages such as mechanize. You'll also learn how to create class-based scrapers with Scrapy libraries and implement your learning on real websites. By the end of the book, you will have explored testing websites with scrapers, remote scraping, best practices, working with images, and many other relevant topics. What you will learn Extract data from web pages with simple Python programming Build a concurrent crawler to process web pages in parallel Follow links to crawl a website Extract features from the HTML Cache downloaded HTML for reuse Compare concurrent models to determine the fastest crawler Find out how to parse JavaScript-dependent websites Interact with forms and sessions About the Author Katharine Jarmul is a data scientist and Pythonista based in Berlin, Germany. She runs a data science consulting company, Kjamistan, that provides services such as data extraction, acquisition, and modelling for small and large companies. She has been writing Python since 2008 and scraping the web with Python since 2010, and has worked at both small and large start-ups who use web scraping for data analysis and machine learning. When she's not scraping the web, you can follow her thoughts and activities via Twitter (@kjam) Richard Lawson is from Australia and studied Computer Science at the University of Melbourne. Since graduating, he built a business specializing in web scraping while travelling the world, working remotely from over 50 countries. He is a fluent Esperanto speaker, conversational in Mandarin and Korean, and active in contributing to and translating open source software. He is currently undertaking postgraduate studies at Oxford University and in his spare time enjoys developing autonomous drones. Table of Contents Introduction Scraping the data Caching downloads Concurrent downloading Dynamic content Interacting with forms Solving CAPTCHA Scrapy Putting it All Together
好的,我可以为您介绍一下如何使用 Python 爬虫来爬取房产数据并保存到本地。 首先,您需要选择一个合适的网站来进行爬取。例如,可以选择链家网、安居客等房产网站。然后,您需要分析该网站的页面结构和数据请求方式,确定需要爬取哪些数据,以及如何获取这些数据。 接着,您可以使用 Python 的 requests 库来发送 HTTP 请求,获取网页内容。然后,可以使用 BeautifulSoup 库来解析网页内容,提取需要的数据。 最后,您可以使用 Python 的 Pandas 库将数据保存到本地的 Excel 或 CSV 文件中。以下是一个示例代码: ```python import requests from bs4 import BeautifulSoup import pandas as pd # 发送 HTTP 请求获取网页内容 url = 'https://bj.lianjia.com/ershoufang/' headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3'} response = requests.get(url, headers=headers) soup = BeautifulSoup(response.content, 'html.parser') # 解析网页内容,提取需要的数据 house_list = soup.find_all('div', {'class': 'info'}) data = [] for house in house_list: name = house.find('div', {'class': 'title'}).text.strip() price = house.find('div', {'class': 'price'}).text.strip() area = house.find('div', {'class': 'houseInfo'}).text.strip() data.append([name, price, area]) # 将数据保存到本地文件 df = pd.DataFrame(data, columns=['名称', '价格', '面积']) df.to_excel('house_data.xlsx', index=False) ``` 这是一个简单的示例代码,您可以根据您需要爬取的数据和网站的不同来进行修改和优化。

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值