python爬虫入门超easy系列（四）

最新推荐文章于 2020-12-02 20:23:18 发布

古艺轩

最新推荐文章于 2020-12-02 20:23:18 发布

阅读量158

点赞数

分类专栏： python爬虫文章标签： python爬虫入门超easy系列（四）

本文链接：https://blog.csdn.net/weixin_44278529/article/details/88078631

版权

python爬虫专栏收录该内容

7 篇文章 0 订阅

订阅专栏

python爬虫入门超easy系列（四）

常见的数据

1.关系型数据库(MySQL,Oracle,postgresql,sqlite3)
2.非关系型数据库(NoSQL)(redis,mongodb,cassandra,Hbase,图数据库neo4j)

糗事百科数据解析范例

例子XML,展示网页的一种复合xml结构

test_data = '''

<div>
<div class="content_item">
                <content_pics>
                    <img src=""></img>
                    <img src=""></img>
                    <img src=""></img>
                </content_pics>
                <content_video>
                    <video src="http://qiubai-video-web.qiushibaike.com/EVQ5BZFGXGVZJH13_hd.mp4"></video>
                </content_video>
                <content_text>
                    你好
                    <br>
                    你好不好
                    <br>
                    好好好好
                </content_text>
           </div>
           
           <div class="content_item">
                <content_pics>
                    <img src=""></img>
                    <img src=""></img>
                    <img src=""></img>
                </content_pics>
                <content_video>
                    <video src="http://qiubai-video-web.qiushibaike.com/EVQ5BZFGXGVZJH13_hd.mp4"></video>
                </content_video>
                <content_text>
                    你好
                    <br>
                    你好不好
                    <br>
                    好好好好
                </content_text>
           </div>
</div>

'''

解析方法

parse_result = lxml.html.fromstring(test_data)

content_items = parse_result.xpath("//div[@class='content_item']")

for item in content_items:
    print("一条文本", "".join(item.xpath(".//content_text/text()")))

    video_url = item.xpath(".//content_video/video/@src")[0]
    video_content = requests.get(video_url)

    pic_contents = item.xpath(".//content_pics/img/@src")
    for pic_url in pic_contents:
        pic_content = requests.get(pic_url)

魔法方法应用进阶

如何实现一个对象对"+"运算符进行减法操作

class CustomPlus(object):

    def __init__(self,value):
        self.add_number = value

    def __add__(self, other):
        result = self.add_number - other.add_number
        return CustomPlus(result) #保证形式的一致性

    def __str__(self):
        return str(self.add_number)

a = CustomPlus(20)
b = CustomPlus(10)
c = CustomPlus(40)

print(isinstance(10,CustomPlus)) #检查运行时类型，保证运算正确
print(type(10) == int)

c = a + b  # 等效于操作c = a.__add__(b)
print(c)

"[]"运算符的本质

实现__setitem__,__getitem__的类都可以使用[]运算符

class MyCache(object):

    def __init__(self):
        self.my_content = 0


	 #m_cache['poip'] = "python is best" 进行赋值运算时调用
    def __setitem__(self, key, value):
        print("test")
        self.my_content = value

	 #m_cache['tdjsjffsjdfj']进行取值运算时调用
    def __getitem__(self, item):
        return self.my_content
        
    #调用("test" in m_cache) "in" 方法判断包含关系的时候调用
    def __contains__(self, item):
    	return True

	m_cache = MyCache()
	m_cache['poip'] = "python is best" #m_cache.__setitem__("test","python is best")
	print(m_cache['tdjsjffsjdfj']) #m_cache.__getitem__('tdjsjffsjdfj')

常见的数据库

1.关系型数据库(MySQL,Oracle,postgresql,sqlite3)

2.非关系型数据库(NoSQL)(redis,mongodb,cassandra,Hbase,图数据库neo4j)

常见的爬虫技术点

可配置爬虫

1.爬虫的启动停止(os.system)

2.爬虫基本爬取数据的设置(起始网页、解析规则、停止条件等)

3.爬虫的进度指示

4.爬虫的数据存储
增量爬虫

1.怎么知道爬取过的网页发生了改变
分布式爬虫
反爬虫

数据的存储

可以了解一下淘宝魔方，数据结构

mongo的基本操作

1.如何启动

  1) cd $mongodb安装目录
  
  2) cd bin #进入安装目录下的bin目录
  
  3）./mongod --dbpath $数据库路径 (可以指定任意目录，简化方法指定到当前目录)
  4) 再打开一个命令行，进入$mongodb安装目录/bin,执行 ./mongo 连接数据库

2.命令行模式如何操作mongo

show dbs; #显示所有可用数据库
use xyz #切换到xyz数据库,xyz替换为你本机存在的数据库
show collections #查看当前数据库(use 切换过的数据库)中的collection(表)
db.students.find() #查看student表中的所有数据

pymongo的基本操作

import pymongo

#连接数据库实例(连接数据库)--->获取相应数据库--->获取相应collection(表)
client = pymongo.MongoClient(host='localhost', port=27017)
db = client.test
collection = db.students #数据库表本质是一个字典

student1 = {
    'id': '20170101',
    'name': 'Jordan',
    'age': 20,
    'gender': 'male'
}

student_update= {
     "id" : "20170101",
    "name": "jack",
    "age" : "19",
    "gender":"male"
}


#NoSQL
#SQL mysql sqlite3 sqlserver oracle
#添加数据如果不指定_id字段，系统会默认生成一个objectId
#insert into students(id,name,age,gender) values('20170101','jordan',20,'male')
collection.insert_one(student1)
#find查找返回符合条件的多个结果，查询条件使用字典指定，可使用多个字段
#select * from students where id = '20170101'
result_find = collection.find({"age":{"$gt":19}})
#返回一个游标，游标相当于一个迭代器，存取查询结果，可使用next()获取一条结果
print(result_find.next())

#update students set name='jack' where id = '20170101'
#更新指定条件数据，upsert为True指定更新符合条件数据，如果没有符合条件数据，执行插入操作
# $set是mongodb内置函数，覆盖原始数据
collection.update({"id":"20170101"},{"$set":student_update},upsert=True)
#delete from students where id = "20170101"
collection.remove({"id":"20170101"})

mongodb和redis的区别区别

简单的增量爬取

什么是hash函数

信息摘要算法使用md5、sha等数学方法生成相应数据的指纹信息，指纹信息为一个字符串一般是固定长度,python中实现md5的方法如下

import hashlib
md5 = hashlib.md5()
md5.update("qianfengpython")
print(md5.hexdigest())

** 注意：如果要比较两个文件，必须生成两个md5对象，不能使用一个进行持续的update **

古艺轩

关注

0
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
python爬虫入门超easy系列（四）

python爬虫入门超easy系列（四）常见的数据1.关系型数据库(MySQL,Oracle,postgresql,sqlite3)2.非关系型数据库(NoSQL)(redis,mongodb,cassandra,Hbase,图数据库neo4j)糗事百科数据解析范例例子XML,展示网页的一种复合xml结构test_data = '''&lt;div&gt;&lt;div clas...
复制链接

扫一扫

专栏目录