爬虫补充学习，带Python学习3

最新推荐文章于 2023-10-17 00:05:02 发布

H_Hao

最新推荐文章于 2023-10-17 00:05:02 发布

阅读量167

点赞数

分类专栏：学习

本文链接：https://blog.csdn.net/haoyuexihuai/article/details/82596324

版权

学习专栏收录该内容

33 篇文章 0 订阅

订阅专栏

数据分析
1.提出正确的问题

正确的提问能解释现象
错误的提问却强行关联无关的事物
正确的提问是验证假设
错误的提问是证明自己是对的
正确的提问是探索方向
错误的提问没有提出问题

2.通过数据论证寻找答案

对比：横线对比（与别人）纵向对比（与自己不同时间段）
细分
溯源

3.解读数据、回答问题

样本问题
因果关联错误
忽略前提

1.Jupyter Notebook

Jupyter Notebook（此前被称为 IPython notebook）是一个交互式笔记本，支持运行 40 多种编程语言。
Jupyter Notebook 的本质是一个 Web 应用程序，便于创建和共享文学化程序文档，支持实时代码，数学方程，可视化和 markdown。用途包括：数据清理和转换，数值模拟，统计建模，机器学习等等

pip3 install jupyter

2.Charts

pip3 install charts

import charts 报错的话

下载：
https://github.com/mugglecoding/Plan-for-combating/tree/master/week3/charts_replace_file
然后把charts目录下的文件替换成下载的。

查看charts的安装路径

import sys
print(sys.path)

找到类似下面的路径
‘/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages’ 然后进入目录下的chart目录,拷贝下图这些文件替换

3.生成charts需要的数据格式
如：series = [{'name': 'John','data': [5],'type': 'column'},{'name': 'Jack','data': [7],'type': 'column'}]

//1.统计出有多少个地区
area_list = []
for i in item_info.find():
    area_list.append(i['area'][0])
area_index = list(set(area_list))    //将list数组先转换为set，再放入一个新的list
print(area_index)

/**
['西城', '燕郊', '密云', '石景山', '海淀', '朝阳', '宣武', '平谷', '怀柔', '昌平', '附近', '大兴', '顺义', '延庆', '不明', '丰台', '北京周边', '房山', '东城', '崇文', '通州', '门头沟']
**/

//2.遍历所有地区area_index，并统计area_list中每个地区的数量
post_times = []
for index in area_index:
    post_times.append(area_list.count(index))
print(post_times)

/**
[3376, 541, 386, 1958, 11768, 19224, 1803, 277, 381, 4963, 3, 4856, 1913, 164, 15505, 7991, 520, 1490, 3155, 1188, 4924, 464]
**/

//3.定义一个函数来使用yield(迭代器)生成数据集
def data_gen(types):
    length = 0
    if length <= len(area_index):
        for area,times in zip(area_index,post_times):
            data = {
                'name':area,
                'data':[times],
                'type':types
            }
            yield data
            length += 1            
for i in data_gen('column'):
    print(i)

/**
{'data': [3376], 'type': 'column', 'name': '西城'}
{'data': [541], 'type': 'column', 'name': '燕郊'}
{'data': [386], 'type': 'column', 'name': '密云'}
{'data': [1958], 'type': 'column', 'name': '石景山'}
{'data': [11768], 'type': 'column', 'name': '海淀'}
{'data': [19224], 'type': 'column', 'name': '朝阳'}
{'data': [1803], 'type': 'column', 'name': '宣武'}
{'data': [277], 'type': 'column', 'name': '平谷'}
{'data': [381], 'type': 'column', 'name': '怀柔'}
{'data': [4963], 'type': 'column', 'name': '昌平'}
{'data': [3], 'type': 'column', 'name': '附近'}
{'data': [4856], 'type': 'column', 'name': '大兴'}
{'data': [1913], 'type': 'column', 'name': '顺义'}
{'data': [164], 'type': 'column', 'name': '延庆'}
{'data': [15505], 'type': 'column', 'name': '不明'}
{'data': [7991], 'type': 'column', 'name': '丰台'}
{'data': [520], 'type': 'column', 'name': '北京周边'}
{'data': [1490], 'type': 'column', 'name': '房山'}
{'data': [3155], 'type': 'column', 'name': '东城'}
{'data': [1188], 'type': 'column', 'name': '崇文'}
{'data': [4924], 'type': 'column', 'name': '通州'}
{'data': [464], 'type': 'column', 'name': '门头沟'}
**/

//4.使用列表解析式生成charts饼图需要的数据格式
series = [data for data in data_gen('column')]
charts.plot(series, show='inline', options=dict(title=dict(text='七日内北京城区二手物品发帖量')))

3.mongodb分片 slice

控制匹配数组返回数据长度

db.c.find（{},{‘type’:{“$slice”:3}）  取前三个 type (-3 后三个);
db.c.find（{},{‘type’:{“$slice”:[3,10]}） 也可以是截取（4-13）；

for i in item_info.find({'pub_date':{'$in':['2016.01.12','2016.01.14']}},{'area':{'$slice':1},'_id':0,'price':0,'title':0}).limit(300):
    print(i)

area使用了分片，取第一个值

4.python时间累加

def get_all_dates(date1,date2):
    ##将开始、结束时间字符串转换为时间格式
    the_date = date(int(date1.split('.')[0]),int(date1.split('.')[1]),int(date1.split('.')[2]))
    end_date = date(int(date2.split('.')[0]),int(date2.split('.')[1]),int(date2.split('.')[2]))
    days = timedelta(days=1)
    ## 1 day, 0:00:00  ##
    while the_date <= end_date:
        yield (the_date.strftime('%Y.%m.%d'))  ##生成器生成需要的时间格式
        the_date = the_date + days   ##循环的时间加一天

5.print()

print('#'*20)   #输出20个#

6.生成折线图数据

def get_data_within(date1,date2,areas):
    for area in areas:
        area_day_posts = []
        for date in get_all_dates(date1,date2):   #调用时间累加函数
            a = list(item_info.find({'pub_date':date,'area':area}))
            each_day_post = len(a)
            area_day_posts.append(each_day_post)
        data = {
            'name': area,
            'data': area_day_posts,
            'type': 'line'
        }
        yield data

for i in get_data_within('2015.12.24','2016.01.05',['朝阳','海淀','通州']):
    print(i)

## 输出的结果series 
##{'data': [220, 217, 259, 266, 322, 287, 309, 307, 346, 440, 488, 641, 649], 'type': 'line', 'name': '朝阳'}
##{'data': [137, 146, 154, 156, 176, 183, 171, 217, 239, 284, 288, 397, 395], 'type': 'line', 'name': '海淀'}
##{'data': [58, 54, 74, 57, 82, 84, 93, 79, 114, 113, 133, 151, 201], 'type': 'line', 'name': '通州'}

7.生成折线图

#固定格式
options = {
    'chart'   : {'zoomType':'xy'},
    'title'   : {'text': '发帖量统计'},
    'subtitle': {'text': '可视化统计图表'},
    'xAxis'   : {'categories': [i for i in get_all_dates('2015.12.24','2016.01.05')]},
    'yAxis'   : {'title': {'text': '数量'}}
    }

series = [i for i in get_data_within('2015.12.24','2016.01.05',['朝阳','海淀','通州'])]

charts.plot(series, options=options,show='inline')

这里写图片描述

8.aggregate聚合

pipeline = [
    //$and并列条件匹配
    {'$match':{'$and':[{'pub_date':'2015.12.24'},{'time':3}]}},
    //将集合中的文档分组,以price字段为组，统计结果放到counts字段,'$sum':1 表示每次加1
    {'$group':{'_id':'$price','counts':{'$sum':1}}},
    //按照counts排序，从高到低排序
    {'$sort' :{'counts':-1}},
    //用来限制MongoDB聚合管道返回的文档数
    {'$limit':10}
]

pipeline2 = [
    {'$match':{'$and':[{'pub_date':'2015.12.25'},{'time':3}]}},
    //将集合中的文档分组,以cates字段切片，取数组中的第三个数，统计结果放到counts字段,'$sum':1 表示每次加1
    {'$group':{'_id':{'$slice':['$cates',2,1]},'counts':{'$sum':1}}},
    {'$sort':{'counts':-1}}
]

pipeline = [
    //pub_date在'2015.12.25','2015.12.27'两个时间内的
    {'$match':{'$and':[{'pub_date':{'$in':['2015.12.25','2015.12.27']}},{'time':1}]}},,
    {'$group':{'_id':{'$slice':['$area',1]},'counts':{'$sum':1}}},
    {'$sort' :{'counts':-1}},
]