13.mongodb之python爬虫

最新推荐文章于 2023-05-01 13:30:35 发布

越奋斗，越幸运

最新推荐文章于 2023-05-01 13:30:35 发布

阅读量378

点赞数

分类专栏： Python基础

本文链接：https://blog.csdn.net/fanjianhai/article/details/103667692

版权

Python基础专栏收录该内容

13 篇文章 1 订阅

订阅专栏

mongodb 文档行数据库总结

1. [Centos7安装mongodb链接](https://www.cnblogs.com/flying1819/articles/9035408.html)
2. mongodb常用命令
3. mongodb数据类型
4. 增删改查
05. 高级查询
06. mongodb的备份与恢复
07. mongodb聚合 aggregate
08. 索引的创建
09. pymongo操作
10. mongodb案例
11. 医院数据的整理

1. Centos7安装mongodb链接

注意： 这里有个错： https 换成http

2. mongodb常用命令

查看当前数据库 db
查看所有数据库 show dbs / show databases
（创建）切换数据库 use db_name
删除当前数据库 db.dropDatabase()
查看所有集合 show collections
删除某个集合 db.collection_name.drop()

3. mongodb数据类型

Object ID: 文档ID
- ObjectID是一个12字节的十六进制数：
- 前4个字节为当前时间戳
- 接下来的3个字节为机器ID
- 接下来的2个字节为MongoDB的服务进程id
- 最后3个字节是简单的增量值
String：字符串，最常用，必须是有效地UTF-8
Boolean：存储一个布尔值，true或false
Integer：整数可以使32位或者64位，这取决于服务器
Arrays：数组或列表，多个值存储到一个键
Object：用于嵌入式的文档，即一个值为一个文档
Null：存储Null值
TimeStamp：时间戳，表示从1970-1-1到现在的总秒数
Date：存储当前日期或者时间的UNIX时间格式

4. 增删改查

插入一条数据 db.test.insert({"name":"xiaofan","age":18})
查询数据 db.test.find()
插入一条数据 db.test.save({"_id":"0001", "name":"xiaofan", "age":98})
更新一条数据
- db.test.update({"name":"xiaofan"},{"name":"xiaoli"})
- db.test.update({name:“xiaofan”},{$set:{name:“xiaoma”}})
- db.test.update({name:“xiaoma”},{ $KaTeX parse error: Expected 'EOF', got '}' at position 25: \dots:"xiaoxiaofan"}}̲, {multi:true})\dots$ 符配合使用才有效
删除数据
- db.test.remove({name:"xiaoxiaofan"}, {justOne:true}) 删除一条数据
- db.test.remove({name:“xiaoxiaofan”}) 删除所有满足条件的数据

05. 高级查询

1. 数据查询

find({条件文档}) 查询 db.test.find({age:23})
findOne({条件文档}) 只返回一条 db.test.findOne({age:23})
pretty() 将结果格式化 db.test.find({age:22}).pretty()

2. 比较运算符

等于：默认等于判断，没有运算符
小于：$lt
小于等于: $lte
大于： $KaTeX parse error: Expected '}', got 'EOF' at end of input: \dotsest.find({age:{$ gt:22}})
大于等于：$gte
不等于： $KaTeX parse error: Expected '}', got 'EOF' at end of input: \dotsest.find({age:{$ ne:23}})

3. 范围运算符

$KaTeX parse error: Expected '}', got 'EOF' at end of input: \dotsest.find({age:{$ in:[22, 23]}})
$KaTeX parse error: Expected '}', got 'EOF' at end of input: \dotsest.find({age:{$ nin:[22, 23]}})

4. 逻辑运算符

and：在json中写多个条件即可 db.test.find({name:“xiao”,age:22})
or: 使用 $KaTeX parse error: Expected '}', got 'EOF' at end of input: \dots db.test.find({$ or:[{name:“xiao”},{age:23}]})

5. 正则表达式

使用 // 或者 $regex编写正则表达式
- 以…开头 db.test.find({name:/^xiaoxiao/})
- 以…结尾 db.test.find({name:{ $r e g e x : " 24$ "}})

6. skip和limit

skip() 类似于mysql中的offset
limit() 类似于mysql中的limit db.test.find().skip(2).limit(2)

7. 自定义查询$where

8. 投影

在查询到的返回结果中，只选择必要的字段，参数为字段值，值为1显示，不写不显示， _id 默认显示，不过不显示要明确写0
db.test.find({age:{$gt:22}},{name:1,age:1,_id:0})

{ "name" : "xiaoxiaofan", "age" : 98 }
{ "name" : "xiao24", "age" : 23 }
{ "name" : "xiao25", "age" : 23 }

9. 排序sort：用于对结果进行排序参数为1时为升序， -1时为降序

db.test.find().sort({age:-1})

10. 统计个数和去重

统计：db.test.find({age:{$gte:23}}).count()
去重：db.集合名称.distinct('去重字段',{条件}) db.test.distinct("name",{age:{$gt:22}})

06. mongodb的备份与恢复

备份
mongodump -d test1 -o ./test1000 （本地备份）
恢复
mongorestore -d test2000 --dir ./test1000/test1

07. mongodb聚合 aggregate

1. 聚合aggregate是基于数据处理的聚合管道，每个文档通过一个由多个阶段stage组成的管道，可以对每个阶段的管道进行分组、过滤等功能，然后经过一系列的处理，输出相应的结果。

db.集合名称.aggregate({管道：{表达式}})

2. mongodb常用管道

$group: 将集合中的文档分组，可用于统计结果
$match: 过滤数据，只输出符合条件的文档
$project: 修改输入文档的结构，如重命名，增加，删除字段，创建计算结果
$sort: 将输入文档排序后输出
$limit: 限制聚合管道输出的文档数目
$skip：跳过指定数量的文档，返回剩下的文档
$unwind: 将数组类型的字段拆分

3. 表达式：处理输入文档并输出

语法：表达式： “$列名”
常用表达式
- $sum: 计算总和， $sum:1 表示以一倍奇数
- $avg: 计算平均值
- $min: 获取最小值
- $max: 获取最大值
- $push：在结果文档中插入值到一个数组中
- $first：根据资源文档的排序获取第一个文档数据
- $last：根据资源文档的排序获取最后一个文档数据

4. 案例

$group分组

db.test.aggregate(
    {$group:{_id: "$age"}}
)

分组统计每一组内文档个数

db.test.aggregate(
    {$group:{_id: "$age", count:{$sum:1}}}
)

$group统计文档的个数，平均年龄，没有年龄的不参与计算

db.test.aggregate(
    {$group:{_id: null, count:{$sum:1}, mean_avg:{$avg:"$age"}}}
)

$group使用的注意点

'$group’对应的字典中有几个值，结果中就有几个值
分组依据需要放到’_id’后面，如果是null，则对全部数据进行计算
取不同的字段的值需要使用 $,^{'}$ gender’, ‘$age’

$project使用

db.test.aggregate(
    {$group:{_id:"$age", count:{$sum:1}}},
    {$project:{age: "$_id", count:1, _id:0}}
)

db.test.aggregate(
    {$group:{_id:"$age", count:{$sum:1}}},
    {$project:{age: "$_id", count:"$count", _id:0}}
)

$m a t c h 相当于 f i n d ，但是在管道中只能使用$ match

db.test.aggregate(
    {$match:{age:{$gt:22}}}
)

去重

db.test.aggregate({
    $group:{_id:{country:"$country", province:"$province", userid:"$userid"}}
})

去重之后按省份分组统计

db.test.aggregate(
    {$group:{_id:{country:"$country", province:"$province", userid:"$userid"}}},
    {$group:{_id:{country:"$_id.country", province:"$_id.province"}, count:{$sum:1}}}
)

对结果进行格式化（美化）

db.test.aggregate(
    {$group:{_id:{country:"$country", province:"$province", userid:"$userid"}}},
    {$group:{_id:{country:"$_id.country", province:"$_id.province"}, count:{$sum:1}}},
    {$project:{country:"$_id.country", province:"$_id.province", count:1, _id:0}}
)

聚合命令： $unwind的使用

db.test.insert({username:"Alex", "tags":["C#","Java","C++"]})
db.test.aggregate({$match:{username:"Alex"}},{$unwind:"$tags"},{$group:{_id:null, count:{$sum:1}}})

$unwind: 属性 preserveNullAndEmptyArrays值为false 表示丢弃属性值为空的文档, 值为true标识保留属性为空的文档

db.test.aggregate({$match:{username:"Alex"}},{$unwind:{path:"$tags",preserveNullAndEmptyArrays:true}},{$group:{_id:null, count:{$sum:1}}})

08. 索引的创建

检测运行时间：db.test1.find({name:“test9999”}).explain(“executionStats”)
查看当前集合的所有索引：db.test1.getIndexes()
在默认情况下索引字段的值是可以相同的
创建索引： db.test1.ensureIndex({name:1})
创建唯一索引（索引的值是唯一的），可以通过这种方式达到去重的目的 db.test1.ensureIndex({name:1},{unique:true})
创建联合索引
删除索引：db.test1.dropIndex({name:1})

09. pymongo操作

from pymongo import MongoClient

client = MongoClient(host="mini3", port=27017)
# test2为数据库名 test1为collection名
collection = client["test2"]["test1"]

ret = collection.find().limit(10)
for i in ret:
    print(i)

10. mongodb案例

爬虫源码，并写入到mongodb当中

import requests
from lxml import etree
import json
import time
from pymongo import MongoClient


class MySpider:
    def __init__(self):
        self.client = MongoClient(host="mini3", port=27017)
        # test2为数据库名 test1为collection名
        self.collection = self.client["db_area"]["c_area"]
        self.base_url = "http://www.stats.gov.cn/tjsj/tjbz/tjyqhdmhcxhfdm/2018/{}"
        self.start_url = self.base_url.format("index.html")
        self.headers = {
            "User-Agent": "Mozilla/5.0 (Linux; Android 5.1.1; Nexus 6 Build/LYZ28E) AppleWebKit/537.36 (KHTML, "
                          "like Gecko) Chrome/63.0.3239.84 Mobile Safari/537.36"}
        self.item = []

    def parse_url(self, url):  # 发送请求，获取响应
        response = requests.get(url, headers=self.headers)
        return response.content.decode("gbk", errors='ignore')

    def run(self):  # 实现主要逻辑
        province_ret_str = self.parse_url(self.start_url)
        province_html = etree.HTML(province_ret_str)
        province_hrefs = province_html.xpath("//tr[@class='provincetr']/td/a")
        # 抓取省份和下一页的链接
        item = []
        for a in province_hrefs:
            province_dict = {}
            province = a.xpath("./text()")[0]
            province_link = a.xpath("./@href")[0]
            province_dict["province_name"] = province
            # province_dict["province_link"] = province_link
            # 爬取市、区相关信息
            city_ret_str = self.parse_url(self.base_url.format(province_link))
            city_html = etree.HTML(city_ret_str)
            city_trs = city_html.xpath("//tr[@class='citytr']")
            city_item = []

            for city_tr in city_trs:
                city_dict = {}

                country_link = city_tr.xpath("./td[1]/a/@href")[0]
                city_code = city_tr.xpath("./td[1]/a/text()")[0]
                city_code = city_code[0:6]
                city_name = city_tr.xpath("./td[2]/a/text()")[0]
                city_dict["city_name"] = city_name
                city_dict["city_code"] = city_code
                # city_dict["link"] = country_link
                # 爬取区县相关信息
                country_ret_str = self.parse_url(self.base_url.format(country_link))
                country_html = etree.HTML(country_ret_str)
                country_trs = country_html.xpath("//tr[@class='countytr']")
                country_item = []

                for country_tr in country_trs:
                    country_dict = {}

                    country_code = country_tr.xpath("./td[1]//text()")[0]
                    country_code = country_code[0:6]
                    country_name = country_tr.xpath("./td[2]//text()")[0]
                    country_dict["country_name"] = country_name
                    country_dict["country_code"] = country_code
                    country_item.append(country_dict)

                city_dict["country"] = country_item

                city_item.append(city_dict)

            province_dict["city"] = city_item
            self.collection.insert(province_dict)
            # item.append(province_dict)
            time.sleep(2)
            print(item)


if __name__ == '__main__':
    mySpider = MySpider()
    mySpider.run()

对mongodb当中的数据进行处理，并写入到excel当中

import xlsxwriter
from pymongo import MongoClient
import copy


def save_to_excel(data_list):
    """
    数据保存到 excel
    :param data_list: 数据列表
    :return:
    """

    workbook = xlsxwriter.Workbook("省份-城市-地区表.xlsx")
    worksheet = workbook.add_worksheet()

    cols = ['id', 'name', 'code']
    for col_index, col in enumerate(cols):
        worksheet.write(1, col_index, col)

    for row_index, data in enumerate(data_list, start=2):
        worksheet.write(row_index, 0, data['id'])
        worksheet.write(row_index, 1, data['name'])
        worksheet.write(row_index, 2, data['code'])

    workbook.close()


client = MongoClient(host="mini3", port=27017)
# test2为数据库名 test1为collection名
collection = client["db_area"]["c_area"]

# ret = collection.aggregate([{"$match": {"province_name": "山西省"}}, {"$limit": 1}, {"$unwind": "$city"},
#                             {"$group": {"_id": {"province": "$province_name", "city": "$city"}}},
#                             {"$project": {"province_name": "$_id.province", "city": "$_id.city", "_id": 0}},
#                             {"$unwind": "$city.country"}, {
#                                 "$project": {"province_name": 1, "city_name": "$city.city_name",
#                                              "city_code": "$city.city_code",
#                                              "country_name": "$city.country.country_name",
#                                              "country_code": "$city.country.country_code"}}])


ret = collection.aggregate([{"$unwind": "$city"},
                            {"$group": {"_id": {"province": "$province_name", "city": "$city"}}},
                            {"$project": {"province_name": "$_id.province", "city": "$_id.city", "_id": 0}},
                            {"$unwind": {"path": "$city.country", "preserveNullAndEmptyArrays": True}}, {
                                "$project": {"province_name": 1, "city_name": "$city.city_name",
                                             "city_code": "$city.city_code",
                                             "country_name": "$city.country.country_name",
                                             "country_code": "$city.country.country_code"}},
                            {"$sort": {"province_name": 1}}])

# for i in ret:
#     print(i)

# 把Python中的字典写入到文件
# with open("my_file_json.json", "w", encoding="utf-8") as f:
#     for i in ret:
#         print(i)
#         f.write(json.dumps(i, ensure_ascii=False, indent=4))

province = ""
city_code = ""
data_list = []
i = 0
for item in ret:
    if province != item["province_name"]:
        i = i + 1
        data_dict = copy.deepcopy({})
        # 省份不同， 写入省份
        data_dict["id"] = i
        data_dict["code"] = None
        data_dict["name"] = item["province_name"]
        data_list.append(data_dict)
        # 写入城市
        # 省份相同, 比较城市
        if city_code != item["city_code"]:
            i = i + 1
            # 城市不同，写入城市
            data_dict = copy.deepcopy({})
            data_dict["id"] = i
            data_dict["name"] = item["city_name"]
            data_dict["code"] = item["city_code"]
            data_list.append(data_dict)
            try:
                i = i + 1
                # 写入地区
                data_dict = copy.deepcopy({})
                data_dict["id"] = i
                data_dict["name"] = item["country_name"]
                data_dict["code"] = item["country_code"]
                data_list.append(data_dict)
            except Exception as e:
                print(e)

        else:
            try:
                i = i + 1
                # 城市相同，写入地区
                data_dict = copy.deepcopy({})
                data_dict["id"] = i
                data_dict["name"] = item["country_name"]
                data_dict["code"] = item["country_code"]
                data_list.append(data_dict)
            except Exception as e:
                print(e)
        city_code = item["city_code"]

    else:
        # 省份相同, 比较城市
        if city_code != item["city_code"]:
            i = i + 1
            data_dict = copy.deepcopy({})
            # 城市不同，写入城市
            data_dict["id"] = i
            data_dict["name"] = item["city_name"]
            data_dict["code"] = item["city_code"]
            data_list.append(data_dict)
            try:
                i = i + 1
                # 写入地区
                data_dict = copy.deepcopy({})
                data_dict["id"] = i
                data_dict["name"] = item["country_name"]
                data_dict["code"] = item["country_code"]
                data_list.append(data_dict)
            except Exception as e:
                print(e)
        else:
            try:
                i = i + 1
                # 城市相同，写入地区
                data_dict = copy.deepcopy({})
                data_dict["id"] = i
                data_dict["name"] = item["country_name"]
                data_dict["code"] = item["country_code"]
                data_list.append(data_dict)
            except Exception as e:
                print(e)

        city_code = item["city_code"]

    province = item["province_name"]
print(data_list)
save_to_excel(data_list)

11. 医院数据的整理

爬取疾控中心数据并且保存到mongodb

import re

import requests
from lxml import etree
from pymongo import MongoClient

# 对应mongodb中数据库名
DB_HOSPITAL = "db_hospital"
# 对应mongodb中collection名
PROVINCE = "heibei"
DISEASE_NAME = "cdc"
TB_HOSPITAL = "t_hospital_{}_{}".format(PROVINCE, DISEASE_NAME)
# 待爬链接
URL = "http://www.hebeicdc.cn/jkjg/1107.jhtml"
# 当前省份
PROVINCE_NAME = "河北省"


class HospitalSpider:
    """疾控中心的爬虫，病保存到mongodb"""

    def __init__(self):
        self.client = MongoClient(host="192.168.1.27", port=27017)
        # db_hospital为数据库名 t_hospital为collection名

        self.collection = self.client[DB_HOSPITAL][TB_HOSPITAL]
        self.start_url = URL
        self.headers = {
            "User-Agent": "Mozilla/5.0 (Linux; Android 5.1.1; Nexus 6 Build/LYZ28E) AppleWebKit/537.36 (KHTML, "
                          "like Gecko) Chrome/63.0.3239.84 Mobile Safari/537.36"}
        self.item = []

    def parse_url(self, url):  # 发送请求，获取响应
        response = requests.get(url, headers=self.headers)
        return response.content.decode()

    def run(self):  # 实现主要逻辑
        hospital_ret_str = self.parse_url(self.start_url)
        hospital_html = etree.HTML(hospital_ret_str)
        tr_elements = hospital_html.xpath("//div[@class='contxt']//tr")
        for tr in tr_elements:
            item_dict = {}
            tds = tr.xpath("./td")

            if len(tds) > 1:
                hospital_name = tr.xpath("./td[1]//text()")
                hospital_name = "".join(hospital_name)
                # 必须拿到全部文本
                other_info = tr.xpath("./td[2]//text()")
                other_info = "".join(other_info)
                ret = re.match(r"地址：(?P<name1>.*).*邮编：(?P<name2>.*)", other_info)
                hospital_address = ret.group(1)
                post_code = ret.group(2)

                item_dict['province_name'] = PROVINCE_NAME
                item_dict['city_name'] = self.city_name

                # 先从地址里面匹配区县
                hospital_address = hospital_address.strip()
                ret = re.match(r"(?P<name1>.*[区|县]).*", hospital_address)
                if ret is not None:
                    item_dict['county_name'] = ret.group(1)
                else:
                    # 从地址里面匹配地级市
                    ret = re.match(r"(?P<name1>.*[市]).*", hospital_address)
                    if ret is not None:
                        item_dict['county_name'] = ret.group(1)
                    else:
                        # 从医院名里面匹配区县
                        ret = re.match(r"(?P<name1>.*[区|县]).*", hospital_name)
                        if ret is not None:
                            item_dict['county_name'] = ret.group(1)
                        else:
                            # 从医院名里面匹配地级市
                            ret = re.match(r"(?P<name1>.*[市]).*", hospital_name)
                            if ret is not None:
                                item_dict['county_name'] = ret.group(1)
                            else:
                                item_dict['county_name'] = "无法识别"

                item_dict['hospital_name'] = hospital_name.strip()
                item_dict['hospital_address'] = hospital_address
                item_dict['post_code'] = post_code.strip()
                item_dict['alias_name'] = ""
                item_dict['organization_type'] = ""
                item_dict['note'] = ""

                print(item_dict)
                self.collection.insert(item_dict)


            else:
                self.city_name = tr.xpath("./td[1]//text()")[0].strip()


if __name__ == '__main__':
    hospitalSpider = HospitalSpider()
    hospitalSpider.run()

导出mongodb疾控数据到excel

import xlsxwriter
from pymongo import MongoClient

# 对应mongodb中数据库名
DB_HOSPITAL = "db_hospital"
# 对应mongodb中collection名
PROVINCE = "heibei"
DISEASE_NAME = "cdc"
TB_HOSPITAL = "t_hospital_{}_{}".format(PROVINCE, DISEASE_NAME)
EXCEL_NAME = "河北省省、市、县（区）疾控机构 .xlsx"


def save_to_excel(title, data_list):
    """
    数据保存到 excel
    :param data_list: 数据列表
    :return:
    """

    workbook = xlsxwriter.Workbook(title)
    worksheet = workbook.add_worksheet()

    cols = ['province_name', 'city_name', 'county_name', 'hospital_name', 'hospital_address','alias_name', 'organization_type', 'note']
    for col_index, col in enumerate(cols):
        worksheet.write(1, col_index, col)

    for row_index, data in enumerate(data_list, start=2):
        worksheet.write(row_index, 0, data['province_name'])
        worksheet.write(row_index, 1, data['city_name'])
        worksheet.write(row_index, 2, data['county_name'])
        worksheet.write(row_index, 3, data['hospital_name'])
        worksheet.write(row_index, 4, data['hospital_address'])
        if data['alias_name'] is None:
            data['alias_name'] = ""
        worksheet.write(row_index, 5, data['alias_name'])
        if data['organization_type'] is None:
            data['organization_type'] = ""
        worksheet.write(row_index, 6, data['organization_type'])
        if data['note'] is None:
            data['note'] = ""
        worksheet.write(row_index, 7, data['note'])

    workbook.close()


if __name__ == '__main__':
    """导出mongodb中的疾控数据到excel"""
    client = MongoClient(host="192.168.1.27", port=27017)
    collection = client[DB_HOSPITAL][TB_HOSPITAL]
    ret = collection.aggregate(
        [{"$group": {
            "_id": {"province_name": "$province_name", "city_name": "$city_name", "county_name": "$county_name",
                    "hospital_name": "$hospital_name","hospital_address":"$hospital_address" ,"alias_name": "$alias_name",
                    "organization_type": "$organization_type", "note": "$note"}}},
            {"$project": {"province_name": "$_id.province_name", "city_name": "$_id.city_name",
                          "county_name": "$_id.county_name", "hospital_name": "$_id.hospital_name","hospital_address":"$_id.hospital_address",
                          "alias_name": "$_id.alias_name", "organization_type": "$_id.organization_type",
                          "note": "$_id.note",
                          "_id": 0}},
            {"$sort": {"province_name": 1, "city_name": 1, "country_name": 1}}
        ])

    data_list = []
    for item in ret:
        print(item)
        data_list.append(item)
    save_to_excel(EXCEL_NAME, data_list)

所有excel入mongodb

import os

import xlrd
from pymongo import MongoClient

count = 0

# 对应mongodb中数据库名
DB_HOSPITAL = "db_hospital"
# 对应mongodb中collection名
PROVINCE = "heibei"
DISEASE_NAME = "total"
TB_HOSPITAL = "t_hospital_{}_{}".format(PROVINCE, DISEASE_NAME)


def save_data_2_mongo(filename, collection):
    global count
    # 链接：https://www.cnblogs.com/nancyzhu/p/8401552.html
    # 只能读不能写,打开一个excel
    book = xlrd.open_workbook(filename)
    # 根据顺序获取sheet
    sheet = book.sheet_by_index(0)
    # 根据sheet页名字获取sheet
    # sheet = book.sheet_by_name(sheet_name)
    print("列数： " , sheet.ncols)
    for row in range(sheet.nrows):
        if row < 2:
            continue
        count += 1
        item_dict = {}
        for index, col in enumerate(range(sheet.ncols)):
            if index == 0:
                item_dict['province_name'] = sheet.cell(row, col).value
            elif index == 1:
                item_dict['city_name'] = sheet.cell(row, col).value
            elif index == 2:
                item_dict['county_name'] = sheet.cell(row, col).value
            elif index == 3:
                item_dict['hospital_name'] = sheet.cell(row, col).value
            elif index == 4:
                item_dict['alias_name'] = sheet.cell(row, col).value
            elif index == 5:
                item_dict['organization_type'] = sheet.cell(row, col).value
            elif index == 6:
                item_dict['note'] = sheet.cell(row, col).value
        collection.insert(item_dict)


if __name__ == '__main__':
    """所有excel入mongodb"""
    # 读取excel 保存到mongodb
    client = MongoClient(host="192.168.1.27", port=27017)
    collection = client[DB_HOSPITAL][TB_HOSPITAL]
    directory = r"C:\Users\59404\Desktop\医院数据2019"

    filenames = os.listdir(directory)
    print(filenames)
    for filename in filenames:
        filename = directory + "/" + filename
        save_data_2_mongo(filename, collection)

    print(count)