mongodb 文档行数据库总结
1. Centos7安装mongodb链接
注意:
这里有个错: https 换成http
2. mongodb常用命令
- 查看当前数据库
db
- 查看所有数据库
show dbs / show databases
- (创建)切换数据库
use db_name
- 删除当前数据库 db.dropDatabase()
- 查看所有集合 show collections
- 删除某个集合 db.collection_name.drop()
3. mongodb数据类型
- Object ID: 文档ID
- ObjectID是一个12字节的十六进制数:
- 前4个字节为当前时间戳
- 接下来的3个字节为机器ID
- 接下来的2个字节为MongoDB的服务进程id
- 最后3个字节是简单的增量值
- String: 字符串, 最常用, 必须是有效地UTF-8
- Boolean:存储一个布尔值,true或false
- Integer: 整数可以使32位或者64位, 这取决于服务器
- Arrays:数组或列表, 多个值存储到一个键
- Object: 用于嵌入式的文档, 即一个值为一个文档
- Null:存储Null值
- TimeStamp:时间戳, 表示从1970-1-1到现在的总秒数
- Date: 存储当前日期或者时间的UNIX时间格式
4. 增删改查
插入一条数据 db.test.insert({"name":"xiaofan","age":18})
- 查询数据 db.test.find()
插入一条数据 db.test.save({"_id":"0001", "name":"xiaofan", "age":98})
- 更新一条数据
db.test.update({"name":"xiaofan"},{"name":"xiaoli"})
- db.test.update({name:“xiaofan”},{$set:{name:“xiaoma”}})
- db.test.update({name:“xiaoma”},{KaTeX parse error: Expected 'EOF', got '}' at position 25: …:"xiaoxiaofan"}}̲, {multi:true})…符配合使用才有效
- 删除数据
db.test.remove({name:"xiaoxiaofan"}, {justOne:true})
删除一条数据- db.test.remove({name:“xiaoxiaofan”}) 删除所有满足条件的数据
05. 高级查询
1. 数据查询
- find({条件文档}) 查询 db.test.find({age:23})
- findOne({条件文档}) 只返回一条 db.test.findOne({age:23})
- pretty() 将结果格式化 db.test.find({age:22}).pretty()
2. 比较运算符
- 等于: 默认等于判断, 没有运算符
- 小于:$lt
- 小于等于: $lte
- 大于:KaTeX parse error: Expected '}', got 'EOF' at end of input: …est.find({age:{gt:22}})
- 大于等于:$gte
- 不等于:KaTeX parse error: Expected '}', got 'EOF' at end of input: …est.find({age:{ne:23}})
3. 范围运算符
- KaTeX parse error: Expected '}', got 'EOF' at end of input: …est.find({age:{in:[22, 23]}})
- KaTeX parse error: Expected '}', got 'EOF' at end of input: …est.find({age:{nin:[22, 23]}})
4. 逻辑运算符
- and:在json中写多个条件即可 db.test.find({name:“xiao”,age:22})
- or: 使用KaTeX parse error: Expected '}', got 'EOF' at end of input: … db.test.find({or:[{name:“xiao”},{age:23}]})
5. 正则表达式
- 使用 // 或者 $regex编写正则表达式
- 以…开头 db.test.find({name:/^xiaoxiao/})
- 以…结尾 db.test.find({name:{ r e g e x : " 24 regex:"24 regex:"24"}})
6. skip和limit
- skip() 类似于mysql中的offset
- limit() 类似于mysql中的limit db.test.find().skip(2).limit(2)
7. 自定义查询$where
8. 投影
- 在查询到的返回结果中,只选择必要的字段, 参数为字段值,值为1显示,不写不显示, _id 默认显示,不过不显示要明确写0
- db.test.find({age:{$gt:22}},{name:1,age:1,_id:0})
{ "name" : "xiaoxiaofan", "age" : 98 }
{ "name" : "xiao24", "age" : 23 }
{ "name" : "xiao25", "age" : 23 }
9. 排序sort: 用于对结果进行排序 参数为1时为升序, -1时为降序
db.test.find().sort({age:-1})
10. 统计个数和去重
- 统计:db.test.find({age:{$gte:23}}).count()
去重:db.集合名称.distinct('去重字段',{条件}) db.test.distinct("name",{age:{$gt:22}})
06. mongodb的备份与恢复
-
备份
mongodump -d test1 -o ./test1000 (本地备份) -
恢复
mongorestore -d test2000 --dir ./test1000/test1
07. mongodb聚合 aggregate
1. 聚合aggregate是基于数据处理的聚合管道,每个文档通过一个由多个阶段stage组成的管道,可以对每个阶段的管道进行分组、过滤等功能,然后经过一系列的处理, 输出相应的结果。
- db.集合名称.aggregate({管道:{表达式}})
2. mongodb常用管道
- $group: 将集合中的文档分组, 可用于统计结果
- $match: 过滤数据,只输出符合条件的文档
- $project: 修改输入文档的结构, 如重命名,增加,删除字段, 创建计算结果
- $sort: 将输入文档排序后输出
- $limit: 限制聚合管道输出的文档数目
- $skip:跳过指定数量的文档,返回剩下的文档
- $unwind: 将数组类型的字段拆分
3. 表达式:处理输入文档并输出
- 语法:表达式: “$列名”
- 常用表达式
- $sum: 计算总和, $sum:1 表示以一倍奇数
- $avg: 计算平均值
- $min: 获取最小值
- $max: 获取最大值
- $push:在结果文档中插入值到一个数组中
- $first:根据资源文档的排序获取第一个文档数据
- $last:根据资源文档的排序获取最后一个文档数据
4. 案例
- $group分组
db.test.aggregate(
{$group:{_id: "$age"}}
)
- 分组统计每一组内文档个数
db.test.aggregate(
{$group:{_id: "$age", count:{$sum:1}}}
)
- $group统计文档的个数,平均年龄,没有年龄的不参与计算
db.test.aggregate(
{$group:{_id: null, count:{$sum:1}, mean_avg:{$avg:"$age"}}}
)
$group使用的注意点
- '$group’对应的字典中有几个值, 结果中就有几个值
- 分组依据需要放到’_id’后面,如果是null,则对全部数据进行计算
- 取不同的字段的值需要使用 , ′ ,' ,′gender’, ‘$age’
$project使用
db.test.aggregate(
{$group:{_id:"$age", count:{$sum:1}}},
{$project:{age: "$_id", count:1, _id:0}}
)
db.test.aggregate(
{$group:{_id:"$age", count:{$sum:1}}},
{$project:{age: "$_id", count:"$count", _id:0}}
)
m a t c h 相 当 于 f i n d , 但 是 在 管 道 中 只 能 使 用 match 相当于find,但是在管道中只能使用 match相当于find,但是在管道中只能使用match
db.test.aggregate(
{$match:{age:{$gt:22}}}
)
去重
db.test.aggregate({
$group:{_id:{country:"$country", province:"$province", userid:"$userid"}}
})
去重之后按省份分组统计
db.test.aggregate(
{$group:{_id:{country:"$country", province:"$province", userid:"$userid"}}},
{$group:{_id:{country:"$_id.country", province:"$_id.province"}, count:{$sum:1}}}
)
对结果进行格式化(美化)
db.test.aggregate(
{$group:{_id:{country:"$country", province:"$province", userid:"$userid"}}},
{$group:{_id:{country:"$_id.country", province:"$_id.province"}, count:{$sum:1}}},
{$project:{country:"$_id.country", province:"$_id.province", count:1, _id:0}}
)
聚合命令: $unwind的使用
db.test.insert({username:"Alex", "tags":["C#","Java","C++"]})
db.test.aggregate({$match:{username:"Alex"}},{$unwind:"$tags"},{$group:{_id:null, count:{$sum:1}}})
- $unwind: 属性 preserveNullAndEmptyArrays值为false 表示丢弃属性值为空的文档, 值为true标识保留属性为空的文档
db.test.aggregate({$match:{username:"Alex"}},{$unwind:{path:"$tags",preserveNullAndEmptyArrays:true}},{$group:{_id:null, count:{$sum:1}}})
08. 索引的创建
- 检测运行时间:db.test1.find({name:“test9999”}).explain(“executionStats”)
- 查看当前集合的所有索引:db.test1.getIndexes()
- 在默认情况下索引字段的值是可以相同的
- 创建索引: db.test1.ensureIndex({name:1})
- 创建唯一索引(索引的值是唯一的),可以通过这种方式达到去重的目的 db.test1.ensureIndex({name:1},{unique:true})
- 创建联合索引
- 删除索引:db.test1.dropIndex({name:1})
09. pymongo操作
from pymongo import MongoClient
client = MongoClient(host="mini3", port=27017)
# test2为数据库名 test1为collection名
collection = client["test2"]["test1"]
ret = collection.find().limit(10)
for i in ret:
print(i)
10. mongodb案例
- 爬虫源码,并写入到mongodb当中
import requests
from lxml import etree
import json
import time
from pymongo import MongoClient
class MySpider:
def __init__(self):
self.client = MongoClient(host="mini3", port=27017)
# test2为数据库名 test1为collection名
self.collection = self.client["db_area"]["c_area"]
self.base_url = "http://www.stats.gov.cn/tjsj/tjbz/tjyqhdmhcxhfdm/2018/{}"
self.start_url = self.base_url.format("index.html")
self.headers = {
"User-Agent": "Mozilla/5.0 (Linux; Android 5.1.1; Nexus 6 Build/LYZ28E) AppleWebKit/537.36 (KHTML, "
"like Gecko) Chrome/63.0.3239.84 Mobile Safari/537.36"}
self.item = []
def parse_url(self, url): # 发送请求,获取响应
response = requests.get(url, headers=self.headers)
return response.content.decode("gbk", errors='ignore')
def run(self): # 实现主要逻辑
province_ret_str = self.parse_url(self.start_url)
province_html = etree.HTML(province_ret_str)
province_hrefs = province_html.xpath("//tr[@class='provincetr']/td/a")
# 抓取省份和下一页的链接
item = []
for a in province_hrefs:
province_dict = {}
province = a.xpath("./text()")[0]
province_link = a.xpath("./@href")[0]
province_dict["province_name"] = province
# province_dict["province_link"] = province_link
# 爬取市、区相关信息
city_ret_str = self.parse_url(self.base_url.format(province_link))
city_html = etree.HTML(city_ret_str)
city_trs = city_html.xpath("//tr[@class='citytr']")
city_item = []
for city_tr in city_trs:
city_dict = {}
country_link = city_tr.xpath("./td[1]/a/@href")[0]
city_code = city_tr.xpath("./td[1]/a/text()")[0]
city_code = city_code[0:6]
city_name = city_tr.xpath("./td[2]/a/text()")[0]
city_dict["city_name"] = city_name
city_dict["city_code"] = city_code
# city_dict["link"] = country_link
# 爬取区县相关信息
country_ret_str = self.parse_url(self.base_url.format(country_link))
country_html = etree.HTML(country_ret_str)
country_trs = country_html.xpath("//tr[@class='countytr']")
country_item = []
for country_tr in country_trs:
country_dict = {}
country_code = country_tr.xpath("./td[1]//text()")[0]
country_code = country_code[0:6]
country_name = country_tr.xpath("./td[2]//text()")[0]
country_dict["country_name"] = country_name
country_dict["country_code"] = country_code
country_item.append(country_dict)
city_dict["country"] = country_item
city_item.append(city_dict)
province_dict["city"] = city_item
self.collection.insert(province_dict)
# item.append(province_dict)
time.sleep(2)
print(item)
if __name__ == '__main__':
mySpider = MySpider()
mySpider.run()
- 对mongodb当中的数据进行处理,并写入到excel当中
import xlsxwriter
from pymongo import MongoClient
import copy
def save_to_excel(data_list):
"""
数据保存到 excel
:param data_list: 数据列表
:return:
"""
workbook = xlsxwriter.Workbook("省份-城市-地区表.xlsx")
worksheet = workbook.add_worksheet()
cols = ['id', 'name', 'code']
for col_index, col in enumerate(cols):
worksheet.write(1, col_index, col)
for row_index, data in enumerate(data_list, start=2):
worksheet.write(row_index, 0, data['id'])
worksheet.write(row_index, 1, data['name'])
worksheet.write(row_index, 2, data['code'])
workbook.close()
client = MongoClient(host="mini3", port=27017)
# test2为数据库名 test1为collection名
collection = client["db_area"]["c_area"]
# ret = collection.aggregate([{"$match": {"province_name": "山西省"}}, {"$limit": 1}, {"$unwind": "$city"},
# {"$group": {"_id": {"province": "$province_name", "city": "$city"}}},
# {"$project": {"province_name": "$_id.province", "city": "$_id.city", "_id": 0}},
# {"$unwind": "$city.country"}, {
# "$project": {"province_name": 1, "city_name": "$city.city_name",
# "city_code": "$city.city_code",
# "country_name": "$city.country.country_name",
# "country_code": "$city.country.country_code"}}])
ret = collection.aggregate([{"$unwind": "$city"},
{"$group": {"_id": {"province": "$province_name", "city": "$city"}}},
{"$project": {"province_name": "$_id.province", "city": "$_id.city", "_id": 0}},
{"$unwind": {"path": "$city.country", "preserveNullAndEmptyArrays": True}}, {
"$project": {"province_name": 1, "city_name": "$city.city_name",
"city_code": "$city.city_code",
"country_name": "$city.country.country_name",
"country_code": "$city.country.country_code"}},
{"$sort": {"province_name": 1}}])
# for i in ret:
# print(i)
# 把Python中的字典写入到文件
# with open("my_file_json.json", "w", encoding="utf-8") as f:
# for i in ret:
# print(i)
# f.write(json.dumps(i, ensure_ascii=False, indent=4))
province = ""
city_code = ""
data_list = []
i = 0
for item in ret:
if province != item["province_name"]:
i = i + 1
data_dict = copy.deepcopy({})
# 省份不同, 写入省份
data_dict["id"] = i
data_dict["code"] = None
data_dict["name"] = item["province_name"]
data_list.append(data_dict)
# 写入城市
# 省份相同, 比较城市
if city_code != item["city_code"]:
i = i + 1
# 城市不同,写入城市
data_dict = copy.deepcopy({})
data_dict["id"] = i
data_dict["name"] = item["city_name"]
data_dict["code"] = item["city_code"]
data_list.append(data_dict)
try:
i = i + 1
# 写入地区
data_dict = copy.deepcopy({})
data_dict["id"] = i
data_dict["name"] = item["country_name"]
data_dict["code"] = item["country_code"]
data_list.append(data_dict)
except Exception as e:
print(e)
else:
try:
i = i + 1
# 城市相同,写入地区
data_dict = copy.deepcopy({})
data_dict["id"] = i
data_dict["name"] = item["country_name"]
data_dict["code"] = item["country_code"]
data_list.append(data_dict)
except Exception as e:
print(e)
city_code = item["city_code"]
else:
# 省份相同, 比较城市
if city_code != item["city_code"]:
i = i + 1
data_dict = copy.deepcopy({})
# 城市不同,写入城市
data_dict["id"] = i
data_dict["name"] = item["city_name"]
data_dict["code"] = item["city_code"]
data_list.append(data_dict)
try:
i = i + 1
# 写入地区
data_dict = copy.deepcopy({})
data_dict["id"] = i
data_dict["name"] = item["country_name"]
data_dict["code"] = item["country_code"]
data_list.append(data_dict)
except Exception as e:
print(e)
else:
try:
i = i + 1
# 城市相同,写入地区
data_dict = copy.deepcopy({})
data_dict["id"] = i
data_dict["name"] = item["country_name"]
data_dict["code"] = item["country_code"]
data_list.append(data_dict)
except Exception as e:
print(e)
city_code = item["city_code"]
province = item["province_name"]
print(data_list)
save_to_excel(data_list)
11. 医院数据的整理
- 爬取疾控中心数据并且保存到mongodb
import re
import requests
from lxml import etree
from pymongo import MongoClient
# 对应mongodb中数据库名
DB_HOSPITAL = "db_hospital"
# 对应mongodb中collection名
PROVINCE = "heibei"
DISEASE_NAME = "cdc"
TB_HOSPITAL = "t_hospital_{}_{}".format(PROVINCE, DISEASE_NAME)
# 待爬链接
URL = "http://www.hebeicdc.cn/jkjg/1107.jhtml"
# 当前省份
PROVINCE_NAME = "河北省"
class HospitalSpider:
"""疾控中心的爬虫,病保存到mongodb"""
def __init__(self):
self.client = MongoClient(host="192.168.1.27", port=27017)
# db_hospital为数据库名 t_hospital为collection名
self.collection = self.client[DB_HOSPITAL][TB_HOSPITAL]
self.start_url = URL
self.headers = {
"User-Agent": "Mozilla/5.0 (Linux; Android 5.1.1; Nexus 6 Build/LYZ28E) AppleWebKit/537.36 (KHTML, "
"like Gecko) Chrome/63.0.3239.84 Mobile Safari/537.36"}
self.item = []
def parse_url(self, url): # 发送请求,获取响应
response = requests.get(url, headers=self.headers)
return response.content.decode()
def run(self): # 实现主要逻辑
hospital_ret_str = self.parse_url(self.start_url)
hospital_html = etree.HTML(hospital_ret_str)
tr_elements = hospital_html.xpath("//div[@class='contxt']//tr")
for tr in tr_elements:
item_dict = {}
tds = tr.xpath("./td")
if len(tds) > 1:
hospital_name = tr.xpath("./td[1]//text()")
hospital_name = "".join(hospital_name)
# 必须拿到全部文本
other_info = tr.xpath("./td[2]//text()")
other_info = "".join(other_info)
ret = re.match(r"地址:(?P<name1>.*).*邮编:(?P<name2>.*)", other_info)
hospital_address = ret.group(1)
post_code = ret.group(2)
item_dict['province_name'] = PROVINCE_NAME
item_dict['city_name'] = self.city_name
# 先从地址里面匹配区县
hospital_address = hospital_address.strip()
ret = re.match(r"(?P<name1>.*[区|县]).*", hospital_address)
if ret is not None:
item_dict['county_name'] = ret.group(1)
else:
# 从地址里面匹配地级市
ret = re.match(r"(?P<name1>.*[市]).*", hospital_address)
if ret is not None:
item_dict['county_name'] = ret.group(1)
else:
# 从医院名里面匹配区县
ret = re.match(r"(?P<name1>.*[区|县]).*", hospital_name)
if ret is not None:
item_dict['county_name'] = ret.group(1)
else:
# 从医院名里面匹配地级市
ret = re.match(r"(?P<name1>.*[市]).*", hospital_name)
if ret is not None:
item_dict['county_name'] = ret.group(1)
else:
item_dict['county_name'] = "无法识别"
item_dict['hospital_name'] = hospital_name.strip()
item_dict['hospital_address'] = hospital_address
item_dict['post_code'] = post_code.strip()
item_dict['alias_name'] = ""
item_dict['organization_type'] = ""
item_dict['note'] = ""
print(item_dict)
self.collection.insert(item_dict)
else:
self.city_name = tr.xpath("./td[1]//text()")[0].strip()
if __name__ == '__main__':
hospitalSpider = HospitalSpider()
hospitalSpider.run()
- 导出mongodb疾控数据到excel
import xlsxwriter
from pymongo import MongoClient
# 对应mongodb中数据库名
DB_HOSPITAL = "db_hospital"
# 对应mongodb中collection名
PROVINCE = "heibei"
DISEASE_NAME = "cdc"
TB_HOSPITAL = "t_hospital_{}_{}".format(PROVINCE, DISEASE_NAME)
EXCEL_NAME = "河北省省、市、县(区)疾控机构 .xlsx"
def save_to_excel(title, data_list):
"""
数据保存到 excel
:param data_list: 数据列表
:return:
"""
workbook = xlsxwriter.Workbook(title)
worksheet = workbook.add_worksheet()
cols = ['province_name', 'city_name', 'county_name', 'hospital_name', 'hospital_address','alias_name', 'organization_type', 'note']
for col_index, col in enumerate(cols):
worksheet.write(1, col_index, col)
for row_index, data in enumerate(data_list, start=2):
worksheet.write(row_index, 0, data['province_name'])
worksheet.write(row_index, 1, data['city_name'])
worksheet.write(row_index, 2, data['county_name'])
worksheet.write(row_index, 3, data['hospital_name'])
worksheet.write(row_index, 4, data['hospital_address'])
if data['alias_name'] is None:
data['alias_name'] = ""
worksheet.write(row_index, 5, data['alias_name'])
if data['organization_type'] is None:
data['organization_type'] = ""
worksheet.write(row_index, 6, data['organization_type'])
if data['note'] is None:
data['note'] = ""
worksheet.write(row_index, 7, data['note'])
workbook.close()
if __name__ == '__main__':
"""导出mongodb中的疾控数据到excel"""
client = MongoClient(host="192.168.1.27", port=27017)
collection = client[DB_HOSPITAL][TB_HOSPITAL]
ret = collection.aggregate(
[{"$group": {
"_id": {"province_name": "$province_name", "city_name": "$city_name", "county_name": "$county_name",
"hospital_name": "$hospital_name","hospital_address":"$hospital_address" ,"alias_name": "$alias_name",
"organization_type": "$organization_type", "note": "$note"}}},
{"$project": {"province_name": "$_id.province_name", "city_name": "$_id.city_name",
"county_name": "$_id.county_name", "hospital_name": "$_id.hospital_name","hospital_address":"$_id.hospital_address",
"alias_name": "$_id.alias_name", "organization_type": "$_id.organization_type",
"note": "$_id.note",
"_id": 0}},
{"$sort": {"province_name": 1, "city_name": 1, "country_name": 1}}
])
data_list = []
for item in ret:
print(item)
data_list.append(item)
save_to_excel(EXCEL_NAME, data_list)
- 所有excel入mongodb
import os
import xlrd
from pymongo import MongoClient
count = 0
# 对应mongodb中数据库名
DB_HOSPITAL = "db_hospital"
# 对应mongodb中collection名
PROVINCE = "heibei"
DISEASE_NAME = "total"
TB_HOSPITAL = "t_hospital_{}_{}".format(PROVINCE, DISEASE_NAME)
def save_data_2_mongo(filename, collection):
global count
# 链接:https://www.cnblogs.com/nancyzhu/p/8401552.html
# 只能读不能写,打开一个excel
book = xlrd.open_workbook(filename)
# 根据顺序获取sheet
sheet = book.sheet_by_index(0)
# 根据sheet页名字获取sheet
# sheet = book.sheet_by_name(sheet_name)
print("列数: " , sheet.ncols)
for row in range(sheet.nrows):
if row < 2:
continue
count += 1
item_dict = {}
for index, col in enumerate(range(sheet.ncols)):
if index == 0:
item_dict['province_name'] = sheet.cell(row, col).value
elif index == 1:
item_dict['city_name'] = sheet.cell(row, col).value
elif index == 2:
item_dict['county_name'] = sheet.cell(row, col).value
elif index == 3:
item_dict['hospital_name'] = sheet.cell(row, col).value
elif index == 4:
item_dict['alias_name'] = sheet.cell(row, col).value
elif index == 5:
item_dict['organization_type'] = sheet.cell(row, col).value
elif index == 6:
item_dict['note'] = sheet.cell(row, col).value
collection.insert(item_dict)
if __name__ == '__main__':
"""所有excel入mongodb"""
# 读取excel 保存到mongodb
client = MongoClient(host="192.168.1.27", port=27017)
collection = client[DB_HOSPITAL][TB_HOSPITAL]
directory = r"C:\Users\59404\Desktop\医院数据2019"
filenames = os.listdir(directory)
print(filenames)
for filename in filenames:
filename = directory + "/" + filename
save_data_2_mongo(filename, collection)
print(count)