MongoDB聚合操作网站首页 学无止境

MongoDB聚合操作

一只咸鱼。。 2023-07-02 20:00:03

简介MongoDB聚合操作

文章目录

一、单一聚合
二、聚合管道
三、MapReduce

提示：以下是本篇文章正文内容，MongoDB 系列学习将会持续更新

一、单一聚合

单一作用聚合：提供了对常见聚合过程的简单访问，操作都从单个集合聚合文档。

函数	描述
db.collection.estimatedDocumentCount()	忽略查询条件，返回集合或视图中所有文档的计数
db.collection.count()	返回与find()集合或视图的查询匹配的文档计数，等同于 db.collection.find(query).count()
db.collection.distinct()	在单个集合或视图中查找指定字段的不同值，并在数组中返回结果

// 检索book集合中所有文档的计数, 效率高
db.book.estimatedDocumentCount()
// 结果同上，忽略查询条件
db.book.estimatedDocumentCount({type:"travel"})

// 统计所有文档数量
db.book.count()
// 统计符合条件的文档数量
db.book.count({favCount:{$gt:50}})

// 返回不同type的数组
db.book.distinct("type")
// 返回收藏数大于90的文档不同type的数组
db.book.distinct("type",{favCount:{$gt:90}})

注意：在分片群集上，如果存在孤立文档或正在进行块迁移，则 db.collection.count() 没有查询谓词可能导致计数不准确。要避免这些情况，请在分片群集上使用 db.collection.aggregate() 方法。

回到目录…

二、聚合管道

官方文档：Aggregation Pipeline Stages — MongoDB Manual

pipeline = [$stage1, $stage2, ...$stageN];
db.collection.aggregate(pipeline, {options})

pipelines 一组数据聚合阶段。除$out、$Merge和$geonear阶段之外，每个阶段都可以在管道中出现多次。
options 可选，聚合操作的其他参数。包含：查询计划、是否使用临时文件、游标、最大操作时间、读写策略、强制索引等等

在这里插入图片描述

数据准备：

var tags = ["nosql","mongodb","document","developer","popular"];
var types = ["technology","sociality","travel","novel","literature"];
var books=[];
for(var i=0;i<50;i++){
	var typeIdx = Math.floor(Math.random()*types.length);
	var tagIdx = Math.floor(Math.random()*tags.length);
	var tagIdx2 = Math.floor(Math.random()*tags.length);
	var favCount = Math.floor(Math.random()*100);
	var username = "xx00"+Math.floor(Math.random()*10);
	var age = 20 + Math.floor(Math.random()*15);
	var book = {
		_id: i+1,
		title: "book-"+i,
		type: types[typeIdx],
		tag: [tags[tagIdx],tags[tagIdx2]],
		favCount: favCount,
		author: {name:username,age:age}
	};
	books.push(book)
}
db.book2.insertMany(books);

2.1 $match / $project / $count

$match 条件筛选

$match 用于对文档进行筛选，之后可以在得到的文档子集上做聚合。(一般放在最前面，也可以根据需求放到后面做筛选)。

db.book2.aggregate([{$match:{type:"novel"}}])

$project 字段投影

//投影操作， 将原始字段投影成指定名称， 如将集合中的 title 投影成 name
db.book2.aggregate([{$project:{name:"$title"}}])
//$project 可以灵活控制输出文档的格式，也可以剔除不需要的字段
db.book2.aggregate([{$project:{_id:0,name:"$title",type:1}}])

$count 计数

> db.book2.aggregate([{$count:"all_count"}])
{ "all_count" : 50 }

// $match筛选出type匹配的文档；$count返回聚合管道中过滤后的文档计数，并将值分配给novel_count
> db.book2.aggregate([{$match:{type:"novel"}},{$count:"novel_count"}])
{ "novel_count" : 14 }

回到目录…

2.2 $limit / $skip / $sort

$limit / $skip

db.book2.find().limit(5)
db.book2.aggregate([{$limit:5}])

db.book2.find().skip(5)
db.book2.aggregate([{$skip:5}])

db.book2.find().skip(5).limit(5)
db.book2.aggregate([{$skip:5},{$limit:5}])

$sort

db.book2.find().sort({favCount:1})
db.book2.aggregate([{$sort:{favCount:1}}])

回到目录…

2.3 $group 分组查询

按指定的表达式对文档进行分组，并将每个不同分组的文档输出到下一个阶段。

{$group: {_id: <expression>, <field1>: {<accumulator1>: <expression1>}, ...} }

_id 表示分组条件 (必填)，指定为 null 时以整个文档为一组；剩余的计算字段是可选的，并使用运算符进行计算。
$group 阶段的内存限制为 100M。默认情况下，如果stage超过此限制会产生错误。但是，要允许处理大型数据集，请将 allowDiskUse 选项设置为 true 以启用 $group 操作以写入临时文件。

accumulator 操作符：

名称	描述	类比 SQL
$avg	计算均值。	avg
$first	返回每组第一个文档，如果有排序，按照排序，如果没有按照默认的存储的顺序的第一个文档。	limit 0,1
$last	返回每组最后一个文档，如果有排序，按照排序，如果没有按照默认的存储的顺序的最后个文档。	-
$max	根据分组，获取集合中所有文档对应值得最大值。	max
$min	根据分组，获取集合中所有文档对应值得最小值。	min
$push	将指定的表达式的值添加到一个数组中。	-
$addToSet	将表达式的值添加到一个集合中 (无重复值，无序)。	-
$sum	计算总和	sum
$stdDevPop	返回输入值的总体标准偏差（population standard deviation）	-
$stdDevSamp	返回输入值的样本标准偏差（the sample standard deviation）	-

示例一：统计文档中的书籍总量、收藏总数、平均收藏量

db.book2.aggregate([
	{$group:{_id:null,bookCount:{$sum:1},favCount:{$sum:"$favCount"},favAvg:{$avg:"$favCount"}}}
])
{ "_id" : null, "bookCount" : 50, "favCount" : 2392, "favAvg" : 47.84 }

示例二：统计每个作者的book收藏总数

db.book2.aggregate([
	{$group:{_id:"$author.name",favCount:{$sum:"$favCount"}}}
])
{ "_id" : "xx007", "favCount" : 259 }
{ "_id" : "xx009", "favCount" : 160 }
{ "_id" : "xx005", "favCount" : 229 }
{ "_id" : "xx008", "favCount" : 302 }
{ "_id" : "xx001", "favCount" : 300 }
{ "_id" : "xx002", "favCount" : 311 }
{ "_id" : "xx006", "favCount" : 174 }
{ "_id" : "xx003", "favCount" : 160 }
{ "_id" : "xx004", "favCount" : 373 }
{ "_id" : "xx000", "favCount" : 124 }

示例三：统计每个作者的每本book的收藏数

db.book2.aggregate([
	{$group:{_id:{author:"$author.name",title:"$title"},favCount:{$sum:"$favCount"}}}
])
{ "_id" : { "author" : "xx006", "title" : "book-42" }, "favCount" : 77 }
{ "_id" : { "author" : "xx002", "title" : "book-33" }, "favCount" : 70 }
{ "_id" : { "author" : "xx002", "title" : "book-3" }, "favCount" : 44 }
{ "_id" : { "author" : "xx007", "title" : "book-34" }, "favCount" : 50 }
{ "_id" : { "author" : "xx004", "title" : "book-46" }, "favCount" : 84 }
{ "_id" : { "author" : "xx008", "title" : "book-4" }, "favCount" : 64 }
{ "_id" : { "author" : "xx009", "title" : "book-8" }, "favCount" : 45 }
Type "it" for more

示例四：每个作者的book的type合集

db.book2.aggregate([
	{$group:{_id:"$author.name",types:{$addToSet:"$type"}}}
])
{ "_id" : "xx007", "types" : [ "sociality", "travel", "literature", "novel", "technology" ] }
{ "_id" : "xx009", "types" : [ "literature", "novel" ] }
{ "_id" : "xx005", "types" : [ "travel", "novel", "literature", "sociality" ] }
{ "_id" : "xx008", "types" : [ "literature", "sociality", "technology" ] }
{ "_id" : "xx001", "types" : [ "novel", "sociality", "technology" ] }
{ "_id" : "xx002", "types" : [ "literature", "novel", "sociality", "technology" ] }
{ "_id" : "xx006", "types" : [ "novel" ] }
{ "_id" : "xx003", "types" : [ "literature", "novel", "technology" ] }
{ "_id" : "xx004", "types" : [ "sociality", "travel", "literature", "novel" ] }
{ "_id" : "xx000", "types" : [ "literature", "sociality" ] }

示例五：统计每个分类的book文档数量

db.book2.aggregate([
	{$group:{_id:"$type",bookCount:{$sum:1}}},
	{$sort:{bookCount:1}}
])
{ "_id" : "travel", "bookCount" : 4 }
{ "_id" : "technology", "bookCount" : 9 }
{ "_id" : "literature", "bookCount" : 11 }
{ "_id" : "sociality", "bookCount" : 12 }
{ "_id" : "novel", "bookCount" : 14 }

回到目录…

2.4 $unwind 展开数组

可以将数组拆分为单独的文档。

> db.book2.aggregate([{$unwind:"$tag"}])
{ "_id" : 1, "title" : "book-0", "type" : "literature", "tag" : "developer", "favCount" : 76, "author" : { "name" : "xx003", "age" : 30 } }
{ "_id" : 1, "title" : "book-0", "type" : "literature", "tag" : "document", "favCount" : 76, "author" : { "name" : "xx003", "age" : 30 } }
{ "_id" : 2, "title" : "book-1", "type" : "sociality", "tag" : "document", "favCount" : 41, "author" : { "name" : "xx004", "age" : 27 } }
{ "_id" : 2, "title" : "book-1", "type" : "sociality", "tag" : "document", "favCount" : 41, "author" : { "name" : "xx004", "age" : 27 } }
{ "_id" : 3, "title" : "book-2", "type" : "novel", "tag" : "nosql", "favCount" : 17, "author" : { "name" : "xx009", "age" : 30 } }
{ "_id" : 3, "title" : "book-2", "type" : "novel", "tag" : "mongodb", "favCount" : 17, "author" : { "name" : "xx009", "age" : 30 } }
{ "_id" : 4, "title" : "book-3", "type" : "literature", "tag" : "popular", "favCount" : 44, "author" : { "name" : "xx002", "age" : 29 } }
{ "_id" : 4, "title" : "book-3", "type" : "literature", "tag" : "popular", "favCount" : 44, "author" : { "name" : "xx002", "age" : 29 } }
Type "it" for more

示例一：姓名为xx006的作者的book的tag数组拆分为多个文档

db.book2.aggregate([
	{$match:{"author.name":"xx006"}},
	{$unwind:"$tag"}
])
{ "_id" : 16, "title" : "book-15", "type" : "novel", "tag" : "popular", "favCount" : 97, "author" : { "name" : "xx006", "age" : 30 } }
{ "_id" : 16, "title" : "book-15", "type" : "novel", "tag" : "popular", "favCount" : 97, "author" : { "name" : "xx006", "age" : 30 } }
{ "_id" : 43, "title" : "book-42", "type" : "novel", "tag" : "developer", "favCount" : 77, "author" : { "name" : "xx006", "age" : 28 } }
{ "_id" : 43, "title" : "book-42", "type" : "novel", "tag" : "popular", "favCount" : 77, "author" : { "name" : "xx006", "age" : 28 } }

示例二：每个作者的book的tag合集

db.book2.aggregate([
	{$unwind:"$tag"},
	{$group:{_id:"$author.name",tags:{$addToSet:"$tag"}}}
])
{ "_id" : "xx003", "tags" : [ "document", "mongodb", "developer", "nosql" ] }
{ "_id" : "xx004", "tags" : [ "nosql", "developer", "popular", "mongodb", "document" ] }
{ "_id" : "xx007", "tags" : [ "document", "mongodb", "popular", "nosql", "developer" ] }
{ "_id" : "xx008", "tags" : [ "document", "mongodb", "popular", "developer", "nosql" ] }
{ "_id" : "xx006", "tags" : [ "popular", "developer" ] }
{ "_id" : "xx009", "tags" : [ "nosql", "developer", "popular", "mongodb", "document" ] }
{ "_id" : "xx005", "tags" : [ "nosql", "popular", "mongodb", "document" ] }
{ "_id" : "xx001", "tags" : [ "document", "mongodb", "popular", "nosql", "developer" ] }
{ "_id" : "xx002", "tags" : [ "document", "popular", "developer", "nosql" ] }
{ "_id" : "xx000", "tags" : [ "document", "developer", "nosql" ] }

示例三：统计tag标签的热度排行 (根据favCount总量排行即为热度)

db.book2.aggregate([
	{$unwind:"$tag"},
	{$group:{_id:"$tag",favCount:{$sum:"$favCount"}}},
	{$sort:{favCount:-1}}
])
{ "_id" : "popular", "favCount" : 1421 }
{ "_id" : "document", "favCount" : 977 }
{ "_id" : "developer", "favCount" : 893 }
{ "_id" : "nosql", "favCount" : 850 }
{ "_id" : "mongodb", "favCount" : 643 }

回到目录…

2.5 $lookup 左外连接

Mongodb3.2 新增，主要用来实现多表关联查询。每个输入待处理的文档，经过 $lookup 阶段的处理，输出的新文档中会包含一个新生成的数组。数组列存放的数据是来自被 Join 集合的适配文档 (如果没有即为[ ])。

db.collection.aggregate([{
	$lookup: {
		from: "<等待被Join的集合>",
		localField: "<源集合中的match值>",
		foreignField: "<待Join的集合的match值>",
		as: "<输出文档的新增值命名>"
	}
})

数据准备：

db.customer.insert({_id:1,name:"customer1",phone:"13112345678",address:"test1"})
db.customer.insert({_id:2,name:"customer2",phone:"13112345679",address:"test2"})
db.order.insert({_id:1,orderCode:"order001",customerId:1,price:200})
db.order.insert({_id:2,orderCode:"order002",customerId:2,price:400})
db.orderItem.insert({_id:1,productName:"apples",qutity:2,orderId:1})
db.orderItem.insert({_id:2,productName:"oranges",qutity:2,orderId:1})
db.orderItem.insert({_id:3,productName:"mangoes",qutity:2,orderId:1})
db.orderItem.insert({_id:4,productName:"apples",qutity:2,orderId:2})
db.orderItem.insert({_id:5,productName:"oranges",qutity:2,orderId:2})
db.orderItem.insert({_id:6,productName:"mangoes",qutity:2,orderId:2})

示例一：联表查询用户以及对应的订单信息

db.customer.aggregate([
	{$lookup: {
		from: "order",
		localField: "_id",
		foreignField: "customerId",
		as: "customerOrder"
		}
	}
]).pretty()
// 结果如下:
{
	"_id" : 1,
	"name" : "customer1",
	"phone" : "13112345678",
	"address" : "test1",
	"customerOrder" : [
		{
	        "_id" : 1,
            "orderCode" : "order001",
            "customerId" : 1,
            "price" : 200
        }
    ]
}
{
	"_id" : 2,
	"name" : "customer2",
	"phone" : "13112345679",
	"address" : "test2",
	"customerOrder" : [
		{
			"_id" : 2,
			"orderCode" : "order002",
			"customerId" : 2,
			"price" : 400
		}
	]
}

示例二：联表查询订单信息以及对应的商品项

db.order.aggregate([
	{$lookup: {
		from: "orderItem",
		localField: "_id",
		foreignField: "orderId",
		as: "orderItem"
		}
	}
]).pretty()
// 结果如下:
{
	"_id" : 1,
	"orderCode" : "order001",
	"customerId" : 1,
	"price" : 200,
	"orderItem" : [
		{
			"_id" : 1,
			"productName" : "apples",
			"qutity" : 2,
			"orderId" : 1
		},
		{
			"_id" : 2,
			"productName" : "oranges",
			"qutity" : 2,
			"orderId" : 1
		},
		{
			"_id" : 3,
			"productName" : "mangoes",
			"qutity" : 2,
			"orderId" : 1
		}
	]
}
{
	"_id" : 2,
	"orderCode" : "order002",
	"customerId" : 2,
	"price" : 400,
	"orderItem" : [
		{
			"_id" : 4,
			"productName" : "apples",
			"qutity" : 2,
			"orderId" : 2
		},
		{
			"_id" : 5,
			"productName" : "oranges",
			"qutity" : 2,
			"orderId" : 2
		},
		{
			"_id" : 6,
			"productName" : "mangoes",
			"qutity" : 2,
			"orderId" : 2
		}
	]
}

回到目录…

2.6 $bucket 存储桶

将传入文档分类为组 (称为存储桶)，指定表达式和存储桶边界并输出文档到每个存储桶，结果至少包含一个输入文档。

$bucket 阶段的 RAM 限制为 100 MB。如果阶段超过此限制，会返回一个错误。

{
  $bucket: {
      groupBy: <expression>, //指定字段,用来对文档进行分组
      boundaries: [ <lowerbound1>, <lowerbound2>, ... ], //_id值，指定每个存储桶的边界
      default: <literal>, //可选的,指定边界外的_id值
      output: { //可选的, 除_id字段外,指定输出文档中要包含的字段
         <output1>: { <$accumulator expression> },
         ...
         <outputN>: { <$accumulator expression> }
      }
   }
}

示例：统计book收藏数在[0,10),[10,60),[60,80),[80,100),[100,+∞）区间内的文档数

db.book2.aggregate([{
	$bucket:{
		groupBy:"$favCount",
		boundaries:[0,10,60,80,100],
		default:"other",
		output:{"bookCount":{$sum:1}}
	}
}])
{ "_id" : 0, "bookCount" : 5 }
{ "_id" : 10, "bookCount" : 28 }
{ "_id" : 60, "bookCount" : 10 }
{ "_id" : 80, "bookCount" : 7 }

回到目录…

三、MapReduce

MapReduce 操作将大量的数据处理工作拆分成多个线程并行处理，然后将结果合并在一起。MongoDB 提供的 Map-Reduce 非常灵活，对于大规模数据分析也相当实用。

MapReduce 具有两个阶段：

将具有相同 Key 的文档数据整合在一起的 map 阶段
组合 map 操作的结果进行统计输出的 reduce 阶段

db.collection.mapReduce(
	function() {emit(key,value);}, //map函数,将数据拆分成键值对，交给reduce函数
	function(key,values) {return reduceFunction}, //reduce函数,根据键将值做统计运算
	{
		out: <collection>, //可选,将结果汇入指定表
		query: <document>, //可选,筛选数据的条件，筛选的数据送入map
		sort: <document>, //排序完后，送入map
		limit: <number>, //限制送入map的文档数
		finalize: <function>, //可选,修改reduce的结果后进行输出
		scope: <document>, //可选,指定map、reduce、finalize的全局变量
		jsMode: <boolean>, //可选，默认false。在mapreduce过程中是否将数据转换成bson格式
		verbose: <boolean>, //可选，是否在结果中显示时间，默认false
		bypassDocumentValidation: <boolean> //可选，是否略过数据校验
	}
)

在这里插入图片描述

示例：统计type为travel的不同作者的book文档收藏数

db.book2.mapReduce(
	function(){emit(this.author.name,this.favCount)},
	function(key,values){return Array.sum(values)},
	{
		query:{type:"travel"},
		out: "books_favCount"
	}
)
db.books_favCount.find()
{ "_id" : "xx007", "value" : 57 }
{ "_id" : "xx005", "value" : 95 }
{ "_id" : "xx004", "value" : 58 }

我们使用聚合管道也能实现， MongoDB5.0开始，map-reduce已被弃用。

db.book2.aggregate([
	{$match:{type:"travel"}},
	{$group:{_id:"$author.name",favCount:{$sum:"$favCount"}}}
])
{ "_id" : "xx007", "favCount" : 57 }
{ "_id" : "xx005", "favCount" : 95 }
{ "_id" : "xx004", "favCount" : 58 }