一.简介
通过使用Summarizer提供矢量列【向量、矩阵】汇总统计Dataframe。可用的指标是按列的最大值,最小值,平均值,总和,方差,std和非零数,以及总数。
二.代码实战【以均值、方差为例】
package spark2
.ml
import org
.apache
.log4j
.{Level
, Logger
}
import org
.apache
.spark
.ml
.linalg
.{Vector
, Vectors
}
import org
.apache
.spark
.sql
.SparkSession
import org
.apache
.spark
.ml
.stat
.Summarizer
._
object MLSummary
{
Logger
.getLogger("org").setLevel(Level
.WARN
)
def
main(args
: Array
[String
]) {
val spark
= SparkSession
.builder
.appName(s
"${this.getClass.getSimpleName}")
.config("spark.driver.maxResultSize", "2G")
.master("local[2]")
.getOrCreate()
import spark
.implicits
._
val data
= Seq(
(Vectors
.dense(7.0, 3.0, 4.0), 1.0),
(Vectors
.dense(4.0, 6.0, 7.0), 2.0)
)
val df
= data
.toDF("features", "weight")
val
(meanVal
, varianceVal
) = df
.select(metrics("mean", "variance").summary($
"features", $
"weight").as("summary"))
.select("summary.mean", "summary.variance")
.as
[(Vector
, Vector
)]
.first()
println(meanVal
)
println(varianceVal
)
val
(meanVal2
, varianceVal2
) = df
.select(mean($
"features"), variance($
"features"))
.as
[(Vector
, Vector
)]
.first()
println(meanVal2
)
println(varianceVal2
)
spark
.stop()
}
}
三.执行结果及分析
包括以下类型计算:
mean:包含系数均值的向量。variance:包含系数方差的向量。count总数:总数。numNonzeros:数值类型向量最大非零个数。max:最大值。min:最小值。normL2:每个系数的欧几里得范数。normL1:每个系数的L1范数(绝对值之和)。