Spark重要知识点总结

技术2022-07-13 92

本文内容

Spark哪些算子用到shuffle？Transformation和Action算子Job、Stage、Task的关系宽窄依赖Stage划分PartitionSpark数据倾斜如何解决？Spark任务调度Spark容错机制容错原理Lineage机制Checkpoint Spark RDD持久化Spark vs Hadoop MapReduce

Spark哪些算子用到shuffle？

去重运算：distinct聚合运算：reduceByKey、groupByKey、aggregateByKey等排序运算：sortBy、sortByKey分区运算：repartition、coalesce集合运算：交、差、并，即 intersection、subtract、join，leftOuterJoin等

Transformation和Action算子

transformation：由一个RDD运算得到一个新的RDD，惰性运算，不会立即执行action：运算结果不是RDD，遇到action算子就立即执行，如collect、count、first、take、saveAsTextFile等

Job、Stage、Task的关系

Job：当在程序中遇到一个action算子的时候，就会提交一个job，执行当前action算子及前面的一系列transformation操作Stage：一个job包含多个stage，stage根据宽窄依赖划分，形成窄依赖的父子RDD会划分到同一个stage，形成宽依赖会划分到不同的stage；不存在前后依赖关系的stage之间可以并行执行，存在依赖关系的stage需串行执行Task：一个stage分为多个并行执行的task，一般一个task处理一个partition的数据

宽窄依赖

宽依赖：父RDD与子RDD的对应关系为：1-n即一对多，父RDD的一个partition可能被子RDD的多个partition利用，宽依赖也叫“shuffle dependencies”，即发生了shuffle运算，如groupByKey、reduceByKey等窄依赖：父RDD与子RDD的对应关系为：1-1或n-1，父RDD的一个partition最多被子RDD的一个partition利用，子RDD的一个partition可能利用到父RDD的多个partition，如map、filter、union等

Stage划分

DAG划分stage的依据是宽依赖，为什么呢？

因为一个stage内的task需要满足并行执行的条件，而形成宽依赖的父子RDD不满足并行执行的条件，因为子RDD需等父RDD的所有分区都计算完毕后，才能开始计算，所以需要在宽依赖处进行切割。这样一个stage内的RDD之间的关系均为窄依赖，即子RDD不用等父RDD的所有分区都计算完毕才开始计算，stage内的计算模式是pipeline，stage中的并行度由该stage内最后一个RDD的partition个数来决定。

Partition

Spark通常根据RDD中的Partitioner来进行分区，目前Spark中实现的Partitioner有两种：HashPartitioner和RangePartitioner，当然也可以实现自定义的Partitioner，只需要继承抽象类Partitioner并使用override关键字重写numPartitions和getPartition(key: Any)即可

Spark数据倾斜如何解决？

最简单的方法：提高shuffle操作的并行度，但只能起到缓解作用聚合类的shuffle引起：添加随机前缀，然后初次聚合操作结束后去掉随机前缀，再次进一步全局聚合join类的shuffle引起：使用map join代替reduce join，即使用broadcast+map算子完成join操作，完全避免shuffle，适用于其中一个RDD数据量较小两个RDD的数据量都比较大：采样倾斜key并分拆join操作：将有数据倾斜的RDD中倾斜Key对应的数据集单独抽取出来加上随机前缀，另外一个RDD中与倾斜Key对应的部分数据分别与随机前缀结合形成新的RDD（会扩容RDD），然后将二者Join并去掉前缀。然后将不包含倾斜Key的剩余数据进行Join。适用于倾斜key的数量不多的情况两个RDD都较大且存在大量倾斜key：采用与上述4类似的方案，区别是这一种方案是针对有大量倾斜key的情况，没法将部分key拆分出来进行单独处理，因此只能对整个RDD进行数据扩容，大表随机添加N种随机前缀，小表扩大N倍，缺点是对内存资源要求很高

Spark任务调度

Driver：运行用户提交Application程序的main函数，创建SparkContext对象，根据RDD之间的依赖关系生成一个DAG有向无环图，同时创建DAGScheduler、TaskScheduler和SchedulerBackend等对象DAGScheduler：负责 stage 级的调度，主要是将 DAG 切分成多个 stage，并将 stage 打包成 TaskSet 交给 TaskSchedulerTaskScheduler：负责 task 级的调度，将 DAGScheduler 发过来的 TaskSet 按照指定的调度策略发送给 ExecutorSchedulerBackend：对应多种实现，分别对接不同的资源管理系统HeartbeatReceiver：负责接收 Executor 心跳报文，监控 Executor 存活状态；Task：task在Executor线程池中的运行情况会向TaskScheduler反馈，当task执行失败时，则由TaskScheduler负责重试，将task重新发送给Executor去执行，默认重试3次。如果重试3次依然失败，那么这个task所在的stage就失败了。stage失败了则由DAGScheduler来负责重试，重新发送TaskSet到TaskSchdeuler，Stage默认重试4次。如果重试4次以后依然失败，那么这个job就失败了。job失败了，Application就失败了

Spark容错机制

容错原理

窄依赖：若子RDD的分区丢失，仅需要重算丢失分区对应的父RDD分区，这部分数据都是子RDD分区对应的数据，并不存在冗余计算。宽依赖：若子RDD的分区丢失，需重算其每个父RDD的每个分区的所有数据，但这些数据并不是都给丢失的子RDD分区用的，存在部分数据对应未丢失的子RDD分区中需要的数据，这样就会产生冗余计算开销，这也是宽依赖开销更大的原因。

Lineage机制

RDD数据集通过所谓的血统关系(Lineage)记录它是如何从其它RDD中演变过来的。但相比其它系统的细颗粒度的内存数据更新级别的备份或者LOG机制，RDD的Lineage记录的是粗粒度的特定数据转换（Transformation）操作（filter, map, join 等)，因为更新粒度太细，那么记录更新成本也不低。Lineage本质上很类似于数据库中的重做日志（Redo Log），只不过这个重做日志粒度很大。

在spark中默认是采用logging the updates方式，即通过记录跟踪所有生成RDD的转换（transformations）也就是记录每个RDD的lineage（血统）来重新计算生成丢失的分区数据。

Checkpoint

Checkpoint的作用是将DAG中比较重要的中间数据做一个检查点将中间结果存储到一个高可用的地方(通常这个地方就是HDFS里面)。

为什么要做Checkpoint？当Lineage记录的RDD依赖链随着时间而变长，会造成容错成本过高。在中间阶段做检查点容错起到切断依赖链的作用，若之后有节点出现问题而丢失分区，从做检查点的RDD开始根据lineage进行重做恢复，就会大大减少开销。

Checkpoint和Cache的区别 Cache是把 RDD缓存在内存中，但是RDD 的依赖链不能丢掉，因为当某个executor宕机了，上面cache的RDD就会丢掉，需要通过依赖链重新计算出来 Checkpoint 是把RDD保存在HDFS中实现多副本可靠存储，当你checkpoint执行成功了，那么前面所有的RDD依赖都会被销毁，即切断了依赖链。注意：checkpoint并不是直接将要存储的RDD保存起来，而是另外执行一个job完成RDD的重算然后才保存，所以在checkpoint之前最好先cache一下RDD，checkpoint就可以直接保存缓存中的RDD了，就不需要重头计算一遍了，会对性能有极大的提升。

详见源码及注释部分：

/** * Mark this RDD for checkpointing. It will be saved to a file inside the checkpoint * directory set with `SparkContext#setCheckpointDir` and all references to its parent * RDDs will be removed. This function must be called before any job has been * executed on this RDD. It is strongly recommended that this RDD is persisted in * memory, otherwise saving it on a file will require recomputation. */ def checkpoint(): Unit = RDDCheckpointData.synchronized { // NOTE: we use a global lock here due to complexities downstream with ensuring // children RDD partitions point to the correct parent partitions. In the future // we should revisit this consideration. if (context.checkpointDir.isEmpty) { throw new SparkException("Checkpoint directory has not been set in the SparkContext") } else if (checkpointData.isEmpty) { checkpointData = Some(new ReliableRDDCheckpointData(this)) } }

如何实现Checkpoint？示例：

//首先通过sparkContext设置Checkpoint路径，会在hdfs创建文件夹 sc.setCheckpointDir("hdfs://hdfsPath/checkpointDir") //把rdd持久化到内存、否则保存到文件时会触发一个新的job重新计算 val rdd = sc.parallelize(1 to 1000, 10).cache //调用checkpoint，标记该RDD要Checkpoint rdd.checkpoint() //直到遇到action算子触发job，才会将rdd持久化到对应的hdfs文件中 println(rdd.sum)

确保在rdd.checkpoint()之后执行action算子，因为每个action算子都会调用sc.runJob()方法，而runJob方法在最后会调用rdd.doCheckpoint()方法，注意阅读源码中的注释能够帮助理解前面讲到的内容。

/** * Run a function on a given set of partitions in an RDD and pass the results to the given * handler function. This is the main entry point for all actions in Spark. * * @param rdd target RDD to run tasks on * @param func a function to run on each partition of the RDD * @param partitions set of partitions to run on; some jobs may not want to compute on all * partitions of the target RDD, e.g. for operations like `first()` * @param resultHandler callback to pass each result to */ def runJob[T, U: ClassTag]( rdd: RDD[T], func: (TaskContext, Iterator[T]) => U, partitions: Seq[Int], resultHandler: (Int, U) => Unit): Unit = { if (stopped.get()) { throw new IllegalStateException("SparkContext has been shutdown") } val callSite = getCallSite val cleanedFunc = clean(func) logInfo("Starting job: " + callSite.shortForm) if (conf.getBoolean("spark.logLineage", false)) { logInfo("RDD's recursive dependencies:\n" + rdd.toDebugString) } dagScheduler.runJob(rdd, cleanedFunc, partitions, callSite, resultHandler, localProperties.get) progressBar.foreach(_.finishAll()) rdd.doCheckpoint() } /** * Performs the checkpointing of this RDD by saving this. It is called after a job using this RDD * has completed (therefore the RDD has been materialized and potentially stored in memory). * doCheckpoint() is called recursively on the parent RDDs. */ private[spark] def doCheckpoint(): Unit = { RDDOperationScope.withScope(sc, "checkpoint", allowNesting = false, ignoreParent = true) { if (!doCheckpointCalled) { doCheckpointCalled = true if (checkpointData.isDefined) { if (checkpointAllMarkedAncestors) { // TODO We can collect all the RDDs that needs to be checkpointed, and then checkpoint // them in parallel. // Checkpoint parents first because our lineage will be truncated after we // checkpoint ourselves dependencies.foreach(_.rdd.doCheckpoint()) } checkpointData.get.checkpoint() } else { dependencies.foreach(_.rdd.doCheckpoint()) } } } }

Spark RDD持久化

…

Spark vs Hadoop MapReduce

…

*持续更新中… *如有错误请不吝指正，谢谢！

Processed: 0.012, SQL: 9