DAG划分stage的依据是宽依赖,为什么呢?
因为一个stage内的task需要满足并行执行的条件,而形成宽依赖的父子RDD不满足并行执行的条件,因为子RDD需等父RDD的所有分区都计算完毕后,才能开始计算,所以需要在宽依赖处进行切割。这样一个stage内的RDD之间的关系均为窄依赖,即子RDD不用等父RDD的所有分区都计算完毕才开始计算,stage内的计算模式是pipeline,stage中的并行度由该stage内最后一个RDD的partition个数来决定。RDD数据集通过所谓的血统关系(Lineage)记录它是如何从其它RDD中演变过来的。但相比其它系统的细颗粒度的内存数据更新级别的备份或者LOG机制,RDD的Lineage记录的是粗粒度的特定数据转换(Transformation)操作(filter, map, join 等),因为更新粒度太细,那么记录更新成本也不低。Lineage本质上很类似于数据库中的重做日志(Redo Log),只不过这个重做日志粒度很大。
在spark中默认是采用logging the updates方式,即通过记录跟踪所有生成RDD的转换(transformations)也就是记录每个RDD的lineage(血统)来重新计算生成丢失的分区数据。
Checkpoint的作用是将DAG中比较重要的中间数据做一个检查点将中间结果存储到一个高可用的地方(通常这个地方就是HDFS里面)。
为什么要做Checkpoint? 当Lineage记录的RDD依赖链随着时间而变长,会造成容错成本过高。在中间阶段做检查点容错起到切断依赖链的作用,若之后有节点出现问题而丢失分区,从做检查点的RDD开始根据lineage进行重做恢复,就会大大减少开销。
Checkpoint和Cache的区别 Cache是把 RDD缓存在内存中,但是RDD 的依赖链不能丢掉,因为当某个executor宕机了,上面cache的RDD就会丢掉,需要通过依赖链重新计算出来 Checkpoint 是把RDD保存在HDFS中实现多副本可靠存储,当你checkpoint执行成功了,那么前面所有的RDD依赖都会被销毁,即切断了依赖链。 注意:checkpoint并不是直接将要存储的RDD保存起来,而是另外执行一个job完成RDD的重算然后才保存,所以在checkpoint之前最好先cache一下RDD,checkpoint就可以直接保存缓存中的RDD了,就不需要重头计算一遍了,会对性能有极大的提升。
详见源码及注释部分:
/** * Mark this RDD for checkpointing. It will be saved to a file inside the checkpoint * directory set with `SparkContext#setCheckpointDir` and all references to its parent * RDDs will be removed. This function must be called before any job has been * executed on this RDD. It is strongly recommended that this RDD is persisted in * memory, otherwise saving it on a file will require recomputation. */ def checkpoint(): Unit = RDDCheckpointData.synchronized { // NOTE: we use a global lock here due to complexities downstream with ensuring // children RDD partitions point to the correct parent partitions. In the future // we should revisit this consideration. if (context.checkpointDir.isEmpty) { throw new SparkException("Checkpoint directory has not been set in the SparkContext") } else if (checkpointData.isEmpty) { checkpointData = Some(new ReliableRDDCheckpointData(this)) } }如何实现Checkpoint? 示例:
//首先通过sparkContext设置Checkpoint路径,会在hdfs创建文件夹 sc.setCheckpointDir("hdfs://hdfsPath/checkpointDir") //把rdd持久化到内存、否则保存到文件时会触发一个新的job重新计算 val rdd = sc.parallelize(1 to 1000, 10).cache //调用checkpoint,标记该RDD要Checkpoint rdd.checkpoint() //直到遇到action算子触发job,才会将rdd持久化到对应的hdfs文件中 println(rdd.sum)确保在rdd.checkpoint()之后执行action算子,因为每个action算子都会调用sc.runJob()方法,而runJob方法在最后会调用rdd.doCheckpoint()方法,注意阅读源码中的注释能够帮助理解前面讲到的内容。
/** * Run a function on a given set of partitions in an RDD and pass the results to the given * handler function. This is the main entry point for all actions in Spark. * * @param rdd target RDD to run tasks on * @param func a function to run on each partition of the RDD * @param partitions set of partitions to run on; some jobs may not want to compute on all * partitions of the target RDD, e.g. for operations like `first()` * @param resultHandler callback to pass each result to */ def runJob[T, U: ClassTag]( rdd: RDD[T], func: (TaskContext, Iterator[T]) => U, partitions: Seq[Int], resultHandler: (Int, U) => Unit): Unit = { if (stopped.get()) { throw new IllegalStateException("SparkContext has been shutdown") } val callSite = getCallSite val cleanedFunc = clean(func) logInfo("Starting job: " + callSite.shortForm) if (conf.getBoolean("spark.logLineage", false)) { logInfo("RDD's recursive dependencies:\n" + rdd.toDebugString) } dagScheduler.runJob(rdd, cleanedFunc, partitions, callSite, resultHandler, localProperties.get) progressBar.foreach(_.finishAll()) rdd.doCheckpoint() } /** * Performs the checkpointing of this RDD by saving this. It is called after a job using this RDD * has completed (therefore the RDD has been materialized and potentially stored in memory). * doCheckpoint() is called recursively on the parent RDDs. */ private[spark] def doCheckpoint(): Unit = { RDDOperationScope.withScope(sc, "checkpoint", allowNesting = false, ignoreParent = true) { if (!doCheckpointCalled) { doCheckpointCalled = true if (checkpointData.isDefined) { if (checkpointAllMarkedAncestors) { // TODO We can collect all the RDDs that needs to be checkpointed, and then checkpoint // them in parallel. // Checkpoint parents first because our lineage will be truncated after we // checkpoint ourselves dependencies.foreach(_.rdd.doCheckpoint()) } checkpointData.get.checkpoint() } else { dependencies.foreach(_.rdd.doCheckpoint()) } } } }…
…
*持续更新中… *如有错误请不吝指正,谢谢!