第五章_Spark核心编程_Rdd五个核心属性

 * Internally, each RDD is characterized by five main properties:(在内部，每个RDD有五个主要特性)
 *
 *  - A list of partitions
  /**
   * 方法描述 : 
   *     1. 返回当前Rdd 的分区对象的数组 
   */
  protected def getPartitions: Array[Partition]


 *  - A function for computing each split
  /**
   * 方法描述 :
   *     1. 计算给定分区
   * note :
   *     1. Spark计算时,是使用分区函数对每个分区进行计算
   */
  @DeveloperApi
  def compute(split: Partition, context: TaskContext): Iterator[T]


 *  - A list of dependencies on other RDDs
  /**
   * 方法描述 :
   *     1. 返回当前Rdd依赖的 父Rdd的列表
   * note :
   *     1. Rdd是计算模型的封装,当需求中需要将多个 计算模型(Rdd)进行组合时,就需要将Rdd 建立依赖关系
   */
  protected def getDependencies: Seq[Dependency[_]] = deps


 *  - Optionally(可选), a Partitioner for key-value RDDs (e.g. to say that the RDD is hash-partitioned)
  /** 
  *  方法描述 :
  *      1. 指定 分区器
  */
  @transient val partitioner: Option[Partitioner] = None


 *  - Optionally, a list of preferred locations to compute each split on (e.g. block locations for
 *    an HDFS file)
  /**
  *  方法描述 :
  *      1. 计算数据时，可以根据计算节点的状态选择不同的节点位置进行计算
  */
  protected def getPreferredLocations(split: Partition): Seq[String] = Nil

SparkCore

第五章_Spark核心编程_Rdd五个核心属性

相关

SparkCore系列(三)广播变量和累加器

SparkCore系列(二)rdd聚合操作,rdd之间聚合操作

SparkCore系列(一)变换操作,查找取值操作

标签