Hadoop (file distribution system)

    import org.apache.hadoop.fs.*;
    class HDFSClient {
        public static void main(String[] args) {
            FileSystem fileSystem = FileSystem.get(URI.create("hdfs://hadoop0:8020"), configuration, "hood");
        }
    }

MapReduceExample

<properties>
    <maven.compiler.source>1.8</maven.compiler.source>
    <maven.compiler.target>1.8</maven.compiler.target>
</properties>

com.slatepencil.wordcount (jdk1.8.0_212)
com.slatepencil.flow (jdk1.8.0_212)
com.slatepencil.inputformat (jdk1.8.0_212)

Maven commands

mvn --version
mvn archetype:generate -DgroupId=com.slatepencil.app -DartifactId=demo -DarchetypeArtifactId=maven-archetype-quickstart -DarchetypeVersion=1.4 -DinteractiveMode=false ¹
mvn -B archetype:generate -DgroupId=com.slatepencil.app -DartifactId=demo -DarchetypeArtifactId=maven-archetype-quickstart -DarchetypeVersion=1.4
mvn compile²
mvn test
mvn test-compile ³
mvn package => ${basedir}/target ⁴
mvn install => ${user.home}/.m2/repository
mvn clean
mvn site

Maven Project Structure

my-app
|-- pom.xml
`-- src
    |-- main
    |   `-- java
    |       `-- com
    |           `-- mycompany
    |               `-- app
    |                   `-- App.java
    |   `-- resources [^5]
    |       `-- META-INF
    |           `-- application.properties
    `-- test
        |-- java
        |   `-- com
        |       `-- mycompany
        |           `-- app
        |               `-- AppTest.java
        `-- resources
            `-- test.properties

MapTask

Read 阶段：MapTask 通过用户编写的 RecordReader，从输入 InputSplit 中解析出一个个 key/value。
Map 阶段：该节点主要是将解析出的 key/value 交给用户编写 map()函数处理，并产生一系列新的 key/value。
Collect 收集阶段：在用户编写 map()函数中，当数据处理完成后，一般会调用 OutputCollector.collect()输出结果。在该函数内部，它会将生成的 key/value 分区（调用 Partitioner），并写入一个环形内存缓冲区中。
Spill 阶段：即“溢写”，当环形缓冲区满后，MapReduce 会将数据写到本地磁盘上，生成一个临时文件。需要注意的是，将数据写入本地磁盘之前，先要对数据进行一次本地排序，并在必要时对数据进行合并、压缩等操作。
- 利用快速排序算法对缓存区内的数据进行排序，排序方式是，先按照分区编号 Partition 进行排序，然后按照 key 进行排序。这样，经过排序后，数据以分区为单位聚集在一起，且同一分区内所有数据按照 key 有序。
- 按照分区编号由小到大依次将每个分区中的数据写入任务工作目录下的临时文件 output/spillN.out（N 表示当前溢写次数）中。如果用户设置了 Combiner，则写入文件之前，对每个分区中的数据进行一次聚集操作。
- 将分区数据的元信息写到内存索引数据结构 SpillRecord 中，其中每个分区的元信息包括在临时文件中的偏移量、压缩前数据大小和压缩后数据大小。如果当前内存索引大小超过 1MB，则将内存索引写到文件 output/spillN.out.index 中。
Combine 阶段：当所有数据处理完成后，MapTask 对所有临时文件进行一次合并，以确保最终只会生成一个数据文件。当所有数据处理完后，MapTask 会将所有临时文件合并成一个大文件，并保存到文件 output/file.out 中，同时生成相应的索引文件 output/file.out.index。在进行文件合并过程中，MapTask 以分区为单位进行合并。对于某个分区，它将采用多轮递归合并的方式。每轮合并 io.sort.factor（默认 10）个文件，并将产生的文件重新加入待合并列表中，对文件排序后，重复以上过程，直到最终得到一个大文件。

ReduceTask

Copy 阶段：ReduceTask 从各个 MapTask 上远程拷贝一片数据，并针对某一片数据，如果其大小超过一定阈值，则写到磁盘上，否则直接放到内存中。
Merge 阶段：在远程拷贝数据的同时，ReduceTask 启动了两个后台线程对内存和磁盘上的文件进行合并，以防止内存使用过多或磁盘上文件过多。
Sort 阶段：按照 MapReduce 语义，用户编写 reduce()函数输入数据是按 key 进行聚集的一组数据。为了将 key 相同的数据聚在一起，Hadoop 采用了基于排序的策略。由于各个 MapTask 已经实现对自己的处理结果进行了局部排序，因此，ReduceTask 只需对所有数据进行一次归并排序即可。
Reduce 阶段：reduce()函数将计算结果写到 HDFS 上。

Summary

InputFormat
- TextInputFormat ==» default reading line by line, line offset as key, line as value
- keyValueTextInputFormat one record per line ==» key + "\t" + value
- NlineInputFormat split according certain number of lines (N)
- CombineTextInputFormat combine multiple small file as one split, improving process efficiency
- Customized InputFormat also supported
Mapper
- setup()
- map()
- cleanup()
Partitioner
- HashPartitioner ==» default 根据 key 的哈希值和 numReduces 来返回一个分区号; key.hasCode()&Integer.MAXVALUE%numReduces
- Customized Partitioner
Comparable 排序
- 用自定义的对象作为 key 来输出时必须实现 WritableComparable @override compareTo()
- 部分排序：对最终输出的每一个文件进行内部排序
- 全排序：对所有数据进行排序，通常只有一个 Reducer
- 二次排序：排序的条件有二个
Combiner
- 提高程序执行效率，减少 IO 传输。但是使用时必须不能影响原有业务处理结果
GroupingComparator
- Reduce 端对 key 进行分组。应用于在接收的 key 为 bean 对象时，想让一个或几个字段相同(全部字段比较不相同)的 key 进入到同一个 reduce 方法时，可以采用分组排序
Reducer
- setup()
- reduce()
- clean()
OutputFormat
- TextOutputFormat ==» default 将每一个 KeyValue 键值对，想目标文件输出一行
- SequenceFileOutputFormat 输出作为后续 MapReduce 任务的输入，格式紧凑便于压缩
- Customizable OutputFormat

POM

<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0"
         xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
         xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
    <modelVersion>4.0.0</modelVersion>

    <groupId>com.slatepencil</groupId>
    <artifactId>hdfs</artifactId>
    <version>1.0-SNAPSHOT</version>

    <properties>
        <maven.compiler.source>1.8</maven.compiler.source>
        <maven.compiler.target>1.8</maven.compiler.target>
    </properties>
    <dependencies>
        <dependency>
            <groupId>junit</groupId>
            <artifactId>junit</artifactId>
            <version>4.12</version>
        </dependency>
        <dependency>
            <groupId>org.apache.logging.log4j</groupId>
            <artifactId>log4j-slf4j-impl</artifactId>
            <version>2.12.0</version>
        </dependency>
        <dependency>
            <groupId>org.apache.hadoop</groupId>
            <artifactId>hadoop-client</artifactId>
            <version>3.1.3</version>
        </dependency>
    </dependencies>
</project>

Generate pom.xml => Project Object Model (POM) ↩
Maven will need to download all the plugins and related dependencies it needs to fulfill the command ↩
simply compile the test sources (but not execute the tests) ↩
a SNAPSHOT version is the ‘development’ version before the final ‘release’ version. The SNAPSHOT is “older” than its release ↩

docs

a slatepencil documentail site

Hadoop (file distribution system)

MapReduceExample

Maven commands

Maven Project Structure

MapTask

ReduceTask

Summary

POM