win10+centos7+hadoop 集群环境搭建


一. 前期准备

1. Vmware workstation pro 16 

    官网下载 : https://www.vmware.com/ 

    密钥:ZF3R0-FHED2-M80TY-8QYGC-NPKYF (若失效请自行百度)

2. xshell,xftp 官网下载(需要注册)

3. 国内镜像网站下载centos(笔者以centos7为例),如华为,阿里,清华的镜像。

    https://mirrors.tuna.tsinghua.edu.cn ,https://developer.aliyun.com/mirror/,https://mirrors.huaweicloud.com/

4. 所需要的包

    hadoop 2.7.0      jdk-1.8.0

二. 安装Vmware workstation pro 16

三. 安装 xshell,xftp

四. Vmware网络配置

原因:将Windows系统与虚拟机放到同一子网下,这样可以通过window系统上的浏览器访问虚拟机中hadoop集群的文件管理页面(即 master:50070 页面),为后续idea连接hadoop集  群也提供了方便。固定虚拟机IP地址,方便后续操作。

1. 在VMware软件里面的编辑----》虚拟网络编辑器---》选择VMnet8模式

2. 点击NAT设置

 3. 点击1中的DHCP设置 

 4. 设置VMnet8的地址

     

 5. 右击VMnet8->属性

五. 创建虚拟机实例

笔者创建了三个虚拟机,hostname分别为master,slave1,slave2. 

创建过程中网络类型选择NAT模式

修改主机名,请参考:https://www.cnblogs.com/HusterX/articles/13425074.html

关闭防火墙(或者开放对应窗口)

firewall-cmd --state
systemctl stop firewalld.service
systemctl disable firewalld.service
CloseFirewalld

增加用户(增加对hadoop管理的一个用户,在本地搭建是可以忽略这一步,只要全程在root权限下操作即可)

 UserAdd

最终结果

IP地址 计算机名 主要作用
192.168.47.131 master namenode,JobTracker
192.168.47.132 salve1 datenode,TaskTracker
192.168.47.130 slave2 datenode,TaskTracker

编辑master的 /etc/hosts 文件

192.168.47.131 master
192.168.47.132 slave1
192.168.47.130 slave2
master's hosts

六. Centos7系统环境搭建

以下步骤用xshell连接master 后操作或者在 master内直接进行操作。

PS:笔者全程用root权限操作(若使用hadoop用户,请注意权限问题)

1.用xftp软件 将 Jdk Hadoop 压缩包上传到master中的某个目录下(笔者以 /opt 目录为例)

2. 搭建Java环境

1.检查是否有Java
  java -version

2.若有,则移除openjdk
   查看:
   rpm -qa | grep openjdk
   删除
   rpm -e --nodeps [相关的软件包]

3.解压缩上传到 /opt 目录下的 jdk
   tar -zxvf jdk1.8*****
   重命名
   mv  jdk1.8*****  jdk8

4.增加环境变量(root用户下)
   vim /etc/profile

   增加
   export JAVA_HOME=/opt/jdk8
   export PATH=$PATH:$JAVA_HOME/bin

5.生效 
   source /etc/profile

6.测试
   java -version
Centos7 Java环境搭建流程

3. 搭建hadoop环境

1.解压缩上传到 /opt 目录下的hadoop压缩包
  tar -zxf hadoop-2.7.3.tar.gz 
  重命名
  mv  hadoop-2.7.3 hadoop

2.配置hadoop环境变量(root用户下)
   vim /etc/profile

   增加
   export HADOOP_HOME=/opt/hadoop

   修改
   export 
PATH=$PATH:$JAVA_HOME/bin:$HADOOP_HOME/bin:$HADOOP_HOME/sbin

3.测试
   hadoop version

[root@master ~]# hadoop version
Hadoop 2.7.3
Subversion https://git-wip-us.apache.org/repos/asf/hadoop.git -r baa91f7c6bc9cb92be5982de4719c1c8af91ccff
Compiled by root on 2016-08-18T01:41Z
Compiled with protoc 2.5.0
From source with checksum 2e4ce5f957ea4db193bce3734ff29ff4
This command was run using /opt/hadoop/share/hadoop/common/hadoop-common-2.7.3.jar
Centos7 hadoop环境搭建流程

4. ssh免密登录

1. 生成密钥
[root@master ~]# ssh-keygen -t rsa 
/root/.ssh下 
[root@master ~]# ls /root/.ssh/
id_rsa  id_rsa.pub

2. 加入信任列表
[root@master ~]# cat id_rsa.pub >> authorized_keys
[root@master ~]# ls /root/.ssh/
authorized_keys  id_rsa  id_rsa.pub

3. 设置权限
[root@master ~]# chmod 600 authorized_keys

4. 在其余centos系统中重复1 2 3 
 
5. 分发给集群中的其他主机
    使用模式:ssh-copy-id [-i [identity_file]] [user@]machine
[root@master ~]# ssh-copy-id  [user]@[IP]
SSH免密登录操作

5. 在 /opt/hadoop/ 下创建目录

   (1)  创建hdfs

   (2)  hdfs下创建 name,data,tmp目录

6. hadoop配置文件

在hadoop-env.sh中增加 export JAVA_HOME=/opt/jdk8

# Licensed to the Apache Software Foundation (ASF) under one
# or more contributor license agreements.  See the NOTICE file
# distributed with this work for additional information
# regarding copyright ownership.  The ASF licenses this file
# to you under the Apache License, Version 2.0 (the
# "License"); you may not use this file except in compliance
# with the License.  You may obtain a copy of the License at
#
#     http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

# Set Hadoop-specific environment variables here.

# The only required environment variable is JAVA_HOME.  All others are
# optional.  When running a distributed configuration it is best to
# set JAVA_HOME in this file, so that it is correctly defined on
# remote nodes.

# The java implementation to use.
export JAVA_HOME=${JAVA_HOME}

# The jsvc implementation to use. Jsvc is required to run secure datanodes
# that bind to privileged ports to provide authentication of data transfer
# protocol.  Jsvc is not required if SASL is configured for authentication of
# data transfer protocol using non-privileged ports.
#export JSVC_HOME=${JSVC_HOME}

export HADOOP_CONF_DIR=${HADOOP_CONF_DIR:-"/etc/hadoop"}

# Extra Java CLASSPATH elements.  Automatically insert capacity-scheduler.
for f in $HADOOP_HOME/contrib/capacity-scheduler/*.jar; do
  if [ "$HADOOP_CLASSPATH" ]; then
    export HADOOP_CLASSPATH=$HADOOP_CLASSPATH:$f
  else
    export HADOOP_CLASSPATH=$f
  fi
done

# The maximum amount of heap to use, in MB. Default is 1000.
#export HADOOP_HEAPSIZE=
#export HADOOP_NAMENODE_INIT_HEAPSIZE=""

# Extra Java runtime options.  Empty by default.
export HADOOP_OPTS="$HADOOP_OPTS -Djava.net.preferIPv4Stack=true"

# Command specific options appended to HADOOP_OPTS when specified
export HADOOP_NAMENODE_OPTS="-Dhadoop.security.logger=${HADOOP_SECURITY_LOGGER:-INFO,RFAS} -Dhdfs.audit.logger=${HDFS_AUDIT_LOGGER:-INFO,NullAppender} $HADOOP_NAMENODE_OPTS"
export HADOOP_DATANODE_OPTS="-Dhadoop.security.logger=ERROR,RFAS $HADOOP_DATANODE_OPTS"

export HADOOP_SECONDARYNAMENODE_OPTS="-Dhadoop.security.logger=${HADOOP_SECURITY_LOGGER:-INFO,RFAS} -Dhdfs.audit.logger=${HDFS_AUDIT_LOGGER:-INFO,NullAppender} $HADOOP_SECONDARYNAMENODE_OPTS"

export HADOOP_NFS3_OPTS="$HADOOP_NFS3_OPTS"
export HADOOP_PORTMAP_OPTS="-Xmx512m $HADOOP_PORTMAP_OPTS"

# The following applies to multiple commands (fs, dfs, fsck, distcp etc)
export HADOOP_CLIENT_OPTS="-Xmx512m $HADOOP_CLIENT_OPTS"
#HADOOP_JAVA_PLATFORM_OPTS="-XX:-UsePerfData $HADOOP_JAVA_PLATFORM_OPTS"

# On secure datanodes, user to run the datanode as after dropping privileges.
# This **MUST** be uncommented to enable secure HDFS if using privileged ports
# to provide authentication of data transfer protocol.  This **MUST NOT** be
# defined if SASL is configured for authentication of data transfer protocol
# using non-privileged ports.
export HADOOP_SECURE_DN_USER=${HADOOP_SECURE_DN_USER}

# Where log files are stored.  $HADOOP_HOME/logs by default.
#export HADOOP_LOG_DIR=${HADOOP_LOG_DIR}/$USER

# Where log files are stored in the secure data environment.
export HADOOP_SECURE_DN_LOG_DIR=${HADOOP_LOG_DIR}/${HADOOP_HDFS_USER}

###
# HDFS Mover specific parameters
###
# Specify the JVM options to be used when starting the HDFS Mover.
# These options will be appended to the options specified as HADOOP_OPTS
# and therefore may override any similar flags set in HADOOP_OPTS
#
# export HADOOP_MOVER_OPTS=""

###
# Advanced Users Only!
###

# The directory where pid files are stored. /tmp by default.
# NOTE: this should be set to a directory that can only be written to by 
#       the user that will run the hadoop daemons.  Otherwise there is the
#       potential for a symlink attack.
export HADOOP_PID_DIR=${HADOOP_PID_DIR}
export HADOOP_SECURE_DN_PID_DIR=${HADOOP_PID_DIR}

# A string representing this instance of hadoop. $USER by default.
export HADOOP_IDENT_STRING=$USER
export JAVA_HOME=/opt/jdk8
hadoop-env.sh
<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>




<configuration>
    <property>
        <name>hadoop.tmp.dirname>
        <value>file:/opt/hadoop/hdfs/tmpvalue>
        <discription>A base for other temporary directories.discription>
    property>
    
    
    <property>
        <name>fs.defaultFSname>
        <value>hdfs://192.168.47.131:9000value>
    property>
configuration>
core-site.xml
<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>




<configuration>
    
    <property>
        <name>dfs.replicationname>
        <value>2value>
    property>

    <property>
        <name>dfs.namenode.name.dirname>
        <value>file:/opt/hadoop/hdfs/namevalue>
    property>

    <property>
        <name>dfs.datanode.data.dirname>
        <value>file:/opt/hadoop/hdfs/datavalue>
    property>

configuration>
hdfs-site.xml
<?xml version="1.0"?>

<configuration>


    <property>
        <name>yarn.resourcemanager.hostnamename>
        <value>mastervalue>
    property>
    <property>
        <name>yarn.nodemanager.aux-servicesname>
        <value>mapreduce_shufflevalue>
    property>
configuration>
yarn-site.xml
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>




<configuration>
    <property>
        <name>mapreduce.framework.namename>
        <value>yarnvalue>
    property>
configuration>
mapred-site.xml
slave1
slave2
slaves

PS : slaves文件根据自己的部署确定,笔者部署了俩台slave

7. master环境配置到此结束

   将master中的 /opt/jdk,/opt/hadoop 复制到slave1,slave2的 /opt 目录下。

   将master中的 /etc/hosts,/etc/profile 复制到slave2,slave2对应的目录下。

8. 在slave1,slave2中运行

   source  /etc/profile

附笔者的  /etc/profile 文件

# /etc/profile

# System wide environment and startup programs, for login setup
# Functions and aliases go in /etc/bashrc

# It's NOT a good idea to change this file unless you know what you
# are doing. It's much better to create a custom.sh shell script in
# /etc/profile.d/ to make custom changes to your environment, as this
# will prevent the need for merging in future updates.

pathmunge () {
    case ":${PATH}:" in
        *:"$1":*)
            ;;
        *)
            if [ "$2" = "after" ] ; then
                PATH=$PATH:$1
            else
                PATH=$1:$PATH
            fi
    esac
}


if [ -x /usr/bin/id ]; then
    if [ -z "$EUID" ]; then
        # ksh workaround
        EUID=`/usr/bin/id -u`
        UID=`/usr/bin/id -ru`
    fi
    USER="`/usr/bin/id -un`"
    LOGNAME=$USER
    MAIL="/var/spool/mail/$USER"
fi

# Path manipulation
if [ "$EUID" = "0" ]; then
    pathmunge /usr/sbin
    pathmunge /usr/local/sbin
else
    pathmunge /usr/local/sbin after
    pathmunge /usr/sbin after
fi

HOSTNAME=`/usr/bin/hostname 2>/dev/null`
HISTSIZE=1000
if [ "$HISTCONTROL" = "ignorespace" ] ; then
    export HISTCONTROL=ignoreboth
else
    export HISTCONTROL=ignoredups
fi

export PATH USER LOGNAME MAIL HOSTNAME HISTSIZE HISTCONTROL

# By default, we want umask to get set. This sets it for login shell
# Current threshold for system reserved uid/gids is 200
# You could check uidgid reservation validity in
# /usr/share/doc/setup-*/uidgid file
if [ $UID -gt 199 ] && [ "`/usr/bin/id -gn`" = "`/usr/bin/id -un`" ]; then
    umask 002
else
    umask 022
fi

for i in /etc/profile.d/*.sh /etc/profile.d/sh.local ; do
    if [ -r "$i" ]; then
        if [ "${-#*i}" != "$-" ]; then 
            . "$i"
        else
            . "$i" >/dev/null
        fi
    fi
done

unset i
unset -f pathmung
export JAVA_HOME=/opt/jdk8
export HADOOP_HOME=/opt/hadoop
export PATH=$PATH:$JAVA_HOME/bin:$HADOOP_HOME/bin:$HADOOP_HOME/sbin
/etc/profile

七. 启动hadoop集群

1. 格式化系统
hadoop namenode -format / hdfs namenode -formate
(注意只需要执行一次,后面再启动不需要再次格式化,
除非master/slave有修改)

2.启动hadoop(进入 /opt/hadoop 目录下)
   sbin/start-all.sh

3.验证 jps,查看master进程
   [root@master hadoop]# jps
   28448 ResourceManager
   31777 Jps
   28293 SecondaryNameNode
   28105 NameNode
   查看salve1进程
   [root@slave1 ~]# jps
   22950 Jps
   18665 NodeManager
   18558 DataNode

4. 浏览器中查看
    http://master:50070
     若window与虚拟机在同一子网中,可以在window系统中通过浏览器打开,若不在,可以通过配置实现。同样也可在master中通过浏览器访问。

5. 停止hadoop集群(进入 /opt/hadoop )
    sbin/stop-all.sh
启动命令

八. 程序测试

1.在 /opt/hadoop 目录下
echo "this is a test case, loading, please wait a minit" >> test

2.用hdfs命令创建输入文件夹
   hadoop fs -mkdir /input

3.用hdfs命令将test内容放入 /input 文件夹中 
   hadoop fs -put test /input

4.运行hadoop自带的wordcount例子
   hadoop jar /opt/hadoop/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.7.3.jar wordcount /input /output

5. 查看输出结果
   hadoop fs -ls /output
   hadoop fs -cat /output/part-r-00000
TestCase

PS:这些文件夹以及都是在hdfs上,因此无法在本地磁盘中找到。且在程序运行前,结果文件夹output必须是不存在的。若文件需要更改,然后重新运行程序,则需要将input和output都删除,重新生成。或者新建两个对应的文件夹。如果需要重新hadoop namenode -format 务必把之前的日志,临时等文件进行删除

 

删除命令
hadoop fs -rmr [/targetDir]

列出目标文件夹的文件
hadoop fs -ls [/targetDir]

将本地文件放到hdfs上
hadoop fs -put localFile remoteFilePath
hdfs命令

九. IDEA 连接hadoop集群运行WordCount

请参考:https://www.cnblogs.com/HusterX/p/14162985.html

ZF3R0-FHED2-M80TY-8QYGC-NPKYF