Spark 集群环境搭建

复制和安装 spark

  1. 将本地spark-3.1.1-bin-hadoop3.2.tgz复制进master容器中root目录中
1
docker cp /path/to/spark-3.1.1-bin-hadoop3.2.tgz master:/root
  1. spark解压进/opt/module目录下
1
tar zxvf /root/spark-3.1.1-bin-hadoop3.2 -C /opt/module/
  1. 将解压后的目录改名为spark
1
mv /opt/module/spark-3.1.1-bin-hadoop3.2 /opt/module/spark

配置 spark

  1. spark的配置文件都存放在${SPARK_HOME}/conf
1
cd /opt/module/spark/conf/
  1. 配置文件模板复制成配置文件
1
2
3
cp spark-defaults.conf.template spark-defaults.conf
cp spark-env.sh.template spark-env.sh
cp workers.template workers
  1. 配置spark-defaults.conf
  • 该文件存放着 spark 运行时的默认选项

  • 将最下面Example中的一些注释去除:

1
2
3
4
5
6
7
# Example:
spark.master spark://master:7077
spark.eventLog.enabled true
# spark.eventLog.dir hdfs://namenode:8021/directory
spark.serializer org.apache.spark.serializer.KryoSerializer
spark.executor.memory 1g
# spark.executor.extraJavaOptions -XX:+PrintGCDetails -Dkey=value -Dnumbers="one two three"
  1. 配置spark-env.sh
  • 该文件存放着 spark 的环境变量

  • 最顶端添加如下的内容

1
2
3
4
5
6
7
export JAVA_HOME=/opt/module/jdk
export HADOOP_HOME=/opt/module/hadoop
export HADOOP_CONF_DIR=/opt/module/hadoop/etc/hadoop
export YARN_CONF_DIR=/opt/module/hadoop/etc/hadoop
export SPARK_MASTER_HOST=master
export SPARK_MASTER_IP=master
export SPARK_MASTER_PORT=7077
  • 指定JAVA_HOME路径
  • 指定hadoop安装所在路径
  • 指定hadoop配置文件所在路径
  • 指定YARN配置文件所在目录
  • 指定Spark Master的主机名
  • 指定Spark Master的 IP 地址
  • 指定Spark Master的端口号
  1. 配置workers
  • 配置所有工作节点的主机名或 IP 地址
1
2
3
master
slave1
slave2
  1. 分发 spark 到 slave1 和 slave2
1
2
scp -r /opt/module/spark slave1:/opt/module/
scp -r /opt/module/spark slave2:/opt/module/

启动群集

  • master容器中运行以下命令
1
/opt/module/spark/sbin/start-master.sh
1
/opt/module/spark/sbin/start-workers.sh

测试群集

  1. 运行 spark 中自带的计算 π 的案例
1
/opt/module/spark/bin/spark-submit --master yarn --class org.apache.spark.examples.SparkPi /opt/module/spark/examples/jars/spark-examples_2.12-3.1.1.jar
  • 输出的是π的大概值(每个人运算出的结果都不一样)
1
2
3
4
5
6
7
8
9
10
11
23/10/23 06:55:45 INFO YarnScheduler: Removed TaskSet 0.0, whose tasks have all completed, from pool
23/10/23 06:55:45 INFO DAGScheduler: ResultStage 0 (reduce at SparkPi.scala:38) finished in 1.953 s
23/10/23 06:55:45 INFO DAGScheduler: Job 0 is finished. Cancelling potential speculative or zombie tasks for this job
23/10/23 06:55:45 INFO YarnScheduler: Killing all running tasks in stage 0: Stage finished
23/10/23 06:55:45 INFO DAGScheduler: Job 0 finished: reduce at SparkPi.scala:38, took 2.008845 s
--- Pi is roughly 3.143515717578588 ---
23/10/23 06:55:45 INFO SparkContext: SparkContext is stopping with exitCode 0.
23/10/23 06:55:45 INFO SparkUI: Stopped Spark web UI at http://master:4040
23/10/23 06:55:45 INFO YarnClientSchedulerBackend: Interrupting monitor thread
23/10/23 06:55:45 INFO YarnClientSchedulerBackend: Shutting down all executors
23/10/23 06:55:45 INFO YarnSchedulerBackend$YarnDriverEndpoint: Asking each executor to shut down
  1. 使用jps命令查看spark进程:
  • master 容器中
1
2
8935 Master
9127 Worker
  • slave1 容器中
1
8127 Worker
  • slave2 容器中
1
9457 Worker

参考文章

Spark Configuration