分享知识,分享快乐

0%

CDH6.3.2 升级 Spark3.3.0 版本

CDH6.3.2 升级 Spark3.3.0 版本

https://juejin.cn/post/7140053569431928845

根据上面的文档进行部署 还有下列操作需要补充

上传到要部署 spark3 的客户端机器

1
2
3
tar -zxvf spark-3.3.0-bin-3.0.0-cdh6.3.2.tgz -C /opt/cloudera/parcels/CDH/lib
cd /opt/cloudera/parcels/CDH/lib
mv spark-3.3.0-bin-3.0.0-cdh6.3.2/ spark3

配置 conf

1
2
3
4
5
6
7
8
9
10
11
12
13
14
shell复制代码cd /opt/cloudera/parcels/CDH/lib/spark3/conf
## 开启日志
mv log4j2.properties.template log4j2.properties
## spark-defaults.conf 配置
cp /opt/cloudera/parcels/CDH/lib/spark/conf/spark-defaults.conf ./

# 修改 spark-defaults.conf
vim /opt/cloudera/parcels/CDH/lib/spark3/conf/spark-defaults.conf
删除 spark.extraListeners、spark.sql.queryExecutionListeners、spark.yarn.jars
添加 spark.yarn.jars=hdfs://ns1/user/spark/3versionJars/*

hadoop fs -mkdir -p /spark/3versionJars
cd /opt/cloudera/parcels/CDH/lib/spark3/jars
hadoop fs -put *.jar hdfs://ns1/user/spark/3versionJars

优化设置

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
spark.kryoserializer.buffer.max 512m
spark.serializer org.apache.spark.serializer.KryoSerializer
spark.authenticate false (关闭数据块传输服务SASL加密认证)
spark.io.encryption.enabled false (关闭I/O加密)
spark.network.crypto.enabled false (关闭基于AES算法的RPC加密)
spark.shuffle.service.enabled true (启用外部ShuffleService提高Shuffle稳定性)
spark.shuffle.service.port 7337 (这个外部ShuffleService由YarnNodeManager提供,默认端口7337)
spark.shuffle.useOldFetchProtocol true (兼容旧的Shuffle协议避免报错)
spark.sql.cbo.enabled true (启用CBO基于代价的优化-代替RBO基于规则的优化-Optimizer)
spark.sql.cbo.starSchemaDetection true (星型模型探测,判断列是否是表的主键)
spark.sql.datetime.java8API.enabled false
spark.sql.sources.partitionOverwriteMode dynamic
spark.sql.orc.mergeSchema true (ORC格式Schema加载时从所有数据文件收集)
spark.sql.parquet.mergeSchema false (根据情况设置,我们集群大多数都是parquet,从所有文件收集Schema会影响性能,所以从随机一个Parquet文件收集Schema)
spark.sql.parquet.writeLegacyFormat true (兼容旧集群)
spark.sql.autoBroadcastJoinThreshold 1048576 (当前仅支持运行了ANALYZE TABLE <tableName> COMPUTE STATISTICS noscan的Hive Metastore表,以及直接在数据文件上计算统计信息的基于文件的数据源表)
spark.sql.adaptive.enabled true (Spark AQE[adaptive query execution]启用,AQE的优势:执行计划可动态调整、调整的依据是中间结果的精确统计信息)
spark.sql.adaptive.forceApply false
spark.sql.adaptive.logLevel info
spark.sql.adaptive.advisoryPartitionSizeInBytes 256m (倾斜数据分区拆分,小数据分区合并优化时,建议的分区大小,与spark.sql.adaptive.shuffle.targetPostShuffleInputSize含义相同)
spark.sql.adaptive.coalescePartitions.enabled true (是否开启合并小数据分区默认开启,调优策略之一)
spark.sql.adaptive.coalescePartitions.minPartitionSize 1m (合并后最小的分区大小)
spark.sql.adaptive.coalescePartitions.initialPartitionNum 1024 (合并前的初始分区数)
spark.sql.adaptive.fetchShuffleBlocksInBatch true (是否批量拉取blocks,而不是一个个的去取,给同一个map任务一次性批量拉取blocks可以减少io 提高性能)
spark.sql.adaptive.localShuffleReader.enabled true (不需要Shuffle操作时,使用LocalShuffleReader,例如将SortMergeJoin转为BrocastJoin)
spark.sql.adaptive.skewJoin.enabled true (Spark会通过拆分的方式自动处理Join过程中有数据倾斜的分区)
spark.sql.adaptive.skewJoin.skewedPartitionThresholdInBytes 128m
spark.sql.adaptive.skewJoin.skewedPartitionFactor 5 (判断倾斜的条件:分区大小大于所有分区大小中位数的5倍,且大于spark.sql.adaptive.skewJoin.skewedPartitionThresholdInBytes的值)

将 CDH 集群的 spark-env.sh 复制到 /opt/cloudera/parcels/CDH/lib/spark3/conf 下:

1
2
3
4
5
6
7
8
cp /etc/spark/conf/spark-env.sh  /opt/cloudera/parcels/CDH/lib/spark3/conf
chmod +x /opt/cloudera/parcels/CDH/lib/spark3/conf/spark-env.sh

#修改 spark-env.sh
vim /opt/cloudera/parcels/CDH/lib/spark3/conf/spark-env.sh

export SPARK_HOME=/opt/cloudera/parcels/CDH/lib/spark3
HADOOP_CONF_DIR=${HADOOP_CONF_DIR:-/etc/hadoop/conf}

将 gateway 节点的 hive-site.xml 复制到 spark3/conf 目录下,不需要做变动:

1
/etc/hive/conf/hive-site.xml /opt/cloudera/parcels/CDH/lib/spark3/conf/
1
2
3
4
5
6
7
8
9
10
11
12
cp -r /etc/hadoop/conf/*.xml /opt/cloudera/parcels/CDH/lib/spark3/conf/
cp /etc/hive/conf/hive-site.xml /opt/cloudera/parcels/CDH/lib/spark3/conf/

# 奇怪用快捷方式就不行
cd /opt/cloudera/parcels/CDH/lib/spark3/conf
ln -s /etc/hive/conf/hive-site.xml hive-site.xml
ln -s /etc/hive/conf/hdfs-site.xml hdfs-site.xml
ln -s /etc/hive/conf/core-site.xml core-site.xml
ln -s /etc/hive/conf/mapred-site.xml mapred-site.xml
ln -s /etc/hive/conf/yarn-site.xml yarn-site.xml
ln -s /etc/spark/conf/spark-defaults.conf spark-defaults.conf
ln -s /etc/spark/conf/spark-env.sh spark-env.sh
1
2
cd /usr/local/bin
ln -s /opt/cloudera/parcels/CDH/lib/spark3/bin/pyspark pyspark3

新增一个spark3-shell的快捷方式

vim /opt/cloudera/parcels/CDH/bin/spark3-shell

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
#!/bin/bash
# Autodetect JAVA_HOME if not defined
# Reference: http://stackoverflow.com/questions/59895/can-a-bash-script-tell-what-directory-its-stored-in
SOURCE="${BASH_SOURCE[0]}"
BIN_DIR="$( dirname "$SOURCE" )"
while [ -h "$SOURCE" ]
do
SOURCE="$(readlink "$SOURCE")"
[[ $SOURCE != /* ]] && SOURCE="$DIR/$SOURCE"
BIN_DIR="$( cd -P "$( dirname "$SOURCE" )" && pwd )"
done
BIN_DIR="$( cd -P "$( dirname "$SOURCE" )" && pwd )"
CDH_LIB_DIR=$BIN_DIR/../../CDH/lib
LIB_DIR=$BIN_DIR/../lib
export HADOOP_HOME=$CDH_LIB_DIR/hadoop
. $CDH_LIB_DIR/bigtop-utils/bigtop-detect-javahome
exec $LIB_DIR/spark3/bin/spark-shell "$@"
1
chmod +x /opt/cloudera/parcels/CDH/bin/spark3-shell

用法:alternatives --install <链接> <名称> <路径> <优先度>

1
alternatives --install /usr/bin/spark3-shell spark3-shell /opt/cloudera/parcels/CDH/bin/spark3-shell 1