CDH6.3.2 升级 Spark3.3.0 版本
https://juejin.cn/post/7140053569431928845
根据上面的文档进行部署 还有下列操作需要补充
上传到要部署 spark3 的客户端机器
1 2 3 tar -zxvf spark-3.3.0-bin-3.0.0-cdh6.3.2.tgz -C /opt/cloudera/parcels/CDH/lib cd /opt/cloudera/parcels/CDH/lib mv spark-3.3.0-bin-3.0.0-cdh6.3.2/ spark3
配置 conf
1 2 3 4 5 6 7 8 9 10 11 12 13 14 shell复制代码cd /opt/cloudera/parcels/CDH/lib/spark3/conf # mv log4j2.properties.template log4j2.properties # cp /opt/cloudera/parcels/CDH/lib/spark/conf/spark-defaults.conf ./ # 修改 spark-defaults.conf vim /opt/cloudera/parcels/CDH/lib/spark3/conf/spark-defaults.conf 删除 spark.extraListeners、spark.sql.queryExecutionListeners、spark.yarn.jars 添加 spark.yarn.jars=hdfs://ns1/user/spark/3versionJars/* hadoop fs -mkdir -p /spark/3versionJars cd /opt/cloudera/parcels/CDH/lib/spark3/jars hadoop fs -put *.jar hdfs://ns1/user/spark/3versionJars
优化设置
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 spark.kryoserializer.buffer.max 512m spark.serializer org.apache.spark.serializer.KryoSerializer spark.authenticate false (关闭数据块传输服务SASL加密认证) spark.io.encryption.enabled false (关闭I/O加密) spark.network.crypto.enabled false (关闭基于AES算法的RPC加密) spark.shuffle.service.enabled true (启用外部ShuffleService提高Shuffle稳定性) spark.shuffle.service.port 7337 (这个外部ShuffleService由YarnNodeManager提供,默认端口7337) spark.shuffle.useOldFetchProtocol true (兼容旧的Shuffle协议避免报错) spark.sql.cbo.enabled true (启用CBO基于代价的优化-代替RBO基于规则的优化-Optimizer) spark.sql.cbo.starSchemaDetection true (星型模型探测,判断列是否是表的主键) spark.sql.datetime.java8API.enabled false spark.sql.sources.partitionOverwriteMode dynamic spark.sql.orc.mergeSchema true (ORC格式Schema加载时从所有数据文件收集) spark.sql.parquet.mergeSchema false (根据情况设置,我们集群大多数都是parquet,从所有文件收集Schema会影响性能,所以从随机一个Parquet文件收集Schema) spark.sql.parquet.writeLegacyFormat true (兼容旧集群) spark.sql.autoBroadcastJoinThreshold 1048576 (当前仅支持运行了ANALYZE TABLE <tableName> COMPUTE STATISTICS noscan的Hive Metastore表,以及直接在数据文件上计算统计信息的基于文件的数据源表) spark.sql.adaptive.enabled true (Spark AQE[adaptive query execution]启用,AQE的优势:执行计划可动态调整、调整的依据是中间结果的精确统计信息) spark.sql.adaptive.forceApply false spark.sql.adaptive.logLevel info spark.sql.adaptive.advisoryPartitionSizeInBytes 256m (倾斜数据分区拆分,小数据分区合并优化时,建议的分区大小,与spark.sql.adaptive.shuffle.targetPostShuffleInputSize含义相同) spark.sql.adaptive.coalescePartitions.enabled true (是否开启合并小数据分区默认开启,调优策略之一) spark.sql.adaptive.coalescePartitions.minPartitionSize 1m (合并后最小的分区大小) spark.sql.adaptive.coalescePartitions.initialPartitionNum 1024 (合并前的初始分区数) spark.sql.adaptive.fetchShuffleBlocksInBatch true (是否批量拉取blocks,而不是一个个的去取,给同一个map任务一次性批量拉取blocks可以减少io 提高性能) spark.sql.adaptive.localShuffleReader.enabled true (不需要Shuffle操作时,使用LocalShuffleReader,例如将SortMergeJoin转为BrocastJoin) spark.sql.adaptive.skewJoin.enabled true (Spark会通过拆分的方式自动处理Join过程中有数据倾斜的分区) spark.sql.adaptive.skewJoin.skewedPartitionThresholdInBytes 128m spark.sql.adaptive.skewJoin.skewedPartitionFactor 5 (判断倾斜的条件:分区大小大于所有分区大小中位数的5倍,且大于spark.sql.adaptive.skewJoin.skewedPartitionThresholdInBytes的值)
将 CDH 集群的 spark-env.sh 复制到 /opt/cloudera/parcels/CDH/lib/spark3/conf 下:
1 2 3 4 5 6 7 8 cp /etc/spark/conf/spark-env.sh /opt/cloudera/parcels/CDH/lib/spark3/conf chmod +x /opt/cloudera/parcels/CDH/lib/spark3/conf/spark-env.sh # 修改 spark-env.sh vim /opt/cloudera/parcels/CDH/lib/spark3/conf/spark-env.sh export SPARK_HOME=/opt/cloudera/parcels/CDH/lib/spark3 HADOOP_CONF_DIR=${HADOOP_CONF_DIR:-/etc/hadoop/conf}
将 gateway 节点的 hive-site.xml 复制到 spark3/conf 目录下,不需要做变动:
1 /etc/hive/conf/hive-site.xml /opt/cloudera/parcels/CDH/lib/spark3/conf/
1 2 3 4 5 6 7 8 9 10 11 12 cp -r /etc/hadoop/conf/*.xml /opt/cloudera/parcels/CDH/lib/spark3/conf/ cp /etc/hive/conf/hive-site.xml /opt/cloudera/parcels/CDH/lib/spark3/conf/ # 奇怪用快捷方式就不行 cd /opt/cloudera/parcels/CDH/lib/spark3/conf ln -s /etc/hive/conf/hive-site.xml hive-site.xml ln -s /etc/hive/conf/hdfs-site.xml hdfs-site.xml ln -s /etc/hive/conf/core-site.xml core-site.xml ln -s /etc/hive/conf/mapred-site.xml mapred-site.xml ln -s /etc/hive/conf/yarn-site.xml yarn-site.xml ln -s /etc/spark/conf/spark-defaults.conf spark-defaults.conf ln -s /etc/spark/conf/spark-env.sh spark-env.sh
1 2 cd /usr/local/bin ln -s /opt/cloudera/parcels/CDH/lib/spark3/bin/pyspark pyspark3
新增一个spark3-shell的快捷方式
vim /opt/cloudera/parcels/CDH/bin/spark3-shell
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 # !/bin/bash # Autodetect JAVA_HOME if not defined # Reference: http://stackoverflow.com/questions/59895/can-a -bash-script-tell-what-directory-its-stored-in SOURCE="${BASH_SOURCE[0]}" BIN_DIR="$( dirname "$SOURCE" )" while [ -h "$SOURCE" ] do SOURCE="$(readlink "$SOURCE")" [[ $SOURCE != /* ]] && SOURCE="$DIR/$SOURCE" BIN_DIR="$( cd -P "$( dirname "$SOURCE" )" && pwd )" done BIN_DIR="$( cd -P "$( dirname "$SOURCE" )" && pwd )" CDH_LIB_DIR=$BIN_DIR/../../CDH/lib LIB_DIR=$BIN_DIR/../lib export HADOOP_HOME=$CDH_LIB_DIR/hadoop . $CDH_LIB_DIR/bigtop-utils/bigtop-detect-javahome exec $LIB_DIR/spark3/bin/spark-shell "$@"
1 chmod +x /opt/cloudera/parcels/CDH/bin/spark3-shell
用法:alternatives --install <链接> <名称> <路径> <优先度>
1 alternatives --install /usr/bin/spark3-shell spark3-shell /opt/cloudera/parcels/CDH/bin/spark3-shell 1