hive metastore元数据同步&无效分区清理网站首页 学无止境

hive metastore元数据同步&无效分区清理

ZhaoYingChao88 2024-06-17 10:29:37

简介hive metastore元数据同步&无效分区清理

通过获取hive元数据，查询数据表，批量[MSCK] REPAIR TABLE table_identifier [{ADD|DROP|SYNC} PARTITIONS]

#!/usr/bin/env bash
dir_path="/tmp/hive_meta_clean"
mkdir -p ${dir_path}
database_list_file="${dir_path}/database_list.csv"
table_list_file="${dir_path}/table_list.csv"
beeline -u "jdbc:hive2://kyuubi.test-in.net:10015/"  -n hadoop -p hadoop --showHeader=false  --outputformat=dsv -e "show databases;" > ${database_list_file}
database_list=`cat ${database_list_file}`
record_file="${dir_path}/record_list.log"
  
for database in $database_list;  
do   
	echo $database ; 
	table_list_file="${dir_path}/${database}_table_list.csv"
	beeline -u "jdbc:hive2://kyuubi.test-in.net:10015/"  -n hadoop -p hadoop --showHeader=false  --outputformat=dsv -e "show tables from ${database};" > ${table_list_file}
	
	sql_file="${dir_path}/${database}_sync_partitions.sql"
	table_list=`cat ${table_list_file}`
	count=0
	for str in ${table_list}; do
		table=`echo "$str"|awk '{split($1, arr, "|"); print arr[2]}'`
		echo "MSCK REPAIR TABLE ${database}.${table} SYNC PARTITIONS;" >> ${sql_file}
		count=$(($count+1))
	done
	echo "${database} table_count: $count " >> ${record_file}

done 


sql_file_list=`ls ${dir_path}/*.sql`
for sql_file in $sql_file_list; do
	echo "$sql_file"
	hive -f ${sql_file} &
done
wait

独立执行清理脚本

export HADOOP_CLIENT_OPTS=" -Xmx3192m"
export HADOOP_HEAPSIZE=1024

dir_path="/tmp/hive_meta_clean"
sql_record_file="${dir_path}/hive_clean_sql_record_list.log"
echo "" > $sql_record_file
thread_count=0
sql_file_list=`ls ${dir_path}/*.sql`
for sql_file in $sql_file_list; do
	
	thread_count=$(($thread_count+1))
	wait_tag=$(( $thread_count % 5 ))
	if [ $wait_tag = 0 ] ; then 
	       echo "wait this batch execute run over..."
	       wait
	       echo "this batch execute over."
	else
	    echo "$sql_file" 
		hive -f ${sql_file} &  
		echo "$sql_file" >> $sql_record_file
	fi
done
wait
echo "execute all batch over."

多线程执行



dir_path="/tmp/hive_meta_clean"
sql_record_file="${dir_path}/hive_clean_sql_record_list.log"
exec_sql_record_file="${dir_path}/exec_hive_clean_sql_record_list.log"

function exe_hive_sql(){
	sql="$1"
	sql_file="$2"
	record_file="$3"
	echo "$sql $sql_file $record_file begin"
	hive -e "$sql"
	if [ $? = 0 ]; then
		echo "$sql $sql_file" >> $record_file
		echo "$sql $sql_file success!"
		# delete from $sql_file
		sed "/${sql}/d" -i $sql_file
	else
		echo "$sql $sql_file failed!"
	fi
}

echo "" > $sql_record_file
thread_count=0
sql_file_list=`ls ${dir_path}/*.sql`
for sql_file in $sql_file_list; do
	cat $sql_file | while read sql
	do
	    echo $sql
		exe_hive_sql "$sql" "$sql_file" "$exec_sql_record_file" &
		echo "exe_hive_sql $sql $sql_file" >> $sql_record_file
		thread_count=$(($thread_count+1))
		wait_tag=$(( $thread_count % 5 ))
		if [ $wait_tag = 0 ] ; then 
		       echo "wait this batch execute run over..."
		       wait
		       echo "this batch execute over."
		fi
	done
		
done
wait
echo "execute all batch over."

shell中后台运行函数

new()
{   
     echo "func bkground pid by $$ is $$"
     while [ 1 == 1 ]
     do
           sleep 5
     done
}

echo "current script pid is $$"
new &
while [ 1 == 1 ]
do
    sleep 5
done

运行如上脚本后，通过ps查看会发现当前运行了两个shell进程，输出如下：

[root@localhost ~]# ps -ef|grep test
root 26960 30678 0 12:05 pts/0 00:00:00 sh test.sh
root 26961 26960 0 12:05 pts/0 00:00:00 sh test.sh
————————————————

所以可以确认，shell里面是可以直接以后台方式运行函数的，后台运行的函数以一个新的进程运行。
可是这个时候看一下脚本的输出，会发函数虽然以后台方式在运行了，但是获取到的当前进程id却和主进程一样：

[root@localhost ..]# sh test.sh 
current script pid is 26960
func bkground pid by $$ is 26960

分析原因，可能是因为新的进程继承了主进程的环境变量，直接把$$也copy过来了，所以在新的进程里面获取到的进程id是和父进程一样的。这个现象和fork()很类似，于是google了一下，网上有人问到shell里面是否有类似于c函数fork的实现，下面的回答就是用&。

所以这个后台运行的函数进程，应该是直接copy的主进程的环境变量，导致在函数进程里面通过$$获取到的进程id是错的。

最后在分享一个小技巧，在shell里面如果以后台的方式运行一个代码片段，前提是不定义函数，实现方式是：

{ ....

  ....

} &

————————————————

风语者！平时喜欢研究各种技术，目前在从事后端开发工作，热爱生活、热爱工作。

上一篇
Python采集二手房源数据信息并做多线程...

下一篇
MySQL数据库——MySQL执行事务的语法和流...

站长推荐

U8W/U8W-Mini使用与常见问题解决
U8W/U8W-Mini使用与常见问题解决
分享几个国内免费的ChatGPT镜像网址(亲测有效)
分享几个国内免费的ChatGPT镜像网址(亲测有效)
stm32使用HAL库配置串口中断收发数据（保姆级教程）
stm32使用HAL库配置串口中断收发数据（保姆级教程）
QT多线程的5种用法，通过使用线程解决UI主界面的耗时操作代码，防止界面卡死。
QT多线程的5种用法，通过使用线程解决UI主界面的耗时操作代码，防止界面卡死。...
SpringSecurity实现前后端分离认证授权
SpringSecurity实现前后端分离认证授权

您现在的位置是：首页 >学无止境 >hive metastore元数据同步&无效分区清理网站首页学无止境

hive metastore元数据同步&无效分区清理

shell中后台运行函数

上一篇 Python采集二手房源数据信息并做多线程...

下一篇 MySQL数据库——MySQL执行事务的语法和流...

站长推荐

上一篇
Python采集二手房源数据信息并做多线程...

下一篇
MySQL数据库——MySQL执行事务的语法和流...