- 今日速览
- 环境拓扑(安全开启后)
表格
节点 新增角色 Principal 样例
node1 KDC (kerberos) kadmin/admin@HADOOP.COM
node2 YARN RM + Queue Mgmt yarn/node2@HADOOP.COM
node3 Phoenix RS hbase/node3@HADOOP.COM - 关键知识点
3.1 Kerberos 流程
Client → AS 获取 TGT
TGT → TGS 获取 Service Ticket
Service Ticket → Hadoop NameNode 完成 SASL 握手
3.2 YARN Capacity 调度器
层级队列:root → prod/dev → 子队列 spark/flink
抢占:当 prod 空闲资源 < 10 % 且 dev 占用 > 20 % 时触发
用户限额:yarn.scheduler.capacity.root.dev.maximum-user-limit-factor=0.5
3.3 Phoenix 索引
全局索引(Global Index)= 新表,覆盖列减少回表
本地索引(Local Index)= 同 Region,前缀过滤高效
写放大:WAL + Index 表双写,可调 phoenix.index.wal.disabled=true(容忍宕机丢数据)
3.4 数据仓库拉链表
关链 end_date = 当天-1,开链 start_date = 当天
Hive SQL 用 row_number() over (partition by user_id order by ts) 去重 - 实操流水
4.1 Kerberos 安装
bash
node1
yum -y install krb5-server krb5-workstation
vim /var/kerberos/krb5kdc/kdc.conf # realms = HADOOP.COM
kdb5_util create -s
systemctl enable --now krb5kdc kadmin
创建主体
kadmin.local -q "addprinc -randkey hdfs/node1@HADOOP.COM"
kadmin.local -q "xst -k /etc/security/keytabs/hdfs.service.keytab hdfs/node1@HADOOP.COM"
分发 keytab 并 chmod 400
4.2 Hadoop 开启安全
xml
-- 写 1000 w 行
upsert into orders select ...
-- 无索引查询
select * from orders where user_id='u1234'; -- 14.2 s 全表
-- 建全局索引
CREATE INDEX idx_user ON ORDERS(user_id) INCLUDE(amount);
-- 同样查询 1.1 s,Explain 显示 RANGE SCAN OVER idx_user
写入 10 w 行测试:
– 无主索引 6.8 s
– 有主索引 7.3 s
吞吐下降 ≈ 7 %,可接受。
4.5 数据仓库分层
ODS 层(原始)
CREATE EXTERNAL TABLE ods.user_log(
user_id STRING,
event_type STRING,
ts BIGINT,
json STRING
) STORED AS TEXTFILE
LOCATION '/data/ods/user_log/';
load data inpath '/tmp/user_log_20250924.txt' into table ods.user_log;
DWD 层(清洗拉链)
sql
WITH tmp AS (
SELECT *, row_number() over (partition by user_id order by ts desc) rn
FROM ods.user_log
)
INSERT OVERWRITE TABLE dwd.user_log_chain
SELECT user_id, event_type, ts, '2025-09-24' start_date, '9999-12-31' end_date
FROM tmp WHERE rn=1;
DWS 层(日活宽表)
CREATE TABLE dws.user_daily (
dt STRING,
user_id STRING,
first_event STRING,
last_event STRING,
event_cnt INT
) STORED AS ORC;
INSERT OVERWRITE TABLE dws.user_daily PARTITION(dt='2025-09-24')
SELECT '2025-09-24', user_id, min(event_type), max(event_type), count(*)
FROM dwd.user_log_chain
WHERE start_date<='2025-09-24' AND end_date>='2025-09-24'
GROUP BY user_id;
结果:1 GB 原始 → 350 MB 宽表。
4.6 差异快照 & 滚动删除
bash
基于 last snapshot 做差异
$VMRUN snapshot node1.vmx "diff_$(date +%F)" -memory false -quiesce true
查询上次完整快照
LAST=$(find /backup -name "full_*.vmsn" -printf '%T@ %p\n' | sort -n | tail -1 | awk '{print $2}')
7 天前
find /backup -name "*.vmsn" -mtime +7 -delete
节省空间:
full_20Sep 37 GB
diff_25Sep 14 GB(仅增改)