HIVE调优MapJoin

2024-05-13 01:24:04
开发
27

HIVE调优MapJoin

1.mapjoin （1.2以后自动默认启动mapjoin）

select /*+mapjoin(b)*/ a.xx,b.xxx from a left outer join b on a.id=b.id

2.创建表格


CREATE EXTERNAL TABLE IF NOT EXISTS learn4.student1(
id STRING COMMENT "学生ID",
name STRING COMMENT "学生姓名",
age int COMMENT "年龄",
gender STRING COMMENT "性别",
clazz STRING COMMENT "班级"
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ",";

load data local inpath "/usr/local/soft/hive-3.1.2/data/students.txt" INTO TABLE learn4.student1;


CREATE EXTERNAL TABLE IF NOT EXISTS learn4.score1(
id STRING COMMENT "学生ID",
subject_id STRING COMMENT "科目ID",
score int COMMENT "成绩"
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ",";
load data local inpath "/usr/local/soft/hive-3.1.2/data/score.txt" INTO TABLE learn4.score1;

3.查询建表


CREATE TABLE mapJonTest AS SELECT max.name,min.subject_id,min.score FROM learn4.student1 max JOIN learn4.score1 min ON max.id = min.id;

建表所需时间：

INFO : Total MapReduce CPU Time Spent: 2 seconds 510 msec
INFO : Completed executing command(queryId=root_20240511090524_3a34bdda-4247-4af4-b686-d681856af110); Time taken: 19.199 seconds
INFO : OK
INFO : Concurrency mode is disabled, not creating a lock manager
No rows affected (20.707 seconds)

4.通过 explain 展示执行计划

explain SELECT max.name,min.subject_id,min.score FROM learn4.student1 max JOIN learn4.score1 min ON max.id = min.id;


查看详细信息：

explain extended SELECT max.name,min.subject_id,min.score FROM learn4.student1 max JOIN learn4.score1 min ON max.id = min.id;

| STAGE DEPENDENCIES: | -- 执行stage的依赖
| Stage-4 is a root stage |   Stage-4 表示根流程 --表示最先执行的流程
| Stage-3 depends on stages: Stage-4 |   Stage-3 依赖 Stage-4
| Stage-0 depends on stages: Stage-3 |   Stage-0 依赖 Stage-3 依赖 Stage-4
| |
| STAGE PLANS: |
| Stage: Stage-4 |
| Map Reduce Local Work |
| Alias -> Map Local Tables: |
| $hdt$_1:min |
| Fetch Operator |
| limit: -1 |
| Alias -> Map Local Operator Tree: |
| $hdt$_1:min |
| TableScan |   TableScan 扫描的表
| alias: min |
| Statistics: Num rows: 1 Data size: 1385400 Basic stats: COMPLETE Column stats: NONE |
| Filter Operator |
| predicate: id is not null (type: boolean) |
| Statistics: Num rows: 1 Data size: 1385400 Basic stats: COMPLETE Column stats: NONE |
| Select Operator |
| expressions: id (type: string), subject_id (type: string), score (type: int) |
| outputColumnNames: _col0, _col1, _col2 |
| Statistics: Num rows: 1 Data size: 1385400 Basic stats: COMPLETE Column stats: NONE |
| HashTable Sink Operator |
| keys: |
| 0 _col0 (type: string) |
| 1 _col0 (type: string) |
| |
| Stage: Stage-3 |
| Map Reduce |
| Map Operator Tree: |
| TableScan |
| alias: max |
| Statistics: Num rows: 1 Data size: 388080000 Basic stats: COMPLETE Column stats: NONE |
| Filter Operator |
| predicate: id is not null (type: boolean) |
| Statistics: Num rows: 1 Data size: 388080000 Basic stats: COMPLETE Column stats: NONE |
| Select Operator |
| expressions: id (type: string), name (type: string) |
| outputColumnNames: _col0, _col1 |
| Statistics: Num rows: 1 Data size: 388080000 Basic stats: COMPLETE Column stats: NONE |
| Map Join Operator | -- 不需要做任何操作默认开启 Map JOIN 操作
| condition map: |
| Inner Join 0 to 1 |
| keys: |
| 0 _col0 (type: string) |
| 1 _col0 (type: string) |
| outputColumnNames: _col1, _col3, _col4 |
| Statistics: Num rows: 1 Data size: 426888009 Basic stats: COMPLETE Column stats: NONE |
| Select Operator |
| expressions: _col1 (type: string), _col3 (type: string), _col4 (type: int) |
| outputColumnNames: _col0, _col1, _col2 |
| Statistics: Num rows: 1 Data size: 426888009 Basic stats: COMPLETE Column stats: NONE |
| File Output Operator |
| compressed: false |
| Statistics: Num rows: 1 Data size: 426888009 Basic stats: COMPLETE Column stats: NONE |
| table: |
| input format: org.apache.hadoop.mapred.SequenceFileInputFormat |
| output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat |
| serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe |
| Execution mode: vectorized |
| Local Work: |
| Map Reduce Local Work |
| |
| Stage: Stage-0 |
| Fetch Operator |
| limit: -1 |
| Processor Tree: |
| ListSink |

5.Map JOIN 相关设置：

1）设置自动选择Mapjoin

set hive.auto.convert.join = true; 默认为true

set hive.auto.convert.join = false; 默认为true

2）大表小表的阈值设置（默认25M以下认为是小表）：

set hive.mapjoin.smalltable.filesize = 25000000;

set hive.mapjoin.smalltable.filesize = 10000000;

原文地址:https://blog.csdn.net/2301_77836489/article/details/138697347 本文来自互联网用户投稿，该文观点仅代表作者本人，不代表本站立场。本站仅提供信息存储空间服务，不拥有所有权，不承担相关法律责任。如若转载，请注明出处：https://www.suanlizi.com/kf/1789708131350220800.html 如若内容造成侵权/违法违规/事实不符，请联系《酸梨子》网邮箱：1419361763@qq.com进行投诉反馈，一经查实，立即删除！

阅读全部