博客
关于我
强烈建议你试试无所不能的chatGPT,快点击我
Hive 底层执行流程
阅读量:6620 次
发布时间:2019-06-25

本文共 15508 字,大约阅读时间需要 51 分钟。

Hive并不是简简单单写SQL,因为我们要进行层层调优,如果连Hive的内部运行机制都搞不清,那么hive对我们来说仅仅是一个黑箱,高效率的调优无从谈起,所以我们很有必要了解下Hive是如何将SQL转化为MapReduce任务的呢?

Hive 底层执行流程

我们以下面这个SQL为例

FROM src INSERT OVERWRITE TABLE dest_g1 SELECT src.key, sum(substr(src.value,4)) GROUP BY src.key;

整个编译过程分为六个阶段:

1.Antlr定义SQL的语法规则,完成SQL词法,语法解析,将SQL
转化为抽象语法树AST Tree
,分别是Antlr对SQL编译后自动生成的词法解析和语法解析类,在这两个类中进行复杂的解析。
例子中的AST tree为

ABSTRACT SYNTAX TREE:(TOK_QUERY (TOK_FROM (TOK_TABREF src))(TOK_INSERT (TOK_DESTINATION (TOK_TAB dest_g1)) (TOK_SELECT (TOK_SELEXPR (TOK_COLREF src key)) (TOK_SELEXPR (TOK_FUNCTION sum (TOK_FUNCTION substr (TOK_COLREF src value) 4))))(TOK_GROUPBY (TOK_COLREF src key))))

2.遍历AST Tree,抽象出查询的基本组成单元QueryBlock

AST Tree 仍然非常复杂,不够结构化,不方便直接翻译为 MapReduce 程序, AST
Tree 转化为 (QB)就是将 SQL 进一部抽象和结构化。
AST Tree 生成 QueryBlock 的过程是一个递归的过程,先序遍历 AST Tree ,遇到不
同的Token 节点(理解为特殊标记),保存到相应的属性中,主要包含以下几个过程

TOK_QUERY => 创建 QB 对象,循环递归子节点TOK_FROM => 将表名语法部分保存到 QB 对象的 aliasToTabs 等属性中TOK_INSERT => 循环递归子节点TOK_DESTINATION => 将输出目标的语法部分保存在 QBParseInfo 对象的nameToDest 属性中TOK_SELECT => 分别将查询表达式的语法部分保存在 destToSelExpr 、destToAggregationExprs 、 destToDistinctFuncExprs 三个属性中TOK_WHERE => 将 Where 部分的语法保存在 QBParseInfo 对象的destToWhereExpr 属性中

3.遍历QueryBlock,翻译为执行操作树OperatorTree

Hive 最终生成的 MapReduce 任务, Map 阶段和 Reduce 阶段均由 Operator Tree
组成。逻辑操作符,就是在 Map 阶段或者 Reduce 阶段完成单一特定的操作。
基本的操作符包括
TableScanOperator、SelectOperator、FilterOperator、JoinOperator、GroupByOperator、ReduceSinkOperator
QueryBlock 生成 Operator Tree 就是遍历上一个过程中生成的 QB 和 QBParseInfo
对象的保存
语法的属性,包含如下几个步骤:

QB#aliasToSubq => 有子查询,递归调用QB#aliasToTabs => TableScanOperatorQBParseInfo#joinExpr => QBJoinTree => ReduceSinkOperator + JoinOperatorQBParseInfo#destToWhereExpr => FilterOperatorQBParseInfo#destToGroupby => ReduceSinkOperator +GroupByOperatorQBParseInfo#destToOrderby => ReduceSinkOperator + ExtractOperator

由于 Join/GroupBy/OrderBy 均需要在 Reduce 阶段完成,所以在生成相应操作的Operator 之前都会先生成一个 ReduceSinkOperator ,将字段组合并序列化为 Reduce Key/value,Partition Key

SQL例子翻译成OperatorTree

STAGE PLANS:  Stage: Stage-1    Map Reduce      Alias -> Map Operator Tree:        src            Reduce Output Operator              key expressions:                    expr: key                    type: string              sort order: +              Map-reduce partition columns:                    expr: rand()                    type: double              tag: -1              value expressions:                    expr: substr(value, 4)                    type: string      Reduce Operator Tree:        Group By Operator          aggregations:                expr: sum(UDFToDouble(VALUE.0))          keys:                expr: KEY.0                type: string          mode: partial1          File Output Operator            compressed: false            table:                input format: org.apache.hadoop.mapred.SequenceFileInputFormat                output format: org.apache.hadoop.mapred.SequenceFileOutputFormat                name: binary_table  Stage: Stage-2    Map Reduce      Alias -> Map Operator Tree:        /tmp/hive-zshao/67494501/106593589.10001          Reduce Output Operator            key expressions:                  expr: 0                  type: string            sort order: +            Map-reduce partition columns:                  expr: 0                  type: string            tag: -1            value expressions:                  expr: 1                  type: double      Reduce Operator Tree:        Group By Operator          aggregations:                expr: sum(VALUE.0)          keys:                expr: KEY.0                type: string          mode: final          Select Operator            expressions:                  expr: 0                  type: string                  expr: 1                  type: double            Select Operator              expressions:                    expr: UDFToInteger(0)                    type: int                    expr: 1                    type: double              File Output Operator                compressed: false                table:                    input format: org.apache.hadoop.mapred.TextInputFormat                    output format: org.apache.hadoop.hive.ql.io.IgnoreKeyTextOutputFormat                    serde: org.apache.hadoop.hive.serde2.dynamic_type.DynamicSerDe                    name: dest_g1  Stage: Stage-0    Move Operator      tables:            replace: true            table:                input format: org.apache.hadoop.mapred.TextInputFormat                output format: org.apache.hadoop.hive.ql.io.IgnoreKeyTextOutputFormat                serde: org.apache.hadoop.hive.serde2.dynamic_type.DynamicSerDe                name: dest_g1

4.Logical Optimizer进行OperatorTree变换,合并不必要的

使用,减少shuffle数据量。大部分逻辑层优化器通过变换 OperatorTree ,合并操作符,达到减少 MapReduce Job ,减少 shuffle 数据量的目的。
5.遍历OperatorTree,翻译为Task tree
OperatorTree 转化为 Task tree的过程分为下面几个阶段

  • 对输出表生成 MoveTask
  • 从 OperatorTree 的其中一个根节点向下深度优先遍历
  • ReduceSinkOperator 标示 Map/Reduce 的界限,多个 Job 间的界限
  • 遍历其他根节点,遇过碰到 JoinOperator 合并 MapReduceTask
  • 生成 StatTask 更新元数据
  • 剪断 Map 与 Reduce 间的 Operator 的关系

6. 对Task tree优化,生成最终的执行计划

7、执行

以上就是HiveSQL的底层执行流程

打印SQL运行相关信息

我们在开发中,可以使用下面这个语句来打印SQL语句的相关运行信息

EXPLAIN [EXTENDED|DEPENDENCY|AUTHORIZATION] query

注:我的版本是hive-1.1.0-cdh5.7.0,所以只可用三个可选属性,如果您版本比较高的话,可以去查阅对应属性

下面我对三种可选属性进行简单介绍

EXTENDED

EXTENDED:打印SQL解析成AST&Operator Tree最全面的信息

hive (g6_hadoop)> explain EXTENDED insert  OVERWRITE table g6_access_orc_explain select domain,count(1) num from g6_access_orc where traffic>'99900' group by domain;OKExplainABSTRACT SYNTAX TREE:TOK_QUERY   TOK_FROM      TOK_TABREF         TOK_TABNAME            g6_access_orc   TOK_INSERT      TOK_DESTINATION         TOK_TAB            TOK_TABNAME               g6_access_orc_explain      TOK_SELECT         TOK_SELEXPR            TOK_TABLE_OR_COL               domain         TOK_SELEXPR            TOK_FUNCTION               count               1            num      TOK_WHERE         >            TOK_TABLE_OR_COL               traffic            '99900'      TOK_GROUPBY         TOK_TABLE_OR_COL            domainSTAGE DEPENDENCIES:  Stage-1 is a root stage  Stage-0 depends on stages: Stage-1  Stage-2 depends on stages: Stage-0STAGE PLANS:  Stage: Stage-1    Map Reduce      Map Operator Tree:          TableScan            alias: g6_access_orc            Statistics: Num rows: 260326 Data size: 188215698 Basic stats: COMPLETE Column stats: NONE            GatherStats: false            Filter Operator              isSamplingPred: false              predicate: (traffic > 99900) (type: boolean)              Statistics: Num rows: 86775 Data size: 62738325 Basic stats: COMPLETE Column stats: NONE              Select Operator                expressions: domain (type: string)                outputColumnNames: domain                Statistics: Num rows: 86775 Data size: 62738325 Basic stats: COMPLETE Column stats: NONE                Group By Operator                  aggregations: count(1)                  keys: domain (type: string)                  mode: hash                  outputColumnNames: _col0, _col1                  Statistics: Num rows: 86775 Data size: 62738325 Basic stats: COMPLETE Column stats: NONE                  Reduce Output Operator                    key expressions: _col0 (type: string)                    sort order: +                    Map-reduce partition columns: _col0 (type: string)                    Statistics: Num rows: 86775 Data size: 62738325 Basic stats: COMPLETE Column stats: NONE                    tag: -1                    value expressions: _col1 (type: bigint)                    auto parallelism: false      Path -> Alias:        hdfs://ruozeclusterg6/user/hive/warehouse/g6_hadoop.db/g6_access_orc [g6_access_orc]      Path -> Partition:        hdfs://ruozeclusterg6/user/hive/warehouse/g6_hadoop.db/g6_access_orc           Partition            base file name: g6_access_orc            input format: org.apache.hadoop.hive.ql.io.orc.OrcInputFormat            output format: org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat            properties:              COLUMN_STATS_ACCURATE true              bucket_count -1              columns cdn,region,level,time,ip,domain,url,traffic              columns.comments               columns.types string:string:string:string:string:string:string:bigint              field.delim                 file.inputformat org.apache.hadoop.hive.ql.io.orc.OrcInputFormat              file.outputformat org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat              location hdfs://ruozeclusterg6/user/hive/warehouse/g6_hadoop.db/g6_access_orc              name g6_hadoop.g6_access_orc              numFiles 1              numRows 260326              rawDataSize 188215698              serialization.ddl struct g6_access_orc { string cdn, string region, string level, string time, string ip, string domain, string url, i64 traffic}              serialization.format                serialization.lib org.apache.hadoop.hive.ql.io.orc.OrcSerde              totalSize 8567798              transient_lastDdlTime 1557676635            serde: org.apache.hadoop.hive.ql.io.orc.OrcSerde              input format: org.apache.hadoop.hive.ql.io.orc.OrcInputFormat              output format: org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat              properties:                COLUMN_STATS_ACCURATE true                bucket_count -1                columns cdn,region,level,time,ip,domain,url,traffic                columns.comments                 columns.types string:string:string:string:string:string:string:bigint                field.delim                     file.inputformat org.apache.hadoop.hive.ql.io.orc.OrcInputFormat                file.outputformat org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat                location hdfs://ruozeclusterg6/user/hive/warehouse/g6_hadoop.db/g6_access_orc                name g6_hadoop.g6_access_orc                numFiles 1                numRows 260326                rawDataSize 188215698                serialization.ddl struct g6_access_orc { string cdn, string region, string level, string time, string ip, string domain, string url, i64 traffic}                serialization.format                    serialization.lib org.apache.hadoop.hive.ql.io.orc.OrcSerde                totalSize 8567798                transient_lastDdlTime 1557676635              serde: org.apache.hadoop.hive.ql.io.orc.OrcSerde              name: g6_hadoop.g6_access_orc            name: g6_hadoop.g6_access_orc      Truncated Path -> Alias:        /g6_hadoop.db/g6_access_orc [g6_access_orc]      Needs Tagging: false      Reduce Operator Tree:        Group By Operator          aggregations: count(VALUE._col0)          keys: KEY._col0 (type: string)          mode: mergepartial          outputColumnNames: _col0, _col1          Statistics: Num rows: 43387 Data size: 31368801 Basic stats: COMPLETE Column stats: NONE          File Output Operator            compressed: false            GlobalTableId: 1            directory: hdfs://ruozeclusterg6/user/hive/warehouse/g6_hadoop.db/g6_access_orc_explain/.hive-staging_hive_2019-05-23_23-37-06_889_2210604962026719569-1/-ext-10000            NumFilesPerFileSink: 1            Statistics: Num rows: 43387 Data size: 31368801 Basic stats: COMPLETE Column stats: NONE            Stats Publishing Key Prefix: hdfs://ruozeclusterg6/user/hive/warehouse/g6_hadoop.db/g6_access_orc_explain/.hive-staging_hive_2019-05-23_23-37-06_889_2210604962026719569-1/-ext-10000/            table:                input format: org.apache.hadoop.mapred.TextInputFormat                output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat                properties:                  COLUMN_STATS_ACCURATE true                  bucket_count -1                  columns domain,num                  columns.comments                   columns.types string:bigint                  field.delim |                  file.inputformat org.apache.hadoop.mapred.TextInputFormat                  file.outputformat org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat                  location hdfs://ruozeclusterg6/user/hive/warehouse/g6_hadoop.db/g6_access_orc_explain                  name g6_hadoop.g6_access_orc_explain                  numFiles 1                  numRows 7                  rawDataSize 149                  serialization.ddl struct g6_access_orc_explain { string domain, i64 num}                  serialization.format |                  serialization.lib org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe                  totalSize 156                  transient_lastDdlTime 1558661108                serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe                name: g6_hadoop.g6_access_orc_explain            TotalFiles: 1            GatherStats: true            MultiFileSpray: false  Stage: Stage-0    Move Operator      tables:          replace: true          source: hdfs://ruozeclusterg6/user/hive/warehouse/g6_hadoop.db/g6_access_orc_explain/.hive-staging_hive_2019-05-23_23-37-06_889_2210604962026719569-1/-ext-10000          table:              input format: org.apache.hadoop.mapred.TextInputFormat              output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat              properties:                COLUMN_STATS_ACCURATE true                bucket_count -1                columns domain,num                columns.comments                 columns.types string:bigint                field.delim |                file.inputformat org.apache.hadoop.mapred.TextInputFormat                file.outputformat org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat                location hdfs://ruozeclusterg6/user/hive/warehouse/g6_hadoop.db/g6_access_orc_explain                name g6_hadoop.g6_access_orc_explain                numFiles 1                numRows 7                rawDataSize 149                serialization.ddl struct g6_access_orc_explain { string domain, i64 num}                serialization.format |                serialization.lib org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe                totalSize 156                transient_lastDdlTime 1558661108              serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe              name: g6_hadoop.g6_access_orc_explain  Stage: Stage-2    Stats-Aggr Operator      Stats Aggregation Key Prefix: hdfs://ruozeclusterg6/user/hive/warehouse/g6_hadoop.db/g6_access_orc_explain/.hive-staging_hive_2019-05-23_23-37-06_889_2210604962026719569-1/-ext-10000/Time taken: 1.359 seconds, Fetched: 198 row(s)

AUTHORIZATION

AUTHORIZATION :打印SQL运行相关权限

hive (g6_hadoop)> explain AUTHORIZATION insert  OVERWRITE table g6_access_orc_explain select domain,count(1) num from g6_access_orc where traffic>'99900' group by domain;OKExplainINPUTS:   g6_hadoop@g6_access_orcOUTPUTS:   g6_hadoop@g6_access_orc_explainCURRENT_USER:   hadoopOPERATION:   QUERYAUTHORIZATION_FAILURES:   No privilege 'Update' found for outputs { database:g6_hadoop, table:g6_access_orc_explain}  No privilege 'Select' found for inputs { database:g6_hadoop, table:g6_access_orc, columnName:domain}Time taken: 0.599 seconds, Fetched: 11 row(s)

DEPENDENCY

DEPENDENCY:打印SQL输入表的相关信息

hive (g6_hadoop)> explain DEPENDENCY insert  OVERWRITE table g6_access_orc_explain select domain,count(1) num from g6_access_orc where traffic>'99900' group by domain;Explain{"input_partitions":[],"input_tables":[{"tablename":"g6_hadoop@g6_access_orc","tabletype":"MANAGED_TABLE"}]}Time taken: 0.135 seconds, Fetched: 1 row(s)

转载于:https://blog.51cto.com/14309075/2399484

你可能感兴趣的文章
Git常用命令
查看>>
jQuery属性选择器,选择带[]的属性
查看>>
javascript操作cookie函数写法
查看>>
扫描pdf转换成excel软件
查看>>
一看你就懂,超详细java中的ClassLoader详解
查看>>
一个21点游戏的设计与实现
查看>>
DICOM医学图像处理:深入剖析Orthanc的SQLite,了解WADO & RESTful API
查看>>
Android SO库加载流程
查看>>
关于企业邮箱群发邮件的一些错误观点
查看>>
Eclipse上GIT插件EGIT使用手册之十一_Fetch和Rebase
查看>>
在 IntelliJ IDEA中怎样设置把一个工程当lib给另一个工程用
查看>>
杭州链家房产信息分析
查看>>
室内闪光灯初次尝试
查看>>
Java#HttpServletRequest
查看>>
secureCRT sz,rz的使用
查看>>
外观模式
查看>>
运维自动化工具ansible学习笔记
查看>>
Cookie利用神器之 CookieHacker
查看>>
flyway 使用
查看>>
linux性能
查看>>