提示:文章写完后,目录可以自动生成,如何生成可参考右边的帮助文档
前言
hive sql中,UDTF函数可以将一行数据转换为多行,在查询时,如果与其他列同时出现,会报如下所示错误信息。 org.apache.hadoop.hive.ql.parse.SemanticException:UDTF’s are not supported outside the SELECT clause, nor nested in expressions
一、介绍
explode函数:将一行的数据展开成多行(一进多出)
lateral view关键字:将展开后的数据与原表字段进行横向连接
二、使用步骤
数据
val data = Seq(
("电影1", "演员1,演员2"),
("电影2", "演员3,演员4")
)
val df = data.toDF("movie_name", "actors")
val resultDF = spark.sql(
"""
|select movie_name, el_actors as actor
|from movies lateral view explode(split(`actors`, ',')) t as el_actors
""".stripMargin)
// TODO 结果
+--------+-----+
|movie_name| actor|
+--------+-----+
| 电影1| 演员1|
| 电影1| 演员2|
| 电影2| 演员3|
| 电影2| 演员4|
+--------+-----+
split 函数将演员字段按逗号进行分割,配合explode,lateral view 很好用
2.一些例子
// 统计每个演员的评分 val actorAverageScoreData = spark.sql( """ |select actor as name, round(avg(cast(rating as double)), 1) as value |from( | select element as actor, rating | from movies lateral view explode(split(`actors`, ', ')) t as element |)subquery |group by actor order by value desc |""".stripMargin)
// 统计每个演员出现的次数 val averageData = spark.sql( """ |select actor as name, count(*) as value |from( | select element as actor | from movies lateral view explode(split(`actors`, ', ')) t as element |)subquery |group by actor order by value desc |""".stripMargin)
三、总结
感觉会用到很多,所以记录一下