1.HDFS读写过程
1.1HDFS读流程
RPC:远程过程调用
-
HDFS client 调用FileSystem.open(filePath), 与NN进行【RPC】通信,返回该文件的block列表
返回【FSDataInputStream】对象 -
若block列表较长,则分多次返回,多次运行下面流程
-
HDFS client调用【FSDataInputStream】的read方法 与第一个块最近的DN进行读取,读取完成后,检查是否ok
如果ok,机会关闭与DN的通信 如果不ok,就从第二个节点去读取,以此类推 当block列表全部读取完成了, -
HDFS client调用FSDataInpuitStream的close方法
关闭数据流
1.2 HDFS写流程
-
HDFS client 调用FileSystem.create(filePath)方法
与NN进行【RPC】通信
NN会检查这个文件夹是否存在?是否有权限创建这个文件
如果都可以,就创建一个文件
但是这个时候是没有数据的,也不关联block
NN会再根据 文件的大小,块大小,副本数等
计算要上传多少的块和对应的DN节点
最终这个信息返回给客户端【FSDataOutputStream】对象 -
HDFS client调用客户端【FSDataOutputStream】的write方法
根据NN返回的信息,将第一个块的第一个副本写到DN1
写完复制到DN2,再复制到DN3
当我们三个副本写完,DN3返回ack【确认字符】到DN2
DN2接收到ack,返回ack给DN1
DN1接收到ack,返回ack给到【FSDataOutputStream】
告诉它第一个块的三副本写完了
以此类推 -
当所有的块全部写完
HDFS client调用【FSDataOutputStream】的close方法
关闭数据流
然后调用FileSystem.complete方法,告诉NN文件写成功。
2.rows between
2.1 通过题目理解
例.求近5天的登陆次数总和
SELECT
x.*,
sum( click_pv ) over ( PARTITION BY spu_id ORDER BY pt rows BETWEEN current ROW AND 4 following )
FROM
click AS x
结果为
- 求每个spu_id的click_pv之和
SELECT
spu_id,
pt,
click_pv,
sum( click_pv ) over ( PARTITION BY spu_id ORDER BY pt )
FROM
click
- 求每个spu_id,当前日期和之后所有日期的click_pv之和
SELECT
spu_id,
pt,
click_pv,
sum( click_pv ) over ( PARTITION BY spu_id ORDER BY pt rows BETWEEN current ROW AND unbounded following )
FROM
click
3. 求每个spu_id,当前日期和之前所有日期的click_pv之和
SELECT
spu_id,
pt,
click_pv,
sum( click_pv ) over ( PARTITION BY spu_id ORDER BY pt rows BETWEEN unbounded preceding AND current ROW )
FROM
click
4. 求每个spu_id,当前日期和之前2天的click_pv之和
SELECT
spu_id,
pt,
click_pv,
sum( click_pv ) over ( PARTITION BY spu_id ORDER BY pt rows BETWEEN 2 preceding AND current ROW )
FROM
click
- 求每个spu_id,当前日期和之后2天的click_pv之和
SELECT
spu_id,
pt,
click_pv,
sum( click_pv ) over ( PARTITION BY spu_id ORDER BY pt rows BETWEEN current ROW AND 2 following )
FROM
click
- 求每个spu_id,之前2天和之后2天的click_pv之和
SELECT
spu_id,
pt,
click_pv,
sum( click_pv ) over ( PARTITION BY spu_id ORDER BY pt rows BETWEEN 2 preceding AND 2 following )
FROM
click
但是在有些数据库内开窗函数不写rows between会报错
所以回到第一题研究默认情况下的rows between的写法
1.求每个spu_id的click_pv之和
SELECT
spu_id,
pt,
click_pv,
sum( click_pv ) over ( PARTITION BY spu_id ORDER BY pt rows BETWEEN unbounded preceding AND unbounded following )
FROM
click
2.2 总结
unbounded preceding 前所有行
n preceding 前面n行
current row 当前行
n following 后面n行
unbounded following 后面所有行
3.连续登陆
- 每个id的最大连续天数 找参考系
SELECT
id,
max( cnt )
FROM
(
SELECT
id,
date_sub,
count( 1 ) AS cnt
FROM
(
SELECT
id,
login_date,
row_number() over ( PARTITION BY id ORDER BY login_date ),
dayofyear( login_date )- row_number() over ( PARTITION BY id ORDER BY login_date ) AS date_sub
FROM
login
) as t1
GROUP BY
id,
date_sub
) AS x
GROUP BY
id;
但是这种解法仅限于当年有跨年的数据就不能实现了
所以使用一种通用的方法
SELECT
id,
max( cnt )
FROM
(
SELECT
id,
date_sub,
count( 1 ) AS cnt
FROM
(
SELECT
id,
login_date,
row_number() over ( PARTITION BY id ORDER BY login_date ),
DATE_SUB(
login_date,
INTERVAL row_number() over ( PARTITION BY id ORDER BY login_date ) day) AS date_sub
FROM
login
) as t1
GROUP BY
id,
date_sub
) as t2
GROUP BY
id;
可以看到结果变为正确的
- 每个id最大连续登陆天数和对应的开始时间和结束时间
SELECT
id,
max_login_date,
min_login_date,
cnt
FROM
(
SELECT
id,
max_login_date,
min_login_date,
cnt,
max( cnt ) over ( PARTITION BY cnt ORDER BY date_sub DESC ) AS max_cnt
FROM
(
SELECT
id,
date_sub,
max( login_date ) AS max_login_date,
min( login_date ) AS min_login_date,
count( 1 ) AS cnt
FROM
(
SELECT
id,
login_date,
row_number() over ( PARTITION BY id ORDER BY login_date ),
date_sub( login_date, INTERVAL row_number() over ( PARTITION BY id ORDER BY login_date ) DAY ) AS date_sub
FROM
login
) as t1
GROUP BY
id,
date_sub
)as t3) as t2
WHERE
cnt = max_cnt;