pyspark之字符串函数操作（五）

最新推荐文章于 2024-05-03 11:32:05 发布

hejp_123

最新推荐文章于 2024-05-03 11:32:05 发布

阅读量2.6k

点赞数

分类专栏： spark 文章标签： pyspark 函数字符串

本文链接：https://blog.csdn.net/hejp_123/article/details/88033627

版权

spark 专栏收录该内容

17 篇文章 8 订阅

订阅专栏

1. 字符串拼接
2. 字符串格式化
3. 查找字符串位置
4. 字符串截取
5. 正则表达式
6. 正则表达式替换
7. 其他字符串函数

1. 字符串拼接

from pyspark.sql.functions import concat, concat_ws
df = spark.createDataFrame([('abcd','123')], ['s', 'd'])

# 1.直接拼接
df.select(concat(df.s, df.d).alias('s')).show()
# abcd123

# 2.指定拼接符
df.select(concat_ws('-', df.s, df.d).alias('s')).show()
# 'abcd-123'
 
 1
2
3
4
5
6
7
8
9
10

2. 字符串格式化

from pyspark.sql.functions import format_string

df = spark.createDataFrame([(5, "hello")], ['a', 'b'])
df.select(format_string('%d %s', df.a, df.b).alias('v')).show()
# 5 hello
 
 1
2
3
4
5

3. 查找字符串位置

from pyspark.sql.functions import instr

df = spark.createDataFrame([('abcd',)], ['s',])
df.select(instr(df.s, 'b').alias('s')).show()
# 2
 
 1
2
3
4
5

4. 字符串截取

from pyspark.sql.functions import substring

df = spark.createDataFrame([('abcd',)], ['s',])
df.select(substring(df.s, 1, 2).alias('s')).show()
 
 1
2
3
4

5. 正则表达式

from pyspark.sql.functions import regexp_extract

df = spark.createDataFrame([('100-200',)], ['str'])
df.select(regexp_extract('str', '(\d+)-(\d+)', 1).alias('d')).show()
# '100'

df = spark.createDataFrame([('foo',)], ['str'])
df.select(regexp_extract('str', '(\d+)', 1).alias('d')).show()
 
 1
2
3
4
5
6
7
8

6. 正则表达式替换

from pyspark.sql.functions import regexp_replace

df = spark.createDataFrame([('100-200',)], ['str'])
df.select(regexp_replace('str', '(\\d+)', '--').alias('d')).collect()
 
 1
2
3
4

7. 其他字符串函数

函数	作用
repeat	字符串重复
split	分割

hejp_123

关注

0
点赞
踩
8

收藏

觉得还不错? 一键收藏
0
评论
pyspark之字符串函数操作（五）

1. 字符串拼接2. 字符串格式化3. 查找字符串位置4. 字符串截取5. 正则表达式6. 正则表达式替换7. 其他字符串函数1. 字符串拼接from pyspark.sql.functions import concat, concat_wsdf = spark.createDataFrame([('abcd','123')], ['s', 'd'])# 1.直接拼接...
复制链接

扫一扫

专栏目录