python怎样编写姓名、职业、地址_如何根据姓名、地址识别人际关系，然后通过linux comman或Pysp分配相同的ID...

最新推荐文章于 2022-07-07 09:00:41 发布

weixin_39625305

最新推荐文章于 2022-07-07 09:00:41 发布

阅读量204

点赞数

文章标签： python怎样编写姓名、职业、地址

版权声明：本文为博主原创文章，遵循 CC 4.0 BY-SA 版权协议，转载请附上原文出处链接和本声明。

本文链接：https://blog.csdn.net/weixin_39625305/article/details/113981836

版权

正如注释中所讨论的，基本思想是对数据进行适当的分区，以便具有相同LNAME+Address的记录保持在同一个分区中，运行Python代码在每个分区上生成单独的idx，然后将它们合并到最终的id。在

注意：我在示例记录中添加了一些新行，请参见下面显示的df_new.show()的结果。在from pyspark.sql import Window, Row

from pyspark.sql.functions import coalesce, sum as fsum, col, max as fmax, lit, broadcast

# ...skip code to initialize the dataframe

# tweak the number of repartitioning N based on actual data size

N = 5

# Python function to iterate through the sorted list of elements in the same

# partition and assign an in-partition idx based on Address and LNAME.

def func(partition_id, it):

idx, lname, address = (1, None, None)

for row in sorted(it, key=lambda x: (x.LNAME, x.Address)):

if lname and (row.LNAME != lname or row.Address != address): idx += 1

yield Row(partition_id=partition_id, idx=idx, **row.asDict())

lname = row.LNAME

address = row.Address

# Repartition based on 'LNAME' and 'Address' and then run mapPartitionsWithIndex()

# function to create in-partition idx. Adjust N so that records in each partition

# should be small enough to be loaded into the executor memory:

df1 = df.repartition(N, 'LNAME', 'Address') \

.rdd.mapPartitionsWithIndex(func) \

.toDF()

获取唯一行数cnt(基于Address+LNAME)，即max_idx，然后获取该rcnt的运行和。在

^{pr2}$

将df1与df2连接并创建最终的id idx + rcntdf_new = df1.join(broadcast(df2), on=['partition_id']).withColumn('id', col('idx')+col('rcnt'))

df_new.show()

#+ + -+ -+ + -+ + + -+ -+ + -+ + -+

#|partition_id|Address| D| DOB|FNAME|GENDER| LNAME|MNAME|idx|snapshot|cnt|rcnt| id|

#+ + -+ -+ + -+ + + -+ -+ + -+ + -+

#| 0| B| 0|1990|David| M| Lee| H M| 1|201211.0| 3| 0| 1|

#| 0| J| 3|1991|David| M| Lee| HM| 2|201211.0| 3| 0| 2|

#| 0| D| 6|2000| Marc| M|Robert| MS| 3|201211.0| 3| 0| 3|

#| 1| C| 3|2000| Marc| M|Robert| H| 1|201211.0| 1| 3| 4|

#| 1| C| 6|1988| Marc| M|Robert| M| 1|201211.0| 1| 3| 4|

#| 2| J| 6|1991| 66M| F| Rek| null| 1|201211.0| 1| 4| 5|

#| 2| J| 6|1992| 66M| F| Rek| null| 1|201211.0| 1| 4| 5|

#| 4| J| 2|1995| 66M| F| Rock| J| 1|201211.0| 1| 5| 6|

#| 4| J| 6|1990| 66M| F| Rock| null| 1|201211.0| 1| 5| 6|

#| 4| J| 6|1990| 66M| F| Rock| null| 1|201211.0| 1| 5| 6|

#+ + -+ -+ + -+ + + -+ -+ + -+ + -+

df_new = df_new.drop('partition_id', 'idx', 'rcnt', 'cnt')

注意事项：实际上，在将列LNAME和Address用作唯一性检查之前，需要清除/规范化列LNAME。例如，使用一个单独的列uniq_key，它组合了LNAME和Address作为数据帧的唯一键。下面是一些基本数据清理过程的示例：from pyspark.sql.functions import coalesce, lit, concat_ws, upper, regexp_replace, trim

#(1) convert NULL to '': coalesce(col, '')

#(2) concatenate LNAME and Address using NULL char '\x00' or '\0'

#(3) convert to uppercase: upper(text)

#(4) remove all non-[word/whitespace/NULL_char]: regexp_replace(text, r'[^\x00\w\s]', '')

#(5) convert consecutive whitespaces to a SPACE: regexp_replace(text, r'\s+', ' ')

#(6) trim leading/trailing spaces: trim(text)

df = (df.withColumn('uniq_key',

trim(

regexp_replace(

regexp_replace(

upper(

concat_ws('\0', coalesce('LNAME', lit('')), coalesce('Address', lit('')))

),

r'[^\x00\s\w]+',

''

),

r'\s+',

' '

)

)

))

然后在代码中，将'LNAME'和{}替换为uniq_key，以找到idx

正如cronoik在注释中提到的，您也可以尝试使用一个窗口等级函数来计算分区内的idx。例如：from pyspark.sql.functions import spark_partition_id, dense_rank

# use dense_rank to calculate the in-partition idx

w2 = Window.partitionBy('partition_id').orderBy('LNAME', 'Address')

df1 = df.repartition(N, 'LNAME', 'Address') \

.withColumn('partition_id', spark_partition_id()) \

.withColumn('idx', dense_rank().over(w2))

当你有了df1之后，用同样的方法来计算df2和df_new。这应该比使用mapPartitionsWithIndex()更快，后者基本上是一种基于RDD的方法。

对于实际数据，请调整N以适合您的实际数据大小。这个N只影响初始分区，在dataframe连接之后，分区将被重置为默认值(200)。您可以使用spark.sql.shuffle.partitions来调整此值，例如在初始化spark会话时：spark = SparkSession.builder \

....

.config("spark.sql.shuffle.partitions", 500) \

.getOrCreate()

weixin_39625305

关注

0
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
python怎样编写姓名、职业、地址_如何根据姓名、地址识别人际关系，然后通过linux comman或Pysp分配相同的ID...

正如注释中所讨论的，基本思想是对数据进行适当的分区，以便具有相同LNAME+Address的记录保持在同一个分区中，运行Python代码在每个分区上生成单独的idx，然后将它们合并到最终的id。在注意：我在示例记录中添加了一些新行，请参见下面显示的df_new.show()的结果。在from pyspark.sql import Window, Rowfrom pyspark.sql.functi...
复制链接

扫一扫

评论

被折叠的条评论为什么被折叠?

到【灌水乐园】发言

查看更多评论

添加红包

成就一亿技术人!

hope_wisdom

发出的红包

实付元

使用余额支付

点击重新获取

扫码支付

钱包余额 0

抵扣说明：

1.余额是钱包充值的虚拟货币，按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载，可以购买VIP、付费专栏及课程。