最近在做数据采集过程中,有一些数据采集下来之后是用逗号分割开的,需要拆分成两条,类似:
id | location | university | degree_name | programme_en | url |
---|---|---|---|---|---|
25564 | Perth,Albany | Western Australia | Master of Curatorial Studies | Curatorial Studies in Fine Arts | https://study.uwa.edu.au/courses/master-of-curatorial-studies-in-fine-arts |
这是一所学校的一个专业但是location中有两个值,所以需要拆分为两条,但在采集过程中没有完成拆分,需要用python脚本进行拆分。具体代码如下:
import numpy as np
import pandas as pd
import pymysql
from sqlalchemy import create_engine
conn = pymysql.connect(host='localhost',port = 3306,user='root',passwd='123456',db ='major_test',charset='utf8')
#将数据表的dataframe转换为拆分后的dataframe
def tidy_split(df,column,sep=','):
indexes = []
new_values = []
for i,presplit in enumerate(df[column].astype(str)):
values = presplit.split(sep)
for value in values:
value = value.strip()
indexes.append(i)
new_values.append(value)
new_df = df.iloc[indexes,:].copy()
new_df[column] = new_values
new_df.index = [i for i in range(new_df.shape[0])]
return new_df
#将数据库转化为dataframe格式
def read_table(cur,sql_order):
try:
cur.execute(sql_order)
data = cur.fetchall()
col_result = cur.description
columns = []
for i in range(len(col_result)):
columns.append(col_result[i][0])
frame = pd.DataFrame(list(data),columns=columns)
except Exception as e:
frame = pd.DataFrame()
return frame
#执行方法,可修改column和sep参数,叠加修改可以修改engine,column和sep参数并且修改生成的新表名称
def main():
cur = conn.cursor()
sql_order = 'select * from tmp_school_uk_yuke_xinzeng'
result = read_table(cur, sql_order)
new_result = tidy_split(result, column='location', sep=',')
engine = create_engine('mysql+pymysql://root:123456@127.0.0.1:3306/major_test?charset=utf8')
new_result.to_sql('tmp_ok', engine)
conn.commit()
if __name__ == '__main__':
main()