实验室20200314数据处理任务总结

最新推荐文章于 2022-03-14 18:44:54 发布

CCH²¹

最新推荐文章于 2022-03-14 18:44:54 发布

阅读量362

点赞数

分类专栏： Python数据分析文章标签： python 正则表达式数据分析

本文链接：https://blog.csdn.net/qq_45554010/article/details/104908663

版权

Python数据分析专栏收录该内容

26 篇文章 5 订阅

订阅专栏

如果你想获取数据集和代码，请点这里。

任务描述

基本要求
把样本文件中的数据按下面的样例格式写入输出文件。需要注意的是，输入文件中所有的暂无数据均按暂无写入输出文件，所有的None均按NULL写入输出文件。样本文件中共240条数据。
输入文件样例
样本文件ori_data的数据样例如下：

Tue Mar 19 16:23:02 2019,杭州租房网 >  萧山租房 >  钱江世纪城租房 >   佳境天城人合苑租房  , 合租·佳境天城人合苑4室1厅, 2430元/月(季付价), 公寓 独立卫生间 近地铁 押一付一 随时看房 , 合租 4室1厅2卫 16㎡ 朝南  房屋信息  基本信息 发布：12天前 入住：随时入住   租期：暂无数据 看房：随时可看   楼层：5/18层 电梯：暂无数据   车位：暂无数据 用水：暂无数据   用电：暂无数据 燃气：暂无数据   采暖：暂无数据  ,None, 地址和交通距离地铁2号线-振宁路329m, end 
Tue Mar 19 16:23:02 2019,杭州租房 >  滨江租房 >  浦沿租房 >   朗诗寓·东信大道店租房  , None,  朗诗寓·东信大道店  2550元/月起  ,None, None, None, None, 地址和交通, end

输出文件样例
输出文件dea_data的数据样例如下：

Tue Mar 19 杭州  萧山  钱江世纪城  佳境天城人合苑 2430元/月 16㎡ 4室 2卫 1厅 朝南 5/18层 NULL 随时入住 暂无 暂无 暂无 暂无 暂无 暂无
Tue Mar 19 杭州  滨江  浦沿  朗诗寓·东信大道店 2550元/月 NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL

任务分析

首先我注意到的一点是，样本文件是Unix(LF)格式的，可能需要考虑一下文件编码格式的问题。
在这里插入图片描述
其次，不难发现每一行的数据之间都是以逗号为分隔符，这是在提醒我们使用Python的csv模块来进行处理。
在处理之前，应当先清洗掉那些没有用处的数据，这样可以使后面的处理工作条理更加清晰。
输出样例是有特定的格式的，我们可以导入Python的re模块，利用正则表达式来进行数据的筛选。

源代码及简单说明

关于程序的说明已经通过注释的形式写在了下面的代码里。

#!/usr/bin/env python3

import csv
import re

with open('ori_data', mode='r', encoding='utf-8', newline='') as csv_in_file:
    with open('dea_data_output', mode='w', newline='') as out_file:
        filereader = csv.reader(csv_in_file)
        for row_list in filereader:
            # 创建要写入输出文件的输出字符串
            out_str = ''

            # 删去共有的无用信息
            row_list.pop()
            row_list.pop(2)
            row_list.pop(3)

            # 修改前三列的数据
            row_list[0] = re.search(r'Tue Mar 19', row_list[0]).group()
            row_list[1] = ''.join(row_list[1].split())
            row_list[1] = ''.join(row_list[1].replace('>', '').replace('租房', '  ').replace('网', '').rstrip())
            row_list[2] = re.search(r'(\d*)元/月', row_list[2]).group()

            # 根据列表长度删去各自的无用信息
            if len(row_list) == 9:
                row_list.pop()
                row_list.pop()
                row_list.pop()
                row_list.pop()
                row_list.pop()
            elif len(row_list) == 8:
                row_list.pop()
                row_list.pop()
                row_list.pop()
                row_list.pop()
            elif len(row_list) == 7:
                row_list.pop()
                row_list.pop()
                row_list.pop()
            elif len(row_list) == 6:
                row_list.pop()
                row_list.pop()

            # 修改最后一列的信息
            if row_list[-1].strip() == 'None':
                row_list[-1] = 'NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL'
            else:
                s = ''
                # 房屋面积
                s += re.search(r'(\d*㎡)', row_list[-1]).group()
                s += ' '
                # 室、卫、厅
                s += re.search(r'(\d*室)', row_list[-1]).group()
                s += ' '
                s += re.search(r'(\d*卫)', row_list[-1]).group()
                s += ' '
                s += re.search(r'(\d*厅)', row_list[-1]).group()
                s += ' '
                # 房屋朝向
                s += re.search(r'朝\w', row_list[-1]).group()
                s += ' '
                # 所在楼层
                if re.search(r'(\d*/\d*层)', row_list[-1]) is None:
                    s += 'NULL'
                else:
                    if re.search(r'(\d*/\d*层)', row_list[-1]).group()[0] == '/':
                        s += 'NULL'
                    else:
                        s += re.search(r'(\d*/\d*层)', row_list[-1]).group()
                s += ' '
                # 租期
                if re.search(r'(\d*~\d*年)', row_list[-1]) is None:
                    s += 'NULL'
                else:
                    s += re.search(r'(\d*~\d*年)', row_list[-1]).group()
                s += ' '
                # 入住
                if re.search(r'随时入住', row_list[-1]) is None:
                    s += 'NULL'
                else:
                    s += re.search(r'随时入住', row_list[-1]).group()
                s += ' '
                # 电梯
                if re.search(r'电梯：有', row_list[-1]):
                    s += '有 '
                elif re.search(r'电梯：无', row_list[-1]):
                    s += '无 '
                elif re.search(r'电梯：暂无数据', row_list[-1]):
                    s += '暂无 '
                # 车位
                if re.search(r'车位：免费', row_list[-1]):
                    s += '免费 '
                elif re.search(r'车位：租用', row_list[-1]):
                    s += '租用 '
                elif re.search(r'车位：暂无数据', row_list[-1]):
                    s += '暂无 '
                # 用水
                if re.search(r'用水：民水', row_list[-1]):
                    s += '民水 '
                elif re.search(r'用水：商水', row_list[-1]):
                    s += '商水 '
                elif re.search(r'用水：暂无数据', row_list[-1]):
                    s += '暂无 '
                # 用电
                if re.search(r'用电：民电', row_list[-1]):
                    s += '民电 '
                elif re.search(r'用电：商电', row_list[-1]):
                    s += '商电 '
                elif re.search(r'用电：暂无数据', row_list[-1]):
                    s += '暂无 '
                # 燃气
                if re.search(r'燃气：有', row_list[-1]):
                    s += '有 '
                elif re.search(r'燃气：无', row_list[-1]):
                    s += '无 '
                elif re.search(r'燃气：暂无数据', row_list[-1]):
                    s += '暂无 '
                # 采暖
                if re.search(r'采暖：自采暖', row_list[-1]):
                    s += '自采暖'
                elif re.search(r'采暖：集中供暖', row_list[-1]):
                    s += '集中'
                elif re.search(r'采暖：暂无数据', row_list[-1]):
                    s += '暂无'
                row_list[-1] = s

            # 向输出字符串内添加信息
            out_str += row_list[0] + ' ' + row_list[1] + ' ' + row_list[2] + ' ' + row_list[3] + '\n'

            # 写入文件
            out_file.write(out_str)