python文本文件csv_我的第一个Python项目:如何将杂乱无章的文本文件转换为纯净的CSV文件

python文本文件csv

So I decided to learn Python. Turns out this computer programming language isn’t so hard (well, until I got this project! :P ).

所以我决定学习Python。 事实证明,这种计算机编程语言并不难(嗯,直到我得到这个项目!:P)。

Within seconds, I fell in love with its easy, crisp syntax and its automatic indentation while writing. I was mesmerized when I learned that data structures like lists, tuples and dictionary could be created and initialized dynamically with a single line (like so, list-name = [] ).

几秒钟之内,我就爱上了它简单明快的语法以及编写时自动缩进的功能。 当我得知可以使用一行创建和初始化列表,元组和字典之类的数据结构时,我感到非常着迷(例如,list-name = [])。

Moreover, the values held in these could be accessed with and without the use of indexes. This makes the code highly readable as the index is replaced by an English word of one’s choice.

此外,可以使用和不使用索引来访问其中包含的值。 这使得代码具有很高的可读性,因为索引被选择的英文单词替换了。

Well, enough said about the language. Let me show you what the project demanded.

好吧,关于语言的说法已经足够了。 让我告诉您项目的要求。

My brother gave me this project. He came across a text file containing thousands of words. Many of the words shared almost the same meaning. Each word had its definition and an example sentence next to it but in a not-so-organized manner. There were spaces and newlines between a word and its sentence. Some aspects were missing from the words. Below are the snippets of the text file which I’m talking about:

我的兄弟给了我这个项目。 他遇到了一个包含数千个单词的文本文件。 许多单词具有几乎相同的含义。 每个词都有其定义和一个例句,但采用的方式不太统一。 单词和句子之间有空格和换行符。 这个词缺少一些方面。 以下是我正在谈论的文本文件的片段:

He wanted the text aspects to be uniform. For that, he needed me to neatly assort all similar meaning words beside a topic. He told me that this could be achieved by capturing all the data in the text into a dictionary in the following format:

他希望文本方面保持统一。 为此,他需要我将一个主题旁边的所有相似含义词整齐地分类。 他告诉我,可以通过以下格式将文本中的所有数据捕获到字典中来实现:

and then writing them into a CSV (Comma Separated Values) File.

然后将它们写入CSV(逗号分隔值)文件。

He asked if I could take this up as my first project, now that I had learned the fundamentals. I was thrilled to work out the logic and so I instantly agreed. When asked about the deadline, he gave me a decent time of 2 days to finish.

他问,既然我已经了解了基础知识,我是否可以将其作为我的第一个项目。 我很高兴弄清楚逻辑,所以我立刻同意了。 当被问及截止日期时,他给了我2天的时间来完成。

Alas, I ended up taking double amount of time for I struggled to debug the written code properly. Frankly, if it hadn’t been for my brother’s short visits to my room to look at the progress and hinting at the wrong assumptions made by me while writing the conditions, I was destined to finish the project in eternity :P

las,我最终花费了两倍的时间,因为我难以正确调试编写的代码。 坦率地说,如果不是我兄弟短暂访问我的房间来查看进度并暗示我在编写条件时做出的错误假设,那么我注定要永恒完成项目:P

I began by creating mini tasks within the program which I sought to finish before building up the entire program. These were as listed below:

我首先在程序中创建微型任务,然后在构建整个程序之前先完成这些任务。 这些如下所示:

1.形成一个正则表达式以匹配数字及其旁边的单词。 (1. Forming a Regex to match a number and the word next to it.)

I examined the text file and noticed that every topic (herein referred to as ‘key’ ) had a number preceding it. So, I wrote a few lines of code for making a regex (regular expression — a powerful tool to extract text) of the pattern as follows:

我检查了文本文件,发现每个主题(在此称为“键”)前面都有一个数字。 因此,我写了几行代码来制作该模式的regex(正则表达式-一种提取文本的强大工具),如下所示:

However, when I ran this I got an error, UnicodeDecodeError, to be exact which meant I didn’t have access to the text file. I looked it up in https://stackoverflow.com and after a long search with no luck, my brother came and found a solution. The error was rectified as follows:

但是,当我执行此操作时,准确地得到一个错误UnicodeDecodeError,这意味着我无权访问文本文件。 我在https://stackoverflow.com上进行了查找,经过漫长的搜索,没有运气,我的兄弟来找到了解决方案。 该错误已纠正如下:

Still, I didn’t get the desired output. This was because some keys had slashes (‘/’) or spaces (‘ ‘) in the text which my regex couldn’t match. I thought of improving the regex expression later and so wrote a comment next to it.

不过,我没有得到想要的输出。 这是因为某些键在我的正则表达式无法匹配的文本中带有斜杠('/')或空格('')。 我后来考虑改进正则表达式的表达,因此在其旁边写了一条评论。

2.从文本文件中获取行列表作为字符串 (2. Obtaining a list of lines as strings from the text file)

For this, I wrote just 1 line of code and fortunately, no errors showed up.

为此,我只编写了1行代码,幸运的是,没有出现错误。

However, I obtained an unclean list. It contained newlines (‘\n’) and spaces (‘ ‘) I then sought to refine the list as follows:

但是,我得到了一个不干净的清单。 然后包含换行符('\ n')和空格('''),然后尝试按以下方式精炼列表:

3.分别提取单词,含义和例句,然后将其添加到相应的列表中。 (3. Extracting words, meanings, and example sentences separately and adding them to corresponding lists.)

This was by far the hardest part to do as it involved proper logic and proper judgment by pattern recognition.

到目前为止,这是最难的部分,因为它涉及正确的逻辑和通过模式识别的正确判断。

Interestingly, while glancing over the text file, I noticed more patterns. Every word had its meaning in the same line separated by a ‘=’ sign. Also, every example was preceded by ‘:’ sign and ‘Example’ keyword.

有趣的是,在浏览文本文件时,我注意到了更多的模式。 每个单词的含义在同一行中,以“ =”号分隔。 另外,每个示例前面都带有“:”符号和“ Example”关键字。

I thought of making use of regex again. I found an alternate and more elegant solution by slicing the line (now a string in the list) according to the placement of the symbols. Slicing is another cool feature in python. I wrote the code as follows:

我想到再次使用正则表达式。 通过根据符号的位置对线(现在为列表中的字符串)进行切片,我找到了另一种更优雅的解决方案。 切片是python中的另一个很酷的功能。 我写的代码如下:

The above code reads almost like English. For every line in the clean list, it checks whether it has a ‘=’ or a ‘:’ sign. If it does, then the index of the sign is found and slicing is done accordingly.

上面的代码读起来几乎像英语。 对于清除列表中的每一行,它都会检查它是否带有'='或':'符号。 如果是这样,则找到符号的索引并相应地进行切片。

In the first ‘if’, the part before the ‘=’ is stored in the variable ‘word’ and the part after it is stored in ‘meaning’. Similarly for the second ‘if’ (‘elif — else if — in this case), the part after ‘:’ is stored in ‘example’. And after each iteration, the word, meaning and example sentence are stored in the corresponding lists. In this way, the whole data can be extracted.

在第一个“ if”中,“ =”之前的部分存储在变量“ word”中,而其后面的部分存储在“含义”中。 同样,对于第二个“ if”(在此情况下为“ elif-else if”),“:”之后的部分存储在“ example”中。 每次迭代后,单词,含义和例句都存储在相应的列表中。 这样,可以提取全部数据。

So far so good. But, I noted that the extraction was to be done in a manner such that every word (and its aspects) of the particular key had to be accumulated together as one value for the key. This meant it was required to store each word, meaning, and example inside a tuple. Each tuple was to be stored inside a single list which would represent itself as the value for a particular key. This is depicted below:

到目前为止,一切都很好。 但是,我注意到,提取的方式应使特定密钥的每个单词(及其各个方面)都必须作为密钥的一个值一起累积。 这意味着需要将每个单词,含义和示例存储在一个元组中。 每个元组将存储在一个列表中,该列表将自身表示为特定键的值。 如下所示:

For this, I planned to collect each word, meaning and sentence of each key inside a separate list enclosed by another list, say key-list. Again, the picture will tell you more precisely:

为此,我计划将每个键的每个单词,含义和句子收集在一个单独的列表中,该列表由另一个列表(例如“密钥列表”)包围。 同样,图片将更准确地告诉您:

To do this, I added the following code to the one which I wrote for slicing:

为此,我将以下代码添加到我编写的切片代码中:

This code’s logic (the else part) turned out to be wrong, unfortunately. I wrongly assumed that only 2 conditions(‘=’ and ‘:’) existed in the text. There were many exceptions which I failed to notice. I ended up wasting hours for debugging possible errors in the logic. I had assumed that the complete text file followed the same pattern. But that was simply not the case.

不幸的是,该代码的逻辑(其他部分)被证明是错误的。 我错误地认为文本中仅存在2个条件(“ =”和“:”)。 有很多例外我没注意到。 我最终浪费了很多时间来调试逻辑中可能的错误。 我以为完整的文本文件遵循相同的模式。 但是事实并非如此。

Unable to make progress, I moved on to the next part of the program. I thought I could use some help from my brother after completing the other parts. :P

无法取得进展,我进入了程序的下一部分。 我以为我在完成其他部分后可以从哥哥那里获得一些帮助。 :P

To be continued…
未完待续…
4.使用Zip Function和Parameter Unpacking为键创建值。 (4. Creating values for keys using Zip Function and Parameter Unpacking.)

At this point, I wasn’t entirely sure what I would do even after achieving the above configuration of lists. I had learned about ‘Zip’ function and ‘Parameter Unpacking’ during one of my brother’s tech talks, which literally zipped the lists passed to it, like so:

在这一点上,即使实现上述列表配置,我也不确定要做什么。 在我哥哥的一次技术讲座中,我了解了“ Zip”功能和“参数解压缩”,从字面上看,它们压缩了传递给它的列表,就像这样:

So I thought I could somehow combine those two features to achieve the desired result. After a bit of to-ing and fro-ing, testing the features and working on dummy lists, I succeeded. I created a separate file (beta) for this task, the snippet of which is given below:

所以我想我可以通过某种方式将这两个功能结合起来以达到预期的效果。 经过一番反复研究,测试功能并处理伪列表,我成功了。 我为此任务创建了一个单独的文件(测试版),其片段如下所示:

The working of the above code can be figured out by having a look at the output:

通过查看输出,可以弄清楚上面代码的工作方式:

The zip() function zips the corresponding lists or values within the lists and encloses them in a tuple. The tuples inside the lists are then converted to lists for unpacking and further zipping. Finally, the desired output is obtained.

zip()函数压缩列表中的相应列表或值,并将它们包含在一个元组中。 然后,列表中的元组将转换为列表,以进行拆包和进一步压缩。 最后,获得所需的输出。

I felt much relieved for the code worked this time. I was happy that I could manipulate the would-be extracted data and mold it into the required format. I copied the code to the main file on which I was working and modified the variable names accordingly. Now all there was left to do was to assign values to the keys in the dictionary (and of course the extraction part!).

这次的代码使我感到宽慰。 我很高兴可以操纵要提取的数据并将其成型为所需的格式。 我将代码复制到正在使用的主文件中,并相应地修改了变量名。 现在剩下要做的就是为字典中的键分配值(当然还有提取部分!)。

5.为字典中的键分配值。 (5. Assigning values to the keys in the dictionary.)

For this, I came to this solution after some experimentation with the code:

为此,在对代码进行了一些试验之后,我来到了这个解决方案:

This produced the desired output as follows:

这产生了所需的输出,如下所示:

The program was almost done. The main problem lay in the data extraction part.

该程序几乎完成了。 主要问题在于数据提取部分。

… continuation from section 3
…继续第3节

After hours and hours of debugging, I grew more and more frustrated as to why the damn thing didn’t work. I called my brother and he gave me a subtle hint about the assumptions I had made while defining the conditional loops and if-else clauses. We scrutinized the text file and noticed that some words had examples in two lines instead of one.

经过数小时的调试,我对为什么该死的东西不起作用感到越来越沮丧。 我给哥哥打了个电话,他给了我关于定义条件循环和if-else子句时所做假设的微妙提示。 我们仔细检查了文本文件,发现有些单词的示例不是两行而是两行。

According to my code logic, since there is no ‘:’ sign in the second line (nor a ‘=’ sign, for that matter), the contents in the line would not be treated as a part of the example. As a result, this statement would make the last ‘else’ part true and execute the code written in it. Considering all this, I modified the code as below:

根据我的代码逻辑,由于第二行中没有':'符号(就此而言,也没有'='符号),因此该行中的内容将不被视为示例的一部分。 结果,此语句将使最后的“ else”部分为真并执行其中编写的代码。 考虑到所有这些,我修改了如下代码:

Here, hasNumbers() is a function which checks whether a given line has numbers in it. I defined it as follows:

在这里,hasNumbers()是一个检查给定行中是否包含数字的函数。 我将其定义如下:

What this does is that it collects the second line of the example if all other conditions fail, combines it with the first line and then adds it the corresponding list as before.

它的作用是,如果所有其他条件均失败,它将收集示例的第二行,将其与第一行合并,然后像以前一样将其添加到相应的列表中。

To my disappointment, this didn’t work and instead showed an error that the index was out of range. I was dumbstruck, as every line of code seemed to be logically correct in my view.

令我感到失望的是,这没有用,而是显示了索引超出范围的错误。 我很沮丧,因为我认为每一行代码在逻辑上都是正确的。

After hours of madness, my brother showed me a way to fetch the line numbers where the error occurred. One of the main skills in programming is the ability the debug the program, to properly check for possible errors and maintain a continuous flow.

经过数小时的疯狂,我的兄弟向我展示了一种获取发生错误的行号的方法。 编程的主要技能之一是调试程序的能力,以正确检查可能的错误并保持连续的流程。

Interestingly, the following addition to the code reported that the error occurred at around line number 1750 of the text file.

有趣的是,代码的以下补充内容报告该错误发生在文本文件的第1750行附近。

This meant that the program worked well till that line number and that my code was correct! The problems lay in my wrong assumptions and also the text file thanks to its heterogeneity.

这意味着该程序在该行号之前运行良好,并且我的代码是正确的! 问题在于我的错误假设以及文本文件的异构性。

This time around, I noticed some keys were not by their numbers which caused problems in the logic flow. I rectified the mistakes by further modifying the code as follows:

这次,我注意到一些键不是按其编号,这会导致逻辑流程出现问题。 我通过进一步修改代码来纠正错误,如下所示:

This worked well till line 4428 of the text file but crashed right after. I checked that line number on the text file itself but that didn’t help much. Then I realized, much to my happiness, that it must be the last line. The whole program worked on the clean list which was void of newlines and spaces. I printed the last line of the clean list and compared it with the last line of the text file. They matched!

直到文本文件的第4428行,它都运行良好,但此后立即崩溃。 我检查了文本文件本身上的行号,但是并没有太大帮助。 然后,我非常高兴地意识到这一定是最后一行。 整个程序工作在干净的列表上,没有换行符和空格。 我打印了清除列表的最后一行,并将其与文本文件的最后一行进行了比较。 他们匹配了!

I was extremely happy to know this as it meant the program was executed until the end. The only reason why it crashed was that after the last sentence none of the code made sense. My conditionals were designed to every time check the next line also, along with the current line. Since there was no line after the last line, it crashed.

我很高兴知道这一点,因为这意味着程序将一直执行到最后。 它崩溃的唯一原因是在最后一句之后没有任何代码有意义。 我的条件句旨在每次都检查下一行以及当前行。 由于最后一行之后没有一行,因此崩溃了。

So I wrote an additional line of code to cover that up:

因此,我写了另一行代码来弥补这一点:

Everything worked now. Finally! Now all I had to was to assign the keys to corresponding values and that’s that! I took a break at this moment, considering that my project was finally over. I would add some final touches to it later.

现在一切正常。 最后! 现在,我所要做的就是将键分配给相应的值,就是这样! 考虑到我的项目终于结束了,我此时休息了一下。 我会在以后添加一些最后的修饰。

But before taking a break, I decided to enclose every code inside various functions so as to make the code look neat. I already had much trouble navigating up and down the lines of code. So I decided to take a break after doing this.

但是在休息之前,我决定将每个代码封装在各种函数中,以使代码看起来很整洁。 在上下两行代码中导航已经很麻烦。 因此,我决定在此之后休息一下。

However, after doing so, the program started giving variable scope errors. I realized that this was because variables declared inside functions cannot be called directly from outside the function as they are in the local namespace. Unwilling to make further changes due to that lame error I decided to revert back to the same code with which I had been hitting my head from the start.

但是,这样做之后,程序开始给出可变范围错误。 我意识到这是因为在函数内部声明的变量不能像在本地名称空间中那样直接从函数外部调用。 由于这个la脚的错误,我不愿做进一步的更改,因此我决定恢复到从一开始就遇到的相同代码。

However, to my utter disbelief, the program didn’t work in the same way as it did before. In fact, it didn’t work at all! I simply couldn’t figure out the reason (and I still can’t!). I was utterly depressed for the rest of the day. It was like experiencing a nightmare even before falling asleep!

但是,令我完全怀疑的是,该程序的工作方式与以前不同。 实际上,它根本不起作用! 我根本不知道原因(而且我仍然不知道!)。 在那天余下的时间里,我完全感到沮丧。 这就像在入睡之前经历一场噩梦!

Fortunately and miraculously, the code worked the next day after I made some careful changes. I made sure that I made many beta files (for each change made) thereafter so as to avoid such unnecessary chaos.

幸运的是,在我进行了一些仔细的更改后的第二天,代码就开始工作了。 此后,我确保制作了许多beta文件(每次更改),以避免不必要的混乱。

After a few more hours, I was able to finally complete my program (but not until I consumed 4 full days). I made few more changes such as:

再过几个小时,我终于可以完成我的程序了(但是直到我花了整整4天的时间)。 我做了一些其他更改,例如:

i) modifying the ‘hasNumbers’ function to ‘hasNumbersDot’ function and excluding the regex I made earlier in the program. This matched the keys more efficiently as it had no assumptions and hence no exceptions. The code for it is as follows:

i)将“ hasNumbers”功能修改为“ hasNumbersDot”功能,并排除我之前在程序中创建的正则表达式。 由于没有假设,因此也没有例外,因此可以更有效地匹配密钥。 其代码如下:

ii) replacing the regex condition and the code for obtaining keys from the clean list.

ii)替换正则表达式条件和用于从清除列表中获取密钥的代码。

iii) combining the ‘if’ conditions in the ‘examples extraction’ part

iii)在“示例提取”部分中组合“ if”条件

iv) materializing the code for dictionary key assignment

iv)实现用于字典键分配的代码

Also, after some trial and error, I was able to convert the data obtained into a beautifully structured CSV file:

另外,经过反复试验,我能够将获得的数据转换为结构精美的CSV文件:

You can check out my github repository on my profile for viewing the full code for the program including the text file and csv file.

您可以在个人资料上签出我的github存储库,以查看该程序的完整代码,包括文本文件和csv文件。

Overall, it was a great experience. I got to learn so much out of this project. I also gained more confidence in my skills. Despite some unfortunate events (programming involves such things :P), I was finally able to complete the given task.

总的来说,这是一次很棒的经历。 我从这个项目中学到了很多东西。 我也对自己的技能更有信心。 尽管发生了一些不幸的事件(编程涉及到诸如:P的事情),但我终于能够完成给定的任务。

One last thing! Recently, I came across a hilarious meme regarding the stages of debugging which is so relatable to my experience that I can’t resist sharing. xD

最后一件事! 最近,我遇到了一个关于调试阶段的有趣的模因,它与我的经验紧密相关,以至于我无法抗拒共享。 xD

Thanks for making it all the way until here (even if you skipped most of it to check out the final result :P).

感谢您一直进行到此处(即使您跳过了大部分内容以查看最终结果:P也是如此)。

翻译自: https://www.freecodecamp.org/news/my-first-python-project-converting-a-disorganized-text-file-into-a-neatly-structured-csv-file-21f4c6af502d/

python文本文件csv

  • 1
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值