与python一起玩的乐趣与excel电子表格的真实工作-CSDN博客

I recently had a request from my lovely ex-engineer-now-artist girlfriend (yes, the same girlfriend) that she wanted to update some Cost-Of-Goods-Sold values in an Excel spreadsheet based on the value of another column in the same row.

最近，我可爱的前工程师现任女友(是的，同一位女友)向我提出了一个要求，她希望根据Excel表格中另一列的值来更新Excel电子表格中的商品销售成本值。同一行。

Now, before you say “Why not just run a macro?”, let me explain. No, let me sum up. The spreadsheet was from an outside vendor source, and had to be re-downloaded periodically from the same source. There were about 5000 rows of data. The reference values were based on a limited number of items (about 50 different possibilities) each of which was associated with a specific price, which could change. And, a replacement file had to be re-uploaded to the same place once her changes were made. So, having a separate file to maintain her reference prices would be easier and cleaner in the long run.

现在，在您说“为什么不运行宏？”之前，请允许我解释一下。不，我总结一下。该电子表格来自外部供应商，必须定期从同一来源重新下载。大约有5000行数据。参考值基于有限数量的商品(大约50种不同的可能性)，每个商品都与可能改变的特定价格相关联。并且，在进行更改后，必须将替换文件重新上传到同一位置。因此，从长远来看，拥有一个单独的文件来维持她的参考价格将更加容易和清洁。

As I had mentioned in my last article “Some Basics And Some Real Work Using Google App Script”, I had just finished a Python course that had the usual read-a-csv-write-json simplistic scenarios. Feeling very confident in my ability to figure out anything that came my way, I hitched up my britches and pitched in.

正如我在上一篇文章“ 使用Google App脚本的一些基础知识和实际工作 ”中提到的那样，我刚刚完成了一个Python课程，该课程具有通常的read-a-csv-write-json简单场景。我对自己能找到解决办法的能力感到非常自信，于是我搭上了马裤，伸了个懒腰。

Of course the first thing to decide was, how was I going to read the data? I asked her to save the Excel file to a flat CSV. That way I could use my newly-acquired Python skills to slurp the data in, twiddle the column, then write out a new CSV all ready for her to import into Excel. Seems pretty easy, right?

当然，要决定的第一件事是，我将如何读取数据？我要求她将Excel文件保存为平面CSV。这样，我可以使用新获得的Python技能将数据插入其中，旋转列，然后写出新的CSV文件，准备将其导入Excel。看起来很简单，对不对？

The first roadblock was when I tried to read the CSV file she had exported. Python complained that it found characters that weren’t UTF-8 encoded. After a little sniffing around on the web, I found a reference to a handy library called “chardet” to help detect what character set this thing was in. However, there were very few examples that matched my requirement, so I delved into the documentation. I spent a goodly amount of time playing with it, but the best I could do was to get a low-confidence response that the file was in Thai. Let’s just say that I went down the code page/charset rabbit hole a lot further than I intended (or should have…but it was so interesting!). There had to be an easier way.

第一个障碍是当我尝试读取她导出的CSV文件时。 Python抱怨说它发现了不是UTF-8编码的字符。经过一番网上嗅探之后，我找到了一个名为“ chardet ”的便捷库的引用，以帮助检测该字符集所在的字符集。但是，很少有符合我要求的示例，因此我深入研究了文档。我花了很多时间来玩它，但是我能做的最好的就是得到一个关于文件是泰语的低信度回复。可以说，我比预期的要差很多(或应该有……但是很有趣！)。必须有一种更简单的方法。

Next, I researched into whether there were any libraries that were specifically designed to read Excel files. It was no surprise that there were several. The first I tried was “xlrd”, and it worked great. Very simple to use, I was able to treat the incoming rows as a list with attributes, it looked promising. However, I had an issue where one of the columns I was interested in had come out as a formula rather than the actual value. Not a winner. Additionally, I could not update the data. I’d have to use a separate library called “xlwt” to create an entirely new object and populate THAT so I could write out the file. Again, there had to be an easier way.

接下来，我研究了是否有专门用于读取Excel文件的库。毫不奇怪。我尝试的第一个是“ xlrd ”，效果很好。使用非常简单，我能够将传入的行视为具有属性的列表，看起来很有希望。但是，我有一个问题，我感兴趣的一列是公式而不是实际值。不是赢家。此外，我无法更新数据。我必须使用一个单独的库“ xlwt ”来创建一个全新的对象并填充该对象，这样我才能写出该文件。同样，必须有一种更简单的方法。

I finally wound up using a library called “openpyxl”. It was possible to set an option when I opened the file to load the workbook, access the sheet and force the formulas to evaluate and supply the actual end value.

我最终使用了一个名为“ openpyxl ”的库。当我打开文件以加载工作簿，访问工作表并强制公式求值并提供实际最终值时，可以设置一个选项。

import openpyxl as xl
wb = xl.load_workbook("myFile.xlsx", data_only=True)
sheet = wb.worksheets[0]

Huzzah! It looked like a viable approach.

晕！看起来是可行的方法。

Now, after I had successfully proven to myself that I could read the spreadsheet file, I concentrated on how I was going to provide the “decode” values. As in, I had to use the value in one spreadsheet column, do a lookup, and provide a value to put into a different spreadsheet column. Sounds like it’s tailor-made for JSON.

现在，在我向自己成功证明自己可以读取电子表格文件之后，我集中于如何提供“解码”值。在这种情况下，我必须在一个电子表格列中使用该值，进行查找，然后提供一个值以放入另一个电子表格列中。听起来像是为JSON量身定制的。

Fortunately, JSON is pretty simple to read and load into a dictionary object.

幸运的是，JSON非常易于读取并加载到字典对象中。

import json
with open("cogsprices.json") as json_file:
 prices = json.load(json_file)

Here’s a subset of the entire JSON file:

这是整个JSON文件的子集：

{
  "8x8 inch / Black": 25.9,
  "8x8 inch / Cherry": 25.9,
  "8x8 inch / Natural": 25.9,
  "8x8 inch / Walnut": 25.9,
  "8x8 inch / White": 25.9,
  "10x10 inch / Black": 27.9,
  "10x10 inch / Cherry": 27.9,
  "10x10 inch / Natural": 27.9,
  "10x10 inch / Walnut": 27.9,
  "10x10 inch / White": 27.9
}

Once you have both of those objects, comparing them is a pretty simple iteration through the worksheet object.

一旦拥有了这两个对象，就可以通过工作表对象进行比较，从而比较它们。

for row in range(1, sheet.max_row):
    doStuff

In this case, I wanted to determine whether or not there was a match between the value in a specific column, and the JSON values I had read into a dictionary from the other file. Python has a handy “in” keyword for use with dictionaries to see if a key exists:

在这种情况下，我想确定特定列中的值与我从其他文件读入字典的JSON值之间是否匹配。 Python有一个方便的“ in”关键字，可与字典一起使用以查看键是否存在：

variant_name = sheet.cell(row=row, column=4).value
    if variant_name in prices and sheet.cell(row=row, column=6).value != prices[variant_name]:
        sheet.cell(row=row, column=6).value = prices[variant_name]

So, I determined I could read a spreadsheet, read a JSON file with lookup values, compare the values and update my rows in memory from the lookup. Great!

因此，我决定可以读取电子表格，读取具有查找值的JSON文件，比较这些值并通过查找更新内存中的行。大！

From a workflow standpoint, having hard-coded file names wasn’t really the best option. Keeping in mind that the lookup JSON data was supposed to be “cost of goods sold” for a set of products, it was entirely likely that the values would change over time, so I wanted to be able to have an archive. And, the spreadsheet that was supplied would likely have a different name each time it was provided. So, I wanted to be able to supply command-line arguments for both of these. It turns out there’s a wonderful Python library called “argparse” that gives you a ton of flexibility.

从工作流的角度来看，具有硬编码的文件名并不是真正的最佳选择。请记住，查找JSON数据应该是一组产品的“销售成本”，所以这些值很可能会随时间变化，因此我希望能够拥有一个存档。并且，提供的电子表格每次提供时都可能具有不同的名称。因此，我希望能够为这两个参数提供命令行参数。事实证明，有一个很棒的Python库叫做argparse ，它为您提供了极大的灵活性。

Creating a parser is pretty simple.

创建解析器非常简单。

parser = argparse.ArgumentParser(description='Add COGS to a products spreadsheet from a prices file.')

The “description” parameter supplies some text that will display if you run the script with a “-h” argument. More on that later.

如果您使用“ -h”参数运行脚本，“ description”参数将提供一些文本。以后再说。

Once you have the parser object, you can pretty much add as many arguments as you like. You have the option of setting them as required, whether there’s a default value, and a host of others (here’s a link to the documentation).

拥有解析器对象后，几乎可以添加任意数量的参数。您可以选择是否根据需要设置它们，是否有默认值以及许多其他值(这里是文档的链接)。

The first argument I wanted was to specify where to find the prices file. The statement looked like this:

我想要的第一个参数是指定在哪里找到价格文件。该语句如下所示：

parser.add_argument('-p', "--pricesfile", help='Prices file. Must be valid json.', nargs='?', default='cogsprices.json')

The first two parameters are what can be used on the command line as either a short version or a long version. Some people like to use the more verbose one for readability and self-documentation. The “help” parameter supplies some text that will display if you run the script with a “-h” argument. The “nargs=” parameter indicates that there could be zero or one value supplied. The “default” parameter supplies the value to be used if the parameter/value is not supplied on the command line.

前两个参数可以在命令行中用作短版本或长版本。有些人喜欢使用更冗长的代码来提高可读性和自我证明。如果您使用“ -h”参数运行脚本，“ help”参数将提供一些文本。 “ nargs =”参数表示可以提供零或一个值。如果命令行中未提供参数/值，则“默认”参数将提供要使用的值。

Next, I wanted to give the name of the input spreadsheet. This one would have to be supplied, not optional and no default value would be accepted.

接下来，我想输入输入电子表格的名称。必须提供此值，而不是可选的，并且不接受默认值。

parser.add_argument('-i', "--inputfile", help='Product file. Must be valid .xlsx file.', required=True)

Finally, I wanted to be able to test this without necessarily generating an output spreadsheet, so I needed a “flag” style parameter.

最后，我希望能够进行测试而不必生成输出电子表格，因此我需要一个“标志”样式参数。

parser.add_argument('-t', "--test", help='Allows you to run without generating a new file.', action="store_true")

The “action” parameter in the above tells the parser to basically hold a boolean value for later use.

上面的“ action”参数告诉解析器基本上保留一个布尔值供以后使用。

But how to access these values? First, you create an object variable to hold them.

但是如何访问这些值？首先，创建一个对象变量来保存它们。

args = parser.parse_args()

Once you have this object, you can reference the various values using “dot” notation. For example, here’s how we open the JSON file:

一旦有了该对象，就可以使用“点”表示法引用各种值。例如，这是我们打开JSON文件的方式：

with open(args.pricesfile) as json_file:

Notice that the reference after the “dot” is the same as the verbose argument name from above. Depending on what kind of value you supply, you can use these as numbers, as strings, or as booleans:

请注意，“点”之后的引用与上方的详细参数名称相同。根据您提供的值的类型，可以将它们用作数字，字符串或布尔值：

if args.test:
    doSomething
else:
    doSomethingElse

So, once I determine that this is not a test, I can save the on-memory workbook with a new name that I assemble from today’s date and time:

因此，一旦确定这不是测试，就可以使用从今天的日期和时间收集来的新名称保存内存工作簿：

newFileName = 'new_file_' + datetime.now().strftime("%Y%m%d%H%M%S") + '.xlsx'
    wb.save(filename=newFileName)

Now, since I want to be able to run this script and get an idea of how many prices are actually being updated, I put in some counters and a print statement:

现在，由于我希望能够运行此脚本并了解实际上正在更新多少价格，因此我添加了一些计数器和打印语句：

…
print(f'Starting at: {datetime.now().strftime("%m/%d/%Y %H:%M:%S")}')
…
for row in range(1, sheet.max_row):
…
    totalRowsCount += 1
    if variant_name in prices and sheet.cell(row=row, column=6).value != prices[variant_name]:
…
        changedRowsCount += 1
…
if args.test:
    testStatus='NOT saved'
else:
    newFileName = 'new_file_' + datetime.now().strftime("%Y%m%d%H%M%S") + '.xlsx'
    wb.save(filename=newFileName)
    testStatus='saved as ' + newFileName
    
print(f'Ended at: {datetime.now().strftime("%m/%d/%Y %H:%M:%S")}\nFile {testStatus}\nChanged {changedRowsCount} out of {totalRowsCount} rows.')

This is the end of the script.

脚本到此结束。

However, from a workflow standpoint, there are a couple of other things you may want to do.

但是，从工作流程的角度来看，您可能还需要做其他几件事。

When you execute this from the Linux command line with the — help argument, you get this:

在Linux命令行上使用-help参数执行此操作时，将得到以下信息：

rick@mylinuxmint: ~$python3 setCOGS.py --help
usage: setCOGS2.py [-h] [-p [PRICESFILE]] -i INPUTFILE [-t]
Add COGS to a products spreadsheet from a prices file.
optional arguments:
 -h, --help show this help message and exit
 -p [PRICESFILE], --pricesfile [PRICESFILE]
 Prices file. Must be valid json.
 -i INPUTFILE, --inputfile INPUTFILE
 Product Prices file. Must be valid .xlsx file.
 -t, --test Allows you to run without generating a new file.

If you exclude any parameters you get this:

如果排除任何参数，则会得到以下信息：

rick@mylinuxmint: ~$python3 setCOGS.py
usage: setCOGS.py [-h] [-p [PRICESFILE]] -i INPUTFILE [-t]
setCOGS.py: error: the following arguments are required: -i/ - inputfile

This is expected, since we said the — inputfile parameter is required above.

这是预料之中的，因为我们在上面说了– inputfile参数。

If you specify all the arguments, it looks like this:

如果指定所有参数，则如下所示：

rick@mylinuxmint: ~$python3 setCOGS.py -p cogsprices.json -i cogs-8–12–2020.xlsx -t
Starting at: 08/14/2020 19:11:39
Ended at: 08/14/2020 19:11:39
File NOT saved
Changed 1148 out of 4639 rows.

So it gives us when we started, when it ended, that the file was NOT saved since this is a test, and how many rows would have been changed.

因此，它使我们从开始到结束都没有保存文件，因为这是一个测试，并且将更改多少行。

Finally, when we do this for real:

最后，当我们真正执行此操作时：

rick@mylinuxmint: ~$python3 setCOGS.py -p cogsprices.json -i cogs-8–12–2020.xlsx
Starting at: 08/14/2020 19:13:26
Ended at: 08/14/2020 19:13:27
File saved as new_file_20200814191327.xlsx
Changed 1148 out of 4639 rows.

And, voila. I have a file I can send back to my girlfriend for re-uploading, and I’ve saved her a bunch of work.

还有，瞧。我有一个文件，可以发回给我的女朋友重新上传，并且为她节省了很多工作。

结论 (Conclusion)

Python is great for manipulating data, and the plethora of libraries means you probably won’t have to “re-invent the wheel” when you need something specific done. Our particular script accepts parameters, it is documented, it runs fast, and it’s simple to use. This is a classic scenario of data wrangling that turns into something we could consider “production ready”. Here’s the complete script…only 30 lines, including print statements!

Python非常适合处理数据，并且大量的库意味着您可能不需要“重新发明轮子”，而需要做一些特定的事情。我们的特定脚本接受参数，已记录在案，运行速度很快，并且易于使用。这是数据争用的经典场景，它变成了我们可以认为“生产就绪”的东西。这是完整的脚本……仅30行，包括打印语句！

#!/usr/bin/env python3
import argparse
import json
import openpyxl as xl
from datetime import datetime
parser = argparse.ArgumentParser(description='Add COGS to a products spreadsheet from a prices file.')
parser.add_argument('-p', "--pricesfile", help='Prices file. Must be valid json.', nargs='?', default='cogsprices.json')
parser.add_argument('-i', "--inputfile", help='Products file. Must be valid .xlsx file.', required=True)
parser.add_argument('-t', "--test", help='Allows you to run without generating a new file.', action="store_true")
args = parser.parse_args()
with open(args.pricesfile) as json_file:
    prices = json.load(json_file)
print(f'Starting at: {datetime.now().strftime("%m/%d/%Y %H:%M:%S")}')
changedRowsCount = 0
totalRowsCount = 0
wb = xl.load_workbook(args.inputfile, data_only=True)
sheet = wb.worksheets[0]
for row in range(1, sheet.max_row):
    variant_name = sheet.cell(row=row, column=4).value
    totalRowsCount += 1
    if variant_name in prices and sheet.cell(row=row, column=6).value != prices[variant_name]:
        sheet.cell(row=row, column=6).value = prices[variant_name]
        changedRowsCount += 1
if args.test:
    testStatus='NOT saved'
else:
    newFileName = 'new_file_' + datetime.now().strftime("%Y%m%d%H%M%S") + '.xlsx'
    wb.save(filename=newFileName)
    testStatus='saved as ' + newFileName
print(f'Ended at: {datetime.now().strftime("%m/%d/%Y %H:%M:%S")}\nFile {testStatus}\nChanged {changedRowsCount} out of {totalRowsCount} rows.')

结尾： (Coda:)

Later that evening, my lovely ex-engineer-now-artist girlfriend said “Oh, no! Oh dear! I have some price changes already, and some new products to include!”

那天晚上晚些时候，我可爱的前工程师现在的女友说：“哦，不！噢亲爱的！我已经有一些价格变化，并且包括一些新产品！”

“Not to worry, my dear…all I have to do is include them in the cogsprices.json file and run it again!”

“不用担心，亲爱的……我所要做的就是将它们包含在cogsprices.json文件中，然后再次运行！”

<将行添加到JSON文件中，重新运行脚本>

“My hero!”

“我的英雄！”

Okay, so maybe it didn’t really quite happen like that, but you get the idea. We’ve abstracted the things we KNOW will be dynamically changing so our script doesn’t have to change. Plus, we’ve included some handy help to give us a hint about how to run this script later on (after we’ve forgotten how to run it). And, in a pinch, we could even supply this script to someone else so they could do it by themselves and we wouldn’t ever have to touch it again.

好的，也许它并没有真正发生过，但是您明白了。我们已经抽象了我们知道将动态更改的内容，因此我们的脚本不必更改。另外，我们提供了一些方便的帮助，以提示我们以后如何运行此脚本(在我们忘记了如何运行它之后)。而且，在紧要关头，我们甚至可以将此脚本提供给其他人，以便他们可以自己完成此脚本，而不必再去碰它了。

Thanks for reading!

谢谢阅读！