oracle导入银行对账单_BankClassify：银行对帐单条目的简单自动分类

最新推荐文章于 2024-06-15 09:35:52 发布

cumei1658

最新推荐文章于 2024-06-15 09:35:52 发布

阅读量438

点赞数

文章标签： python 机器学习人工智能深度学习 java

原文链接：https://www.pybloggers.com/2018/05/bankclassify-simple-automatic-classification-of-bank-statement-entries/

版权

oracle导入银行对账单

This is another entry in my ‘Previously Unpublicised Code’ series – explanations of code that has been sitting on my Github profile for ages, but has never been discussed publicly before. This time, I’m going to talk about BankClassify a tool for classifying transactions on bank statements into categories like Supermarket, Eating Out and Mortgage automatically. It is an interactive command-line application that looks like this:

这是我的“以前未公开的代码”系列中的另一个条目–多年来，我一直在我的Github个人资料中使用对代码的解释，但以前从未公开讨论过。这次，我将讨论BankClassify这个工具，该工具可将银行对帐单上的交易自动分类为Supermarket，Eating Out和Mortgage等类别。它是一个交互式命令行应用程序，如下所示：

For each entry in your bank statement, it will guess a category, and let you correct it if necessary – learning from your corrections.

对于银行对帐单中的每个条目，它将猜测一个类别，并在必要时让您更正-从更正中学习。

I’ve been using this tool for a number of years now, as I never managed to find another tool that did quite what I wanted. I wanted to have an interactive classification process where the computer guessed a category for each transaction but you could correct it if it got it wrong. I also didn’t want to be restricted in what I could do with the data once I’d categorised it – I wanted a simple CSV output, so I could just analyse it using pandas. BankClassify meets all my needs.

我已经使用该工具很多年了，因为我一直没有找到另一个可以满足我要求的工具。我希望有一个交互式的分类过程，在该过程中，计算机会为每笔交易猜测一个类别，但是如果输入错误，则可以进行更正。我也不想在对数据进行分类后就受到限制，我想要一个简单的CSV输出，因此我可以使用熊猫进行分析。 BankClassify满足我的所有需求。

If you want to use BankClassify as it is written at the moment then you’ll need to be banking with Santander – as it can only important text-format data files downloaded from Santander Online Banking at the moment. However, if you’ve got a bit of Python programming ability (quite likely if you’re reading this blog) then you can write another file import function, and use the rest of the module as-is. To get going, just look at the README in the repository.

如果您想使用目前编写的BankClassify，则需要使用Santander进行银行业务-因为它目前只能从Santander Online Banking下载重要的文本格式数据文件。但是，如果您具有Python编程能力（如果您正在阅读此博客，则很有可能），则可以编写另一个文件导入功能，并按原样使用模块的其余部分。要开始，只需查看资源库中的自述文件。

So, how does this work? Well it uses a Naive Bayesian classifier – a very simple machine learning tool that is often used for spam filtering (see this excellent article by Paul Graham introducing its use for spam filtering). It simply splits text into tokens (more on this later) and uses training data to calculate probabilities that text containing each specific token belongs in each category. The term ‘naive’ is used because of various naive, and probably incorrect, assumptions which are made about independence between features, using a uniform prior distribution and so on.

那么，这是如何工作的呢？好吧，它使用了朴素的贝叶斯分类器-一个非常简单的机器学习工具，通常用于垃圾邮件过滤（请参阅Paul Graham的这篇出色的文章，介绍垃圾邮件过滤的用法）。它只是将文本拆分为标记（稍后会详细介绍），并使用训练数据来计算包含每个特定标记的文本属于每个类别的概率。使用“天真”一词是因为各种天真的假设，而且这些假设可能是错误的，这些假设是关于要素之间的独立性，使用统一的先验分布等的。

Creating a Naive Bayesian classifier in Python is very easy, using the textblob package. There is a great tutorial on building a classifier using textblob here, but I’ll run quickly through my code anyway:

使用textblob包，在Python中创建朴素贝叶斯分类器非常容易。有使用textblob建立一个分类有很大的教程在这里，但我会很快通过我的代码一定要运行：

First we load all the previous data from the aptly-named AllData.csv file, and pass it to the _get_training function to get the training data from this file in a format acceptable to textblob. This is basically a list of tuples, each of which contains (text, classification). In our case, the text is the description of the transaction from the bank statement, and the classification is the category that we want to assign it to. For example ("CARD PAYMENT TO SHELL TOTHILL,2.04 GBP, RATE 1.00/GBP ON 29-08-2013", "Petrol"). We use the _extractor function to split the text into tokens and generate ‘features’ from these tokens. In our case this is simply a function that splits the text by either spaces or the '/' symbol, and creates a boolean feature with the value True for each token it sees.

首先，我们从恰当命名的AllData.csv文件中加载所有先前的数据，并将其传递给_get_training函数，以使用textblob可接受的格式从该文件中获取训练数据。这基本上是一个元组列表，每个元组都包含（文本，分类） 。在我们的例子中，文本是银行对帐单中交易的描述，分类是我们要为其分配的类别。例如（“向卡托尔支付信用卡付款，2.04英镑，2013年8月29日汇率为1.00 / GBP”，“汽油”） 。我们使用_extractor函数将文本拆分为标记，并根据这些标记生成“功能”。在我们的例子中，这只是一个函数，它用空格或'/'符号分割文本，并为它看到的每个标记创建一个布尔值，其值为True 。

Now we’ve got the classifier, we read in the new data (_read_santander_file) and the list of categories (_read_categories) and then get down to the classification (_ask_with_guess). The classification just calls the classifier.classify method, giving it the text to classify. We then do a bit of work to nicely display the list of categories (I use colorama to do nice fonts and colours in the terminal) and ask the user whether the guess is correct. If it is, then we just save the category to the output file – but if it isn’t we call the classifier.update function with the correct tuple of (text, classification), which will update the probabilities used within the classifier to take account of this new information.

现在我们有了分类器，我们读入新数据（ _read_santander_file ）和类别列表（ _read_categories ），然后进入分类（ _ask_with_guess ）。分类仅调用classifier.classify方法，并为其提供要分类的文本。然后，我们做了一些工作以很好地显示类别列表（我使用colorama在终端中创建漂亮的字体和颜色），并询问用户猜测是否正确。如果是，那么我们只将类别保存到输出文件中；但是如果不是，我们使用正确的元组（text，classification）调用classifier.update函数，这将更新分类器中使用的概率此新信息的帐户。

That’s pretty-much it – all of the rest of the code is just plumbing that joins all of this together. This just shows how easy it is to produce a useful tool, using a simple machine learning technique.

差不多了–其余所有代码只是将所有这些结合在一起。这只是说明使用简单的机器学习技术来生产有用的工具是多么容易。

Just as a brief aside, you can do interesting things with the classifier object, like ask it to tell you what the most informative features are:

除了简要介绍之外，您还可以使用分类器对象做一些有趣的事情，例如询问它告诉您最有用的功能是什么：

Most Informative Features
                      IN = True           Cheque : nan    =      6.5 : 1.0
              UNIVERSITY = True           Cheque : nan    =      6.5 : 1.0
                 PAYMENT = None           Cheque : nan    =      6.5 : 1.0
                COWHERDS = True           Eating : nan    =      6.5 : 1.0
                    CARD = None           Cheque : nan    =      6.5 : 1.0
                  CHEQUE = True           Cheque : nan    =      6.5 : 1.0
        TICKETOFFICESALE = True           Travel : nan    =      6.5 : 1.0
             SOUTHAMPTON = True           Cheque : nan    =      6.5 : 1.0
                   CRAFT = True            Craft : nan    =      4.3 : 1.0
                     LTD = True            Craft : nan    =      4.3 : 1.0
                   HOBBY = True            Craft : nan    =      4.3 : 1.0
                    RATE = None           Cheque : nan    =      2.8 : 1.0
                     GBP = None           Cheque : nan    =      2.8 : 1.0
              SAINSBURYS = True           Superm : nan    =      2.6 : 1.0
                WAITROSE = True           Superm : nan    =      2.6 : 1.0