python 中管道_在Python中建立Analytics(分析)资料管道

python 中管道

If you’ve ever wanted to work with streaming data, or data that changes quickly, you may be familiar with the concept of a data pipeline. Data pipelines allow you transform data from one representation to another through a series of steps. Data pipelines are a key part of data engineering, which we teach in our new Data Engineer Path.

如果您曾经想使用流数据或快速变化的数据,那么您可能会熟悉数据管道的概念。 数据管道允许您通过一系列步骤将数据从一种表示形式转换为另一种表示形式。 数据管道是数据工程的关键部分,我们将在新的“ 数据工程师之路”中进行讲授。

A common use case for a data pipeline is figuring out information about the visitors to your web site. If you’re familiar with Google Analytics, you know the value of seeing real-time and historical information on visitors. In this blog post, we’ll use data from web server logs to answer questions about our visitors.

数据管道的一个常见用例是找出有关您网站访问者的信息。 如果您熟悉Google Analytics(分析) ,那么您就会知道查看访问者的实时和历史信息的价值。 在此博客文章中,我们将使用Web服务器日志中的数据来回答有关访问者的问题。

If you’re unfamiliar, every time you visit a web page, such as the Dataquest Blog, your browser is sent data from a web server. To host this blog, we use a high-performance web server called Nginx. Here’s how the process of you typing in a URL and seeing a result works:

如果您不熟悉,则每次访问网页(例如Dataquest Blog)时 ,都会从Web服务器向浏览器发送数据。 要托管此博客,我们使用一个称为Nginx的高性能Web服务器。 输入网址并查看结果的过程如下:

The process of sending a request from a web browser to a server.

从Web浏览器向服务器发送请求的过程。

First, the client sends a request to the web server asking for a certain page. The web server then loads the page from the filesystem and returns it to the client (the web server could also dynamically generate the page, but we won’t worry about that case right now). As it serves the request, the web server writes a line to a log file on the filesystem that contains some metadata about the client and the request. This log enables someone to later see who visited which pages on the website at what time, and perform other analysis.

首先,客户端向Web服务器发送请求以请求某个页面。 然后,Web服务器从文件系统加载页面并将其返回给客户端(Web服务器也可以动态生成页面,但是我们现在不必担心这种情况)。 Web服务器在处理请求时,在文件系统上的日志文件中写入一行,其中包含有关客户端和请求的一些元数据。 该日志使某人可以稍后查看谁在什么时间访问了网站上的哪些页面,并执行其他分析。

Here are a few lines from the Nginx log for this blog:

这是Nginx日志中的几行内容:

X.X.X.X - - [09/Mar/2017:01:15:59 +0000] "GET /blog/assets/css/jupyter.css HTTP/1.1" 200 30294 "http://www.dataquest.io/blog/" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Ubuntu Chromium/53.0.2785.143 Chrome/53.0.2785.143 Safari/537.36 PingdomPageSpeed/1.0 (pingbot/2.0; +http://www.pingdom.com/)"
X.X.X.X - - [09/Mar/2017:01:15:59 +0000] "GET /blog/assets/js/jquery-1.11.1.min.js HTTP/1.1" 200 95786 "http://www.dataquest.io/blog/" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Ubuntu Chromium/53.0.2785.143 Chrome/53.0.2785.143 Safari/537.36 PingdomPageSpeed/1.0 (pingbot/2.0; +http://www.pingdom.com/)"
X.X.X.X - - [09/Mar/2017:01:15:59 +0000] "GET /blog/assets/js/markdeep.min.js HTTP/1.1" 200 58713 "http://www.dataquest.io/blog/" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Ubuntu Chromium/53.0.2785.143 Chrome/53.0.2785.143 Safari/537.36 PingdomPageSpeed/1.0 (pingbot/2.0; +http://www.pingdom.com/)"
X.X.X.X - - [09/Mar/2017:01:15:59 +0000] "GET /blog/assets/js/index.js HTTP/1.1" 200 3075 "http://www.dataquest.io/blog/" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Ubuntu Chromium/53.0.2785.143 Chrome/53.0.2785.143 Safari/537.36 PingdomPageSpeed/1.0 (pingbot/2.0; +http://www.pingdom.com/)"
X.X.X.X - - [09/Mar/2017:01:16:00 +0000] "GET /blog/atom.xml HTTP/1.1" 301 194 "-" "UniversalFeedParser/5.2.1 +https://code.google.com/p/feedparser/"
X.X.X.X - - [09/Mar/2017:01:16:01 +0000] "GET /blog/feed.xml HTTP/1.1" 200 48285 "-" "UniversalFeedParser/5.2.1 +https://code.google.com/p/feedparser/"

X.X.X.X - - [09/Mar/2017:01:15:59 +0000] "GET /blog/assets/css/jupyter.css HTTP/1.1" 200 30294 "http://www.dataquest.io/blog/" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Ubuntu Chromium/53.0.2785.143 Chrome/53.0.2785.143 Safari/537.36 PingdomPageSpeed/1.0 (pingbot/2.0; +http://www.pingdom.com/)"
X.X.X.X - - [09/Mar/2017:01:15:59 +0000] "GET /blog/assets/js/jquery-1.11.1.min.js HTTP/1.1" 200 95786 "http://www.dataquest.io/blog/" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Ubuntu Chromium/53.0.2785.143 Chrome/53.0.2785.143 Safari/537.36 PingdomPageSpeed/1.0 (pingbot/2.0; +http://www.pingdom.com/)"
X.X.X.X - - [09/Mar/2017:01:15:59 +0000] "GET /blog/assets/js/markdeep.min.js HTTP/1.1" 200 58713 "http://www.dataquest.io/blog/" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Ubuntu Chromium/53.0.2785.143 Chrome/53.0.2785.143 Safari/537.36 PingdomPageSpeed/1.0 (pingbot/2.0; +http://www.pingdom.com/)"
X.X.X.X - - [09/Mar/2017:01:15:59 +0000] "GET /blog/assets/js/index.js HTTP/1.1" 200 3075 "http://www.dataquest.io/blog/" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Ubuntu Chromium/53.0.2785.143 Chrome/53.0.2785.143 Safari/537.36 PingdomPageSpeed/1.0 (pingbot/2.0; +http://www.pingdom.com/)"
X.X.X.X - - [09/Mar/2017:01:16:00 +0000] "GET /blog/atom.xml HTTP/1.1" 301 194 "-" "UniversalFeedParser/5.2.1 +https://code.google.com/p/feedparser/"
X.X.X.X - - [09/Mar/2017:01:16:01 +0000] "GET /blog/feed.xml HTTP/1.1" 200 48285 "-" "UniversalFeedParser/5.2.1 +https://code.google.com/p/feedparser/"

Each request is a single line, and lines are appended in chronological order, as requests are made to the server. The format of each line is the Nginx combined format, which looks like this internally:

每个请求都是一行,随着对服务器的请求,行按时间顺序追加。 每行的格式是Nginx combined格式,内部看起来像这样:

Note that the log format uses variables like $remote_addr, which are later replaced with the correct value for the specific request. Here are descriptions of each variable in the log format:

请注意,日志格式使用$remote_addr类的变量,稍后将其替换为特定请求的正确值。 以下是日志格式的每个变量的说明:

  • $remote_addr – the ip address of the client making the request to the server. For the first line in the log, this is X.X.X.X (we removed the ips for privacy).
  • $remote_user – if the client authenticated with basic authentication, this is the user name. Blank in the first log line.
  • $time_local – the local time when the request was made. 09/Mar/2017:01:15:59 +0000 in the first line.
  • $request – the type of request, and the URL that it was made to. GET /blog/assets/css/jupyter.css HTTP/1.1 in the first line.
  • $status – the response status code from the server. 200 in the first line.
  • $body_bytes_sent – the number of bytes sent by the server to the client in the response body. 30294 in the first line.
  • $http_referrer – the page that the client was on before sending the current request. http://www.dataquest.io/blog/ in the first line.
  • $http_user_agent – information about the browser and system of the client. Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Ubuntu Chromium/53.0.2785.143 Chrome/53.0.2785.143 Safari/537.36 PingdomPageSpeed/1.0 (pingbot/2.0; +http://www.pingdom.com/) in the first line.
  • $remote_addr –向服务器发出请求的客户端的IP地址。 对于日志的第一行,这是XXXX (出于隐私目的,我们删除了ips)。
  • $remote_user –如果客户端通过基本身份验证进行了身份验证,则为用户名。 在第一条日志行中为空白。
  • $time_local –发出请求的本地时间。 第一行是09/Mar/2017:01:15:59 +0000
  • $request$request类型 ,以及发出请求的URL。 在第一行中GET /blog/assets/css/jupyter.css HTTP/1.1
  • $status服务器的响应状态代码 。 第一行200
  • $body_bytes_sent –服务器在响应正文中发送给客户端的字节数。 第一行中的30294
  • $http_referrer –客户端在发送当前请求之前$http_referrer的页面。 第一行中的http://www.dataquest.io/blog/
  • $http_user_agent –有关客户端的浏览器和系统的信息。 Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Ubuntu Chromium/53.0.2785.143 Chrome/53.0.2785.143 Safari/537.36 PingdomPageSpeed/1.0 (pingbot/2.0; +http://www.pingdom.com/)在第一行。

The web server continuously adds lines to the log file as more requests are made to it. Occasionally, a web server will rotate a log file that gets too large, and archive the old data.

随着对它的更多请求,Web服务器不断向日志文件添加行。 有时,Web服务器将旋转太大的日志文件,并存档旧数据。

As you can imagine, companies derive a lot of value from knowing which visitors are on their site, and what they’re doing. For example, realizing that users who use the Google Chrome browser rarely visit a certain page may indicate that the page has a rendering issue in that browser. Another example is in knowing how many users from each country visit your site each day. It can help you figure out what countries to focus your marketing efforts on. At the simplest level, just knowing how many visitors you have per day can help you understand if your marketing efforts are working properly.

可以想象,公司通过了解哪些访客在他们的网站上以及他们在做什么而获得了很多价值。 例如,意识到使用Google Chrome浏览器的用户很少访问某个页面,可能表明该页面在该浏览器中存在渲染问题。 另一个例子是知道每天有多少国家的用户访问您的网站。 它可以帮助您确定将市场营销重点放在哪些国家。 在最简单的层面上,仅了解您每天有多少访问者就可以帮助您了解营销工作是否正常进行。

In order to calculate these metrics, we need to parse the log files and analyze them. In order to do this, we need to construct a data pipeline.

为了计算这些指标,我们需要分析日志文件并进行分析。 为此,我们需要构建一个数据管道。

关于数据管道的思考 (Thinking About The Data Pipeline)

Here’s a simple example of a data pipeline that calculates how many visitors have visited the site each day:

这是一个数据管道的简单示例,该数据管道计算每天有多少访问者访问该站点:

Getting from raw logs to visitor counts per day.

每天从原始日志获取访问者计数。

As you can see above, we go from raw log data to a dashboard where we can see visitor counts per day. Note that this pipeline runs continuously – when new entries are added to the server log, it grabs them and processes them. There are a few things you’ve hopefully noticed about how we structured the pipeline:

正如您在上面看到的,我们从原始日志数据转到仪表板,在这里我们可以查看每天的访客数。 请注意,此管道连续运行-将新条目添加到服务器日志时,它将捕获它们并进行处理。 关于我们如何构建管道,您希望已经注意到一些事情:

  • Each pipeline component is separated from the others, and takes in a defined input, and returns a defined output.
    • Although we don’t show it here, those outputs can be cached or persisted for further analysis.
  • We store the raw log data to a database. This ensures that if we ever want to run a different analysis, we have access to all of the raw data.
  • We remove duplicate records. It’s very easy to introduce duplicate data into your analysis process, so deduplicating before passing data through the pipeline is critical.
  • Each pipeline component feeds data into another component. We want to keep each component as small as possible, so that we can individually scale pipeline components up, or use the outputs for a different type of analysis.
  • 每个管道组件都彼此分开,并接受定义的输入,并返回定义的输出。
    • 尽管此处未显示,但可以将这些输出缓存或保留以进行进一步分析。
  • 我们将原始日志数据存储到数据库中。 这样可以确保,如果我们要进行其他分析,则可以访问所有原始数据。
  • 我们删除重复的记录。 将重复数据引入您的分析过程非常容易,因此在通过管道传递数据之前进行重复数据删除至关重要。
  • 每个管道组件将数据馈送到另一个组件。 我们希望每个组件都尽可能小,以便我们可以分别放大管道组件,或将输出用于不同类型的分析。

Now that we’ve seen how this pipeline looks at a high level, let’s implement it in Python.

现在,我们已经了解了该管道的高层情况,让我们在Python中实现它。

处理和存储Web服务器日志 (Processing And Storing Webserver Logs)

In order to create our data pipeline, we’ll need access to webserver log data. We created a script that will continuously generate fake (but somewhat realistic) log data. Here’s how to follow along with this post:

为了创建数据管道,我们需要访问Web服务器日志数据。 我们创建了一个脚本,该脚本将连续生成伪造(但有些现实)的日志数据。 以下是此帖子的后续操作:

  • Clone this repo.
  • Follow the README to install the Python requirements.
  • Run python log_generator.py.

After running the script, you should see new entries being written to log_a.txt in the same folder. After 100 lines are written to log_a.txt, the script will rotate to log_b.txt. It will keep switching back and forth between files every 100 lines.

运行脚本后,您应该看到新条目被写入同一文件夹中的log_a.txt 。 将100行写入到log_a.txt ,脚本将旋转为log_b.txt 。 每100行将在文件之间来回切换。

Once we’ve started the script, we just need to write some code to ingest (or read in) the logs. The script will need to:

启动脚本后,我们只需要编写一些代码来提取(或读入)日志。 该脚本将需要:

  • Open the log files and read from them line by line.
  • Parse each line into fields.
  • Write each line and the parsed fields to a database.
  • Ensure that duplicate lines aren’t written to the database.
  • 打开日志文件并逐行读取它们。
  • 将每一行解析为字段。
  • 将每一行和已解析的字段写入数据库。
  • 确保没有将重复的行写入数据库。

The code for this is in the store_logs.py file in this repo if you want to follow along.

如果要继续, 代码位于此存储库store_logs.py文件中。

In order to achieve our first goal, we can open the files and keep trying to read lines from them.

为了实现我们的第一个目标,我们可以打开文件并继续尝试从文件中读取行。

The below code will:

下面的代码将:

  • Open both log files in reading mode.
  • Loop forever.
    • Figure out where the current character being read for both files is (using the tell method).
    • Try to read a single line from both files (using the readline method).
    • If neither file had a line written to it, sleep for a bit then try again.
      • Before sleeping, set the reading point back to where we were originally (before calling readline), so that we don’t miss anything (using the seek method).
    • If one of the files had a line written to it, grab that line. Recall that only one file can be written to at a time, so we can’t get lines from both files.
  • 在读取模式下打开两个日志文件。
  • 永远循环。
    • 找出两个文件当前正在读取的字符在哪里(使用tell方法)。
    • 尝试从两个文件中读取一行(使用readline方法)。
    • 如果两个文件均未写入任何行,请稍睡片刻,然后重试。
      • 在睡觉之前,将读取点设置回我们原来的位置(在调用readline之前),这样我们就不会丢失任何东西(使用seek方法)。
    • 如果其中一个文件写入了一行,请抓住该行。 回想一下一次只能写入一个文件,因此我们无法从两个文件中获取行。
f_a f_a = = openopen (( LOG_FILE_ALOG_FILE_A , , 'r''r' )
)
f_b f_b = = openopen (( LOG_FILE_BLOG_FILE_B , , 'r''r' )
)
while while TrueTrue :
    :
    where_a where_a = = f_af_a .. telltell ()
    ()
    line_a line_a = = f_af_a .. readlinereadline ()
    ()
    where_b where_b = = f_bf_b .. telltell ()
    ()
    line_b line_b = = f_bf_b .. readlinereadline ()

    ()

    if if not not line_a line_a and and not not line_bline_b :
        :
        timetime .. sleepsleep (( 11 )
        )
        f_af_a .. seekseek (( where_awhere_a )
        )
        f_bf_b .. seekseek (( where_bwhere_b )
        )
        continue
    continue
    elseelse :
        :
        if if line_aline_a :
            :
            line line = = line_a
        line_a
        elseelse :
            :
            line line = = line_bline_b

Once we’ve read in the log file, we need to do some very basic parsing to split it into fields. We don’t want to do anything too fancy here – we can save that for later steps in the pipeline. You typically want the first step in a pipeline (the one that saves the raw data) to be as lightweight as possible, so it has a low chance of failure. If this step fails at any point, you’ll end up missing some of your raw data, which you can’t get back!

读完日志文件后,我们需要进行一些非常基本的解析,以将其拆分为多个字段。 我们不想在这里做任何花哨的事情–我们可以将其保存下来,以备将来的步骤使用。 通常,您希望管道中的第一步(保存原始数据的第一步)尽可能轻巧,因此发生故障的可能性很小。 如果此步骤在任何时候都失败了,您将最终丢失一些原始数据,这些原始数据将无法恢复!

In order to keep the parsing simple, we’ll just split on the space ( ) character then do some reassembly:

为了使解析简单,我们将在空格()字符上进行拆分,然后进行一些重组:

Parsing log files into structured fields.

将日志文件解析为结构化字段。

In the below code, we:

在下面的代码中,我们:

  • Take a single log line, and split it on the space character ( ).
  • Extract all of the fields from the split representation.
    • Note that some of the fields won’t look “perfect” here – for example the time will still have brackets around it.
  • Initialize a created variable that stores when the database record was created. This will enable future pipeline steps to query data.
  • 取一条日志行,并在空格字符()上进行分割。
  • 从拆分表示中提取所有字段。
    • 请注意,某些字段在这里看起来并不“完美” –例如,时间仍会带有括号。
  • 初始化created变量,该变量存储何时创建数据库记录。 这将使将来的管道步骤可以查询数据。

We also need to decide on a schema for our SQLite database table and run the needed code to create it. Because we want this component to be simple, a straightforward schema is best. We’ll use the following query to create the table:

我们还需要为SQLite数据库表确定一个架构,并运行所需的代码来创建它。 因为我们希望此组件很简单,所以最好使用简单的架构。 我们将使用以下查询创建表:

CREATE CREATE TABLE TABLE IF IF NOT NOT EXISTS EXISTS logs logs (
  (
  raw_log raw_log TEXT TEXT NOT NOT NULL NULL UNIQUEUNIQUE ,
  ,
  remote_addr remote_addr TEXTTEXT ,
  ,
  time_local time_local TEXTTEXT ,
  ,
  request_type request_type TEXTTEXT ,
  ,
  request_path request_path TEXTTEXT ,
  ,
  status status INTEGERINTEGER ,
  ,
  body_bytes_sent body_bytes_sent INTEGERINTEGER ,
  ,
  http_referer http_referer TEXTTEXT ,
  ,
  http_user_agent http_user_agent TEXTTEXT ,
  ,
  created created DATETIME DATETIME DEFAULT DEFAULT CURRENT_TIMESTAMP
  CURRENT_TIMESTAMP
  ))

Note how we ensure that each raw_log is unique, so we avoid duplicate records. Also note how we insert all of the parsed fields into the database along with the raw log. There’s an argument to be made that we shouldn’t insert the parsed fields, since we can easily compute them again. However, adding them to fields makes future queries easier (we can select just the time_local column, for instance), and it saves computational effort down the line.

请注意,我们如何确保每个raw_log是唯一的,因此我们避免了重复的记录。 还要注意我们如何将所有已解析的字段与原始日志一起插入数据库。 有一个论点是我们不应该插入已解析的字段,因为我们可以轻松地再次计算它们。 但是,将它们添加到字段中会使将来的查询变得更加容易(例如,我们只能选择time_local列),并且节省了下线的计算量。

Keeping the raw log helps us in case we need some information that we didn’t extract, or if the ordering of the fields in each line becomes important later. For these reasons, it’s always a good idea to store the raw data.

保留原始日志对我们很有帮助,以防万一我们需要一些未提取的信息,或者以后各行中字段的顺序变得很重要。 由于这些原因,存储原始数据始终是一个好主意。

Finally, we’ll need to insert the parsed records into the logs table of a SQLite database. Choosing a database to store this kind of data is very critical. We picked SQLite in this case because it’s simple, and stores all of the data in a single file. If you’re more concerned with performance, you might be better off with a database like Postgres.

最后,我们需要将已解析的记录插入SQLite数据库的logs表中。 选择一个数据库来存储此类数据非常关键。 在这种情况下,我们选择SQLite是因为它很简单,并将所有数据存储在一个文件中。 如果您更关心性能,那么使用Postgres这样的数据库可能会更好。

In the below code, we:

在下面的代码中,我们:

  • Connect to a SQLite database.
  • Instantiate a cursor to execute queries.
  • Put together all of the values we’ll insert into the table (parsed is a list of the values we parsed earlier)
  • Insert the values into the database.
  • Commit the transaction so it writes to the database.
  • Close the connection to the database.
  • 连接到SQLite数据库。
  • 实例化游标以执行查询。
  • 将所有我们将插入表中的值放在一起( parsed是我们之前解析的值的列表)
  • 将值插入数据库。
  • 提交事务,以便将其写入数据库。
  • 关闭与数据库的连接。

We just completed the first step in our pipeline! Now that we have deduplicated data stored, we can move on to counting visitors.

我们刚刚完成了产品线的第一步! 现在我们已经存储了重复数据删除的数据,接下来我们可以继续统计访问者了。

喜欢这篇文章吗? 使用Dataquest学习数据科学! (Enjoying this post? Learn data science with Dataquest!)

  • Learn from the comfort of your browser.
  • Work with real-life data sets.
  • Build a portfolio of projects.
  • 从舒适的浏览器中学习。
  • 处理实际数据集。
  • 建立项目组合。

用数据管道统计访问者 (Counting Visitors With A Data Pipeline)

We can use a few different mechanisms for sharing data between pipeline steps:

我们可以使用几种不同的机制在流水线步骤之间共享数据:

  • Files
  • Databases
  • Queues
  • 档案
  • 资料库
  • Queue列

In each case, we need a way to get data from the current step to the next step. If we point our next step, which is counting ips by day, at the database, it will be able to pull out events as they’re added by querying based on time. Although we’ll gain more performance by using a queue to pass data to the next step, performance isn’t critical at the moment.

在每种情况下,我们都需要一种从当前步骤到下一步获取数据的方法。 如果将下一步(按天统计ip)指向数据库,则可以根据时间查询添加事件,从而提取事件。 尽管通过使用队列将数据传递到下一步将获得更高的性能,但目前性能并不重要。

We’ll create another file, count_visitors.py, and add in some code that pulls data out of the database and does some counting by day.

我们将创建另一个文件count_visitors.py ,并添加一些将数据从数据库中拉出并按日进行计数的代码。

We’ll first want to query data from the database. In the below code, we:

我们首先要从数据库查询数据。 在下面的代码中,我们:

  • Connect to the database.
  • Query any rows that have been added after a certain timestamp.
  • Fetch all the rows.
  • 连接到数据库。
  • 查询在某个时间戳记之后添加的任何行。
  • 提取所有行。
def def get_linesget_lines (( time_objtime_obj ):
    ):
    conn conn = = sqlite3sqlite3 .. connectconnect (( DB_NAMEDB_NAME )
    )
    cur cur = = connconn .. cursorcursor ()
    ()
    curcur .. executeexecute (( "SELECT remote_addr,time_local FROM logs WHERE created > ?""SELECT remote_addr,time_local FROM logs WHERE created > ?" , , [[ time_objtime_obj ])
    ])
    resp resp = = curcur .. fetchallfetchall ()
    ()
    return return respresp

We then need a way to extract the ip and time from each row we queried. The below code will:

然后,我们需要一种从查询的每一行中提取IP和时间的方法。 下面的代码将:

  • Initialize two empty lists.
  • Pull out the time and ip from the query response and add them to the lists.
  • 初始化两个空列表。
  • 从查询响应中提取时间和IP,并将其添加到列表中。

You may note that we parse the time from a string into a datetime object in the above code. The code for the parsing is below:

您可能会注意到,在上面的代码中,我们将时间从字符串解析为datetime对象。 解析代码如下:

def def parse_timeparse_time (( time_strtime_str ):
    ):
    time_obj time_obj = = datetimedatetime .. strptimestrptime (( time_strtime_str , , '['[ %% d/d/ %% b/b/ %% Y:Y: %% H:H: %% M:M: %% S S  %% z]'z]' )
    )
    return return time_objtime_obj

Once we have the pieces, we just need a way to pull new rows from the database and add them to an ongoing visitor count by day. The below code will:

整理好片段之后,我们只需要一种从数据库中提取新行并将其按天添加到持续访问者计数中的方法。 下面的代码将:

  • Get the rows from the database based on a given start time to query from (we get any rows that were created after the given time).
  • Extract the ips and datetime objects from the rows.
  • If we got any lines, assign start time to be the latest time we got a row. This prevents us from querying the same row multiple times.
  • Create a key, day, for counting unique ips.
  • Add each ip to a set that will only contain unique ips for each day.
  • 根据给定的开始时间从数据库中获取行以进行查询(我们将获取在给定时间之后创建的所有行)。
  • 从行中提取ips和datetime对象。
  • 如果有任何行,请将开始时间指定为我们有一行的最晚时间。 这样可以防止我们多次查询同一行。
  • 创建密钥day ,以计算唯一的ips。
  • 将每个ip添加到每天仅包含唯一ip的集合中。

This code will ensure that unique_ips will have a key for each day, and the values will be sets that contain all of the unique ips that hit the site that day.

该代码将确保unique_ips每天都有一个密钥,并且值将被设置为包含当天访问该网站的所有唯一ip。

After sorting out ips by day, we just need to do some counting. In the below code, we:

每天对ip进行分类后,我们只需要进行一些计数即可。 在下面的代码中,我们:

  • Assign the number of visitors per day to counts.
  • Extract a list of tuples from counts.
  • Sort the list so that the days are in order.
  • Print out the visitor counts per day.
  • 分配每天的访问者数量进行counts
  • counts提取元组列表。
  • 对列表进行排序,以便按顺序排列日期。
  • 打印出每天的访客计数。
for for kk , , v v in in unique_ipsunique_ips .. itemsitems ():
    ():
    countscounts [[ kk ] ] = = lenlen (( vv )

)

count_list count_list = = countscounts .. itemsitems ()
()
count_list count_list = = sortedsorted (( count_listcount_list , , keykey == lambda lambda xx : : xx [[ 00 ])
])
for for item item in in count_listcount_list :
    :
    printprint (( "{}: {}""{}: {}" .. formatformat (( ** itemitem ))))

We can then take the code snippets from above so that they run every 5 seconds:

然后,我们可以从上方获取代码段,以使它们每5秒运行一次:

共同努力 (Pulling The Pipeline Together)

We’ve now taken a tour through a script to generate our logs, as well as two pipeline steps to analyze the logs. In order to get the complete pipeline running:

现在,我们浏览了一个脚本以生成日志,并通过两个管道步骤来分析日志。 为了使完整的管道运行:

  • Clone the analytics_pipeline repo from Github if you haven’t already.
  • Follow the README.md file to get everything setup.
  • Execute log_generator.py.
  • Execute store_logs.py.
  • Execute count_visitors.py.
  • 如果还没有,请从Github复制analytics_pipeline存储库
  • 按照README.md文件进行所有设置。
  • 执行log_generator.py
  • 执行store_logs.py
  • 执行count_visitors.py

After running count_visitors.py, you should see the visitor counts for the current day printed out every 5 seconds. If you leave the scripts running for multiple days, you’ll start to see visitor counts for multiple days.

运行count_visitors.py ,您应该会看到每5秒打印出当天的访问者计数。 如果您将脚本运行几天,就会开始看到访客数天。

Congratulations! You’ve setup and run a data pipeline. Let’s now create another pipeline step that pulls from the database.

恭喜你! 您已经设置并运行数据管道。 现在让我们创建另一个从数据库中提取的管道步骤。

向数据管道添加另一个步骤 (Adding Another Step To The Data Pipeline)

One of the major benefits of having the pipeline be separate pieces is that it’s easy to take the output of one step and use it for another purpose. Instead of counting visitors, let’s try to figure out how many people who visit our site use each browser. This will make our pipeline look like this:

将管道分成多个部分的主要好处之一是,很容易将一步的输出用于另一目的。 让我们试着算出访问每个网站的浏览者的人数,而不是计算访问者的人数。 这将使我们的管道如下所示:

We now have one pipeline step driving two downstream steps.

现在,我们有一个管道步骤驱动两个下游步骤。

As you can see, the data transformed by one step can be the input data for two different steps. If you want to follow along with this pipeline step, you should look at the count_browsers.py file in the repo you cloned.

如您所见,一步转换的数据可以是两个不同步的输入数据。 如果要遵循此管道步骤,则应查看克隆的回购中的count_browsers.py文件。

In order to count the browsers, our code remains mostly the same as our code for counting visitors. The main difference is in us parsing the user agent to retrieve the name of the browser. In the below code, you’ll notice that we query the http_user_agent column instead of remote_addr, and we parse the user agent to find out what browser the visitor was using:

为了对浏览器进行计数,我们的代码与计数访客的代码基本相同。 主要区别在于我们解析用户代理以检索浏览器的名称。 在下面的代码中,您会注意到我们查询的是http_user_agent列而不是remote_addr ,并且我们分析了用户代理以找出访问者正在使用的浏览器:

def def get_linesget_lines (( time_objtime_obj ):
    ):
    conn conn = = sqlite3sqlite3 .. connectconnect (( DB_NAMEDB_NAME )
    )
    cur cur = = connconn .. cursorcursor ()
    ()
    curcur .. executeexecute (( "SELECT time_local,http_user_agent FROM logs WHERE created > ?""SELECT time_local,http_user_agent FROM logs WHERE created > ?" , , [[ time_objtime_obj ])
    ])
    resp resp = = curcur .. fetchallfetchall ()
    ()
    return return resp

resp

def def get_time_and_ipget_time_and_ip (( lineslines ):
    ):
    browsers browsers = = []
    []
    times times = = []
    []
    for for line line in in lineslines :
        :
        timestimes .. appendappend (( parse_timeparse_time (( lineline [[ 00 ]))
        ]))
        browsersbrowsers .. appendappend (( parse_user_agentparse_user_agent (( lineline [[ 11 ]))
    ]))
    return return browsersbrowsers , , times

times

def def parse_user_agentparse_user_agent (( user_agentuser_agent ):
    ):
    browsers browsers = = [[ "Firefox""Firefox" , , "Chrome""Chrome" , , "Opera""Opera" , , "Safari""Safari" , , "MSIE""MSIE" ]
    ]
    for for browser browser in in browsersbrowsers :
        :
        if if browser browser in in user_agentuser_agent :
            :
            return return browser
    browser
    return return "Other""Other"

We then modify our loop to count up the browsers that have hit the site:

然后,我们修改循环以统计访问该网站的浏览器:

Once we make those changes, we’re able to run python count_browsers.py to count up how many browsers are hitting our site.

进行这些更改后,我们就可以运行python count_browsers.py来计算有多少浏览器访问了我们的网站。

We’ve now created two basic data pipelines, and demonstrated some of the key principles of data pipelines:

现在,我们已经创建了两个基本数据管道,并演示了数据管道的一些关键原理:

  • Making each step fairly small.
  • Passing data between pipelines with defined interfaces.
  • Storing all of the raw data for later analysis.
  • 使每个步骤都相当小。
  • 使用定义的接口在管道之间传递数据。
  • 存储所有原始数据以供以后分析。

扩展数据管道 (Extending Data Pipelines)

After this post, you should understand how to create a basic data pipeline. In the next post(s), we’ll cover some of the more advanced data pipeline topics, such as:

在这篇文章之后,您应该了解如何创建基本数据管道。 在下一篇文章中,我们将介绍一些更高级的数据管道主题,例如:

  • Handling errors
  • Creating redundancy
  • Scaling components
  • Increasing throughput
  • Adding more complex steps
  • 处理错误
  • 创建冗余
  • 缩放组件
  • 吞吐量增加
  • 添加更复杂的步骤

In the meantime, feel free to extend the pipeline we implemented. Here are some ideas:

同时,随时扩展我们实施的管道。 这里有一些想法:

  • Can you make a pipeline that can cope with much more data? What if log messages are generated continuously?
  • Can you geolocate the IPs to figure out where visitors are?
  • Can you figure out what pages are most commonly hit?
  • 您可以建立一个可以处理更多数据的管道吗? 如果连续生成日志消息怎么办?
  • 您可以对IP进行地理位置定位,以确定访问者在哪里吗?
  • 您能找出哪些网页最常被点击吗?

翻译自: https://www.pybloggers.com/2017/03/building-an-analytics-data-pipeline-in-python/

python 中管道

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值