个人memex的数据处理管道

最新推荐文章于 2023-06-05 19:05:45 发布

weixin_26714375

最新推荐文章于 2023-06-05 19:05:45 发布

阅读量240

点赞数

文章标签： python

原文链接：https://medium.com/@davidk01/data-processing-pipeline-for-a-personal-memex-c9b2725cd02

版权

“Wholly new forms of encyclopedias will appear, ready-made with a mesh of associative trails running through them, ready to be dropped into the memex and there amplified.” — Vannevar Bush

“将会出现全新的百科全书，现成的网状的 关联小径 穿过它们，准备被放到memex中并在那里 放大。” —瓦纳瓦尔·布什

介绍 (Introduction)

I’m not really sure what Vannevar Bush meant by “amplification” so what I’m going to present is just a guess but I think it’s a good guess because an important aspect of a personal memex is that it’s a tool for reifying implicit parts of information and exposing them as concrete structures. The explicit structures can then be queried, viewed, modified, recombined, etc. to create higher level structures. It’s not possible to create higher level structures if the lower level structures are not explicit so to “amplify” data / information / knowledge one must first make the implicit structure of the data / information / knowledge explicit and convert it into concrete data structures amenable to computational analysis.

我不太确定范内瓦尔·布什(Vannevar Bush)所说的“放大”是什么意思，所以我要介绍的只是一个猜测，但我认为这是一个很好的猜测，因为个人Memex的一个重要方面是它是一种用于隐式部分化的工具信息，并将其作为具体结构公开。然后可以查询，查看，修改，重新组合显式结构以创建更高级别的结构。如果较低层的结构不是明确的，则不可能创建较高层的结构，因此要“放大”数据/信息/知识，必须首先使数据/信息/知识的隐式结构明确，然后将其转换为适合的具体数据结构。计算分析。

In this post we are going to take data (basically some files in a folder) and “amplify” their implicit aspects with CouchDB and Apache Tika so that we can query it in a way that would not have been possible otherwise, i.e. we are going to amplify our data and make its implicit structures explicit.

在这篇文章中，我们将获取数据(基本上是文件夹中的一些文件)，并使用CouchDB和Apache Tika“放大”其隐式方面，以便我们能够以其他方式无法查询的方式进行查询，即，我们将放大我们的数据并使其隐式结构明确。

This post is also avaiable as a notion document with better code layout so if you want to see the code I recommend viewing it in notion.

这篇文章也可以作为具有更好代码布局的概念文档来使用，因此，如果您想查看代码，我建议您以概念的方式查看。

If you’re following along the series of posts for building a personal memex then this is the second post in that series. The first one is here.

如果您正在按照系列文章来构建个人memex，那么这是该系列文章中的第二篇。 第一个在 这里。

扩增管道 (A Pipeline for Amplification)

We’re going to need some code but not that much because Apache Tika will do most of the heavy lifting and all we need to do is coordinate the process and re-structure the output. From a high level, we’re going to take a file (it can be any file that can be processed with Apache Tika) and we’re going to extract the content that is implicit in the file. Implicit here just means it’s not as readily accessible, i.e. if you have a PNG image with some text then the text is obvious to you but it’s not expressed in a machine readable way as a text file which is a long-winded way of saying the text is “implicit” and not amenable to programmatic manipulation. So, we’re going to extract the implicit data and put it alongside the original file so that it can be accessed and amplified by other parts and components of the memex.

我们将需要一些代码，但是并不需要那么多代码，因为Apache Tika将完成大部分繁重的工作，而我们所需要做的就是协调流程并重组输出。从较高的角度来看，我们将获取一个文件(可以是可以用Apache Tika处理的任何文件)，并且将提取文件中隐含的内容。这里的隐含含义只是意味着它不那么容易访问，即，如果您的PNG图像带有一些文本，则该文本对您来说是显而易见的，但不会以机器可读的方式表示为文本文件，这是冗长的说法。文字是“隐式”的，不适合程序操纵。因此，我们将提取隐式数据并将其与原始文件放在一起，以便可由memex的其他部分和组件进行访问和放大。

The code itself is not very complicated, it’s a simple pipeline that takes a file, passes it to Apache Tika to extract the metadata and text content, and then writes all the pieces to disk as either plain text documents or JSON files. I’ll present the code in chunks along with an explanation of what is happening. The goal isn’t to make everything perfectly understandable but to show you enough that if you wanted to build your own script you could use what I presented as a starting point.

代码本身不是很复杂，它是一个简单的管道，它接收一个文件，将其传递给Apache Tika以提取元数据和文本内容，然后将所有片段作为纯文本文档或JSON文件写入磁盘。我将以代码块的形式呈现代码，并说明正在发生的事情。目的不是使所有内容都完全易于理解，而是向您展示足够多的内容，以便您想要构建自己的脚本时可以使用我介绍的内容作为起点。

So with that out of the way let’s get to the code.

因此，让我们开始编写代码。

放大数据 (Amplifying Data)

There is some basic configuration boilerplate that we need to get out of the way so let’s do that first.

我们需要避开一些基本配置样板，所以让我们首先开始。

require 'json'
require 'pry'
require_relative './lib/config.rb'
require 'digest'
require 'fileutils'# Grab the configuration.
configuration = MemexConfiguration::Config
couchdb_config = configuration["couchdb"]
couchdb_auth = "#{couchdb_config["username"]}:#{couchdb_config["password"]}
couchdb_url = "#{couchdb_auth}@#{couchdb_config["url"]}"
couchdb_db = couchdb_config["database"]
db_url = "#{couchdb_url}/#{couchdb_db}"
storage = configuration["storage"]["root"]
processed_location = configuration["storage"]["processed"]
tika_config = configuration["tika"]
metadata_url = File.join(tika_config["url"], tika_config["metadata"]["url"])
metadata_headers = tika_config["metadata"]["headers"].map do |(k, v)|
  "--header '#{k.capitalize}: #{v}'"
end.join(' ')
text_url = File.join(tika_config["url"], tika_config["text"]["url"])
text_headers = tika_config["text"]["headers"].map do |(k, v)|
  "--header '#{k.capitalize}: #{v}'"
end.join(' ')

What we’re doing here is extracting all the configuration parameters we need because the rest of the code is going to use them in one way or another. At some point I will go back and refactor the pieces and hide / encapsulate the details better but having it explicit here is helpful because it’s easy to see the major components relevant to this part of the pipeline. The major pieces and variables are for working with the file system (storage, processed_location) and the URLs for talking to CouchDB (db_url) and Apache Tika (metadata_url, text_url).

我们在这里所做的是提取所需的所有配置参数，因为其余的代码将以一种或另一种方式使用它们。在某个时候，我将回过头来重构各个部分，更好地隐藏/封装细节，但是在此处明确说明是有帮助的，因为很容易看到与管道这一部分相关的主要组件。主要作品和变量是与文件系统(工作storage ， processed_location )和交谈的CouchDB(网址db_url )和Apache提卡( metadata_url ， text_url )。

Now that we have the required configuration parameters we can start processing some files from the configured storage location.

有了所需的配置参数后，我们就可以开始从已配置的存储位置处理一些文件了。

Dir[File.join(storage, '*')].each do |memex_note|
  next if File.directory?(memex_note)
  # Much of what we do relies on the SHA256 digest so we compute it ahead of everything else.
  digest = Digest::SHA2.file(memex_note)
  digest << memex_note
  digest = digest.to_s
  file_stat = File::Stat.new(memex_note)
  # We need a temporary directory as the working directory for the files we will generate.
  tmpdir = Dir::mktmpdir("tmp-#{digest}", storage)
  begin
    # ...
  rescue StandardError => e
    STDERR.puts "There was an error when processing #{memex_note}. Processed files are in #{tmpdir}."
  end
en

What we’re doing here is iterating through all the files that are stored in a specific location (storage) and calculating each file's SHA256 digest along with extracting file stats and creating a temporary directory where we will be doing all the work. The temporary directory is useful because we want easy access to all the files we will be generating in case something goes wrong. It's easier to debug issues when each sub-part of the pipeline is cleanly isolated in a specific folder.

我们在这里要做的是遍历存储在特定位置( storage )中的所有文件，并计算每个文件的SHA256摘要以及提取文件统计信息并创建一个临时目录，我们将在其中进行所有工作。临时目录很有用，因为我们希望在出现问题时可以轻松访问将要生成的所有文件。当管道的每个子部分干净地隔离在特定文件夹中时，调试问题会更容易。

Next we are going to extract metadata and text from the file and store them in the temporary folder. If nothing goes wrong then we will moves the files to their permanent home, index the data with CouchDB, and clean up the temporary directory.

接下来，我们将从文件中提取元数据和文本，并将它们存储在临时文件夹中。如果没有问题，那么我们将把文件移到它们的永久目录，用CouchDB索引数据，并清理临时目录。

# Copy the file to the temporary folder.
    staging_path = File.join(tmpdir, digest)
    FileUtils.cp(memex_note, staging_path)
    # Grab the metadata, write it disk, and then parse the JSON because we need it
    # for generating the CouchDB document.
    metadata = `curl -s -T #{staging_path} #{metadata_url} #{metadata_headers}`
    File.open("#{staging_path}.metadata", "w") { |f| f.write(metadata) }
    metadata = JSON.parse(metadata)
    # Grab the text content and write it to the temporary directory.
    text = `curl -s -T #{staging_path} #{text_url} #{text_headers}`
    File.open("#{staging_path}.txt", "w") { |f| f.write(text) }
    # We now have everything we need to generate the CouchDB document so let's make it.
    couchdb_document = {
      _id: digest,
      year: file_stat.ctime.year,
      month: file_stat.ctime.month,
      day: file_stat.ctime.day,
      content_type: metadata["Content-Type"],
      file_name: File.basename(memex_note),
      file_type: `file #{staging_path}`.strip.split(': ')[-1]
    }
    # Write the CouchDB document to disk.
    File.open("#{staging_path}.couchdb.json", "w") { |f| f.write(couchdb_document.to_json) }
    # At this point we have everything we need as far as CouchDB and files are concerned so
    # we use `rsync` to move the files into the processed folder, add the document to CouchDB,
    # and finally clean up the temporary folder.
    `rsync -rtza #{File.join(tmpdir, '/')} #{processed_location}`
    if $?.exitstatus > 0
      STDERR.puts "Something went wrong during rsync processing of #{tmpdir}. Leaving the folder intact and moving on."
      # Do not block on errors. Just move on and process the next note.
      next
    end
    # At this point all the files are in place so we just need to put the document
    # in the database and delete the temporary folder.
    couchdb_response = JSON.parse(`curl -s -X PUT #{File.join(db_url, digest)} --data @#{staging_path}.couchdb.json`)
    if !couchdb_response["ok"]
      STDERR.puts "Something went wrong when uploading the document from #{staging_path} to CouchDB: #{couchdb_response.to_json}"
      # If something goes wrong we leave the folder as is and move on to the next item.
      # Usually it is because of conflicts so the easiest way to fix the problem is to delete the existing entry
      # and let the indexing script re-try.
      next
    end
    # If everything was successful then clean up the temporary directory and delete the original file.
    FileUtils.rm_f(memex_note)
    FileUtils.remove_entry(tmpdir)

That’s a lot of code in one go so I’ll just highlight the most important parts.

一口气写了很多代码，所以我只重点介绍最重要的部分。

Extracting the metadata:

提取元数据：

# Grab the metadata, write it disk, and then parse the JSON because we need it
# for generating the CouchDB document.
metadata = `curl -s -T #{staging_path} #{metadata_url} #{metadata_headers}`
File.open("#{staging_path}.metadata", "w") { |f| f.write(metadata) }

Extracting the text with Tesseract through Apache Tika:

通过Apache Tika用Tesseract提取文本：

# Grab the text content and write it to the temporary directory.
text = `curl -s -T #{staging_path} #{text_url} #{text_headers}`
File.open("#{staging_path}.txt", "w") { |f| f.write(text) }

Generating the CouchDB document:

生成CouchDB文档：

# We now have everything we need to generate the CouchDB document so let's make it.
couchdb_document = {
  _id: digest,
  year: file_stat.ctime.year,
  month: file_stat.ctime.month,
  day: file_stat.ctime.day,
  content_type: metadata["Content-Type"],
  file_name: File.basename(memex_note),
  file_type: `file #{staging_path}`.strip.split(': ')[-1]
}
# Write the CouchDB document to disk.
File.open("#{staging_path}.couchdb.json", "w") { |f| f.write(couchdb_document.to_json) }

The rest of the code is just doing basic error handling and then moving files around and cleaning up the temporary directory where we did all the work.

剩下的代码只是做基本的错误处理，然后移动文件并清理临时目录，我们在其中完成了所有工作。

And that’s pretty much it. After this script runs each file is amplified into 3 + 1 different pieces (+1 is because we also keep a copy of the original file along with the amplified pieces). Each piece is a structure that reifies an implicit aspect of the original file like metadata, text content, file stats, content type, etc.

就是这样。此脚本运行后，每个文件被放大为3 + 1个不同的片段(+1是因为我们还保留了原始文件的副本以及放大的片段)。每个部分都是一种结构，用于纠正原始文件的隐式方面，例如元数据，文本内容，文件统计信息，内容类型等。

摘要 (Summary)

So what did we accomplish? We built a basic pipeline with the help of Ruby, CouchDB, and Apache Tika for amplifying our data with structures that were implicit and not amenable to easy programmatic access and manipulation. By making the implicit structure explicit and indexing parts of it with CouchDB we have given ourselves the pieces necessary for building higher level structures that would not have been possible otherwise.

那我们完成了什么？我们在Ruby，CouchDB和Apache Tika的帮助下建立了一个基本管道，以使用隐式且不易于编程访问和操作的结构来放大我们的数据。通过使用CouchDB使隐式结构显式化并对其部分进行索引，我们为自己提供了构建更高层次结构所必需的部分，而这些在其他情况下是不可能的。

In the next post I will go over another aspect of the personal memex pipeline that integrates the data we extracted / amplified into a form that MeiliSearch understands. MeiliSearch will allow us to search the amplified data so that we can start thinking about how to build associative trails by leveraging MeiliSearch as a fuzzy search engine.

在下一篇文章中，我将介绍个人memex管道的另一方面，该管道将我们提取/放大的数据集成为MeiliSearch可以理解的形式。 MeiliSearch将允许我们搜索放大的数据，以便我们可以开始考虑如何利用MeiliSearch作为模糊搜索引擎来构建关联路径。

翻译自: https://medium.com/@davidk01/data-processing-pipeline-for-a-personal-memex-c9b2725cd02

weixin_26714375

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
个人memex的数据处理管道

“Wholly new forms of encyclopedias will appear, ready-made with a mesh of associative trails running through them, ready to be dropped into the memex and there amplified.” — Vannevar Bush “将会出现全新的百科全书...
复制链接

扫一扫