apache spark_使用Apache Spark处理Wikipedia时吸取的教训

最新推荐文章于 2023-11-02 23:57:02 发布

cumian9828

最新推荐文章于 2023-11-02 23:57:02 发布

阅读量1.5k

点赞数 1

文章标签：数据库大数据 hadoop python linux

原文链接：https://www.freecodecamp.org/news/processing-wikipedia-with-spark-542213bd4365/

版权

apache spark

by Siddhesh Rane

由Siddhesh Rane

使用Apache Spark处理Wikipedia时吸取的教训 (Lessons learned while processing Wikipedia with Apache Spark)

Apache Spark is an open-source fault-tolerant cluster-computing framework that also supports SQL analytics, machine learning, and graph processing.

Apache Spark是一个开源的容错群集计算框架，还支持SQL分析，机器学习和图形处理。

It works by splitting your data into partitions, and then processing those partitions in parallel on all the nodes in the cluster. If any node goes down, it reassigns the task of that node to a different node and hence provides fault tolerance.

它的工作原理是将数据拆分为多个分区，然后在集群中的所有节点上并行处理这些分区。如果任何节点发生故障，它将把该节点的任务重新分配给其他节点，从而提供容错能力。

Being 100x faster than Hadoop has made it hugely popular for Big Data processing. Spark is written in Scala and runs on the JVM, but the good news is it also provides APIs for Python and R as well as C#. It is well documented with examples that you should check out.

比Hadoop快100倍，使其在大数据处理中广受欢迎。 Spark用Scala编写，可以在JVM上运行，但是好消息是它还提供了Python，R和C＃的API。上面有大量示例应记录的文档。

When you are ready to give it a try, this article will guide you from download and setup through to performance tuning. My tiny Spark cluster performed 100 million string matches over all the articles in Wikipedia — in less than two hours.

当您准备尝试时，本文将指导您从下载和设置到性能调整。我的小型Spark集群在不到两个小时的时间内完成了Wikipedia中所有文章的1亿个字符串匹配。

It’s when you get past the tutorials and do some serious work that you realize all the hassles of the tech stack you are using. Learning through mistakes is the best way to learn. But sometimes you are just short on time and wish you knew every possible thing that could go wrong.

当您通过教程并做一些认真的工作时，您就会意识到所使用的技术堆栈的所有麻烦。通过错误学习是最好的学习方法。但是有时候您只是时间紧迫，希望您知道所有可能出错的事情。

Here, I describe some of the problems that I faced when starting with Spark, and how you can avoid them.

在这里，我描述了从Spark入门时遇到的一些问题，以及如何避免这些问题。

如何开始 (How to get started)

下载打包的Hadoop依赖项随附的Spark二进制文件 (Download the Spark binary that comes with packaged Hadoop dependencies)

If you set out to download Spark, you’ll notice that there are various binaries available for the same version. Spark advertizes that it does not need Hadoop, so you might download the user-provided-hadoop version which is smaller in size. Don’t do that.

如果您打算下载Spark，则会注意到同一版本有多种二进制文件可用。 Spark宣传它不需要Hadoop，因此您可以下载用户提供的hadoop版本，该版本的大小较小。 不要那样做 。

Although Spark does not use Hadoop’s MapReduce framework, it does have dependencies on other Hadoop libraries like HDFS and YARN. The without-hadoop version is for when you already have Hadoop libraries provided elsewhere.

尽管Spark不使用Hadoop的MapReduce框架，但它确实依赖于其他Hadoop库，例如HDFS和YARN。无哈希版本适用于您已经在其他地方提供Hadoop库的情况。

使用独立群集模式，而不是Mesos或YARN (Use the standalone cluster mode, not Mesos or YARN)

Once you test the built-in examples on local cluster, and ensure that everything is installed and working properly, proceed to set up your cluster.

在local群集上测试了内置示例之后，并确保所有内容均已安装并正常运行，然后继续设置群集。

Spark gives you three options: Mesos, YARN, and standalone.

Spark为您提供了三个选项：Mesos，YARN和独立。

The first two are resource allocators which control your replica nodes. Spark has to request them to allocate its own instances. As a beginner, don’t increase your complexity by going that way.

前两个是控制副本节点的资源分配器。 Spark必须请求他们分配自己的实例。作为初学者，不要那样做增加您的复杂性。

The standalone cluster is the easiest to setup. It comes with sensible defaults, like using all your cores for executors. It is part of the Spark distribution itself and has a sbin/start-all.sh script that can bring up the primary as well as all your replicas listed in conf/slaves using ssh.

独立集群最容易设置。它带有合理的默认值，例如将所有内核用于执行程序。它是Spark发行版本身的一部分，并且具有sbin/start-all.sh脚本，该脚本可以使用ssh调出conf/slaves列出的主副本以及所有副本。

Mesos/YARN are separate programs that are used when your cluster isn’t just a spark cluster. Also, they don’t come with sensible defaults: executors don’t use all cores on the replicas unless explicitly specified.

Mesos / YARN是单独的程序，当您的群集不仅仅是一个Spark群集时，可以使用它们。而且，它们没有合理的默认值：除非明确指定，否则执行程序不会使用副本上的所有内核。

You also have the option of a high availability mode using Zookeeper, which keeps a list of backup primaries incase any primary fails. If you are a beginner, you are likely not handling a thousand-node cluster where the risk of node failure is significant. You are more likely to set up a cluster on a managed cloud platform like Amazon’s or Google’s, which already takes care of node failures.

您还可以选择使用Zookeeper的高可用性模式，该模式会保留备份主数据库列表，以防万一任何主数据库发生故障。如果您是初学者，则可能不会处理节点故障风险很大的千节点群集。您更有可能在诸如Amazon或Google的托管云平台上建立集群，该集群已经解决了节点故障。

您不需要云基础架构或小型集群的高可用性 (You don’t need high availability with cloud infrastructure or a small cluster)

I had my cluster set up in a hostile environment where human factors were responsible for power failures, and nodes going off the grid. (Basically my college computer lab where diligent students turn off the machine and careless students pull out LAN cables). I could still pull off without high availability by careful choice of the primary node. You wouldn’t have to worry about that.

我将群集设置在一个恶劣的环境中，在该环境中，人为因素导致了电源故障，并且节点脱离了电网。 (基本上是我的大学计算机实验室，勤奋的学生关闭机器，粗心的学生拔出LAN电缆)。通过仔细选择主节点，我仍然可以在没有高可用性的情况下完成任务。您不必为此担心。

检查用于运行Spark的Java版本 (Check the Java version you use to run Spark)

One very important aspect is the Java version you use to run Spark. Normally, a later version of Java works with something compiled for older releases.

一个非常重要的方面是用于运行Spark的Java版本。通常，Java的更高版本可以与针对较早版本进行编译的东西一起使用。

But with Project Jigsaw, modularity introduced stricter isolation and boundaries in Java 9 which breaks certain things that use reflection. On Spark 2.3.0 running on Java 9, I got illegal reflection access. Java 8 had no issues.

但是在Jigsaw项目中，模块化在Java 9中引入了更严格的隔离和边界，这打破了某些使用反射的东西。在Java 9上运行的Spark 2.3.0上，我获得了非法的反射访问权限。 Java 8没有问题。

This will definitely change in the near future, but keep that in mind until then.

这肯定会在不久的将来改变，但请记住这一点。

完全按原样指定主要URL。不要将域名解析为IP地址，反之亦然 (Specify the primary URL exactly as is. Do not resolve domain names to IP adresses, or vice-versa)

The standalone cluster is very sensitive about URLs used to resolve primary and replica nodes. Suppose you start the primary node like below:

独立群集对用于解析主节点和副本节点的URL非常敏感。假设您像下面那样启动主节点：

> sbin/start-master.sh

and your primary is up at localhost:8080

并且您的主要服务器位于localhost:8080

By default, your PC’s hostname is chosen as the primary URL address. x360 resolves to localhost but starting a replica like below will not work.

默认情况下，将PC的主机名选择为主URL地址。 x360解析为localhost但无法启动如下所示的副本。

# does not work > sbin/start-slave.sh spark://localhost:7077

# works > sbin/start-slave.sh spark://x360:7077

This works, and our replica has been added to the cluster:

这可行，并且我们的副本已添加到集群中：

Our replica has an IP address in the 172.17.x.x subdomain, which is actually the subdomain set up by Docker on my machine.

我们的副本的IP地址在172.17.xx子域中，它实际上是Docker在我的机器上设置的子域。

The primary can communicate with this replica because both are on the same machine. But the replica cannot communicate with other replicas on the network, or a primary on a different machine, because its IP address is not routable.

主数据库可以与此副本进行通信，因为两者都在同一台计算机上。但是该副本无法与网络上的其他副本或另一台计算机上的主副本进行通信，因为其IP地址不可路由。

Like in the primary case above, a replica on a machine without primary will take up the hostname of the machine. When you have identical machines, all of them end up using the same hostname as their address. This creates a total mess and no one can communicate with the other.

像上面的主要情况一样，没有主要服务器的计算机上的副本将占用该计算机的主机名。当您拥有相同的计算机时，它们最终都将使用相同的主机名作为其地址。这会造成一团糟，而且任何人都无法与对方交流。

So the above commands would change to:

因此，以上命令将更改为：

# start master> sbin/start-master.sh -h $myIP # start slave > sbin/start-slave.sh -h $myIP spark://<masterIP>:7077 # submit a job > SPARK_LOCAL_IP=$myIP bin/spark-submit ...

where myIP is the IP address of the machine which is routable between the cluster nodes. It is more likely that all nodes are on the same network, so you can write a script which will set myIP on each machine.

其中， myIP是在群集节点之间可路由的计算机的IP地址。所有节点都可能在同一网络上，因此您可以编写一个脚本来在每台计算机上设置myIP 。

# assume all nodes in the 10.1.26.x subdomain siddhesh@master:~$ myIP=`hostname -I | tr " " "\n" | grep 10.1.26. | head`

代码流程 (Flow of the code)

So far we have set up our cluster and seen that it is functional. Now its time to code. Spark is quite well-documented and comes with lots of examples, so its very easy to get started with coding. What is less obvious is how the whole thing works which results in some very hard to debug errors during runtime. Suppose you coded something like this:

到目前为止，我们已经建立了集群并看到了它的功能。现在是时候编写代码了。 Spark的文档非常齐全，并带有许多示例，因此非常容易上手编码。不太清楚的是整个过程是如何工作的，这导致在运行时很难调试一些错误。假设您编写了如下代码：

class SomeClass {  static SparkSession spark;  static LongAccumulator numSentences;

public static void main(String[] args) {    spark = SparkSession.builder()                        .appName("Sparkl")                       .getOrCreate(); (1)    numSentences = spark.sparkContext()                       .longAccumulator("sentences"); (2)    spark.read()        .textFile(args[0])        .foreach(SomeClass::countSentences); (3)  }  static void countSentences(String s) { numSentences.add(1); } (4) }

1 create a spark session

1创建火花会话

2 create a long counter to keep track of job progress

2创建一个长计数器来跟踪工作进度

3 traverse a file line by line calling countSentences for each line

3逐行遍历文件，每行调用countSentences

4 add 1 to the accumulator for each sentence

4每个句子的累加器加1

The above code works on a local cluster but will fail with a null pointer exception when run on a multinode cluster. Both spark as well as numSentences will be null on the replica machine.

上面的代码在local群集上有效，但是在多节点群集上运行时将失败，并出现空指针异常。副本计算机上的spark和numSentences都将为null。

To solve this problem, encapsulate all initialized states in non-static fields of an object. Use main to create the object and defer further processing to it.

要解决此问题，请将所有初始化状态封装在对象的非静态字段中。使用main创建对象并推迟对其进行进一步处理。

What you need to understand is that the code you write is run by the driver node exactly as is, but what the replica nodes execute is a serialized job that spark gives them. Your classes will be loaded by the JVM on the replica.

您需要了解的是，编写的代码完全由驱动程序节点按原样运行，但是副本节点执行的是序列化作业，可以激发它们。您的类将由JVM在副本服务器上加载。

Static initializers will run as expected, but functions like main won’t, so static values initialized in the driver won’t be seen in the replica. I am not sure how the whole thing works, and am only inferring from experience, so take my explanation with a grain of salt. So your code now looks like:

静态初始化程序将按预期运行，但不会像main函数那样运行，因此在驱动程序中初始化的静态值将不会在副本中看到。我不确定整个过程是如何工作的，只是从经验中推断出来，因此请多加解释。所以您的代码现在看起来像：

class SomeClass {  SparkSession spark; (1)  LongAccumulator numSentences;  String[] args;   SomeClass(String[] args) { this.args = args; }   public static void main(String[] args){    new SomeClass(args).process(); (2)  }   void process() {    spark = SparkSession.builder().appName("Sparkl").getOrCreate();   numSentences = spark.sparkContext().longAccumulator("sentences");   spark.read().textFile(args[0]).foreach(this::countSentences); (3) }  void countSentences(String s) { numSentences.add(1); }}

1 Make fields non static

1使字段为非静态

2 create instance of the class and then execute spark jobs

2创建该类的实例，然后执行spark作业

3 reference to this in the foreach lambda brings the object in the closure of accessible objects and thus gets serialized and sent to all replicas.

3在foreach lambda中this引用使该对象处于可访问对象的关闭中，因此被序列化并发送到所有副本。

Those of you who are programming in Scala might use Scala objects which are singleton classes and hence may never come across this problem. Nevertheless, it is something you should know.

那些在Scala中进行编程的人可能会使用Scala对象(它们是单例类)，因此可能永远不会遇到此问题。但是，这是您应该知道的。

提交应用程序和依赖项 (Submit app and dependencies)

There is more to coding above, but before that you need to submit your application to the cluster. Unless your app is extremely trivial, chances are you are using external libraries.

上面还有更多关于编码的内容，但是在此之前，您需要将应用程序提交到集群。除非您的应用程序极其琐碎，否则您可能正在使用外部库。

When you submit your app jar, you also need to tell Spark the dependent libraries that you are using, so it will make them available on all nodes. It is pretty straightforward. The syntax is:

提交应用程序jar时，还需要告知Spark您正在使用的依赖库，以便使它们在所有节点上都可用。这很简单。语法为：

bin/spark-submit --packages groupId:artifactId:version,...

I have had no issues with this scheme. It works flawlessly. I generally develop on my laptop and then submit jobs from a node on the cluster. So I need to transfer the app and its dependencies to whatever node I ssh into.

我对此计划没有任何疑问。它完美地工作。我通常使用笔记本电脑进行开发，然后从群集中的节点提交作业。因此，我需要将应用程序及其依赖项转移到我ssh插入的任何节点中。

Spark looks for dependencies in the local maven repo, then the central repo and any repos you specify using --repositories option. It is a little cumbersome to sync all that on the driver and then type out all those dependencies on the command line. So I prefer all dependencies packaged in a single jar, called an uber jar.

Spark在本地Maven存储库中查找依赖项，然后在中央存储库以及您使用--repositories选项指定的所有存储库中--repositories项。同步驱动程序上的所有内容，然后在命令行上键入所有这些依赖项，比较麻烦。因此，我更喜欢将所有依赖项打包在一个称为uber jar的jar中。

使用Maven Shade插件生成具有所有依赖项的uber jar，因此作业提交变得更容易 (Use Maven shade plugin to generate an uber jar with all dependencies so job submitting becomes easier)

Just include the following lines in your pom.xml

只需在pom.xml包含以下几行

<build> <plugins>  <plugin>   <groupId>org.apache.maven.plugins</groupId>   <artifactId>maven-shade-plugin</artifactId   <version>3.0.0</version>   <configuration>    <artifactSet>     <excludes>      <exclude>org.apache.spark:*</exclude>     </excludes>    </artifactSet>   </configuration>   <executions>    <execution>     <phase>package</phase&gt;     <goals>      <goal>shade</goal>     </goals>    </execution>   </executions>  </plugin> </plugins> </build>

When you build and package your project, the default distribution jar will have all dependencies included.

在生成和打包项目时，默认的分发jar将包含所有依赖项。

As you submit jobs, the application jars get accumulated in the work directory and fill up over time.

提交作业时，应用程序jar会累积在work目录中，并随着时间的推移而填满。

Set spark.worker.cleanup.enabled to true in conf/spark-defaults.conf

在conf/spark-defaults.conf spark.worker.cleanup.enabled设置为true

This option is false by default and is applicable to the stand-alone mode.

默认情况下，此选项为false，适用于独立模式。

输入和输出文件 (Input and Output files)

This was the most confusing part that was difficult to diagnose.

这是最难诊断的部分。

Spark supports reading/writing of various sources such as hdfs, ftp, jdbc or local files on the system when the protocol is file:// or missing. My first attempt was to read from a file on my driver. I assumed that the driver would read the file, turn it into partitions, and then distribute those across the cluster. Turns out it doesn’t work that way.

当协议为file://或缺少协议时，Spark支持读取/写入系统上的各种资源，例如hdfs ， ftp ， jdbc或本地文件。我的第一次尝试是从驱动程序上的文件读取。我假设驱动程序将读取文件，将其转换为分区，然后将其分布在整个群集中。事实证明，这种方式行不通。

When you read a file from the local filesystem, ensure that the file is present on all the worker nodes at exactly the same location. Spark does not implicitly distribute files from the driver to the workers.

从本地文件系统read文件时，请确保该文件存在于完全相同位置的所有辅助节点上。 Spark不会将文件从驱动程序隐式分发给工作程序。

So I had to copy the file to every worker at the same location. The location of the file was passed as an argument to my app. Since the file was located in the parent folder, I specified its path as ../wikiArticles.txt. This did not work on the worker nodes.

因此，我不得不将文件复制到同一位置的每个工作人员。文件的位置作为参数传递给我的应用程序。由于该文件位于父文件夹中，因此我将其路径指定为../wikiArticles.txt 。这在工作节点上不起作用。

始终传递绝对文件路径以进行读取 (Always pass absolute file paths for reading)

It could be a mistake from my side, but I know that the filepath made it as is into the textFile function and it caused “file not found” errors.

在我看来，这可能是一个错误，但我知道文件路径按原样进入了textFile函数，并导致了“找不到文件”错误。

Spark supports common compression schemes, so most gzipped or bzipped text files will be uncompressed before use. It might seem that compressed files will be more efficient, but do not fall for that trap.

Spark支持常见的压缩方案，因此大多数压缩或压缩文本文件在使用前都不会被压缩。似乎压缩文件会更有效，但不会落入该陷阱。

不要读取压缩的文本文件，尤其是`gzip` 。未压缩的文件处理速度更快。 (Don’t read from compressed text files, especially `gzip`. Uncompressed files are faster to process.)

Gzip cannot be uncompressed in parallel like bzip2, so nodes spend the bulk of their time uncompressing large files.

Gzip无法像bzip2一样并行解压缩，因此节点花费大量时间来解压缩大文件。

It is a hassle to make the input files available on all workers. You can instead use Spark’s file broadcast mechanism. When submitting a job, specify a comma separated list of input files with the --files option. Accessing these files requires SparkFiles.get(filename). I could not find enough documentation on this feature.

使输入文件对所有工作人员都可用很麻烦。您可以改用Spark的文件广播机制。提交作业时，请使用--files选项指定用逗号分隔的输入文件列表。访问这些文件需要SparkFiles.get(filename) 。我找不到有关此功能的足够文档。

To read a file broadcasted with the --files option, use SparkFiles.get(<onlyFileNameNotFullPath>) as the pathname in read functions.

要读取通过--files选项广播的文件，请在读取函数中使用SparkFiles.get(<onlyFileNameNotFullPat h>)作为路径名。

So a file submitted as --files /opt/data/wikiAbstracts.txt would be accesed as SparkFiles.get("WikiAbstracts.txt"). This returns a string which you can use in any read function that expects a path. Again, remember to specify absolute paths.

因此，以--files /opt/data/wikiAbstracts.txt提交的文件将被添加为SparkFiles.get("WikiAbstracts.txt") 。这将返回一个字符串，您可以在任何需要路径的读取函数中使用该字符串。同样，请记住指定绝对路径。

Since my input file was 5GB gzipped, and my network was quite slow at 12MB/s, I tried to use Spark’s file broadcast feature. But the decompression itself was taking so long that I manually copied the file to every worker. If your network is fast enough, you can use uncompressed files. Or alternatively, use HDFS or FTP server.

由于我的输入文件压缩为5GB，并且我的网络速度非常慢，为12MB / s，因此我尝试使用Spark的文件广播功能。但是解压缩本身花费的时间太长，以至于我手动将文件复制到每个工作人员。如果您的网络足够快，则可以使用未压缩的文件。或者，使用HDFS或FTP服务器。

Writing files also follows the semantics of reading. I was saving my DataFrame to a csv file on the local system. Again I had the assumption that the results would be sent back to the driver node. Didn’t work for me.

编写文件也遵循阅读的语义。我将DataFrame保存到本地系统上的csv文件中。我再次假设结果将被发送回驱动程序节点。没为我工作。

将DataFrame保存到本地文件路径后，每个工作程序会将其计算的分区保存到其自己的磁盘中。没有数据发送回驱动程序 (When a DataFrame is saved to local file path, each worker saves its computed partitions to its own disk. No data is sent back to the driver)

I was only getting a fraction of the results I was expecting. Initially I had misdiagnosed this problem as an error in my code. Later I found out that each worker was storing its computed results on its own disk.

我得到的只是预期结果的一小部分。最初，我将这个问题误诊为代码中的错误。后来我发现每个工作人员都将计算结果存储在自己的磁盘上。

分区 (Partitions)

The number of partitions you make affects the performance. By default, Spark will make as many partitions as there are cores in the cluster. This is not always optimal.

您创建的分区数量会影响性能。默认情况下，Spark会创建与集群中核心数量一样多的分区。这并不总是最佳的。

Keep an eye on how many workers are actively processing tasks. If too few, increase the number of partitions.

密切关注有多少工人正在积极地处理任务。如果太少，请增加分区数。

If you read from a gzipped file, Spark creates just one partition which will be processed by only one worker. That is also one reason why gzipped files are slow to process. I have observed slower performance with small number of large partitions as compared to a large number of small partitions.

如果您从gzip压缩文件中读取数据，Spark将仅创建一个分区，该分区将仅由一名工作人员处理。这也是压缩文件处理缓慢的原因之一。我观察到，与大量的小分区相比，少量的大分区会降低性能。

It’s better to explicitly set the number of partitions while reading data.

最好在读取数据时明确设置分区数。

You may not have to do this when reading from HDFS, as Hadoop files are already partitioned.

从HDFS读取数据时，您可能不必这样做，因为Hadoop文件已分区。

维基百科和DBpedia (Wikipedia and DBpedia)

There are no gotchas here, but I thought it would be good to make you aware of alternatives. The entire Wikipedia xml dump is 14GB compressed and 65 GB uncompressed. Most of the time you only want the plain text of the article, but the dump is in MediaWiki markup so it needs some preprocessing. There are many tools available for this in various languages. Although I haven’t used them personally, I am pretty sure it must be a time consuming task. But there are alternatives.

这里没有陷阱，但是我想让您知道其他选择会很好。整个Wikipedia xml转储的压缩大小为14GB，未压缩的大小为65GB。大多数时候，您只需要文章的纯文本，但是转储在MediaWiki标记中，因此需要进行一些预处理。有许多可用的各种语言工具。尽管我还没有亲自使用过它们，但是我很确定这一定是一项耗时的任务。但是，还有其他选择。

If all you want is the Wikipedia article plaintext, mostly for NLP, then download the dataset made available by DBpedia.

如果只需要Wikipedia文章的纯文本(主要用于NLP)，则下载DBpedia提供的数据集。

I used the full article dump (NIF Context) available at DBpedia (direct download from here). This dataset gets rid of unwanted stuff like tables, infoboxes, and references. The compressed download is 4.3GB in the turtle format. You can covert it to tsv like so

我使用了DBpedia上的完整文章转储( NIF Context )(可从此处直接下载)。该数据集摆脱了表格，信息框和引用之类的多余内容。 turtle格式的压缩下载为4.3GB。您可以像这样将其隐藏到tsv

Similar datasets are available for other properties like page links, anchor texts, and so on. Do check out DBpedia.

类似的数据集可用于其他属性，例如页面链接，锚文本等。请检查DBpedia 。

关于数据库的一句话 (A word about databases)

I never quite understood why there is a plethora of databases, all so similar, and on top of that people buy database licenses. Until this project I hadn’t seriously used any. I ever only used MySQL and Apache Derby.

我从来不完全理解为什么会有如此之多的如此众多的数据库，而且人们还购买数据库许可证。在这个项目之前，我还没有认真使用过。我曾经只使用MySQL和Apache Derby。

For my project I used a SPARQL triple store database, Apache Jena TDB, accessed over a REST API served by Jena Fuseki. This database would give me RDF urls, labels, and predicates for all the resources mentioned in the supplied article. Every node would make a database call and only then would proceed with further processing.

在我的项目中，我使用了SPARQL三重存储数据库Apache Jena TDB，该数据库通过Jena Fuseki提供的REST API访问。该数据库将为我提供所提供文章中提到的所有资源的RDF网址，标签和谓词。每个节点都将进行数据库调用，然后才可以进行进一步处理。

My workload had become IO bound, as I could see near 0% CPU utilization on worker nodes. Each partition of the data would result in two SPARQL queries. In the worst case scenario, one of the two queries was taking 500–1000 seconds to process. Thankfully, the TDB database relies on Linux’s memory mapping. I could map the whole DB into RAM and significantly improve performance.

我的工作负载已成为IO约束，因为我可以看到工作节点上的CPU利用率接近0％。数据的每个分区将导致两个SPARQL查询。在最坏的情况下，两个查询之一要花费500-1000秒的时间来处理。值得庆幸的是，TDB数据库依赖于Linux的内存映射。我可以将整个数据库映射到RAM中并显着提高性能。

如果您受IO约束，并且数据库可以放入RAM，请在内存中运行它。 (If you are IO bound and your database can fit into RAM, run it in memory.)

I found a tool called vmtouch which would show what percentage of the database directory had been mapped into memory. This tool also allows you to explicitly map any files/directories into the RAM and optionally lock it so it wont get paged out.

我找到了一个名为vmtouch的工具，该工具可以显示数据库目录中已映射到内存的百分比。该工具还允许您将任何文件/目录显式映射到RAM中，并可选地对其进行锁定，以免被分页。

My 16GB database could easily fit into my 32 GB RAM server. This boosted query performance by orders of magnitude to 1–2 seconds per query. Using a rudimentary form of database load balancing based on partition number, I could cut down my execution time to half by using 2 SPARQL servers instead of one.

我的16GB数据库可以轻松装入我的32 GB RAM服务器。这将查询性能提高了每个查询1-2秒数量级。使用基于分区号的基本形式的数据库负载平衡，我可以通过使用2台SPARQL服务器而不是一台将SPARQL服务器的执行时间减少一半。

结论 (Conclusion)

I truly enjoyed distributed computing on Spark. Without it I could not have completed my project. It was quite easy to take my existing app and have it run on Spark. I definitely would recommend anyone to give it a try.

我真的很喜欢Spark上的分布式计算。没有它，我将无法完成我的项目。拿我现有的应用程序并使其在Spark上运行非常容易。我绝对会建议任何人尝试一下。

Originally published at siddheshrane.github.io.

最初发布于siddheshrane.github.io 。

翻译自: https://www.freecodecamp.org/news/processing-wikipedia-with-spark-542213bd4365/