Scala教程– scala.io.Source，访问文件，flatMap，可变地图

最新推荐文章于 2024-03-27 19:58:41 发布

danpie3295

最新推荐文章于 2024-03-27 19:58:41 发布

阅读量2.2k

点赞数

文章标签：字符串 python java 大数据 linux

本教程介绍了如何使用Scala的`scala.io.Source`访问文件系统，包括统计文件字数、逐行处理、流式传输文件内容。示例中演示了如何通过flatMap和可变Map计算文件中的单词计数，同时讨论了从URL读取文件的方法。教程强调了使用不可变Map和迭代器的好处，以及避免一次性加载整个文件到内存的重要性。

摘要由CSDN通过智能技术生成

前言

这是面向初学者的Scala教程的第8部分。该博客上还有其他文章，您可以在我正在为其创建的计算语言学课程的链接页面上找到这些链接和其他资源。此外，您可以在“ JCG Java教程”页面上找到本教程和其他教程系列。

本教程是关于访问文件系统以便使用文本文件的。上一教程介绍了如何构建一个Map，该Map包含给定文本中每个单词类型的计数。但是，假定该文本在String变量中可用，并且通常我们希望了解有关文件系统或Internet上存在的文件的信息。本教程展示了如何通过为文件构建单个String或以流方式逐行使用文件来将文件内容读取到Scala中进行处理。在此过程中，引入了不可变的Map作为一种实现单词计数而无需将整个文件读入内存的方法。

文件内容的字数统计

作为示例，我们将使用来自Gutenberg项目的完整的Sherlock Holmes 。下载它，将其放入目录中，然后在该目录中启动Scala REPL。要访问文件，我们将使用Source类，因此首先需要导入它。

scala> import scala.io.Source
import scala.io.Source

Source提供了多种与文件进行交互的方法，并使您可以在Scala程序中访问它们。 fromFile方法是您最需要的方法。

scala> Source.fromFile("pg1661.txt")
res3: scala.io.BufferedSource = non-empty iterator

这将创建一个BufferedSource ，从中您可以轻松地以String形式获取文件的所有内容。

scala> val holmes = Source.fromFile("pg1661.txt").mkString
holmes: String =
"Project Gutenberg's The Adventures of Sherlock Holmes, by Arthur Conan Doyle

This eBook is for the use of anyone anywhere at no cost and with
almost no restrictions whatsoever.  You may copy it, give it away or
re-use it under the terms of the Project Gutenberg License included
with this eBook or online at www.gutenberg.net
<...many more lines...>

这样，您可以执行与教程7所示相同的操作来获得字数统计（除了这里我们将按空格分隔而不是单个空格）。

scala> val counts = holmes.split("\\s+").groupBy(x=>x).mapValues(x=>x.length)
counts: scala.collection.immutable.Map[java.lang.String,Int] = Map(wood-work, -> 1, "Pray, -> 1, herself. -> 2, stern-post -> 1, "Should -> 1, incident -> 8, serious -> 14, earth--" -> 2, sinister -> 10, comply -> 7, breaks -> 1, forgotten -> 3, precious -> 10, 'It -> 3, compliment -> 2, suite, -> 1, "DEAR -> 1, summarise. -> 1, "Done -> 1, fine.' -> 1, lover -> 5, of. -> 2, lead. -> 1, plentiful -> 1, 'Lone -> 4, malignant -> 1, terrible -> 14, rate -> 1, mole -> 1, assert -> 1, lights -> 2, Stevenson, -> 1, submitted -> 4, tap. -> 1, beard, -> 1, band--a -> 1, force! -> 1, snow -> 7, Produced -> 2, ask, -> 1, purchasing -> 1, Hall, -> 1, wall. -> 5, remarked -> 32, laughing -> 4, member." -> 1, 30,000 -> 2, Redistributing -> 1, coat, -> 6, "'One -> 2, 'band,' -> 1, relapsed -> 1, apol...

scala> counts("Holmes")
res2: Int = 197

scala> counts("Watson")
res3: Int = 4

恐怕您感到奇怪的是Watson仅出现了四次，请记住我们在空白处分开，这意味着在类似以下的句子中，感兴趣的标记是Watson，而不是Watson 。

“亲爱的沃森，你不可能来得更好，”他亲切地说。

查找该内容和其他内容可显示故事中包含Watson的更多令牌。

scala> counts("Watson,\"")
res4: Int = 19

scala> counts("Watson,")
res5: Int = 40

scala> counts("Watson.")
res6: Int = 10

当然，真正的问题是在空格上标记化太粗糙了。要正确地做到这一点，通常需要一个好的手工令牌生成器（它能够保留诸如eg和Mr.和Yahoo!之类的令牌，同时将标点符号分割成大多数单词）或一台机器学习的机器，该机器接受过手工标记令牌数据的训练。有关后者的示例，请参阅Apache OpenNLP工具箱标记器，其中包括针对英语的预训练模型。

逐行工作

通常，您需要逐行处理一个文件，而不是像我们上面那样以单个字符串读取整个内容。例如，您可能需要对每一行进行不同的处理，因此仅将其作为单个String并不是特别方便。或者，您可能正在处理一个无法轻松放入内存的大文件（当您读取整个字符串时会发生这种情况）。您可以使用getLines方法以Iterator [String]的形式获取文件中的行，其中每个项目都是文件中的一行。

scala> Source.fromFile("pg1661.txt").getLines
res4: Iterator[String] = non-empty iterator

此迭代器已准备好供您使用行，但它不会立即将所有文件读入内存-而是对其进行缓冲，以便您可以按需使用每一行，实际上是在读取磁盘时要求更多行。您可以将其视为将文件流式传输到Scala程序，就像现代音频和视频内容流式传输到您的计算机一样：它从未实际存储，而是仅在需要时部分地传输到需要的地方。

当然，迭代器与列表之类的序列数据结构共享很多：一旦有了迭代器，就可以在其上使用foreach ， for ， map等。因此，要打印出文件中的所有行，我们可以执行以下操作。

scala> Source.fromFile("pg1661.txt").getLines.foreach(println)
Project Gutenberg's The Adventures of Sherlock Holmes, by Arthur Conan Doyle

This eBook is for the use of anyone anywhere at no cost and with
almost no restrictions whatsoever.  You may copy it, give it away or
re-use it under the terms of the Project Gutenberg License included
with this eBook or online at www.gutenberg.net

Title: The Adventures of Sherlock Holmes

Author: Arthur Conan Doyle
<...many more lines...>

这会产生很多输出，但是它向您展示了如何轻松地创建自己的Unix cat程序的Scala实现：只需将以下行保存在名为cat.scala的文件中：

scala.io.Source.fromFile(args(0)).getLines.foreach(println)

然后用文件名调用它以列出其内容。

$ scala cat.scala pg1661.txt

回到REPL中，看到整个文件有点不理想。如果您只想查看文件的开头，请在foreach之前在Iterator上使用take方法。

scala> Source.fromFile("pg1661.txt").getLines.take(5).foreach(println)
Project Gutenberg's The Adventures of Sherlock Holmes, by Arthur Conan Doyle

This eBook is for the use of anyone anywhere at no cost and with
almost no restrictions whatsoever.  You may copy it, give it away or
re-use it under the terms of the Project Gutenberg License included

take方法通常对任何序列都非常有用，并提供了drop方法的补充，如以下简单List [Int]上的示例所示。

scala> val numbers = 1 to 10 toList
numbers: List[Int] = List(1, 2, 3, 4, 5, 6, 7, 8, 9, 10)

scala> numbers.take(3)
res12: List[Int] = List(1, 2, 3)

scala> numbers.drop(3)
res13: List[Int] = List(4, 5, 6, 7, 8, 9, 10)

scala> numbers.take(3) ::: numbers.drop(3)
res14: List[Int] = List(1, 2, 3, 4, 5, 6, 7, 8, 9, 10)

逐行计算字数，请先尝试

现在，我们已经了解了如何读取文件并开始逐行处理文件，我们如何计算每个单词出现的次数？从教程7及更高版本中回想起，起点是要有一个字符串序列（数组，列表等），其中每个元素都是单词标记。要开始实现这一目标，我们可以简单地使用从getLines获得的Iterator [String]上的toList方法。

scala> val holmes = Source.fromFile("pg1661.txt").getLines.toList
holmes: List[String] = List(The Project Gutenberg EBook of The Adventures of Sherlock Holmes, by Sir Arthur Conan Doyle, (#15 in our series by Sir Arthur Conan Doyle), "", Copyright laws are changing all over the world. Be sure to check the, copyright laws for your country before downloading or redistributing, this or any other Project Gutenberg eBook., "", This header should be the first thing seen when viewing this Project, Gutenberg file.  Please do not remove it.  Do not change or edit the, header without written permission., "", Please read the "legal small print," and other information about the, eBook and Project Gutenberg at the bottom of this file.  Included is, important information about your specific rights and restrictions in, how the file may be used.  You can also find ou...

现在，我们以List [String]的形式获取文件的内容，并且可以继续执行有用的操作。例如，我们可以将每行（字符串）映射为由空格分隔的字符串序列。

scala> val listOfListOfWords = Source.fromFile("pg1661.txt").getLines.toList.map(x => x.split(" ").toList)
listOfListOfWords: List[List[java.lang.String]] = List(List(Project, Gutenberg's, The, Adventures, of, Sherlock, Holmes,, by, Arthur, Conan, Doyle), List(""), List(This, eBook, is, for, the, use, of, anyone, anywhere, at, no, cost, and, with), List(almost, no, restrictions, whatsoever., "", You, may, copy, it,, give, it, away, or), List(re-use, it, under, the, terms, of, the, Project, Gutenberg, License, included), List(with, this, eBook, or, online, at, www.gutenberg.net), List(""), List(""), List(Title:, The, Adventures, of, Sherlock, Holmes), List(""), List(Author:, Arthur, Conan, Doyle), List(""), List(Posting, Date:, April, 18,, 2011, [EBook, #1661]), List(First, Posted:, November, 29,, 2002), List(""), List(Language:, English), List(""), List(""), List(***, START, OF, THIS, PRO...

并且，正如我们在教程7中所看到的，当我们有了一个列表列表时，我们可以使用flatten创建一个大列表。

scala> val listOfWords = listOfListOfWords.flatten
listOfWords: List[java.lang.String] = List(Project, Gutenberg's, The, Adventures, of, Sherlock, Holmes,, by, Arthur, Conan, Doyle, "", This, eBook, is, for, the, use, of, anyone, anywhere, at, no, cost, and, with, almost, no, restrictions, whatsoever., "", You, may, copy, it,, give, it, away, or, re-use, it, under, the, terms, of, the, Project, Gutenberg, License, included, with, this, eBook, or, online, at, www.gutenberg.net, "", "", Title:, The, Adventures, of, Sherlock, Holmes, "", Author:, Arthur, Conan, Doyle, "", Posting, Date:, April, 18,, 2011, [EBook, #1661], First, Posted:, November, 29,, 2002, "", Language:, English, "", "", ***, START, OF, THIS, PROJECT, GUTENBERG, EBOOK, THE, ADVENTURES, OF, SHERLOCK, HOLMES, ***, "", "", "", "", Produced, by, an, anonymous, Project, Gut...

但是，现在您可能已经意识到这是我们之前看到的map-then-flatten模式，这意味着我们可以使用flatMap代替。

scala> val flatMappedWords = Source.fromFile("pg1661.txt").getLines.toList.flatMap(x => x.split(" "))
flatMappedWords: List[java.lang.String] = List(Project, Gutenberg's, The, Adventures, of, Sherlock, Holmes,, by, Arthur, Conan, Doyle, "", This, eBook, is, for, the, use, of, anyone, anywhere, at, no, cost, and, with, almost, no, restrictions, whatsoever., "", You, may, copy, it,, give, it, away, or, re-use, it, under, the, terms, of, the, Project, Gutenberg, License, included, with, this, eBook, or, online, at, www.gutenberg.net, "", "", Title:, The, Adventures, of, Sherlock, Holmes, "", Author:, Arthur, Conan, Doyle, "", Posting, Date:, April, 18,, 2011, [EBook, #1661], First, Posted:, November, 29,, 2002, "", Language:, English, "", "", ***, START, OF, THIS, PROJECT, GUTENBERG, EBOOK, THE, ADVENTURES, OF, SHERLOCK, HOLMES, ***, "", "", "", "", Produced, by, an, anonymous, Project,...

但是您应该为所有这些事情感到烦恼：不是这里的想法（部分）不是一次读完所有的行吗？确实，通过上面的操作，我们在迭代器上对toList说了一下，就将整个文件读入内存。然而，我们可以做无toList一步，只需直接flatMap迭代器，并通过标记，而不是线得到一个新的迭代器。

scala> val flatMappedWords = Source.fromFile("pg1661.txt").getLines.flatMap(x => x.split(" "))
flatMappedWords: Iterator[java.lang.String] = non-empty iterator

现在，如果我们要算的话，我们可以在转换到一个列表并执行GROUPBY的mapValues欺骗我们已经看到（输出省略）。

scala> val counts = Source.fromFile("pg1661.txt").getLines.flatMap(x => x.split(" ")).toList.groupBy(x=>x).mapValues(x=>x.length)

糟糕-可以，但是我们再次将整个文件带入内存，因为从toList创建的List具有文件的所有行。接下来，我们将看到如何使用可变地图来解决此问题。

通过使用Iterator进行流传输并使用可变映射来进行字计数

到目前为止，在所有教程中，除了因上下文（如toString方法中的Arrays）导致可变数据结构出现之外，我几乎都固定不变的数据结构。尝试在可能的情况下使用不可变的数据结构是很好的，但是有时可变的结构会更方便并且可能更合适。

使用在上一教程中看到的不可变Maps，您无法将分配更改为键，也无法添加新键。

lettersToNumbers: scala.collection.immutable.Map[java.lang.String,Int] = Map(A -> 1, B -> 2, C -> 3)

1
scala> lettersToNumbers("A") = 4
<console>:9: error: value update is not a member of scala.collection.immutable.Map[java.lang.String,Int]
lettersToNumbers("A") = 4

scala> lettersToNumbers("D") = 5
<console>:9: error: value update is not a member of scala.collection.immutable.Map[java.lang.String,Int]
lettersToNumbers("D") = 5

还有另一种Map，即scala.collection.mutable.Map ，它确实允许这种行为。

scala> import scala.collection.mutable
import scala.collection.mutable

scala> val mutableLettersToNumbers = mutable.Map("A"->1, "B"->2, "C"->3)
mutableLettersToNumbers: scala.collection.mutable.Map[java.lang.String,Int] = Map(C -> 3, B -> 2, A -> 1)

scala> mutableLettersToNumbers("A") = 4

scala> mutableLettersToNumbers("D") = 5

scala> mutableLettersToNumbers
res4: scala.collection.mutable.Map[java.lang.String,Int] = Map(C -> 3, D -> 5, B -> 2, A -> 4)

使用+ =方法，它还有一种方便的方法来增加与键关联的计数。

scala> mutableLettersToNumbers("D") += 5

scala> mutableLettersToNumbers
res6: scala.collection.mutable.Map[java.lang.String,Int] = Map(C -> 3, D -> 10, B -> 2, A -> 4)

但是，我们不能将这种方法与不存在的键一起使用。

scala> mutableLettersToNumbers("E") += 1
java.util.NoSuchElementException: key not found: E
<...stacktrace...>

幸运的是，我们可以提供默认值。这是启动默认值为0的新Map的示例。

scala> val counts = mutable.Map[String,Int]().withDefault(x=>0)
counts: scala.collection.mutable.Map[String,Int] = Map()

scala> counts("Z") += 1

scala> counts("Y") += 1

scala> counts("Z") += 1

scala> counts
res11: scala.collection.mutable.Map[String,Int] = Map(Z -> 2, Y -> 1)

注意：当从Map中已有的一些值开始时，Scala可以推断键和值的类型，但是在初始化一个空Map时，必须显式声明键和值类型。

有了这个，这里就是我们如何使用flatMap加上可变的Map来计算文本中的单词的数量，而无需将整个文本读入内存。

import scala.collection.mutable
val counts = mutable.Map[String, Int]().withDefault(x=>0)
for (token <- scala.io.Source.fromFile("pg1661.txt").getLines.flatMap(x =>x.split("\\s+")))
counts(token) += 1

以这种方式创建了counts Map之后，一旦添加完元素，我们就可以使用toMap方法将其转换为不可变的Map。

scala> val fixedCounts = counts.toMap
fixedCounts: scala.collection.immutable.Map[String,Int] = Map(wood-work, -> 1,
<...output truncated...>

现在，我们无法修改fixedCounts上的值，这在许多情况下具有优势，例如，我们不会意外破坏值或添加不需要的键，并且对并行处理有（积极的）含义。

scala> fixedCounts("Holmes") = 0
<console>:13: error: value update is not a member of scala.collection.immutable.Map[String,Int]
fixedCounts("Holmes") = 0
^

从URL读取文件

事实证明， scala.io.Source的功能远不止于从文件读取。另一个示例是使用fromURL方法从URL读取以访问Internet上的文件。

val holmesUrl = """http://www.gutenberg.org/cache/epub/1661/pg1661.txt"""
for (line <- Source.fromURL(holmesUrl).getLines)
println(line)

如果您打算一次又一次地分析相同的文件，则可能不是您所需要的，只需下载该文件并在本地使用即可。但是，如果您正在浏览页面内的链接（例如，在处理Wikipedia或Twitter数据时）并且需要即时读取URL的内容，则此功能非常有用。

使用（上）源

关于通过Source.fromFile和Source.fromURL获得的迭代器的最后说明：您只能对它们进行一次迭代！这是使它们更高效的部分原因-它们并未将所有文本保存在内存中。因此，如果您有以下行为，请不要感到惊讶。

scala> val holmesIterator = Source.fromFile("pg1661.txt").getLines
 holmesIterator: Iterator[String] = non-empty iterator

scala> holmesIterator.foreach(println)

Project Gutenberg's The Adventures of Sherlock Holmes, by Arthur Conan Doyle

This eBook is for the use of anyone anywhere at no cost and with
 almost no restrictions whatsoever.  You may copy it, give it away or
 re-use it under the terms of the Project Gutenberg License included
 with this eBook or online at www.gutenberg.net

<...many lines of output...>

This Web site includes information about Project Gutenberg-tm,
 including how to make donations to the Project Gutenberg Literary
 Archive Foundation, how to help produce our new eBooks, and how to
 subscribe to our email newsletter to hear about new eBooks.

scala> holmesIterator.foreach(println)

<...nothing output!...>

因此，迭代器用完了！如果您想再次浏览文件，则需要像第一次操作一样启动一个新的Iterator。与迭代器呆在一起而不转换为列表（从而将所有内容都带入内存）的整洁之处在于，我们在迭代器上执行的每个映射操作仅适用于我们正在查看的当前项目，因此我们无需阅读全部内容文件存入内存。

当然，如果您有一个相当小的文件可以使用，则应该完全自由地列出它，并根据需要使用它进行处理—由于可以使用groupBy和mapValue模式，因此这样做通常更加方便。

参考： 对于初学者来说，Scala的第一步，来自Bcompose博客的JCG合作伙伴 Jason Baldridge的第8部分。

相关文章：