
Our team at MyHeritage built a powerful genealogy platform and maintains a huge database numbering 12.5 billion historical records that allow people to learn about their ancestors. Users can search this database to discover new information about their families and find photographs featuring their relatives. MyHeritage’s Record Matching technology automatically notifies users when historical records match information in their family trees, saving them the need to actively search the archives.

我们在MyHeritage的团队建立了一个强大的族谱平台,并维护着一个庞大的数据库,该数据库拥有125亿条历史记录 ,使人们可以了解其祖先。 用户可以搜索该数据库以发现有关其家庭的新信息,并找到以其亲戚为特色的照片。 当历史记录匹配其家谱中的信息时,MyHeritage的记录匹配技术会自动通知用户,从而使他们无需主动搜索档案。

One of the biggest collections in this historical record database is a structured index of the data extracted from U.S. City Directories. In this article I’m going to explain why this project was highly complex, why city directories are important, and how we solved technical problems faced by our engineers.

这个历史记录数据库中最大的收藏之一是从美国城市目录中提取的数据的结构化索引。 在本文中,我将解释为什么这个项目非常复杂,为什么城市目录很重要,以及我们如何解决工程师面临的技术问题。

City directories are books that contain listings of residents, streets, and businesses, and indicate their location in a certain city. They may be arranged alphabetically, geographically, or in other ways. The data from city directories is a goldmine for genealogy research since it allows users to reconstruct the personal history of individuals in their family.

城市目录是一本包含居民,街道和企业列表的书,并指出了他们在特定城市中的位置。 它们可以按字母,地理或其他方式排列。 城市目录中的数据是家谱研究的金矿,因为它允许用户重建其家庭中个人的个人历史。

City Directories. Photo courtesy of Florida Memory, State Archives of Florida
市目录。 图片由 佛罗里达州记忆馆提供, 佛罗里达州国家档案馆

These books were issued by many different publishers and across many different cities throughout the years. What if we could extract the names of all the individuals listed in tens of thousands of books, and then sort or match them into groups representing a certain family? We could find year ranges when a person lived at a particular address, their occupational history, and even infer marriage and death events.

这些年来,这些书是由许多不同的出版商和许多不同城市发行的。 如果我们可以提取成千上万本书籍中列出的所有个人的姓名,然后将它们分类或匹配为代表某个家庭的组,该怎么办? 我们可以找到一个人居住在特定地址的年份范围,其职业经历,甚至推断出婚姻和死亡事件。

Residential directory example, image by the author

A data index of city directories would provide massive added value to MyHeritage’s Record Matching system, which automatically finds matches between our users’ family trees and historical records. However, converting thousands of city directories into structured data becomes a complex task.

城市目录的数据索引将为MyHeritage的记录匹配系统提供巨大的附加值, 系统自动查找用户的家谱和历史记录之间的匹配项。 但是,将数千个城市目录转换为结构化数据成为一项复杂的任务。

To structure the data we needed to identify each logical record in the book, infer the surname (not trivial, since when a group of people share the same surname, it is commonly represented by a special ditto character instead of the surname, to save paper), and then break each record into components.

为了构造数据,我们需要识别书中的每个逻辑记录,推断姓氏(这并非易事,因为当一群人共享相同的姓氏时,通常用特殊的同上字符代替姓氏来表示,以节省纸张) ),然后将每条记录分成多个部分。

Record decomposition, image by the author

In addition, we had to handle numerous abbreviations, combine line breaks, extract family members, get rid of advertisements and other unwanted data, and make the process resistant to OCR mistakes (OCR processing is described below). Adding extra complexity to this project was the fact that there was no uniform standard for data published in the city directories — each publisher had their own vision of how to represent the data, and sometimes one publisher chose to do it in different ways throughout the years.

此外,我们必须处理众多缩写,合并换行符,提取家族成员,消除广告和其他不需要的数据,并使该过程能够抵抗OCR错误(OCR处理如下所述)。 这个项目增加了额外的复杂性,因为在城市目录中发布的数据没有统一的标准-每个发布者对如何表示数据都有自己的见解,有时多年来,一个发布者选择以不同的方式来做。

The whole project was broken into five major parts:


  1. Book digitization

  2. Record extraction

  3. Data labeling

  4. Post processing

  5. Consolidation


图书数字化 (Book digitization)

The source data we received of the books was on microfilm. An individual microfilm roll is about 25–30 meters long. In total, there were 12,966 rolls of 35 mm microfilms, which produced 12,027,174 digital images. If we stretched out each microfilm and taped them all end-to-end it would be around 325–390 kilometers long.

我们从这些书中获得的原始数据是缩微胶卷。 单个缩微胶卷的长度约为25-30米。 总共有12,966卷35毫米缩微胶片,产生了12,027,174张数字图像。 如果我们将每张微缩胶片拉长,然后将它们全部端对端地用胶带粘贴,则长度约为325-390公里。

Microfilm. Photo by Mike Mansfield
缩微胶卷。 迈克·曼斯菲尔德 ( Mike Mansfield)

An additional part of the data arrived on 6,290 microfiche films (sheets of flat films). The microfilm and microfiche were digitized using special equipment which gave us more than 12 million JPEG 2000 images, consuming 160 TB. Almost all of our images are “2-up” as that is how they were initially microfilmed. Images were optimized for maximum OCR quality: 300–400 DPI, 8-bit grayscale, high contrast.

另一部分数据来自6,290张微缩胶片(平板胶片)。 使用专用设备将缩微胶片和缩微胶片数字化,这使我们获得了超过1200万张JPEG 2000图像,消耗了160 TB的图像。 几乎我们所有的图像都是“ 2-up”,因为这是最初对它们进行缩微胶卷拍摄的方式。 图像经过了优化,可实现最大的OCR质量:300–400 DPI,8位灰度,高对比度。

Mike Mansfield Mike Mansfield提供

Our next step was to recognize text from images within the OCR. This was done by our team in Utah using OCR servers. This cluster processed all the images and produced data in ALTO (Analyzed Layout and Text Object) XML schema. As a result, we received a set of XML files with recognized words, combined into rows and blocks by their coordinates on a page.

我们的下一步是从OCR中的图像中识别文本。 这是由我们在犹他州的团队使用OCR服务器完成的。 该集群以ALTO (分析的布局和文本对象)XML模式处理了所有图像并生成了数据。 结果,我们收到了一组带有可识别单词的XML文件,这些文件通过它们在页面上的坐标组合成行和块。

OCR’d regions, image by the author

This OCR’d data was stored next to images and we proceeded to the next important step — keying of metadata. One of the challenges we faced was that the books were microfilmed with no clear division — books could start or finish in the middle of the film. We worked together with a third-party vendor to find and tag the front cover and back cover of each book to segment the data. Furthermore, the vendor tagged the positions of residential directories in the books, abbreviation tables and book metadata. For this subproject, we prepared a set of vendor-specific images with reduced dimensions and more aggressive JPEG compression.

此OCR数据存储在图像旁边,我们进行了下一个重要步骤-键入元数据。 我们面临的挑战之一是书籍没有缩影地缩微缩影-书籍可以在影片中间开始或结束。 我们与第三方供应商合作,查找并标记了每本书的封面和封底,以对数据进行细分。 此外,供应商在书籍,缩写表和书籍元数据中标记了住宅目录的位置。 对于此子项目,我们准备了一组特定于供应商的图像,这些图像尺寸减小且更具JPEG压缩性。

Once that work was complete, we had everything split and were ready for the next step, which was triggered automatically once a certain book was uploaded on a specified S3 bucket.


记录提取 (Record Extraction)

We developed a record extraction algorithm that processed the books sequentially, and generally works as follows: when the book is first processed, the algorithm chooses a specific number of pages from the residential directory in order to form the learning set, which is then analyzed to compute weights for further steps. Specifically, it analyzes the OCR output: font sizes, block sizes, and average line numbers in the structure of blocks and columns on the learning set pages. The statistical data collected allowed us to reject advertisement blocks and outline valuable regions.

我们开发了一种记录提取算法,该记录提取算法按顺序处理图书,并且通常按以下方式工作:首次处理图书时,该算法从住宅目录中选择特定数量的页面以形成学习集,然后将其分析为计算权重以进行进一步的步骤。 具体来说,它将分析OCR输出:学习集页面上的块和列结构中的字体大小,块大小和平均行数。 收集的统计数据使我们可以拒绝广告块并概述有价值的区域。

(in red) and permitted ( (红色 )和允许的文本( in blue) text areas, image by the author 蓝色 ),作者提供的图像

After the computation of the learning set, we processed the whole book. As a result, part of the text lines and blocks were rejected, the other parts were put together into effective columns.

计算出学习集后,我们处理了整本书。 结果,部分文本行和块被拒绝,其他部分放到有效的列中。

A single book could contain between dozens and hundreds of pages. It’s important to find the correct learning set since the statistical data collected on this set will affect further processing. If the learning set consists of advertisement pages only, whole book processing will fail. Our algorithm tries to pick similar pages for the learning set, which is usually not the case for advertising reach pages, where font size varies widely.

一本书可能包含数十到数百页。 找到正确的学习集非常重要,因为在此学习集上收集的统计数据会影响进一步的处理。 如果学习集仅包含广告页面,则整本书处理将失败。 我们的算法会尝试为学习集选择相似的页面,对于广告覆盖率页面(字体大小变化很大),通常情况并非如此。

Once we rejected the garbage and defined effective columns, we needed to learn the row structure since it differs from book to book. For further computation it’s important to align rotated pages (which happens sometimes after digitization). Generally, we didn’t need to rotate whole images, we only wanted to align coordinates of text rows mathematically. We did this by calculating linear regression on row coordinates and rotating the line with all points to match altitude.

一旦我们拒绝了垃圾并定义了有效的列,我们就需要学习行结构,因为行结构因书而异。 为了进行进一步的计算,对齐旋转的页面(在数字化之后有时会发生)非常重要。 通常,我们不需要旋转整个图像,只需要数学上对齐文本行的坐标即可。 我们通过在行坐标上计算线性回归并旋转所有点以符合高度的线来实现此目的。

Aligning column using linear regression, image by the author

The next step was to clean up ditto indicators (if they were used in a book). Ditto indicators are special markers that infer the surname for a given record that were used by publishers to save costs. There are several reasons to eliminate ditto indicators, mainly because of numerous OCR errors on these symbols and to make the rows from different books more consistent.

下一步是清理同上指示符(如果在书中使用了它们)。 同上指示符是特殊标记,可以推断给定记录的姓氏,发布者使用它们来节省成本。 消除同上的指标有多种原因,主要是因为这些符号上存在许多OCR错误,并使来自不同书籍的行更加一致。

blue), secondary lines ( 蓝色 ),辅助线( yellow), line breaks ( 黄色 ),换行符( green), image by the author 绿色 ),作者提供的图像

We then needed to classify resulting rows by their horizontal indents. There are a few possible indentation configurations, but the standard configuration is shown in the example above.

然后,我们需要通过水平缩进对结果行进行分类。 有一些可能的缩进配置,但是上面的示例中显示了标准配置。

We classified the rows into primary lines, secondary lines, and line breaks. In specific books or columns secondary lines and (or) line breaks could be absent. The primary line is a record which starts with a surname, the secondary line is a record which should infer the previous surname, but represents another person. A line break is a continuation of the previous record.

我们将行分为主要行,次要行和换行符。 在特定的书或专栏中,可能没有辅助线和(或)换行符。 第一行是一条以姓氏开头的记录,第二行是一条可以推断先前姓氏但代表另一个人的记录。 换行符是先前记录的延续。

We ran clustering algorithms on rows from the learning set by relative x-coordinate to achieve this.


Rows clusterization by x-indents, image by the author

The calculations performed allowed us to discover the indentation values used in the book and process all the lines while taking into account these values.


Then, knowing the indentation, we could classify the lines throughout the book. After classification, all line breaks were concatenated with the previous lines, and all secondary lines inherited the surnames from the primary lines.

然后,知道了缩进,我们就可以对整本书中的行进行分类。 分类后,所有换行符都与先前的行连接在一起,并且所有辅助行都从主要行继承了姓氏。

Row processing, image by the author

At the final stage of record extraction, we performed OCR cleanup — we searched and replaced specific known OCR errors and also assigned unique identifiers to each row. Identifiers were later applied on further steps. The results were stored on S3 and passed to the next module.

在记录提取的最后阶段,我们执行了OCR清理-我们搜索并替换了特定的已知OCR错误,并且还为每行分配了唯一的标识符。 标识符随后应用于进一步的步骤。 结果存储在S3上并传递到下一个模块。

Extracted record example, image by the author

数据标签 (Data labeling)

Once we had raw records, we needed to break them into logical components that reflect people’s names, spouses, occupations, workplaces, addresses, and so on.


Raw record example: Baruch Albt L (Rose) slsmn Wolf & Co h2000 Washn, image by the author
原始记录示例:Baruch Albt L(Rose)slsmn Wolf&Co h2000 Washn,作者提供的图片

The first thing that we tried was to parse strings by regular expressions. Regular expressions are very helpful in identifying patterns that follow a certain structure. The problem was that we had no clear structure here. Thousands of books had different formats and there was no chance of covering everything with regexs. Perhaps we could process it with a sequence of conditions? We could tokenize strings, apply conditional statements, analyze letter cases, special characters, and so on. In fact, this way leads to conditional hell in the code. The situation was compounded by the fact that we could never expect clear records, and because there are always some OCR errors, real records often look like this:

我们尝试的第一件事是通过正则表达式解析字符串。 正则表达式对于识别遵循特定结构的模式非常有帮助。 问题是我们在这里没有明确的结构。 成千上万的书籍使用不同的格式,因此没有机会用正则表达式覆盖所有内容。 也许我们可以通过一系列条件来处理它? 我们可以标记字符串,应用条件语句,分析字母大小写,特殊字符等等。 实际上,这种方式导致代码中的条件性地狱。 由于我们永远都不会期望有清晰的记录,而且由于总是存在一些OCR错误,因此真实记录通常看起来像这样,使情况变得更加复杂:

OCR errors in extracted records, image by the author

It became very inefficient to parse and process such records with a conditional approach.


Was there a better way to extract the data? Yes. One of the best ways to reach this goal is a specific class of machine learning algorithms named Conditional Random Fields (CRFs). Conditional Random Fields can be used to predict any sequence in which multiple variables depend on each other. CRFs are commonly used in labeling or parsing of sequential data for natural language processing or biological sequences, POS tagging, shallow parsing, name entity recognition, gene finding, object recognition, image segmentation in computer vision, and others.

有没有更好的方法来提取数据? 是。 达到此目标的最佳方法之一是一类特定的机器学习算法,称为条件随机场 (CRF)。 条件随机字段可用于预测多个变量相互依赖的任何序列。 CRF通常用于对自然语言处理或生物序列进行顺序数据的标记或解析,POS标记,浅层解析,名称实体识别,基因发现,对象识别,计算机视觉中的图像分割等。

CRF took the previous context in a sequence into account when making a prediction on a data point. This made it easy to find patterns in sentence structure and even classify corrupted words correctly.

在对数据点进行预测时,CRF考虑了序列中的先前上下文。 这使查找句子结构中的模式变得容易,甚至可以正确地对损坏的单词进行分类。

Stanford NER classifier based on CRF
基于CRF的Stanford NER分类器

CRF is a supervised machine learning algorithm, requiring a training set to produce its predictions. We decided to build a small training set from each of the books. During this process, human operators trained models by assigning a label to each word of the training set, such as name, occupation, address, phone number and other. We built a system with a useful interface to label a large portion of data.

CRF是一种受监督的机器学习算法,需要训练集才能产生其预测。 我们决定从每本书中构建一个小型培训集。 在此过程中,操作员通过给训练集的每个单词分配标签来训练模型,例如名称,职业,地址,电话号码等。 我们构建了一个带有有用接口的系统,以标记大部分数据。

Model training interface, image by the author

The quality of the labeling was fundamental, since it served as the input data for the machine learning algorithms. A potential error in the labeling could dramatically affect the results of this process. The goal was to build a model for each book to split records into components with a CRF algorithm as best we could.

标签的质量至关重要,因为它是机器学习算法的输入数据。 标签中的潜在错误可能会严重影响此过程的结果。 我们的目标是为每本书建立一个模型,以尽我们所能将记录分为CRF算法。

Operators manually trained models for each of the books. For model quality control we built a self-validation system. The idea was simple: take N records labeled by an operator and split it into T records for a training dataset and V = N — T records for a validation dataset.

操作员为每本书手动训练模型。 对于模型质量控制,我们建立了一个自我验证系统。 这个想法很简单:取N个由操作员标记的记录,并将其分为训练数据集的T记录和验证数据集的V = N-T记录。

A training dataset was used to predict the responses for the observations in a validation dataset. Based on this we calculated an approximate score of each model’s quality. When combined with human QA, it provided a great quality score for trained models.

训练数据集用于预测验证数据集中观察结果的响应。 基于此,我们计算出每个模型质量的近似分数。 与人工QA结合使用时,它为训练有素的模型提供了很高的质量得分。

Once the model was trained and validated, the whole book was processed by a CRF module implemented in Python and deployed on an Ec2 cluster. After this step we had each record broken into labeled components. This data is stored on S3 and the next pipeline step triggered within the SQS message, while the operator started work on the next book.

对模型进行训练和验证后,整本书将由Python中实现的CRF模块处理并部署在Ec2集群上。 在此步骤之后,我们将每个记录分解为标记的组件。 此数据存储在S3上,并在SQS消息中触发下一个流水线步骤,而操作员开始着手下一本书。

后期处理 (Post Processing)

Once records were broken into components by CRF, it was still not enough to build a valuable searchable index. We still needed to do a sequence of computations, using a post-processing module, which is a java application deployed on an auto-scaling ec-2 cluster.

一旦记录被CRF分解为组件,仍然不足以建立有价值的可搜索索引。 我们仍然需要使用后处理模块执行一系列计算,该模块是部署在自动缩放ec-2集群上的Java应用程序。

In post-processing we expanded abbreviations (common and book-specific), extracted spouses, infer surnames of spouses, determined death records, and more.


Each book has its own abbreviation table that was originally keyed manually.


Sample abbreviations table, image by the author

It was important to expand abbreviations during this step since abbreviations are context-dependent, and the same abbreviation in an address and in a name could have different meanings. At this stage we already deconstructed records into logical parts, and we knew the context of each part. There were many other genealogy-specific transformations and error corrections that occurred during this step, and as a result we have structured records: a list of individuals from a certain book with their first and surnames, relationship between them, addresses, occupations, and other details.

在此步骤中扩展缩写很重要,因为缩写是上下文相关的,并且地址和名称中的相同缩写可能具有不同的含义。 在这一阶段,我们已经将记录分解为逻辑部分,并且我们知道每个部分的上下文。 在此步骤中还发生了许多其他特定于家谱的转换和错误更正,因此,我们形成了结构化的记录:某本书的个人列表,包括其名字和姓氏,它们之间的关系,地址,职业和其他细节。

Data extracted from the record, image by the author

合并 (Consolidation)

An interesting fact about city directories is that the books were published periodically over many years, and the same person could repeatedly appear in different publications. In other words, If John Smith lived in Atlanta for 20 years, he would probably appear in every Atlanta residential directory issued during the given period. Books are often published every two years, which gives us good overlapping and redundancy in the data.

关于城市目录的一个有趣的事实是,这些书定期出版多年,并且同一个人可以反复出现在不同的出版物中。 换句话说,如果约翰·史密斯(John Smith)在亚特兰大居住了20年,他很可能会出现在给定时期内发布的每个亚特兰大住宅目录中。 书籍通常每两年出版一次,这使我们在数据上有很好的重叠和冗余。

Consolidated records with death events extracted, image by the author

This idea led us to implement an algorithm that runs on a book range from a certain city and combines matched individuals into consolidated records.


The consolidation algorithm is a complex system that produced matches between records from different books and discovered people’s life events. The system needed to process a large amount of data and was designed to be scalable horizontally using the mapReduce model. It ran as a final step of the pipeline in the amazon ec-2 cluster.

合并算法是一个复杂的系统,它会产生不同书籍的记录之间的匹配,并发现人们的生活事件。 该系统需要处理大量数据,并设计为使用mapReduce模型可水平扩展。 它是亚马逊ec-2集群中流水线的最后一步。

Consolidated record details, image by the author

During this process, the system automatically discovered an individual’s genealogy events, hidden at first glance, but visible by analyzing changes in the record data. For example, it can detect the year range when a certain person lived at a given address. It can detect approximate marriage events, when a spouse appears for the first time; furthermore, it is able to infer death events, when a wife becomes a widow in a sequence of consolidated records.

在此过程中,系统会自动发现一个人的家谱事件,乍一看是隐藏的,但通过分析记录数据的变化即可看到。 例如,它可以检测某人居住在给定地址时的年份范围。 第一次出现配偶时,它可以检测到近似的婚姻事件; 此外,当妻子在一系列合并记录中成为寡妇时,它可以推断死亡事件。

Example of consolidation from 31 books, image by the author

In the example above, the algorithm consolidated 31 records from different books published between 1912–1959 into a single genealogy record. The system detected that Alfred and Mary Albert married circa 1914. Albert worked as conductor, carpenter, and motorman during his life and died circa 1959.

在上面的示例中,该算法将1912年至1959年之间出版的不同书籍中的31条记录合并为一个家谱记录。 该系统检测到阿尔弗雷德和玛丽·阿尔伯特大约在1914年结婚。阿尔伯特一生担任指挥,木匠和驾车者,并于1959年左右去世。

放在一起 (Putting It All Together)

All components described above were deployed in AWS infrastructure and formed a pipeline that allowed us to do algorithm modifications and instantly re-run processes on any part of the data.


City Directories Pipeline diagram, image by the author

Consolidated records, produced in the last step of the pipeline, represent structured searchable genealogy data that is loaded and indexed into MyHeritage SuperSearch™ — a genealogy search engine that consists of 12.5 billion historical records.

在管道的最后一步中产生的合并记录表示结构化的可检索家谱数据,该数据已加载并编入MyHeritage SuperSearch™ (该谱系搜索引擎由125亿条历史记录组成)。

Within the described pipeline we were able to process 25,000 public U.S. city directories published between 1860 and 1960. It comprises 545 million aggregated records that have been consolidated from 1.3 billion records. The coverage of this collection is so broad that almost anyone researching their family history in the U.S. during those years would likely find their ancestor listed.

在描述的管道中,我们能够处理1860年至1960年之间发布的25,000个美国城市公共目录。该目录包含5.45亿条汇总记录,已从13亿条记录中合并而成。 该收藏的范围如此广泛,以至于几乎所有在这些年里研究其家庭史的人都可能会发现其祖先。

City Directories search form, image by the author

If you liked the article, check out another major project we accomplished: extracting 290 million individuals from U.S. Yearbooks

如果您喜欢这篇文章,请查看我们完成的另一个重大项目: 从《美国年鉴》中提取2.9亿个人

