The Classic Vector Space Model

Description, Advantages and Limitations of the Classic Vector Space Model

Global Information

Unlike the Term Count Model, Salton's Vector Space Model [1] incorporates local and global information

Eq 1: Term Weight = Term Vector

where

  • tfi = term frequency (term counts) or number of times a term i occurs in a document. This accounts for local information.
  • dfi = document frequency or number of documents containing term i
  • D = number of documents in a database.

the dfi /D ratio is the probability of selecting a document containing a queried term from a collection of documents. This can be viewed as a global probability over the entire collection. Thus, the log(D/dfi) term is the inverse document frequency, IDFi and accounts for global information. The following figure illustrates the relationship between local and global frequencies in an ideal database collection consisting of five documents D1, D2, D3, D4, and D5. Only three documents contain the term "CAR". Querying the system for this term gives an IDF value of log(5/3) = 0.2218.

Term Vector

Self-Similarity Elements

Those of us specialized in applied fractal geometry recognize the self-similar nature of this figure up to some scales. Note that collections consist of documents, documents consist of passages and passages consist of sentences. Thus, for a term i in a document j we can talk in terms of collection frequencies (Cf), term frequencies (tf), passage frequencies (Pf) and sentence frequencies (Sf)

Eq 2(a, b, c): Term Vector

Eq 2(b) is implicit in Eq 1. Models that attempt to associate term weights with frequency values must take into consideration the scaling nature of relevancy. Certainly, the so-called "keyword density" ratio promoted by many search engine optimizers (SEOs) is not in this category.

Vector Space Example

To understand Eq 1, let use a trivial example. To simplify, let assume we deal with a basic term vector model in which we

  1. do not take into account WHERE the terms occur in documents.
  2. use all terms, including very common terms and stopwords.
  3. do not reduce terms to root terms (stemming).
  4. use raw frequecies for terms and queries (unnormalized data).

I'm presenting the following example, courtesy of Professors David Grossman and Ophir Frieder, from the Illinois Institute of Technology [2]. This is one of the best examples on term vector calculations available online.

  • By the way, Dr. Grossman and Dr. Frieder are the authors of the authority book Information Retrieval: Algorithms and Heuristics. Originally published in 1997, a new edition is available now through Amazon.com [3]. This is a must-read literature for graduate students, search engineers and search engine marketers. The book focuses on the real thing behind IR systems and search algorithms.

Suppose we query an IR system for the query "gold silver truck". The database collection consists of three documents (D = 3) with the following content

D1: "Shipment of gold damaged in a fire"
D2: "Delivery of silver arrived in a silver truck"
D3: "Shipment of gold arrived in a truck"

Retrieval results are summarized in the following table.

Term Vector

The tabular data is based on Dr. Grossman's example. I have added the last four columns to illustrate all term weight calculations. Let's analyze the raw data, column by column.

  1. Columns 1 - 5: First, we construct an index of terms from the documents and determine the term counts tfi for the query and each document Dj.
  2. Columns 6 - 8: Second, we compute the document frequency di for each document. Since IDFi = log(D/dfi) and D = 3, this calculation is straightforward.
  3. Columns 9 - 12: Third, we take the tf*IDF products and compute the term weights. These columns can be viewed as a sparse matrix in which most entries are zero.

Now we treat weights as coordinates in the vector space, effectively representing documents and the query as vectors. To find out which document vector is closer to the query vector, we resource to the similarity analysis introduced in Part 2.

Similarity Analysis

First for each document and query, we compute all vector lengths (zero terms ignored)

Term Vector

Next, we compute all dot products (zero products ignored)

Term Vector

Now we calculate the similarity values

Term Vector

Finally we sort and rank the documents in descending order according to the similarity values

Rank 1: Doc 2 = 0.8246
Rank 2: Doc 3 = 0.3271
Rank 3: Doc 1 = 0.0801

Observations

This example illustrates several facts. First, that very frequent terms such as "a", "in", and "of" tend to receive a low weight -a value of zero in this case. Thus, the model correctly predicts that very common terms, occurring in many documents in a collection are not good discriminators of relevancy. Note that this reasoning is based on global information; ie., the IDF term. Precisely, this is why this model is better than the term count model discussed in Part 2. Third, that instead of calculating individual vector lengths and dot products we can save computational time by applying directly the similarity function

Eq 3: Term Vector

Of course, we still need to know individual tf and IDF values.

Limitations of the Model

As a basic model, the term vector scheme discussed has several limitations. First, it is very calculation intensive. From the computational standpoint it is very slow, requiring a lot of processing time. Second, each time we add a new term into the term space we need to recalculate all vectors. As pointed out by LEE, CHUANG and SEAMONS [4], computing the length of the query vector (the first term in the denominator of Eq 3) requires access to every document term, not just the terms specified in the query.

Other limitations include

  1. Long Documents: Very long documents make similarity measures difficult (vectors with small dot products and high dimensionality)
  2. False negative matches: documents with similar content but different vocabularies may result in a poor inner product. This is a limitation of keyword-driven IR systems.
  3. False positive matches: Improper wording, prefix/suffix removal or parsing can results in spurious hits
    (falling, fall + ing; therapist, the + rapist, the + rap + ist; Marching, March + ing; GARCIA, GAR + CIA). This is just a pre-processing limitation, not exactly a limitation of the vector model.
  4. Semantic content: Systems for handling semantic content may need to use special tags (containers)

We can improve the model by

  1. getting a set of keywords that are representative of each document.
  2. eliminating all stopwords and very common terms ("a", "in", "of", etc).
  3. stemming terms to their roots.
  4. limiting the vector space to nouns and few descriptive adjectives and verbs.
  5. using small signature files or not too huge inverted files.
  6. using theme mapping techniques.
  7. computing subvectors (passage vectors) in long documents
  8. not retrieving documents below a defined cosine threshold

On Polysemy and Synonymity

A main disadvantage of this and all term vector models is that terms are assumed to be independent (i.e. no relation exists between the terms). Often this is not the case. Terms can be related by

  1. Polysemy; i.e., terms can be used to express different things in different contexts (e.g. driving a car and driving results). Thus, some irrelevant documents may have high similarities because they may share some words from the query. This affects precision.
  2. Synonymity; i.e., terms can be used to express the same thing (e.g. car insurance and auto insurance). Thus, the similarity of some relevant documents with the query can be low just because they do not share the same terms. This affects recall.

Of these two, synonymity can produce a detrimental effect on term vector scores.

Acknowledgements

The author thanks Professors David Grossman and Ophir Frieder, from the Illinois Institute of Technology, for allowing him to use information from their Vector Space Implementation graduate lectures.

The author also thanks Gupta Uddhav and Do Te Kien from the University of San Francisco for referencing this resource in their PERSONAL WEB NEIGHBORHOOD project.

Next: Vector Models based on Normalized Frequencies
Prev: The Term Counts Model
References
  1. Salton, Gerard. Introduction to Modern Information Retrieval. McGraw-Hill, 1983
  2. Vector Space Implementation
  3. Information Retrieval: Algorithms and Heuristics; Kluwer International Series in Engineering and Computer Science, 461.
  4. Document Ranking and the Vector-Space Model

 
  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
SQLAlchemy 是一个 SQL 工具包和对象关系映射(ORM)库,用于 Python 编程语言。它提供了一个高级的 SQL 工具和对象关系映射工具,允许开发者以 Python 类和对象的形式操作数据库,而无需编写大量的 SQL 语句。SQLAlchemy 建立在 DBAPI 之上,支持多种数据库后端,如 SQLite, MySQL, PostgreSQL 等。 SQLAlchemy 的核心功能: 对象关系映射(ORM): SQLAlchemy 允许开发者使用 Python 类来表示数据库表,使用类的实例表示表中的行。 开发者可以定义类之间的关系(如一对多、多对多),SQLAlchemy 会自动处理这些关系在数据库中的映射。 通过 ORM,开发者可以像操作 Python 对象一样操作数据库,这大大简化了数据库操作的复杂性。 表达式语言: SQLAlchemy 提供了一个丰富的 SQL 表达式语言,允许开发者以 Python 表达式的方式编写复杂的 SQL 查询。 表达式语言提供了对 SQL 语句的灵活控制,同时保持了代码的可读性和可维护性。 数据库引擎和连接池: SQLAlchemy 支持多种数据库后端,并且为每种后端提供了对应的数据库引擎。 它还提供了连接池管理功能,以优化数据库连接的创建、使用和释放。 会话管理: SQLAlchemy 使用会话(Session)来管理对象的持久化状态。 会话提供了一个工作单元(unit of work)和身份映射(identity map)的概念,使得对象的状态管理和查询更加高效。 事件系统: SQLAlchemy 提供了一个事件系统,允许开发者在 ORM 的各个生命周期阶段插入自定义的钩子函数。 这使得开发者可以在对象加载、修改、删除等操作时执行额外的逻辑。
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值