wordstat中help文件extraction of topics部分 英文

Extraction of Topics


Contents


Menu

 

 

 

  •  Scroll to Top of Page
  •  Print Topic
  •  Show/Hide Expanders

The Topic Extraction feature of WordStat attempts to uncover the hidden thematic structure of a text collection by applying a combination of natural language processing and statistical analysis. The main statistical procedure used for topic extraction in WordStat is a factor analysis. Technically speaking, such an extraction is achieved by computing a word by document frequency matrix, or alternatively by segmenting documents into smaller chunks and computing a word by segment frequency matrix. Once this matrix is obtained, a factor analysis with Varimax rotation is computed in order to extract a small number of factors. All words with a factor loading higher than a specific criterion are then retrieved as part of the extracted topic. While in hierarchical cluster analysis, a word may only appear in one cluster, topic modeling using factor analysis may result in a word being associated with more than one factor, a characteristic that more realistically represents the polysemic nature of some words as well as the multiple contexts of word usage.

 

The current implementation of the topic-modeling procedure has a limit of 2,500 words or content categories. (We are working on ways to increase the capability to at least twice this amount.) To insure the stability of the factoring solution, low frequency items should preferably be excluded. It is thus strongly recommended to remove any word occurring less than 10 times on smaller data sets, ideally less than 30 to 50 times on larger ones. Stemming, lemmatization or the creation of a categorization dictionary may also be used to group words or phrases, including less frequent ones, prior to the topic extraction.

 

WordStat provides the following analysis options to control the topic modeling process:

 

Segmentation - This option allows one to specify whether the data to be used for topic modeling will be based on the co-occurrence of words in the same document, or whether they will be based on co-occurrence within paragraphs or sentences. The choice of segmentation should ideally reflect how topics are being distributed in a typical document and across documents, as well as the objective of the analysis. When the text collection consists of long documents containing multiple topics (such as long political speeches) and one needs to identify all topics in order to compare their relative frequencies, then performing a segmentation by paragraph or by sentence may be more sensitive than computing co-occurrences by documents. Alternatively, if one attempts to differentiate documents by identifying domains or disciplines, or to identify the dominant issue of documents, then performing the analysis at the document level may be more appropriate. When analyzing responses to open-ended questions, which may include several topics listed in a single paragraph, segmenting by sentence may also result in a more precise extraction of the various topics they contain.

 

No. Topics - Setting this option allows one to specify how many topics to extract.

 

Loading - This option allows one to set a minimum factor loading an word should reach in order to be retained in the factor solution. By default, this value is set to 0.4. Increasing the cutoff value will reduce the number of words, keeping only the more representative ones, while reducing it may include words that are somewhat less characteristic of the extracted topic.

 

Once the options have been set, click the button to perform the analysis. Please note that extracting topics on more than a few hundred words can take several minutes. Once extracted, the TOPICS page should looks like this:

The table to the left contains the following information:

 

NO

Shows the factor number. Please note that some factor numbers may be omitted if none of their items attained the factor-loading cutoff criteria. When factors are being merged by the user, this column  displays the numbers of all factors that have been merged together.

NAME

WordStat uses an algorithm to automatically provide a label for the extracted topic. This label may be edited by clicking the button.

KEYWORDS

Lists all keywords meeting the factor loading cutoff criteria in descending order of factor loading.

% VAR

Shows the percentage of variance explained. Please note that the smaller the segment one chose, the lower the percentage.

FREQ

Displays the total frequency of all items listed in the keywords column.

CASES

Shows the number of cases containing at least one of the items listed in the keywords column.

% CASES

Displays the percentage of cases with at least one of the items listed in the keywords column.

 

 

Topic Modeling Buttons:

 

Allows one to delete the topic on the selected row.

 

Click to merge a topic into another one. One first needs to select the row containing the first topic one would like to merge, and then click this button. A dialog box will appear with a list of all other topics. Select the second topic and click OK .

 

To rename a topic, first select the topic and then click this button. Type the new name and click OK.

 

To retrieve segments associated with a topic, select it and click this button. All text segments containing at least two keywords of the selected topic will be retrieved and presented in a table format. You may however change both the type of segments retrieved (paragraphs, sentences or full documents) or the minimum number of topic words needed for retrieval.

 

Allows one to perform co-occurrence analysis of all the extracted topics including clustering and multidimensional scaling, and to create proximity plots as well as link charts. For more information on the various features available, see the Co-Occurrence Page topic.

 

 

Allows one to perform full crosstabulation analysis of all the displayed topics with structured data, to apply statistical analysis, and to create various charts such as correspondence plots, heatmaps, bubble charts, and bar charts.  For more information on the various features available for crosstabulation analysis, see the Crosstab Page topic.

 

 

Stores the extracted topics currently displayed into a new categorization dictionary where folders at the first level correspond to different topics, and where each of those folders contains the associated words. A dialog box allows one to save

 

 

Press this button to append a copy of the topic table in the Report Manager. A descriptive title will be provided automatically. To edit this title or to enter a new one, hold down the SHIFT keyboard key while clicking this button (for more information on the Report Manager, see the Report Management Feature topic).

 

 

Allows to store the topic table to disk in various formats, including Excel, tab and comma delimited files, plain text, HTML, XML, SPSS or Stata files.

 

 

Allows you to print a copy of the displayed chart

 

 

Using the Right Panel

The right side of this table is a panel that allows one to look at the distribution of the selected topic among values of up to two structured variables. One may display this distribution using either a vertical bar chart, a horizontal bar chart or a line chart, by clicking on the corresponding button. Four statistics may also be represented on those charts:

 

Case Occurrence - number of cases in this subgroup containing at least one of these words.

Category Percent - percentage of cases in this subgroup containing at least one of these words.

Word Frequency - total number of these words in this subgroup.

Rate per 10,000 Words - rate of words in this subgroup per 10,000 words.

Right-clicking anywhere in the chart areas displays a popup menu that allows one to edit the chart, save it to disk or in the Report Manager, or to copy it to the clipboard. Clicking a specific bar or a data point of a line chart also allows one to retrieve text segments associated with the selected class and containing words of the selected topic.

 

 

 

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
水资源是人类社会的宝贵财富,在生活、工农业生产是不可缺少的。随着世界人口的增长及工农业生产的发展,需水量也在日益增长,水已经变得比以往任何时候都要珍贵。但是,由于人类的生产和生活,导致水体的污染,水质恶化,使有限的水资源更加紧张。长期以来,油类物质(石油类物质和动植物油)一直是水和土壤的重要污染源。它不仅对人的身体健康带来极大危害,而且使水质恶化,严重破坏水体生态平衡。因此各国都加强了油类物质对水体和土壤的污染的治理。对于水油含量的检测,我国处于落后阶段,与国际先进水平存在差距,所以难以满足当今技术水平的要求。为了取得具有代表性的正确数据,使分析数据具有与现代测试技术水平相应的准确性和先进性,不断提高分析成果的可比性和应用效果,检测的方法和仪器是非常重要的。只有保证了这两方面才能保证快速和准确地测量出水油类污染物含量,以达到保护和治理水污染的目的。开展水油污染检测方法、技术和检测设备的研究,是提高水污染检测的一条重要措施。通过本课题的研究,探索出一套适合我国国情的水质污染现场检测技术和检测设备,具有广泛的应用前景和科学研究价值。 本课题针对我国水体的油污染,探索一套检测油污染的可行方案和方法,利用非分散红外光度法技术,开发研制具有自主知识产权的适合国情的适于野外便携式的测油仪。利用此仪器,可以检测出被测水样亚甲基、甲基物质和动植物油脂的污染物含量,为我国众多的环境检测站点监测水体的油污染状况提供依据。
### 内容概要 《计算机试卷1》是一份综合性的计算机基础和应用测试卷,涵盖了计算机硬件、软件、操作系统、网络、多媒体技术等多个领域的知识点。试卷包括单选题和操作应用两大类,单选题部分测试学生对计算机基础知识的掌握,操作应用部分则评估学生对计算机应用软件的实际操作能力。 ### 适用人群 本试卷适用于: - 计算机专业或信息技术相关专业的学生,用于课程学习或考试复习。 - 准备计算机等级考试或职业资格认证的人士,作为实战演练材料。 - 对计算机操作有兴趣的自学者,用于提升个人计算机应用技能。 - 计算机基础教育工作者,作为教学资源或出题参考。 ### 使用场景及目标 1. **学习评估**:作为学校或教育机构对学生计算机基础知识和应用技能的评估工具。 2. **自学测试**:供个人自学者检验自己对计算机知识的掌握程度和操作熟练度。 3. **职业发展**:帮助职场人士通过实际操作练习,提升计算机应用能力,增强工作竞争力。 4. **教学资源**:教师可以用于课堂教学,作为教学内容的补充或学生的课后练习。 5. **竞赛准备**:适合准备计算机相关竞赛的学生,作为强化训练和技能检测的材料。 试卷的目标是通过系统性的题目设计,帮助学生全面复习和巩固计算机基础知识,同时通过实际操作题目,提高学生解决实际问题的能力。通过本试卷的学习与练习,学生将能够更加深入地理解计算机的工作原理,掌握常用软件的使用方法,为未来的学术或职业生涯打下坚实的基础。

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值