r语言化巴西地图_使用自然语言处理比较巴西和美国大学论文

最新推荐文章于 2024-01-23 14:02:12 发布

cumifi2519

最新推荐文章于 2024-01-23 14:02:12 发布

阅读量387

点赞数

文章标签：机器学习人工智能深度学习 python java

原文链接：https://www.freecodecamp.org/news/comparing-brazilian-and-us-university-theses-using-natural-language-processing-47196a2f9d64/

版权

r语言化巴西地图

by Déborah Mesquita

由DéborahMesquita

使用自然语言处理比较巴西和美国大学论文 (Comparing Brazilian and US university theses using natural language processing)

People are more likely to consider a thesis that’s written by a student at a top-ranked University as better than a thesis produced by a student at a University with low (or no) status.

人们更有可能认为排名靠前的大学的学生撰写的论文比由地位低(或没有)的大学的学生撰写的论文更好。

But in what way are the works different? What can the students from non-famous Universities do to produce better work and become more well-known?

但是作品有什么不同之处？来自非著名大学的学生可以做什么以提高工作质量并变得更加知名？

I was curious to answer these questions, so I decided to explore two things only: the themes of the works and their nature. Measuring the quality of a university is something very complex, and is not my goal here. We will analyze a number of Undergraduate theses using natural language processing. We’ll extract keywords using tf-idf and classify the theses using Latent Semantic Indexing (LSI).

我很好奇地回答这些问题，所以我决定去探索只有 两样东西 ：作品和自然的主题。衡量大学的质量非常复杂，这并不是我的目标。我们将使用自然语言处理方法来分析许多本科论文。我们将使用tf-idf提取关键字，并使用潜在语义索引(LSI)对这些分类。

数据 (The data)

Our dataset has abstracts of Undergraduate Computer Science Theses from Federal University of Pernambuco (UFPE), located in Brazil, and from Carnegie Mellon University, located in the United States. Why Carnegie Mellon? Because it was the only University where I could find a list of theses produced by students who were at the end of their Undergraduate degree program.

我们的数据集包含位于巴西的伯南布哥联邦大学 (UFPE)和位于美国的卡内基·梅隆大学的本科计算机科学论文摘要。为什么选择卡内基·梅隆？因为这是唯一一所大学，我可以找到本科学位课程结束时的学生所写论文的清单。

The Times Higher Education World University Rankings says that Carnegie Mellon has the 6th best Computer Science program, while UFPE is not event in this ranking. Carnegie Mellon ranks 23rd in the World University Ranking, and UFPE is around 801st.

《泰晤士报高等教育世界大学排名》说，卡内基·梅隆大学拥有最好的计算机科学程序第六名，而UFPE不在该排名中。卡内基梅隆大学在世界大学排名中排名第23位，而UFPE排名第801位。

All works were produced between the years of 2002 and 2016. Each thesis has the following information:

所有作品均在2002年至2016年之间制作。每篇论文都有以下信息：

title of the thesis
论文标题
abstract of the thesis
论文摘要
year of the thesis
论文年份
university where the thesis was produced
论文发表的大学

Theses from Carnegie Mellon can be found here and theses from Federal University of Pernambuco can be found here.

卡内基梅隆论文可以发现这里和伯南布哥联邦大学论文可以发现在这里。

第1步-研究论文主题 (Step 1 — Investigating the themes of the theses)

提取关键词 (Extracting keywords)

To get the themes of the thesis, we will use a well known algorithm called tf-idf.

为了获得论文的主题，我们将使用一种称为tf-idf的众所周知的算法。

tf-idf (tf-idf)

What tf-idf does is to penalize words that appear a lot in a document and at the same time appear a lot in other documents. If this happens, the word is not a good pick to characterize this text (as the word could also be used to characterize all the texts). Let’s use an example to understand this better. We have two documents:

tf-idf的作用是惩罚在文档中出现很多而同时在其他文档中出现很多的单词。如果发生这种情况，则该单词不是表征该文本的最佳选择(因为该单词也可以用于表征所有文本)。让我们用一个例子来更好地理解这一点。我们有两个文件：

Document 1:

文件1：

| Term   | Term Count | |--------|------------| | this   |     1      | | is     |     1      | | a      |     2      | | sample |     1      |

And Document 2:

和文件2：

| Term    | Term Count | |---------|------------| | this    |     1      | | is      |     1      | | another |     2      | | example |     3      |

First let’s see what’s going on. The word this appears 1 time in both documents. This could mean that the word is kind of neutral, right?

首先，让我们看看发生了什么。这个词该出现在这两份文件1次。这可能意味着该词有点中性，对吧？

On the other hand, the word example appears 3 times in Document 2 and 0 times in Document 1. Interesting.

另一方面，单词example在文档2中出现3次，在文档1中出现0次。

Now let’s apply some math. We need to compute two things: TF (Term Frequency) and IDF (Inverse Document Frequency).

现在让我们应用一些数学。我们需要计算两件事：TF(术语频率)和IDF(反向文档频率)。

The equation for TF is:

TF的公式为：

TF(t) = (Number of times that term t appears in the document) / (Total number of terms in the document)

So for terms this and example, we have:

因此，对于本例和示例，我们有：

TF('this',   Document 1) = 1/5 = 0.2TF('example',Document 1) = 0/5 = 0

TF('this',   Document 2) = 1/7 = 0.14TF('example',Document 2) = 3/7 = 0.43

The equation for IDF is:

IDF的公式为：

IDF(t) = log_e(Total number of documents / Number of documents where term t is present)

Why do we use a logarithm here? Because tf-idf is is an heuristic.

为什么在这里使用对数？因为tf-idf是一种启发式方法。

The intuition was that a query term which occurs in many documents is not a good discriminator, and should be given less weight than one which occurs in few documents, and the measure was an heuristic implementation of this intuition. — Stephen Robertson

直觉是，在许多文档中出现的查询词并不是一个很好的区分符，应该赋予它比在少数文档中出现的查询词更少的权重，并且该措施是对该直觉的启发式实现。 — 斯蒂芬·罗伯逊

As usεr11852 explains here:

正如usεr11852 在这里说明的：

The aspect emphasised is that the relevance of a term or a document does not increase proportionally with term (or document) frequency. Using a sub-linear function (the logarithm) therefore helps dumped down (sic) this effect. …The influence of very large or very small values (e.g. very rare words) is also amortised. — usεr11852

强调的方面是术语或文档的相关性不会随术语(或文档)的出现频率成比例地增加。因此，使用次线性函数(对数)有助于降低(原文如此)这种效果。 …还分摊了非常大或非常小的值(例如，非常稀有的单词)的影响。 —usεr11852

Using the equation for IDF, we have:

使用IDF方程，我们有：

IDF('this',   Documents) = log(2/2) = 0

IDF('example',Documents) = log(2/1) = 0.30

And finally, the TF-IDF:

最后，TF-IDF：

TF-IDF('this',   Document 2) = 0.14 x 0 = 0TF-IDF('example',Document 2) = 0.43 x 0.30 = 0.13

I used the 4 words with highest scores results from the tf-idf algorithm for each thesis. I did this using CountVectorizer and TfidfTransformer from scikit-learn.

对于每个论文，我使用了tf-idf算法中得分最高的4个单词。我使用scikit-learn的 CountVectorizer和TfidfTransformer进行了此操作。

You can see the Jupyter notebook with the code here.

你可以看到Jupyter笔记本的代码 在这里。

With 4 keywords for each thesis, I used the WordCloud library to visualize the words for each University.

每个论文有4个关键字，我使用WordCloud库来可视化每个大学的单词。

主题建模 (Topic Modeling)

Another strategy I used to explore the themes from theses of both Universities was topic modeling with Latent Semantic Indexing (LSI).

我用来探讨两所大学主题的另一个策略是使用潜在语义索引 (LSI)进行主题建模。

潜在语义索引 (Latent Semantic Indexing)

This algorithm gets data from tf-idf and uses matrix decomposition to group documents in topics. We will need some linear algebra to understand this, so let’s start.

该算法从tf-idf获取数据，并使用矩阵分解将主题中的文档分组。我们需要一些线性代数来理解这一点，所以让我们开始吧。

奇异值分解(SVD) (Singular Value Decomposition (SVD))

First we need to define how to do this matrix decomposition. We will use Singular Value Decomposition (SVD). Given a matrix M of dimensions m x n, M can be described as:

首先，我们需要定义如何进行矩阵分解。我们将使用奇异值分解 (SVD)。给定尺寸为mxn的矩阵M ， M可描述为：

M = UDV*

Where U and V* are orthonormal basis (V* represents the transpose of matrix V). An orthonormal basis is the result if we have two things (normal + orthogonal):

其中U和V *是正交基 ( V *表示矩阵V的转置)。如果我们有两件事(正交+正交)，则正交基础是结果：

when all vectors are of length 1
当所有向量的长度均为1时
when all vectors are mutually orthogonal (they make an angle of 90°)
当所有向量相互正交时(它们成90°角)

D is a diagonal matrix (the entries outside the main diagonal are all zero).

D是对角矩阵(主对角线之外的条目均为零)。

To get a sense of how all of this works together we will use the brilliant geometric explanation from this article by David Austing.

为了了解所有这些如何协同工作，我们将使用David Austing 这篇文章中出色的几何解释。

Let’s say we have a matrix M:

假设我们有一个矩阵M ：

M = | 3 0 |    | 0 1 |

We can take a point (x,y) in the plane and transforming it into another point using matrix multiplication:

我们可以在平面上取一个点( x ， y)并使用矩阵乘法将其转换为另一个点：

| 3 0 |  . | x | = | 3x || 0 1 |    | y |   | y  |

The effect of this transformation is shown below:

此转换的效果如下所示：

As we can see, the plane is horizontally stretched by a factor of 3, while there is no vertical change.

如我们所见，该平面在水平方向上拉伸了3倍，而在垂直方向上没有变化。

Now, if we take another matrix, M’:

现在，如果我们采用另一个矩阵M'：

M' = | 2 1 |     | 1 2 |

The effect is:

效果是：

It is not so clear how to simply describe the geometric effect of the transformation. However, let’s rotate our grid through a 45 degree angle and see what happens.

目前尚不清楚如何简单地描述变换的几何效果。但是，让我们将网格旋转45度角，看看会发生什么。

We see now that this new grid is transformed in the same way that the original grid was transformed by the diagonal matrix: the grid is stretched by a factor of 3 in one direction.

现在我们看到，该新网格的变换方式与对角矩阵对原始网格的变换方式相同： 该网格在一个方向上拉伸3倍 。

Now let’s use some definitions. M is a diagonal matrix (the entries outside the main diagonal are all zero) and both M and M’ are symmetric (if we get the columns and use them as new rows, we will get the same matrix).

现在让我们使用一些定义。 M是对角矩阵 (主对角线之外的条目均为零)，并且M和M'都是对称的 (如果我们获得列并将它们用作新行，我们将获得相同的矩阵)。

Multiplying by a diagonal matrix results in a scaling effect (a linear transformation that enlarges or shrinks objects by a scale factor).

与对角矩阵相乘会产生缩放效果(通过比例因子放大或缩小对象的线性变换)。

The effect we saw (the same result for both M and M’) is a very special situation that results from the fact that the matrix M’ is symmetric. If we have a symmetric 2 x 2 matrix, it turns out that we may always rotate the grid in the domain so that the matrix acts by stretching and perhaps reflecting in the two directions. In other words, symmetric matrices behave like diagonal matrices. — David Austin

我们看到的效果( M和M'的结果相同)是一个非常特殊的情况，它是由矩阵M'是对称的这一事实引起的。如果我们有一个对称的2 x 2矩阵，事实证明我们可以始终在域中旋转网格，以使矩阵通过拉伸并可能在两个方向上反射来起作用。换句话说，对称矩阵的行为类似于对角矩阵。 — 大卫·奥斯汀

“This is the geometric essence of the singular value decomposition for 2 x 2 matrices: for any 2 x 2 matrix, we may find an orthogonal grid that is transformed into another orthogonal grid.” — David Austin

“这是2 x 2矩阵奇异值分解的几何本质：对于任何2 x 2矩阵，我们可能会发现一个正交网格被转换为另一个正交网格。” — 大卫·奥斯汀

We will express this fact using vectors: with an appropriate choice of orthogonal unit vectors v1 and v2, the vectors Mv1 and Mv2 are orthogonal.

我们将使用向量来表达这一事实：在适当选择正交单位向量v1和v2的情况下 ，向量Mv1和Mv2是正交的。

We will use n1 and n2 to denote unit vectors in the direction of Mv1 and Mv2. The lengths of Mv1 and Mv2 — denoted by σ1 and σ2 — describe the amount that the grid is stretched in those particular directions.

我们将使用n1和n2表示Mv1和Mv2方向上的单位矢量。 Mv1和Mv2的长度(用σ1和σ2表示)描述了网格在那些特定方向上的拉伸量。

Now that we have a geometric essence, let’s go back to the formula:

现在我们有了几何图形的本质，让我们回到公式：

M = UDV*

U is a matrix whose columns are the vectors n1 and n2 (unit vectors of the ‘new’ grid, in the direction of v1 and v2)
U是一个矩阵，其列为向量n1和n2 ( “新”网格的单位向量，沿v1和v2的方向)
D is a diagonal matrix whose entries are σ1 and σ2 (the length of each vector)
D是对角矩阵，其条目为σ1和σ2( 每个向量的长度 )
V* is a matrix whose columns are v1 and v2 (vectors of the ‘old’ grid)
V *是一个矩阵，其列为v1和v2 ( “旧”网格的向量 )

Now that we understand a little about how SVD works, let’s see how LSI makes use of the technique to group texts. As Ian Soboroff shows on his Information Retrieval course slides:

现在，我们对SVD的工作原理有所了解，让我们看看LSI如何利用该技术对文本进行分组。正如伊恩·索博罗夫(Ian Soboroff)在其“信息检索”课程幻灯片中所示：

U is a matrix for transforming new documents
U是用于转换新文档的矩阵
D is the diagonal matrix that gives relative importance of dimensions (we will talk more about these dimensions in a minute)
D是给出尺寸相对重要性的对角矩阵(我们将在稍后讨论这些尺寸)
V* is a representation of M in k dimensions
V *是M在k维上的表示

To see how this works we will use document titles from two domains (Human Computer Interaction and Graph Theory). These examples are from the paper An Introduction to Latent Semantic Analysis.

为了了解其工作原理，我们将使用来自两个领域的文档标题(人机交互和图论)。这些示例来自“ 潜在语义分析简介 ”一文。

c1: Human machine interface for ABC computer applications c2: A survey of user opinion of computer system response time c3: System and human system engineering testing of EPS

m1: The generation of random, binary, ordered trees m2: The intersection graph of paths in trees m3: Graph minors: A survey

The first step is to create a matrix with the number of times each term appears:

第一步是创建一个矩阵，其中包含每个术语出现的次数：

| termo     | c1 | c2 | c3 | m1 | m2 | m3 | |-----------|----|----|----|----|----|----|| human     | 1  | 0  | 1  | 0  | 0  | 0  || interface | 1  | 0  | 0  | 0  | 0  | 0  | | computer  | 1  | 1  | 0  | 0  | 0  | 0  | | user      | 0  | 1  | 0  | 0  | 0  | 0  | | system    | 0  | 1  | 2  | 0  | 0  | 0  | | survey    | 0  | 1  | 0  | 0  | 0  | 1  | | trees     | 0  | 0  | 0  | 1  | 1  | 0  | | graph     | 0  | 0  | 0  | 0  | 1  | 1  | | minors    | 0  | 0  | 0  | 0  | 0  | 1  |

Decomposing the matrix we have this (you can use this online tool to apply the SVD):

分解矩阵，我们可以得到这个矩阵(您可以使用此在线工具来应用SVD)：

# U Matrix (to transform new documents)

-0.386  0.222 -0.096 -0.458  0.357 -0.105-0.119  0.055 -0.434 -0.379  0.156 -0.040-0.345 -0.062 -0.615 -0.089 -0.264  0.135-0.226 -0.117 -0.181  0.290 -0.420  0.175-0.760  0.218  0.493  0.133 -0.018  0.044-0.284 -0.498 -0.176  0.374  0.033 -0.311-0.013 -0.321  0.289 -0.571 -0.582 -0.386-0.069 -0.621  0.185 -0.252  0.236  0.675-0.057 -0.382  0.005  0.085  0.453 -0.485

Matrix that gives relative importance of dimensions:

给出尺寸相对重要性的矩阵：

# D Matrix (relative importance of dimensions)

2.672 0.000 0.000 0.000 0.000 0.0000.000 1.983 0.000 0.000 0.000 0.0000.000 0.000 1.625 0.000 0.000 0.0000.000 0.000 0.000 1.563 0.000 0.0000.000 0.000 0.000 0.000 1.263 0.0000.000 0.000 0.000 0.000 0.000 0.499

Representation of M in k dimensions (in this case, we have k documents):

M在k个维度中的表示形式(在这种情况下，我们有k个文档)：

# V* Matrix (representation of M in k dimensions)

-0.318 -0.604 -0.713 -0.005 -0.031 -0.153 0.108 -0.231  0.332 -0.162 -0.475 -0.757-0.705 -0.294  0.548  0.178  0.291  0.009-0.593  0.453 -0.122 -0.365 -0.527  0.132 0.197 -0.531  0.254 -0.461 -0.274  0.572-0.020  0.087 -0.033 -0.772  0.580 -0.242

Okay, we have the matrices. But now the matrix is not 2 x 2. Do we really need the amount of dimensions that this term-document matrix has? Are all dimensions important features for each term and each document?

好的，我们有矩阵。但是现在矩阵不是2 x2。我们真的需要这个期末文档矩阵所具有的维数吗？所有维度对于每个术语和每个文档都是重要特征吗？

Let’s go back to the example of David Austin. Let’s say now we have M’’:

让我们回到大卫·奥斯丁的例子。假设现在我们有M'' ：

M'' = | 1 1 |      | 2 2 |

Now M’’ is no longer a symmetric matrix. For this matrix, the value of σ2 is zero. On the grid, the result of the multiplication is:

现在M'' 不再是对称矩阵 对于此矩阵，σ2的值为零。在网格上，相乘的结果是：

We have that if a value from the main diagonal of D is zero this term does not appear in the decomposition of M.

我们认为，如果D的主对角线的值为零，则该项不会出现在M的分解中 。

In this way, we see that the rank of M, which is the dimension of the image of the linear transformation, is equal to the number of non-zero values. — David Austin

通过这种方式，我们看到M的秩 (即线性变换的图像维)等于非零值的数量。 — 大卫·奥斯汀

What LSI does is to change the dimensionality of the terms.

LSI所做的是更改条款的维度。

In the original matrix terms are k-dimensional (k is the number of documents). The new space has lower dimensionality, so the dimensions are now groups of terms that tend to co-occur in the same documents. — Ian Soboroff

在原始矩阵中，术语是k维的(k是文档数)。新空间的维数较低，因此这些维现在是趋于在同一文档中同时出现的术语组。 — 伊恩·索博罗夫 ( Ian Soboroff)

Now we can go back to the example. Let’s create a space with two dimensions. For this we will use only two values of the diagonal matrix D:

现在我们可以回到示例。让我们创建一个二维空间。为此，我们将仅使用对角矩阵D的两个值：

# D2 Matrix

2.672 0.000 0.000 0.000 0.000 0.0000.000 1.983 0.000 0.000 0.000 0.0000.000 0.000 0.000 0.000 0.000 0.0000.000 0.000 0.000 0.000 0.000 0.0000.000 0.000 0.000 0.000 0.000 0.0000.000 0.000 0.000 0.000 0.000 0.000

As Alex Thomo explains in this tutorial, terms are represented by the row vectors of U2 x D2 (U2 is U with only 2 dimensions) and documents are represented by the column vectors of D2 x V2* (V2* is V* with only 2 dimensions). We multiply by D2 because D is the diagonal matrix that gives relative importance of dimensions, remember?

正如Alex Thomo在本教程中解释的那样，术语由U2 x D2的行向量表示( U2是只有2维的U )，而文档由D2 x V2 *的列向量表示( V2 *是V *只有2个维)尺寸)。我们乘以D2是因为D是对角矩阵，它给出了尺寸的相对重要性，还记得吗？

Then we calculate the coordinates of each term and each document through these multiplications. The result is:

然后，我们通过这些乘法计算每个术语和每个文档的坐标。结果是：

human     = (-1.031, 0.440)interface = (-0.318, 0.109)computer  = (-0.922, -0.123)user      = (-0.604, -0.232)system    = (-2.031, -0.232) survey    = (-0.759, -0.988)trees     = (-0.035, -0.637)graph     = (-0.184, -1.231) minors    = (-0.152, -0.758)

c1        = (-0.850, 0.214)c2        = (-1.614, -0.458)c3        = (-1.905, 0.658)m1        = (-0.013, -0.321)m2        = (-0.083, -0.942)m3        = (-0.409, -1.501)

Using matplotlib to visualize this, we have:

使用matplotlib可视化这一点，我们有：

Cool, right? The vectors in red are Human Computer Interaction documents and the blue ones are of Graph Theory documents.

酷吧？红色的向量是“人机交互”文档，蓝色的是“图论”文档。

What about the choice of the number of dimensions?

尺寸数量的选择呢？

The number of dimensions retained in LSI is an empirical issue. Because the underlying principle is that the original data should not be perfectly regenerated but, rather, an optimal dimensionality should be found that will cause correct induction of underlying relations, the customary factor-analytic approach of choosing a dimensionality that most parsimoniously represent the true variance of the original data is not appropriate. — Source

LSI中保留的维数是一个经验问题。因为基本原理是原始数据不应该被完美地再生，而是应该找到最佳维数，这将导致对潜在关系的正确归纳，因此，选择最简洁地代表真实方差的维数的常规因子分析方法的原始数据不合适。 — 来源

The measure of similarity computed in the reduced dimensional space is usually, but not always, the cosine between vectors.

在降维空间中计算的相似性度量通常(但不总是)是向量之间的余弦。

And now we can go back to the dataset with theses from the Universities. I used the lsi model from gensim. I did not find many differences between the works of the Universities (all seemed to belong to the same cluster). The topic that most differentiated the works of the Universities was this one:

现在，我们可以使用大学的论文返回数据集。我使用了gensim的lsi 模型。我没有发现大学作品之间有很多差异(所有作品似乎都属于同一类)。使大学作品与众不同的主题是：

y topic:[('object', 0.29383227033104375), ('software', -0.22197520420133632), ('algorithm', 0.20537550622495102), ('robot', 0.18498675015157251), ('model', -0.17565360130127983), ('project', -0.164945961528315), ('busines', -0.15603883815175643), ('management', -0.15160458583774569), ('process', -0.13630070297362168), ('visual', 0.12762128292042879)]

Visually we have:

在视觉上，我们有：

In the image the y topic is on the y-axis. We can see that Carnegie Mellon theses are more associated with ‘object’, ‘robot’, and ‘algorithm’ and the theses from UFPE are more associated with ‘software’, ‘project’, and ‘business’.

在图像中， y主题在y轴上。我们可以看到，卡内基梅隆大学的论文与“对象”，“机器人”和“算法”更多相关，而UFPE的论文与“软件”，“项目”和“业务”更多相关。

You can see the Jupyter notebook with the code here.

你可以看到Jupyter笔记本的代码 在这里。

第2步-调查作品的性质 (Step 2 — Investigating the nature of the works)

I always had the impression that in Brazil, students produce many theses with literature review, while in the other Universities they made few theses like this. To check, I analyzed the titles of the theses.

我总是给人留下这样的印象，在巴西，学生撰写许多带有文学评论的论文，而在其他大学中，他们很少发表这样的论文。为了检查，我分析了论文的标题。

Usually when a thesis is a literature review the word ‘study’ appears in the title. I then took all the titles of all the theses and checked the words that appear the most, for each University. The results were:

通常，当一篇论文是一篇文献综述时，标题中会出现“研究”一词。然后，我拿了所有论文的所有标题，并检查了每所大学出现最多的单词。结果是：

You can see the Jupyter notebook with the code here.

你可以看到Jupyter笔记本的代码 在这里。

发现 (Findings)

The sense I got from this simple analysis was that the themes of the works did not differ much. But it was possible to visualize what seems to be the specialties of each institution. The Federal University of Pernambuco produces more work related to projects and business and Carnegie Mellon produces more work related to robots and algorithms. In my view, this difference of specialties is not something bad, it simply shows that each university is specialized in certain areas.

通过这种简单的分析，我感到作品的主题并没有太大差异。但是有可能将每个机构的专业形象化。伯南布哥联邦大学从事与项目和业务有关的更多工作，而卡耐基梅隆大学则与机器人和算法有关的工作更多。在我看来，这种专业差异并不是什么不好的事情，它只是表明每所大学在某些领域都是专业的。

A takeaway was that in Brazil we need to produce different works instead of just doing literature review.

一个收获是，在巴西，我们需要制作不同的作品，而不仅仅是进行文献综述。

Something important that I realized while doing the analysis (and that did not come from the findings of the analysis itself), was that only having the best thesis is not enough. I started the analysis trying to identify why they produce better works than us and what can we do to get there and become more well known. But I felt that maybe one way to get there is simply to show more of our work and to exchange more knowledge with them. The reason is because this can force us to produce more relevant articles and improve with feedback.

我在进行分析时意识到的重要一点(并非来自分析本身的发现)是，仅拥有最佳论文是不够的。我开始进行分析，试图找出为什么他们能产生比我们更好的作品，以及我们如何才能达到目标并变得更加知名。 但是我觉得到达那里的一种方法仅仅是展示更多我们的工作并与他们交流更多的知识。原因是因为这可能迫使我们撰写更多相关文章并通过反馈进行改进。

I also think that this is for everyone, both for university students and for us professionals alike. This quote that sums it up well:

我也认为这对所有人都适用，无论是针对大学生还是我们的专业人员。这句话很好地总结了一下：

“It’s not enough to be good. In order to be found, you have to be findable.” — Austin Kleon

“仅仅做得还不够。为了被发现，你必须被发现。” — 奥斯汀·克莱恩

And that’s it, thank you for reading!

就是这样，谢谢您的阅读！

If you found this article helpful, it would mean a lot if you click the ? and share with friends. Follow me for more articles about Data Science and Machine Learning.

如果您发现这篇文章对您有所帮助，那么单击“否”将有很多意义。 并与朋友分享。 关注我以获取有关数据科学和机器学习的更多文章。

翻译自: https://www.freecodecamp.org/news/comparing-brazilian-and-us-university-theses-using-natural-language-processing-47196a2f9d64/

r语言化巴西地图

cumifi2519

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
r语言化巴西地图_使用自然语言处理比较巴西和美国大学论文

r语言化巴西地图by Déborah Mesquita 由DéborahMesquita 使用自然语言处理比较巴西和美国大学论文 (Comparing Brazilian and US university theses using natural language processing)People are more likely to consider a thesis that’s w...
复制链接

扫一扫