目标跟踪python代码_使用少于30行的python代码进行联系人跟踪

最新推荐文章于 2023-01-13 16:27:44 发布

weixin_26755331

最新推荐文章于 2023-01-13 16:27:44 发布

阅读量999

点赞数 1

文章标签： python 人工智能 tensorflow linux

原文链接：https://towardsdatascience.com/contact-tracing-using-less-than-30-lines-of-python-code-6c5175f5385f

版权

目标跟踪python代码

Contact tracing is the name of the process used to identify those who come into contact with people who have tested positive for contagious diseases — such as measles, HIV, and COVID-19. During a pandemic, performing contact tracing correctly can help reduce the number of people to get infected or speed up the process of treating infected people. Doing so can help save many lives.

接触者追踪是用来识别与那些在麻疹，HIV和COVID-19等传染性疾病测试呈阳性的人接触的过程的名称。在大流行期间，正确执行联系人跟踪可以帮助减少被感染的人数或加快治疗被感染者的过程。这样做可以帮助挽救许多生命。

Technology can help automate the process of contact tracing, producing more efficient and accurate results than if the procedure was performed manually. One technology that can help this process is Machine Learning. More precisely, clustering. Clustering is a subclass of Machine Learning algorithms used to divide data that share some characteristics in different clusters based on these characteristics.

与手动执行该过程相比，技术可以帮助实现联系人跟踪过程的自动化，从而产生更有效，更准确的结果。可以帮助这一过程的一种技术是机器学习 。更精确地讲，聚类。聚类是机器学习算法的子类，用于根据共享特征在不同聚类中划分共享某些特征的数据。

There are various types of clustering algorithms, such as K-means, Mean-Shift, Spectral Clustering, BIRCH, DBSCAN, and so much more. These different algorithms can be divided into three categories:

聚类算法有多种类型，例如K均值，均值漂移，频谱聚类，BIRCH，DBSCAN等。这些不同的算法可分为三类：

Density-based Clustering: Clusters are formed based on the density of the region — examples of this type: DBSCAN (Density-Based Spatial Clustering of Applications with Noise) and OPTICS (Ordering Points to Identify Clustering Structure).
基于密度的聚类：基于区域的密度形成聚类 -这种类型的示例： DBSCAN (带噪声的应用程序的基于密度的空间聚类)和OPTICS (识别聚类结构的订购点)。
Hierarchal-based Clustering: Clusters are formed using a tree-type structure. Some clusters are predefined and then used to create new clusters — examples of this type: CURE (Clustering Using Representatives), BIRCH (Balanced Iterative Reducing Clustering, and using Hierarchies).
基于层次的聚类：使用树型结构形成聚类。预定义了一些群集，然后将其用于创建新群集-这种类型的示例： CURE (使用代表进行群集)， BIRCH (平衡迭代式减少群集和使用层次结构)。
Partitioning-based Clustering: Clusters are formed by partitioning the input data into k clusters — examples of this type: K-means, CLARANS (Clustering Large Applications based upon Randomized Search).
基于分区的聚类：通过将输入数据划分为k个聚类来形成聚类-这种类型的示例： K-means ， CLARANS (基于随机搜索聚类大型应用程序)。

For contact tracing, we need to use a density-based clustering algorithm. The reason is, diseases are transferred when an infected person comes in contact with others. So, more crowded — dense — areas will have more cases than less crowded ones.

对于联系人跟踪，我们需要使用基于密度的聚类算法。原因是，感染者与他人接触时会传染疾病。因此，较拥挤的区域将比不拥挤的区域更多。

To trace the movement of infected people, scientists often use GPS datasets that contain information about the time and location of a person in any given timeframe. The location data is often represented as longitude and latitude coordinates.

为了追踪受感染人群的运动，科学家经常使用GPS数据集，其中包含有关任何给定时间范围内某个人的时间和位置的信息。位置数据通常表示为经度和纬度坐标。

联系人跟踪算法 (Contact Tracing Algorithm)

In order to build a contact tracing algorithm, we need to do three steps:

为了构建联系人跟踪算法，我们需要执行三个步骤：

Get location data of different users within a specific time and place.
获取特定时间和地点的不同用户的位置数据。
Apply a density-based clustering algorithm on the data.
在数据上应用基于密度的聚类算法。
Use the clusters to predict infected people.
使用聚类预测受感染的人。

So, let’s get started…

所以，让我们开始吧...

We will write a Python code that uses the DBSCAN clustering algorithm to predict who might get infected because they came in contact with an infected person.

我们将编写一个使用DBSCAN聚类算法的Python代码，以预测谁可能由于与感染者接触而受到感染。

步骤№1：获取数据。 (Step №1: Obtain data.)

Unfortunately, we can’t obtain real-life data from GPS locations. So, we will build a mock dataset to apply our algorithm on. For this article, I used a Mock Data Generator to generate a JSON dataset containing 100 entries of the locations of 10 users. If you want to try another dataset, make sure the following conditions apply:

不幸的是，我们无法从GPS位置获取现实数据。因此，我们将构建一个模拟数据集以对其应用算法。对于本文，我使用了Mock数据生成器来生成JSON数据集，其中包含10个用户位置的100个条目。如果要尝试另一个数据集，请确保满足以下条件：

There is more than one entry for each user.
每个用户有多个条目。
The users are close in the distance to each other and within a timeframe (for example, a day or a specific number of hours).
用户彼此之间的距离很近，并且在一个时间范围(例如一天或特定的小时数)之内。

First, let’s import all the libraries that we will use. We will need Pandas and Sklearn to process the data and Pygal to display it.

首先，让我们导入将要使用的所有库。我们将需要Pandas和Sklearn处理数据，并需要Pygal来显示数据。

import pandas as pd
import pygal
from sklearn.cluster import DBSCAN

Note: in case you don’t have one of these libraries, you can use pip to install each of them from the command line. Moreover, if you're using Jupyter Notebook, you need to add this cell to display Pygal plots:

注意：如果您没有这些库之一，则可以使用pip从命令行安装每个库。此外，如果您使用Jupyter Notebook ，则需要添加此单元格以显示Pygal图：

from IPython.display import display, HTML
base_html = """
<!DOCTYPE html>
<html>
  <head>
  <script type="text/javascript" src="http://kozea.github.com/pygal.js/javascripts/svg.jquery.js"></script>
  <script type="text/javascript" src="https://kozea.github.io/pygal.js/2.0.x/pygal-tooltips.min.js""></script>
  </head>
  <body>
    <figure>
      {rendered_chart}
    </figure>
  </body>
</html>
"""

Now, we can load our dataset and show the first 5 rows to understand how it is built.

现在，我们可以加载数据集并显示前5行以了解其构建方式。

dataFrame = pd.read_json(r"Location_Of_Your_Dataset\MOCK_DATA.json")
dataFrame.head()

To understand the data more, we will plot it using the Pygal scatter plot. We can extract the different locations of each user and store that in a dictionary and then use this dictionary to plot the data.

为了进一步了解数据，我们将使用Pygal散点图对其进行绘制。我们可以提取每个用户的不同位置并将其存储在字典中，然后使用该字典来绘制数据。

disp_dict = {}
for index, row in dataFram.iterrows():
    if row['User'] not in disp_dict.keys():
        disp_dict[row['User']] = [(row['Latitude'], row['Longitude'])]
    else:
        disp_dict[row['User']].append((row['Latitude'], row['Longitude']))
xy_chart = pygal.XY(stroke=False)
[xy_chart.add(k,v) for k,v in sorted(disp_dict.items())]
display(HTML(base_html.format(rendered_chart=xy_chart.render(is_unicode=True))))

Running this code, we get…

运行这段代码，我们得到...

步骤№2：应用DBSCAN算法。 (Step №2: Apply DBSCAN algorithm.)

Awesome. Now that we have our dataset, we can apply the clustering algorithm on it and then use that to predict potential infections. To accomplish that, we will use the DBSCAN algorithm.

太棒了现在我们有了数据集，我们可以在其上应用聚类算法，然后使用它来预测潜在的感染。为此，我们将使用DBSCAN算法。

The DBSCAN algorithm views clusters as areas of high density separated by regions of low density. Because of this, clusters found by DBSCAN can be of any shape, as opposed to k-means, which assumes that all clusters are convex shaped.

DBSCAN算法将群集视为由低密度区域分隔的高密度区域。因此，与假定所有集群均为凸形的k均值相反，DBSCAN发现的集群可以为任何形状。

Sklearn had a predefined DBSCAN algorithm; all you need to do to use it is know three parameters:

Sklearn具有预定义的DBSCAN算法；使用它所需要做的就是知道三个参数：

eps: This factor indicates the distance between the different points in the same cluster. In our case, we will use the recommended distance by the CDC, which is 6 feet (or 0.0018288 kilometers).
eps：此因子指示同一群集中不同点之间的距离。在本例中，我们将使用CDC推荐的距离，即6英尺(或0.0018288公里)。
min_samples: The minimum number of samples in the cluster. In the case of large, noisy datasets, increase this number.
min_samples：集群中的最小样本数。如果数据集嘈杂，请增加此数量。
metric: This sets the distance metric between the data points. Sklearn has many distance metrics, such as euclidean, manhattan, and Minkowski. For our case, however, we need a distance measure that describes distance on a cipher (The Earth). The metric for that is called haversine.
metric：设置数据点之间的距离度量。 Sklearn具有许多距离度量标准，例如欧式，曼哈顿和Minkowski。但是，对于我们的情况，我们需要一种距离度量，该距离度量描述密码(地球)上的距离。度量标准称为haversine。

We can now apply our model to the dataset.

现在，我们可以将模型应用于数据集。

safe_distance = 0.0018288 # a radial distance of 6 feet in kilometers
model = DBSCAN(eps=safe_distance, min_samples=2, metric='haversine').fit(dataFram[['Latitude', 'Longitude']])
core_samples_mask = np.zeros_like(model.labels_, dtype=bool)
core_samples_mask[model.core_sample_indices_] = True
labels = model.labels_
dataFram['Cluster'] = model.labels_.tolist()

Applying the model with these parameters leads to 18 clusters. We can display these clusters using this piece of code…

将具有这些参数的模型应用于18个聚类。我们可以使用这段代码来显示这些集群……

disp_dict_clust = {}
for index, row in dataFram.iterrows():
    if row['Cluster'] not in disp_dict_clust.keys():
        disp_dict_clust[row['Cluster']] = [(row['Latitude'], row['Longitude'])]
    else:
        disp_dict_clust[row['Cluster']].append((row['Latitude'], row['Longitude']))
print(len(disp_dict_clust.keys()))
from pygal.style import LightenStyle
dark_lighten_style = LightenStyle('#F35548')
xy_chart = pygal.XY(stroke=False, style=dark_lighten_style)
[xy_chart.add(str(k),v) for k,v in disp_dict_clust.items()]
display(HTML(base_html.format(rendered_chart=xy_chart.render(is_unicode=True))))

After the algorithm is done, if there were any data points without a cluster, they will be clustered as noise or cluster -1. Often, you will find that all users in this data set will be part of the -1 cluster in addition to other clusters.

算法完成后，如果有任何没有聚类的数据点，它们将被聚类为噪声或聚类-1。通常，您会发现此数据集中的所有用户除其他群集外，还将是-1群集的一部分。

步骤№3：预测感染者。 (Step №3: Predict infected people.)

If we have the name of an infected person, we can use that to get all clusters this person is a part of. From there, we can see other people in these clusters. These people will have a higher probability of getting infected than those who are not.

如果我们有一个被感染者的名字，我们可以使用它来获取这个人所属的所有集群。从那里，我们可以看到这些集群中的其他人。与未感染者相比，这些人被感染的可能性更高。

Obtain all clusters a specific person belongs to
获取特定人所属的所有集群

Given a name inputName for example, William, we want all clusters that William is a part of.

给定一个名为inputName的名称，例如William ，我们希望William属于其中的所有集群。

inputName = "William"
inputNameClusters = set()
    for i in range(len(dataFrame)):
        if dataFrame['User'][i] == inputName:
            inputNameClusters.add(dataFrame['Cluster'][i])

After executing this code, the inputNameCluster will become {2, 4, 5, -1}.

执行此代码后， inputNameCluster将变为{ inputNameCluster ，-1}。

Get people within a specific cluster.
将人员吸引到特定的集群中。

Now, we want other people who belong to this specific set of clusters.

现在，我们希望其他人属于这组特定的集群。

infected = set()
    for cluster in inputNameClusters:
        if cluster != -1:
            namesInCluster = dataFrame.loc[dataFrame['Cluster'] == cluster, 'User']
            for i in range(len(namesInCluster)):
                name = namesInCluster.iloc[i]
                if name != inputName:
                    infected.add(name)

In both these sections, I am using sets to avoid having extra if-else statements when the inputName is in each cluster's names list.

在这两个部分中，当inputName在每个集群的名称列表中时，我使用集合来避免使用多余的if-else语句。

Voilà, the code will return {‘Doreen’, ‘James’, ‘John’}, which means, those three people are potentially infected because they came into contact with William at some point in time and place.

Voilà ，代码将返回{'Doreen'，'James'，'John'}，这意味着这三个人可能被感染，因为他们在某个时间和地点接触了William。

I have put the core code into a function that takes a dataframe and a user’s name and perform contact tracing of this user and finally print potentially infected people. The function will first check if the inputName is valid; if not, it will raise an assertion error. On top of that, it's less than 30 lines of code!!

我已经将核心代码放入了一个函数，该函数采用一个数据框和一个用户名，并对该用户进行联系人跟踪，最后打印出可能感染的人。该函数将首先检查inputName是否有效；否则，将引发断言错误。最重要的是，它不到30行代码！

The full code for the contact tracing function:

联系人跟踪功能的完整代码：

结论 (Conclusion)

Contact Tracing is one of the ways we can use technology to save people’s lives and provide them treatment as soon as possible. Government and medical personals often have access to GPS locations of some patients. The process we walked through in this article, is fundamentally the same as the one they follow to obtain potential infections. Luckily, thanks to libraries like Sklearn, we can use predefined models on our datasets and obtain results with few lines of code.

联系追踪是我们使用技术挽救人们生命并尽快为其提供治疗的一种方式。政府和医务人员经常可以访问某些患者的GPS位置。我们在本文中介绍的过程从根本上与他们获得潜在感染所遵循的过程相同。幸运的是，由于使用了Sklearn之类的库，我们可以在数据集中使用预定义的模型，并且只需几行代码即可获得结果。