静态图表是如此2019数据科学

最新推荐文章于 2024-10-12 12:26:23 发布

weixin_26721705

最新推荐文章于 2024-10-12 12:26:23 发布

阅读量280

点赞数

文章标签： python java

原文链接：https://medium.com/analytics-vidhya/static-charts-are-so-2019-data-science-48149551fab5

版权

本文介绍了如何使用数据科学创建动画图表，如条形图竞赛、折线图竞赛和汉斯·罗斯林图表等。作者通过Flourish工具展示了这些图表的创建过程和优势，强调了动态图表在传达时间、规模和故事讲述方面的强大力量。同时，作者提醒，虽然这些图表制作相对容易，但选择正确的数据和理解其适用场景至关重要，避免过度使用导致信息过载。

摘要由CSDN通过智能技术生成

Using Data Science to: Create animated charts

使用数据科学：创建动画图表

介绍(Introduction)

When I started my Data Science journey, someone told me that the key to success would be learning to learn. Now that I’m further along, I realize that they hit the nail on the head.

当我开始数据科学之旅时，有人告诉我，成功的关键将是学习。现在，我走得更远了，我意识到他们打在了头上。

The truth is that there are a ton of complex tools, concepts, and technologies to learn and understand. But by the time you’ve learned them, there are already new tools coming out that are faster, better, and easier.

事实是，有很多复杂的工具，概念和技术需要学习和理解。但是，当您学习它们时，已经出现了更快，更好，更容易的新工具。

Now that I feel good about my foundation, anytime I come across an exciting new tool or application, I try to learn it as quickly as possible. So when I started to notice a trend of animated charts pop up, I knew it was time to jump in and figure out how they are made.

既然我对自己的基础感到满意，那么每当我遇到一个令人兴奋的新工具或应用程序时，我都会尽可能快地学习它。因此，当我开始注意到动画图表的趋势突然出现时，我知道现在该开始了解它们的制作方法了。

I knew from past experience that some of this could be done manually through a javascript library called D3.js, but that this approach would be pretty time consuming. I did a little more research and found out that many people were using a tool called Flourish.

从过去的经验中我知道，其中一些操作可以通过名为D3.jsJavaScript库手动完成，但是这种方法非常耗时。我进行了一些研究，发现许多人正在使用一种名为Flourish的工具。

After checking it out, I have to say, Flourish blew me away. It’s fast, flexible, super easy to use, and free (as long as you don’t mind making your data public). I decided to spend this week exploring its animated options. Below is a breakdown of some of my favorite charts they had to offer.

检查完之后，我不得不说，蓬勃发展使我震惊。它快速，灵活，超级易于使用且免费(只要您不介意将数据公开)。我决定本周花时间探索其动画选项。以下是他们必须提供的一些我最喜欢的图表的细分。

条形图竞赛 (Bar Chart Race)

Military Spending by Nation

国家军费开支

总览(Overview)

First and foremost I wanted to build an animated bar chart end to end, meaning that I wanted to use novel data that I would have to extract, clean, and upload. I thought that military spending could be an interesting topic, and luckily I was able to find some data on just that. I’ve made the cleaned dataset available here.

首先，我想端到端构建一个动画条形图，这意味着我想使用必须提取，清理和上传的新颖数据。我认为军费开支可能是一个有趣的话题，幸运的是，我能够找到有关此事的一些数据。我已经在这里提供了清理后的数据集。

What I learned in this process is that once the data is properly formatted, building the animation is extremely easy. Of course, data cleaning is a huge task on its own, but with Flourish, that’s basically 95% of the work.

我在此过程中了解到的是，一旦数据正确格式化，制作动画就非常容易。当然，数据清理本身就是一项艰巨的任务，但是使用Flourish，基本上就完成了95％的工作。

Overall I think that this is a super powerful visualization. It captures time, scale, and really helps tell a story.

总的来说，我认为这是一个非常强大的可视化。它可以捕获时间，规模，并确实有助于讲述一个故事。

有什么好处 (What it’s good for)

This is a very flexible chart type and is great for any examples with 1 dynamic value measured at regular time intervals. Here are some potential examples:

这是一种非常灵活的图表类型，非常适合在固定时间间隔测量1个动态值的任何示例。以下是一些潜在的示例：

- Average MPG of cars by manufacturer per year.- Green house gas emissions by country per month. - Market Cap of companies per day.

-每年制造商的汽车平均MPG。 -每月每个国家的温室气体排放量。 -每天公司的市值。

数据结构 (Data Structure)

The data here is pretty straight forward, and works great for 2 dimensional data, where one of the fields is time.

这里的数据非常简单，并且非常适合于二维数据，其中一个字段是时间。

One thing to call out here is that each country is a row and each year is a column. Most data I found that worked well for this type of chart were reversed, so prepping the data required transposing and consolidating by country.

这里要说的一件事是，每个国家/地区是一行，每年是一列。我发现大多数适用于此类图表的数据都被颠倒了，因此准备数据需要按国家进行转移和合并。

总体 (Overall)

I really like this chart type and it was incredibly easy to build once the data was clean. It’s fast and responsive, and tells a great story without relying on a bunch of “chart junk” like busy legends or lengthy descriptions.

我真的很喜欢这种图表类型，数据干净后，构建起来非常容易。它快速，React灵敏，并且无需依赖诸如忙碌的传说或冗长的描述之类的“图表垃圾”就能讲述精彩的故事。

There are two major downsides. First, you can really only display 2D data, and one of those fields must be time. Second, the chart does not convey any sort of “history”, so changes over time are lost.

有两个主要缺点。首先，您实际上只能显示2D数据，而这些字段之一必须是时间。其次，图表无法传达任何类型的“历史记录”，因此会丢失随时间变化的信息。

I would use this chart if you care about telling a simple story of a rise to dominance in a specific category that can be represented by a single value.

如果您想讲述一个可以用单个值表示的特定类别中的主导地位上升的简单故事，则可以使用此图表。

折线图竞赛 (Line Chart Race)

Human Freedom Index: Top 20 GDP Countries, 2008–2017

人类自由指数：2008-2017年GDP排名前20位的国家

总览(Overview)

I moved on to Line Charts Races next, and decided to also do this with a novel dataset. I came across a Human Freedom index broken down by year for all the countries in the world. I’ve cleaned up the data and have made it available here.

接下来，我进入了折线图竞赛，并决定也使用一个新颖的数据集来进行此操作。我遇到了世界所有国家的人类自由指数，该指数逐年细分。我已经整理了数据，并在此处提供了这些数据。

The main benefit to this chart format is that it shows a “history” so we can look back and see change over time. However, the downside is that the scores have to be very closely bundled otherwise the scale is destroyed. (You could use logarithmic scaling on the axes, but that is not intuitive for most viewers.)

此图表格式的主要好处是它显示了“历史记录”，因此我们可以回头查看随时间的变化。但是，不利的一面是，分数必须非常紧密地捆绑在一起，否则会破坏音阶。 (您可以在轴上使用对数缩放，但这对大多数观看者来说并不直观。)

有什么好处 (What it’s good for)

This is a great tool if you are dealing with rankings or scores on a fixed scale that vary over time (ex: 1–10 or 1–100). Some potential examples could be:

如果您要处理随时间变化的固定比例的排名或分数(例如1-10或1-100)，则此工具非常有用。一些潜在的例子可能是：

- Character popularity in a show across each episode.- World leader approval ratings by week during the pandemic.

-每一集节目中的角色受欢迎度。-大流行期间每周的世界领导者认可度。

数据结构 (Data Structure)

Again this is designed for 2D data where one of the dimensions is time. As mentioned above, the values here should be tightly clustered, and measured on regular time intervals.

同样，这是为2D数据设计的，其中维度之一是时间。如上所述，此处的值应紧密聚类，并按规则的时间间隔进行测量。

I originally built this with all countries that I could find data for, which was 170 in total. But this led to an extremely slow and choppy chart that was incomprehensible. This seems to work best with 10–25 entities that are being compared.

我最初是用我可以找到其数据的所有国家(总计170个)构建的。但是，这导致了一张极其缓慢和混乱的图表，这是无法理解的。这似乎对要比较的10–25个实体最有效。

总体 (Overall)

This format is great for the right kind of data, however, finding good clean data for this was much harder than I had anticipated. This format still tells a story, and makes it really easy to formulate questions (What happened with Brazil?).

这种格式非常适合正确的数据，但是，要找到良好的干净数据比我预期的要困难得多。这种格式仍然可以讲述一个故事，并且可以很容易地提出问题(巴西发生了什么？)。

The biggest downside is that this still only conveys information on one dynamic datapoint, and that those values have to be tightly clustered.

最大的缺点是，它仍然仅在一个动态数据点上传达信息，并且这些值必须紧密地聚类。

I would use this if you want to compare rankings or scores over time, especially if you want to tie the movement in metrics to key events that could explain.

如果您想比较一段时间内的排名或分数，尤其是要将指标的移动与可以解释的关键事件联系在一起时，我会使用它。

汉斯·罗斯林(Hans Rosling)图表 (Hans Rosling Chart)

GDP+Population+Life Expectancy by Nation.

国内生产总值+人口+预期寿命。

总览 (Overview)

First off, if this chart looks familiar, it’s because it was made famous by its creator, Hans Rosling in this video. For this, I used a prebuilt dataset by Flourish comparing GDP, Life Expectancy, and Population.

首先，如果这张图表看起来很熟悉，那是因为它是由其创建者Hans Rosling在此视频中出名的。为此，我使用了Flourish的预构建数据集来比较GDP，预期寿命和人口。

This chart is very powerful in that it conveys many more dimensions of data at once than the previous two formats. Here we have 3 dynamic data points (GDP, Life Expectancy, Population) along with a static data point (Region) all animated against time. This allows us to visualize much more complex phenomenon.

该图表功能非常强大，因为它一次传送的数据量要比前两种格式多得多。在这里，我们有3个动态数据点(GDP，预期寿命，人口)以及一个静态数据点(区域)，均随时间变化。这使我们可以可视化更复杂的现象。

有什么好处 (What it’s good for)

This chart type is very powerful for data that has multiple dimensions that all change and presumably interact over time. Some potential examples are:

这种图表类型对于具有多个维度的数据非常强大，这些维度都随着时间的推移而发生变化并且可能相互作用。一些潜在的例子是：

-Sales, Profit, # Employees of different companies over time.- Education, Social Services, and Crime Rate of different states over time.

-销售，利润，＃随着时间的推移，不同公司的员工。 -随着时间的推移，不同州的教育，社会服务和犯罪率。

数据结构 (Data Structure)

The data here is a bit more complex making it harder to get data in the right format. The key here is that there does still have to be a time component which is captured by the animation, but that still leaves space for up to 3 more dynamic data points, and 2 fixed data points (Ex: Region).

这里的数据有些复杂，因此很难以正确的格式获取数据。此处的关键在于，仍然必须有一个由动画捕获的时间分量，但仍留出空间，最多可容纳3个以上的动态数据点和2个固定的数据点(例如：区域)。

The version above contains all countries and is a little to crowded and slow in my opinion, so I would probably try to limit the total number of unique entities to 25–50 if I were using this on a novel dataset.

上面的版本包含所有国家，在我看来有点拥挤和缓慢，所以如果我在一个新颖的数据集上使用它，我可能会尝试将唯一实体的总数限制为25–50。

总体 (Overall)

This chart type has huge potential for the right dataset. If you want to visualize something with multiple influences, this is a great choice, but don’t expect to knock it out in a half hour. The data prep process for this is definitely more involved, and it takes more than just a couple of vlookups in Excel.

对于正确的数据集，此图表类型具有巨大的潜力。如果您想可视化具有多种影响的事物，这是一个不错的选择，但是不要期望在半小时内将其淘汰。数据准备过程肯定会涉及更多，并且不仅需要在Excel中进行几次vlookup，还需要花费更多的时间。

This chart type can also be overwhelming on its own, so I think it would be great in a live presentation, but would be confusing if it was just included in a deck.

这种图表类型也可以自己压倒性的，因此我认为在现场演示中会很棒，但是如果只是将其包含在广告牌中会令人困惑。

地理地图 (Geographical Point Map)

Worldwide Earthquake Data

全球地震数据

总览(Overview)

This final animated chart time is totally different from the others as it is designed to deal with geospatial data.

最终的动画图表时间与其他时间完全不同，因为它专门用于处理地理空间数据。

While this chart looks cool, I’m not personally the biggest fan as it is difficult to understand what is going on without significantly more context. This could be a great chart if you were dealing with very specific data and the audience was well versed on the subject matter, but for the casual observer, this chart on its own would need significantly more information.

虽然此图表看起来很酷，但我个人并不是最大的拥护者，因为如果没有明显的更多背景信息，很难理解正在发生的事情。如果您要处理非常具体的数据并且听众对主题很熟悉，那么这可能是一个很好的图表，但是对于临时观察者来说，这张图表本身就需要大量的信息。

You can also display different event types as different colors, so done properly this could show the interaction of multiple effects all on one chart.

您还可以将不同的事件类型显示为不同的颜色，因此正确执行此操作可能会在一张图表上显示多种效果的相互作用。

有什么好处 (What it’s good for)

This chart type is good if you want to show global occurrences of some discrete event over time, particularly if there is a scale or severity to the event type. Some examples could be:

如果要显示一段时间内某些离散事件的整体发生情况，特别是如果事件类型具有规模或严重性时，则此图表类型很好。一些示例可能是：

- Volcanic eruptions- Ufo sightings- Fast Food Chain openings

-火山爆发-不明飞行物目击事件-快餐连锁店开业

数据结构(Data Structure)

The core of the data is still pretty straight forward, but the key difference here is that you need longitude and latitude information for every event. By contrast, there are many other geospatial chart types (Projection Maps) that only require data such as a state or country.

数据的核心仍然很简单，但是这里的主要区别在于，每个事件都需要经度和纬度信息。相比之下，还有许多其他地理空间图表类型(“投影地图”)仅需要州或国家/地区等数据。

Beyond what is shown, there are many other fields which can be used if you want to display color coded events by type.

除了显示的内容外，如果要按类型显示颜色编码的事件，还有许多其他字段可以使用。

总体 (Overall)

This could be a great format for a very specific use case, however, I personally think it would be the wrong choice for 99.999% of data sets. When building data visualizations, some people think that a complex chart is a better chart, but I disagree.

对于非常特定的用例而言，这可能是一种很好的格式，但是，我个人认为对于99.999％的数据集，这将是错误的选择。在构建数据可视化时，有些人认为复杂的图表是更好的图表，但我不同意。

This is a complex chart, but in most cases I think the complexity here would actually be counter productive to conveying information.

这是一个复杂的图表，但是在大多数情况下，我认为这里的复杂性实际上会适得其反。

Feel free to add it to your arsenal, but if you find yourself forcing a dataset to fit into the right format to make this work, you may want to consider if there is a better chart type available for what you’re trying to convey.

随意将其添加到您的武器库中，但是如果您发现自己强迫数据集适合正确的格式以完成此工作，则可能要考虑是否有更好的图表类型可用于您要传达的内容。

结论 (Conclusion)

In my opinion, the goal of a visualization should be to convey a large amount of information as clearly and concisely as possible. Back in the days of print, Edward Tufte, who literally wrote the book on visualization, famously developed the “Data-Ink Ratio” for evaluating the quality of a chart.

我认为，可视化的目标应该是尽可能清晰，简洁地传达大量信息。早在印刷之时，就写过有关可视化的书的爱德华·塔夫特(Edward Tufte)著名地开发了“数据墨水比率”来评估图表的质量。

Today, we’re no longer concerned about how much ink we use, but we are very concerned with how much time and attention any activity consumes. I would argue that updating Tufte’s formula for today’s world would probably result in something like a “Data-Attention Ratio”. With the quality of the chart measured by the information conveyed, divided by the amount of attention it consumed.

今天，我们不再担心使用多少墨水，但是我们非常担心任何活动消耗多少时间和精力。我认为更新Tufte的当今公式可能会导致类似“数据注意力比率”的问题。通过传递的信息衡量的图表质量除以它消耗的注意力量。

That being said, I think that charts displayed above could be extremely valuable, or extremely counter productive. In most cases, you are asking for undivided attention for 15–20 seconds, which is a lot for a single visualization.

话虽如此，我认为上面显示的图表可能非常有价值，或者会适得其反。在大多数情况下，您要求全神贯注15-20秒，对于单个可视化而言这是很大的。

If you have data that makes sense for these formats, then by all means, go for it. But be vary aware that the improper use of these charts could just end up wasting your audience’s time.

如果您有对这些格式有意义的数据，则一定要这样做。但是请多加注意，这些图表的不当使用可能最终会浪费观众的时间。

To wrap this up, the question I set out to answer this week was if I could use Data Science to create animated charts. The answer here is definitely yes, and it was much easier than I had anticipated. However, because it is so easy, I can foresee a day when someone forwards out a deck with 40 slides of animated charts and expects you to watch and understand them all.

总结一下，我本周要回答的问题是我是否可以使用Data Science创建动画图表。答案肯定是肯定的，而且比我预期的要容易得多。但是，因为它非常简单，所以我可以预见会有一天有人将40张动画图表幻灯片放到甲板上，并希望您能观看并理解它们。

Please. Don’t be that someone.

请。不要是那个人。

Originally published at https://thaddeus-segura.com on September 7, 2020.

最初于2020年9月7日发布在https://thaddeus-segura.com 。