I have worked with people from comparative literature (比较文学的人)on the development of social networks innovels based on who talks to who(研究在小说中,基于人们彼此交流的社会网络发展情况), and we’ve used that for testingtheories in comparative literature(我们也将它用作在比较文学里的测试理论).

I’m now working with some social scientists who areinterested in how can we better prepare people for disaster(灾难), and how do we know what works, andthey typically use surveys. 我现在与那些对如何更好的应对灾难的社会学家一起工作,他们通常使用调查。 
And we’re looking at
(我们正在研究的是:), instead, can we analyze Twitter as adisaster approaches(能否将分析Twitter作为一种替换灾难来临的方法),and automatically determine theemotions of people ahead of time(并能提前确定人们的情绪反应), and then correlate(关联) that with news about the disaster (然后将其与灾难的新闻联系起来)to determine what is said in the news(以确定新闻中所说的话) most helps people react in the way thatwe want them to(最有助于人们以我们希望的方式做出反应)

In my group, we like to collaboratevi. 合作) with scientists and other types ofscholarsn. 学者) to help them solve their problems withdata. 
We look at things like recommendation systems where the data are peopleconsuming items,either buying items, or clicking on items, or readingitems. 
And what we want to do is find patterns of consumption
(消费模式) so that we can form predictions aboutwhat somebody might like that they didn’t otherwise(除此之外) see. 

As a side effect(另一方面), we also want to understand patternsof purchasing behavior across a population(我们也想了解整个人口的购买行为模式), which we can explore our fitted(拟合的) data science models to do.(我们可以探索我们的拟合数据科学模型)

For example, analyzing consumption data,in addition tobeing interesting to recommendation system enthusiasts(爱好者), is interesting to economists who wantto understand consumption at a higher level. 
My work now still involves
(包含、涉及) text and finding hidden patterns oftext, but also has branched out to uncovering useful predictive and exploratorypatterns in large scale high dimensional(高维度) data, including things like text,neuroscientist data(神经科学家数据), genetics data, social network data,and helping to define and build the tools for modern data science.(现代数据科学工具) 

Much of my work is in the health space,and we’reinterested in finding out what makes us ill at a population level(在人群中), what keeps us healthy at a populationlevel.

And those can be very broad questions and target systems.

They can also be individual level(个体水平) questions in terms ofhealth behaviors that improve an individual’s health(改善个人健康), whether it’s diet(饮食), or physical activity(体力活动), or other of those types ofactivities. 
The data sets range from
(来自于) individual level data sets fromwearable devices(穿戴设备), they can be survey data(调查数据) collected in context(前后关系,情况) of a study(研究背景下),it can be genomic(基因组) information on individuals. 
And then also it’s important to keep context in mind
(要记住) that as individuals, there are otherthings that affect our health, so things like environmental factors, whetherthey’re exposure(暴露) levels to toxins(毒素), the built environment(建筑环境), access to parks(公园入口)
Those are also very important for sources of data that are used often in publichealth.

So can you look at social media– we have a social workertoday working on campus(校园), looking at social media–and can youpredict gang violence(帮派暴力)?(那么你可以看看社交媒体 - 我们有一个社会工作者,今天在校园里工作,看社交媒体 - 你能预测帮派暴力吗?)

And it turns out(发生), actually, it’s very interesting thatthere are even products already out in the world (世界上)that the police are using to look atsocial media internet feeds【饲料、喂养】(社交媒体互联网信息), as well as long profiles(轮廓、侧面、外貌) of criminals(犯罪分子), and be able to understandblock-by-block(逐个) what a threat situation(威胁情况) is as police engage into a specificsituation.(事实上,有趣的是,世界上甚至有一些产品已经被警方用来观看社交媒体的互联网信息,探索犯罪分子的面部识别,也能够逐个用于帮助曾经警方在特定情况下,无法解决的危险情况。)

My work in data science has been applications of machinelearning to problems coming 
from a particular domain.

So for most of my career, that’s meant to reframing(重新定义) questions from biology as machinelearning tasks.(所以在我的大部分职业生涯中,这意味着将生物学中的问题重新定义为机器学习任务。) 
More recently, I’ve been working with people at the New York Times who havesome very interesting problems and lots of interesting data.

And it’s been very educational(教育意义) seeing how problems from the newsroom(新闻编辑部) or from running a modern business(运行的商业化模式), a digital business (数字化商业)where you gather abundant data(大量的数据), how those problems can be framed interms of machine learning tasks(如何通过机器学习任务来构建这些问题), how you can communicate to people theresults of that machine learning model(如何沟通让人们了解机器学习模式的结果), and how you can help people solve realworld problems with their own data.

In biomedical informatics(在生物医学信息学中), we’re interested in understanding theeffective of interventions(干预有效性) in health care(卫生保健) on the outcomes(结果) that we observe in patients(患者).(在生物医学信息学中,我们有兴趣了解我们在患者中观察到的医疗干预措施的有效性。)

And a few examples include drug safety surveillance(药物安全监督).

So once drugs go through regulatory(监管部门) approval(因此,一旦药物通过了监管部门的批准) and are being used in patients inhospitals, we’re interested in observing the effects of those drugs in the realworld when they’re applied to these patients.

And so if there are safety signals (安全性信号)that did not appear in clinical trials(临床试验)(如果在临床试验中没有出现安全性信号), for example, we analyze the data thatcome from spontaneous(自发性) reports, from the electronic healthrecord(电子健康记录), and from other sources to identify safety signals thatmight have been missed in initial studies.(因此,如果在临床试验中没有出现安全性信号,我们会分析来自自发性报告,电子健康记录和其他来源的数据,以确定在初始研究中可能漏掉的安全性信号。)

Recently, we performed a study as part of a collaborative(共同的) that exists called the ObservationalHealth Data Science and Informatics Collaborative.(最近,我们进行了一项研究,作为一个名为观测健康数据科学和信息学协作的合作的一部分。)

That is an international(国际性的) collaborative that involves individualsfrom academia(学术界), as well as from industry (工商业界)and government.(这是一个涉及学术界和个人的国际合作来自工业和政府。)

The goal of that collaborative is to have a federated(联合) network of observational health datasets that reach into the hundreds of millions to billions of patient records. 这个合作的目标是建立一个观测健康数据集联合网络,其数据量达到数亿至数十亿的患者记录。

In that study, we looked at the heterogeneity(异质性) of how patients with particularconditions (特殊情况下)are treated differently在那项研究中,我们考察了患者在特定条件下的异质性被区别对待。.

In particular, we looked at depression, diabetes,andhypertension. 我们特别关注抑郁症,糖尿病和高血压。

And one of the things that was surprising is that therewas a wide variety in ways(各种各样的治疗) in which patients with those particularconditions were treated.(而令人惊讶的是,这些特殊情况下的患者得到了各种各样的治疗。)

In my own field, we’re using data science to understandthe complexity of cities.

Cities are intersections of the built environment,intersections of people, and intersections 
of the natural environment. 
They’re very, very complicated. 
And it’s very difficult to understand them without the application of datascience to the data 
that we gather about how cities function on a daily basis

We not only have publishers, we have people, everydaypeople,who can post their own experience on social media(在社交媒体上发表自己的体验),whether it’s an online discussion forumor Twitter.

And so we have streaming data that comes in uneditedposts(未经编辑的帖子) made by people. 
This happens not only in text but also in video. 
So if we look at YouTube, it’s one of the largest volume of data
(大量数据) per day that gets added to the internet(被添加到互联网)
And again, that’s everyday people who are adding their videos. 
It’s not that there is more information. 
I guess there is more information. 
But there’s also more storage and more devices out there that are observing theinformation and collecting it

One of the big challenges, I think, for data science ishow to use that kind of observational data to answer questions about theworld. 
This is a type of question that traditional statistics and traditional machinelearning can’t answer. 
I think what’s really the source of much more data is the stuff
(东西) that’s invisible to us because it’s sopervasive(普遍的、蔓延的) in our lives that we just forget it’s there, which isthe ubiquity(无处不在) of personal computing.

And by that I don’t mean PCs the way that we were usingthe word in the ’80s.(我并不是说,在80年代的PC时代,我们就使用了该词) 
I mean, the fact that in my pocket
(口袋中)is a supercomputer that’s keeping trackof all my communications(追踪着我所有的通讯信息) and, for many people, where they are,and beaming that out to Facebook or Google or other companies.

So just the pervasiveness(普遍性的) of really high-powered computation andthe way that all of our lives are becoming streams of data.(所以只是普遍的高性能计算和我们所有的生活方式正在成为数据流。)

So all our personal interactions(互动), all our commerce(商业), all our ingest(所有的) of news media, those are all becomingdigital. 所以,我们所有的个人互动,所有的商业,我们所有的新闻媒体,都变成了数字化的。 
I think it’s the advance of technology. 
And that’s technology that allows us to collect data in a very efficient andcost efficient 
And then computers being able to store the data and integrate
(整合) the data and retrieve(检索) it in a sensible way(合理方式).

So the fact that now our capacity(能力、容量) is increased,then that opens the doorto answer pressing questions(急迫的问题) by allowing us to collect more and moreand more data.(所以,我们的能力越来越高,通过收集越来越多的数据来解决这些问题)

So what’s happening is there are new sensors.

So every day we’re inventing new sensors,and thosesensors are being deployed(部署)
They’re being deployed on our bodies. 
We have FitBits, we have cell phones that can do all kinds of things. 
They have GPS, they have accelerometers
(加速度计), they can detect motion(检测运动).

And then we have all kinds of sensors being deployed allacross our cities, cameras, motion detectors sensing signals(移动探测器传感信号) of cellphones. 
So every day there are new sensors with new data being collected. 
In addition to that, we’re all on Facebook and Twitter, and there’s all thiswhat we call digital exhaust
(数字尾气) being created. 
Every time you go online and you click,and you do something online, that iscreating your own digital exhaust around you. 
Increasingly so, people are recognizing the value of having data. 
Things that are measured are able to be analyzed. 
Things that are analyzed are better able to be managed.

And in health care, in particular,we have an explosion inthe use of electronic health records. 
A few years ago, we saw very little adoption
(采用)of electronic health records. 
But thanks to changes in regulation
(规则) and thanks to a younger, more computerliterate(知识) generation of physicians(更具有电脑知识的一代医生), we have an explosion in the use ofelectronic health records.

Those electronic health records are producing vastquantities (大量的)of data that can be analyzed to improve the practice ofhealth care(医疗健康实践), and to also observe exactly what is going on in healthcare today(这些电子健康记录正在产生大量的数据,可以分析这些数据以改善医疗保健的实践,同时也准确地观察到今天在医疗保健方面正在发生什么。)
Certainly the increased computing power is having a huge role. 
Many data science algorithms
(算法) do require an awful lot ofcomputational power. 
Computers are fast enough now that we can do a lot of that stuff.
For example, Bayesian methods in statistics are very popular now. 
They weren’t so popular in the past because they’re computer-intensive
In order to run certain types of simulations
(仿真、模拟) with very large data sets, they wouldtraditionally have taken a very long time. 
Now with the computation power that we have,we can run those algorithms onthese large data sets very efficiently. 
What else is driving it? 
Well, of course, the web. 
The web is creating huge amounts of data every minute, 
every second. 
There’s huge amounts of data being processed and being created that needs to beanalyzed. 
In other areas like the Human Genome Project,astrophysics
(天体物理学), economics, there’s masses amounts ofdata becoming available for answering very interesting questions in thosefields. 
So all of these things are coming together at the same time, and now is thetime of data science.
Well, I think we’ve been collecting data for centuries
And the amount of data that we’ve been collecting has only been growing year byyear. 
In the 1980s, people were already collecting data about the type of productsthat we were buying in supermarkets and stores
(超市和商店), for example. 
But what’s happened over the last few decades is that our ability to storethose data has grown. 
And with that, we have also improved our ability to analyze the data. 
So the data are becoming more and more and more useful to us. 
And so the storage and analysis of data is one thing that’s driving theexponential growth of data.
Another thing is that we’ve actually been able to develop cheaper andlow-powered sensors, and so what we can measure is also increasing.
So it’s the intersection of multiple things
(多个事物的交集) that is making our society a really,really data-rich world.

1.4 Lecture: What role does data visualization play inData Science?

Data visualization is important for people who want toexplore their data, to get some idea of what it contains, and therefore,perhaps, to develop some intuitions(直觉) about how they would go about solving aproblem, or from that data.

What features are important? 
What kinds of data do they have? 
Visualization is also really important when we’re looking at the output of datascience systems, where we may want to see, for example, one output might beclusters
(集群) of events that happened in the news.

And we might want to visualize, what were thoseevents? 
What were they about? 
And if we have some way to do that,we can get a much better idea of what thesystem has 
been able to produce. 
Data visualization, data summarization
(概要), in general,creating useful exploratorystatistics. 
A statistic is a function of data,and a visualization, then, is a kind ofstatistic. 
In comes data, out comes a thing as a function of that data. 
These are essential
(重要的) pieces of the data science picture. 这些是数据科学图片的重要组成部分。 
They’re essential to understanding these data that we are now collecting andobserving. 
And it’s wide open area. 
My work involves brain imaging, where we actually look for braincharacteristics
(特征) that are associated with disease such asschizophrenia,major depressive disorder,Parkinson’s disease. 我的工作涉及大脑成像,我们实际上在那里寻找大脑特征与精神分裂症,重度抑郁症,帕金森病等疾病有关。 
And data visualization means so much for interpretation
(解释), because, often, we’re targeting ortrying to extract the proverbial(谚语) needle(针) in the haystack(比喻如大海捞针般难找). 数据可视化意味着如此多的解释,因为我们常常把目标放在干草堆上,或者试图提取谚语。 
And exactly what that means from a clinical
(临床)perspective(临床性角度) relies (依赖于)heavily(表强调) on data visualization forinterpretation. 

So to be able to take things from a mathematical space,which is fairly abstract, and convert them and be able to speak to a clinicianand map the areas on the brain that are affected is absolutely critical. 
I think data visualization’s always, in a way, the first step. Before you startreally tackling
(抢断)models and algorithms, you really need to try to look atyour data, and see if there’s something that you missed. 
And so data visualization has come a long way.
It’s not just about making pretty pictures,but it’s also about inspiring
(鼓励) the scientist,and telling them what aresome valuable models that you could explore next, once you’ve visualized thedata. 
So much of what’s done in industry is providing dashboards
(仪表板), for example. 
So there are dashboards running in hospitals,figuring out the data analytics,how they’re measuring patients coming in, coming out. 
At the end of the day, we are visual, and you need to be able to take what youdo, and visualize it. 
So especially in industry and in applications,people won’t understand thebenefit
(效益、好处)that you’re giving them, if you don’t have visualizationtools. 
We depend on being able to see patterns in data,to understand the rhythms
(节奏)in data, to see what’s normal and what’snot. 
I think that there are two domains in which data visualization is particularlyimportant. 
Data visualization is important to the data scientists themselves. 
In order to truly understand the data that is that they’re analyzing, datascientists 
should visualize their data in various ways in order to expose their minds tothe data. 
The human brain is a fantastic data science tool in and of itself, and for datascientists 
to expose their data to their own minds in various ways provides them withinsights into what kinds of hypotheses
(假设) they might want to ask of theirdata. 
Secondly data visualization is important for the communication of the resultsof data science to a general audience
The visualization of results is important for translating what might beinterpretable
(解释) only to a data scientist, to what is palatable (美味的、可口的)to a general audience. 
Data visualization is an interesting topic. 
When you see a great visualization, it does several things. 
Number one is, it makes it very easy to understand what’s going on. 
So a great visualization, for me, can explain an awful lot,and answer manyquestions in a very short space of time.
Another thing that data visualization does,at least the best datavisualizations that I’ve seen, is that they introduce new questions. 
You’ll see something and you’ll go,oh wow I wonder why this is happening,or isthere an explanation for this? 
Last year I saw a great data visualization which showed taxi rides
(出租车)in New York City. 
You could look at it by each individual taxi ride on a map of New York City,and was 
fascinating to see what the taxi drivers do, how long rides last, tips, and soon.
Just watching that creates a host of interesting questions. 
So I think the best visualization will introduce new questions that are ofinterest, 
but will also explain what the designer of the visualization wanted to show youin the first place. 
I mean, the first thing that we all do with a data set, when we’re doingexploratory data 
analysis, is try and visualize the data in multiple ways. 
Our brain is able to identify
(鉴定出、认出) patterns in data pretty easily. 
So we can start to look at what type of analysis might be appropriate
(适当的适合的) for that data set. 
And when we’ve gone through the analysis of the data set,and we’ve uncoveredresults
(我们已经发现的结果), we’ve been able to solve new problems, then presentingwhat we found to the general public, or to an audience that needs to understandwhat we found, is also done via data visualization. 
So it’s both the initial exploration of the data set,as well as thepresentation of the results to the people that need to understand what we’velearned.






