1
00:00:00,000 --> 00:00:07,000
Translator: Andrea McDonough
Reviewer: Jessica Ruby
2
00:00:31,085 --> 00:00:33,847
Big data is an elusive concept.
3
00:00:35,987 --> 00:00:38,675
It represents an amount of digital information,
4
00:00:38,675 --> 00:00:40,845
which is uncomfortable to store,
5
00:00:40,845 --> 00:00:41,973
transport,
6
00:00:41,973 --> 00:00:43,851
or analyze.
7
00:00:43,851 --> 00:00:45,766
Big data is so voluminous
8
00:00:45,766 --> 00:00:48,474
that it overwhelms the technologies of the day
9
00:00:48,474 --> 00:00:50,899
and challenges us to create the next generation
10
00:00:50,899 --> 00:00:54,004
of data storage tools and techniques.
11
00:00:59,557 --> 00:01:01,336
So, big data isn't new.
12
00:01:01,336 --> 00:01:03,694
In fact, physicists at CERN have been rangling
13
00:01:03,694 --> 00:01:08,093
with the challenge of their ever-expanding big data for decades.
14
00:01:09,431 --> 00:01:11,754
Fifty years ago, CERN's data could be stored
15
00:01:11,754 --> 00:01:13,506
in a single computer.
16
00:01:13,506 --> 00:01:15,660
OK, so it wasn't your usual computer,
17
00:01:15,660 --> 00:01:17,077
this was a mainframe computer
18
00:01:17,077 --> 00:01:19,387
that filled an entire building.
19
00:01:21,494 --> 00:01:22,663
To analyze the data,
20
00:01:22,663 --> 00:01:25,611
physicists from around the world traveled to CERN
21
00:01:25,611 --> 00:01:28,637
to connect to the enormous machine.
22
00:01:31,075 --> 00:01:33,928
In the 1970's, our ever-growing big data
23
00:01:33,928 --> 00:01:36,678
was distributed across different sets of computers,
24
00:01:36,678 --> 00:01:38,708
which mushroomed at CERN.
25
00:01:38,708 --> 00:01:40,150
Each set was joined together
26
00:01:40,150 --> 00:01:42,678
in dedicated, homegrown networks.
27
00:01:42,678 --> 00:01:44,464
But physicists collaborated without regard
28
00:01:44,464 --> 00:01:46,413
for the boundaries between sets,
29
00:01:46,413 --> 00:01:49,302
hence needed to access data on all of these.
30
00:01:49,302 --> 00:01:51,287
So, we bridged the independent networks together
31
00:01:51,287 --> 00:01:54,379
in our own CERNET.
32
00:01:54,379 --> 00:01:57,227
In the 1980's, islands of similar networks
33
00:01:57,227 --> 00:01:58,771
speaking different dialects
34
00:01:58,771 --> 00:02:01,311
sprung up all over Europe and the States,
35
00:02:01,311 --> 00:02:04,402
making remote access possible but torturous.
36
00:02:04,402 --> 00:02:06,546
To make it easy for our physicists across the world
37
00:02:06,546 --> 00:02:08,951
to access the ever-expanding big data
38
00:02:08,951 --> 00:02:10,744
stored at CERN without traveling,
39
00:02:10,744 --> 00:02:12,043
the networks needed to be talking
40
00:02:12,043 --> 00:02:13,413
with the same language.
41
00:02:13,413 --> 00:02:17,208
We adopted the fledgling internet working standard from the States,
42
00:02:17,208 --> 00:02:18,584
followed by the rest of Europe,
43
00:02:18,584 --> 00:02:20,752
and we established the principal link at CERN
44
00:02:20,752 --> 00:02:23,255
between Europe and the States in 1989,
45
00:02:23,255 --> 00:02:26,041
and the truly global internet took off!
46
00:02:28,580 --> 00:02:30,371
Physicists could easily then access
47
00:02:30,371 --> 00:02:32,183
the terabytes of big data
48
00:02:32,183 --> 00:02:33,846
remotely from around the world,
49
00:02:33,846 --> 00:02:35,225
generate results,
50
00:02:35,225 --> 00:02:37,520
and write papers in their home institutes.
51
00:02:37,520 --> 00:02:39,021
Then, they wanted to share their findings
52
00:02:39,021 --> 00:02:40,813
with all their colleagues.
53
00:02:40,813 --> 00:02:42,416
To make this information sharing easy,
54
00:02:42,416 --> 00:02:45,358
we created the web in the early 1990's.
55
00:02:45,358 --> 00:02:47,196
Physicists no longer needed to know
56
00:02:47,196 --> 00:02:48,833
where the information was stored
57
00:02:48,833 --> 00:02:51,402
in order to find it and access it on the web,
58
00:02:51,402 --> 00:02:53,536
an idea which caught on across the world
59
00:02:53,536 --> 00:02:55,912
and has transformed the way we communicate
60
00:02:55,912 --> 00:02:57,580
in our daily lives.
61
00:03:00,226 --> 00:03:01,633
During the early 2000's,
62
00:03:01,633 --> 00:03:03,623
the continued growth of our big data
63
00:03:03,623 --> 00:03:06,914
outstripped our capability to analyze it at CERN,
64
00:03:06,914 --> 00:03:10,499
despite having buildings full of computers.
65
00:03:10,499 --> 00:03:12,805
We had to start distributing the petabytes of data
66
00:03:12,805 --> 00:03:14,387
to our collaborating partners
67
00:03:14,387 --> 00:03:17,139
in order to employ local computing and storage
68
00:03:17,139 --> 00:03:19,974
at hundreds of different institutes.
69
00:03:19,974 --> 00:03:22,269
In order to orchestrate these interconnected resources
70
00:03:22,269 --> 00:03:24,313
with their diverse technologies,
71
00:03:24,313 --> 00:03:26,064
we developed a computing grid,
72
00:03:26,064 --> 00:03:27,640
enabling the seamless sharing
73
00:03:27,640 --> 00:03:30,068
of computing resources around the globe.
74
00:03:30,068 --> 00:03:34,459
This relies on trust relationships and mutual exchange.
75
00:03:34,459 --> 00:03:36,752
But this grid model could not be transferred
76
00:03:36,752 --> 00:03:39,036
out of our community so easily,
77
00:03:39,036 --> 00:03:41,330
where not everyone has resources to share
78
00:03:41,330 --> 00:03:43,206
nor could companies be expected
79
00:03:43,206 --> 00:03:45,959
to have the same level of trust.
80
00:03:45,959 --> 00:03:48,254
Instead, an alternative, more business-like approach
81
00:03:48,254 --> 00:03:50,090
for accessing on-demand resources
82
00:03:50,090 --> 00:03:51,798
has been flourishing recently,
83
00:03:51,798 --> 00:03:53,466
called cloud computing,
84
00:03:53,466 --> 00:03:55,342
which other communities are now exploiting
85
00:03:55,342 --> 00:03:57,342
to analyzing their big data.
86
00:03:57,342 --> 00:04:00,329
It might seem paradoxical for a place like CERN,
87
00:04:00,329 --> 00:04:01,900
a lab focused on the study
88
00:04:01,900 --> 00:04:05,071
of the unimaginably small building blocks of matter,
89
00:04:05,071 --> 00:04:08,448
to be the source of something as big as big data.
90
00:04:08,448 --> 00:04:10,530
But the way we study the fundamental particles,
91
00:04:10,530 --> 00:04:13,143
as well as the forces by which they interact,
92
00:04:13,143 --> 00:04:15,246
involves creating them fleetingly,
93
00:04:15,246 --> 00:04:17,614
colliding protons in our accelerators
94
00:04:17,614 --> 00:04:19,041
and capturing a trace of them
95
00:04:19,041 --> 00:04:21,314
as they zoom off near light speed.
96
00:04:21,314 --> 00:04:22,308
To see those traces,
97
00:04:22,308 --> 00:04:25,756
our detector, with 150 million sensors,
98
00:04:25,756 --> 00:04:28,231
acts like a really massive 3-D camera,
99
00:04:28,231 --> 00:04:30,341
taking a picture of each collision event -
100
00:04:30,341 --> 00:04:32,891
that's up to 14 millions times per second.
101
00:04:32,891 --> 00:04:35,424
That makes a lot of data.
102
00:04:37,194 --> 00:04:39,353
But if big data has been around for so long,
103
00:04:39,353 --> 00:04:41,980
why do we suddenly keep hearing about it now?
104
00:04:41,980 --> 00:04:43,691
Well, as the old metaphor explains,
105
00:04:43,691 --> 00:04:46,479
the whole is greater than the sum of its parts,
106
00:04:46,479 --> 00:04:50,256
and this is no longer just science that is exploiting this.
107
00:04:50,256 --> 00:04:51,860
The fact that we can derive more knowledge
108
00:04:51,860 --> 00:04:54,190
by joining related information together
109
00:04:54,190 --> 00:04:55,741
and spotting correlations
110
00:04:55,741 --> 00:04:59,132
can inform and enrich numerous aspects of everyday life,
111
00:04:59,132 --> 00:05:00,160
either in real time,
112
00:05:00,160 --> 00:05:02,451
such as traffic or financial conditions,
113
00:05:02,451 --> 00:05:04,206
in short-term evolutions,
114
00:05:04,206 --> 00:05:06,333
such as medical or meteorological,
115
00:05:06,333 --> 00:05:08,058
or in predictive situations,
116
00:05:08,058 --> 00:05:11,078
such as business, crime, or disease trends.
117
00:05:13,369 --> 00:05:16,432
Virtually every field is turning to gathering big data,
118
00:05:16,432 --> 00:05:18,769
with mobile sensor networks spanning the globe,
119
00:05:18,769 --> 00:05:21,056
cameras on the ground and in the air,
120
00:05:21,056 --> 00:05:24,067
archives storing information published on the web,
121
00:05:24,067 --> 00:05:26,196
and loggers capturing the activities
122
00:05:26,196 --> 00:05:28,895
of Internet citizens the world over.
123
00:05:28,895 --> 00:05:31,486
The challenge is on to invent new tools and techniques
124
00:05:31,486 --> 00:05:33,439
to mine these vast stores,
125
00:05:33,439 --> 00:05:35,240
to inform decision making,
126
00:05:35,240 --> 00:05:37,496
to improve medical diagnosis,
127
00:05:37,496 --> 00:05:39,706
and otherwise to answer needs and desires
128
00:05:39,706 --> 00:05:43,663
of tomorrow's society in ways that are unimagined today.