Building next-generation sustainability workloads with open data-CSDN博客

本文链接：https://blog.csdn.net/just2gooo/article/details/134810397

a knowledge base that will be like having the world's museums, libraries and scientists all on a single device. Just imagine the possibilities in research, conservation and beyond with a tool like that available.

Good afternoon. My name is Sam Bydlon. I'm a solutions architect at AWS and part of the worldwide specialist organization's global impact compute team. The mission of my team is to improve the lives of humans, other species and the environment with the power of AWS technologies and advanced computing. And it is this mission that led us to working with the Natural History Museum in London or NHM for short, one of the world's great museums and champions of pushing our society towards a more sustainable future. One in both one in which both people and the planet can thrive today.

I'm joined by Dr. Vincent Smith, head of digital data and informatics and n in this session, Vince and I are going to discuss the sustainability workloads of tomorrow and with a particular focus on the role that open publicly available data will play not only in the construction of those workloads, but as a concept around which we can form outputs and deliverables.

Thanks Sam. Good afternoon, everybody. So today we're gonna be focusing on the role of open data in our collective missions and how we can really push back the boundaries are open to open data by working together.

Now. Although the Natural History Museum and AWS, we might seem like unlikely bedfellows. We're two quite different organizations with different visions and missions. We were even born in quite a different century, but we do share a common goal of going beyond not just making open data more accessible, but making open data more useful, more insightful. And it's through our partnership that we've been working on one of the Natural History Museum's big open data sets building tooling on top of that open data to aid sustainability research and that sustainability research challenge that we're trying to address.

Well, it's a big one, it's the global biodiversity crisis and how essentially we can try and fix that with big open data. So on this map is pretty much the sum total of every observation of every species on the planet. Every one of those dots that you can see on that map corresponds to a particular species at a particular place and a particular point in time. And some of the most important dots on that map come from natural history collections like mine. And the reason they're important is be cos they're the pretty much the only source of historical information about where species were. And if we know where species were. And if we know where species are, now, we can start to predict where they're going to be in the future under different scenarios of environmental change.

Now, at the moment, our ability to solve this crisis is hampered by only having a, a partial view, a partial picture of this map. But with the help of AWS, we're building ways to fill out that map to link new data sets together and that's what we're gonna be talking about today.

This session is going to be broken into four parts. First, I'm going to touch on the the interconnection of sustainability and open data, including some of the challenges that come with integrating data into sustainability workloads and some ways that you can leverage AWS to solve some of these challenges.

Second, I'm going to turn it over to Vince to discuss N's vision of one of these next generation workloads the planetary knowledge base and the incredible work that's happening at NM to bring this vision to life in particular through the massive effort of digitizing and organizing a global scale natural science collection that dates back multiple centuries.

Third, we're going to take a deep technical dive into a critical component of the planetary knowledge base, the underlying knowledge graph. And the work that N and AWS did together to prototype a solution for this knowledge graph using services such as Amazon, Neptune, Neptune, ML and AWS Glue.

And finally, we're going to wrap up with a look to the future and a few simple calls to action. But let's begin by exploring the connection between sustainability and open data.

Solving some of the world's great sustainability challenges such as climate change, excessive pollution and biodiversity loss is going to require massive amounts of data. But working with sustainability data often comes with its own set of challenges and questions such as where do I find the data I need? Am I going to have to pay to access that data? How can I share the data that myself or my organization has collected with others? And finally, how do I access the storage compute and other technologies? I need to take my data sets to the next level and make them even more useful in order to solve these sustainability challenges.

As a global community, we're going to need to break down the barriers to research and progress and and exploit the discovery and innovation that AWS services in the power of open data can give us. So what is open data? Open data is any data that's publicly available without license or subscription fee to anyone around the world for any purpose? Many of tomorrow's sustainability workloads are going to rely on open data sets either as a fundamental input or ingredient or as an output. Meaning that these workloads are going to be designed to update, augment or create brand new open data sets themselves.

AWS is committed to supporting the open data community and the democratization of data access through programs such as the AWS marketplace, AWS data exchange and registry of open data organizations can share their data globally with a growing and large community of scientists, researchers, developers and companies.

Currently AWS is home to over 450 open data sets that you or your organization could start working with today. And since open data is hosted in the AWS cloud, end users of that data spend less time on data acquisition and more time in the development of the innovations that we need to push sustainability solutions forward. Also, those end users can leverage AWS broad suite of compute analytics and machine learning services just to name a few reducing some of that undifferentiated heavy lifting of actually trying to get data onto the cloud.

On the screen here, you'll see a screenshot of what the AWS marketplace and data exchange looks like. In practice, you can see that by filtering for open data sets, you can easily browse identify and start working with open data sets today. But not all open data is suitable for sustainability workloads. And that's why we created the Amazon Sustainability Data Initiative or ASDA.

ASDA is a program within the Open Data on AWS initiative that hosts over 180 data sets and is aimed at accelerating research and development in sustainability fields, lowering the cost and time associated with acquiring and analyzing large sustainability. Data sets. These data sets span areas such as satellite imagery, agriculture, weather, and climate and biodiversity and are managed and provided by leading institutions such as NOAA, Digital Earth, Africa, Esri and Maxar.

Now, I'd like everyone to take a close look at the categories on screen right now. Is your organization solving a problem in one or more of these areas that you need data for? If the answer is yes, ASDA might have a data set for you, I encourage everyone over the next few weeks to take a look at the data sets that ASDA has to offer, maybe you'll find a data set that you've been looking for. Maybe you'll find a data set that inspires your next sustainability focused workload. Maybe you or your organization has a data set that you've been wanting to share with the world but just haven't had the right medium to do so. The only thing that looking will cost you is a few moments of your time.

Now, as Vince mentioned earlier, the future of open data is one that aims to move beyond simply taking data sets and making them publicly available and into the realm of making these data sets more useful and usable by end users. For instance, by providing an environment for end users to put and vet updates, corrections and augmentations to open data sets or by building machine learning models on top of these data sets that then could be made widely available for inference. Therefore, reducing the need of individuals and organizations from building training and deploying their own machine learning solutions.

With over 200 fully featured services across areas such as databases, analytics, compute and machine learning. The AWS cloud is a great place to build the workloads that are going to give rise to this next generation of open data sets. But what does this process look like in reality? To help answer that question? I'm going to turn it over to Vince to talk about how NM is working towards building a next generation open data set called the planetary knowledge base.

Thanks Sam. So in this next section, then we're gonna be doing a little bit of a deeper dive looking at why big open data is so important to the natural science collection community and how with the help of AWS, we're transforming the management of that data.

So first and foremost, biodiversity is absolutely essential to the health of this planet. A recent World Economic Forum report showed that roughly half of GDP. So something like $44 trillion a year is either highly or moderately or or or mostly dependent upon nature. So whether it's climate change, whether it's emerging disease, food security or finding rare elements, things like lithium to power, our next generation of cars or our laptop batteries, for example, the natural world really matters.

And at the moment, we've got a pretty severe problem because we are losing species at really quite an alarming rate, something like 1000 times the natural background rate. So that equates to usually about 1 to 5 species a year, at least that we know about that are going extinct. And in fact, for some groups, that rate is higher, for some groups, we'll probably lose 30 to 50% of those species by about 2050. And this is really fundamentally restructuring our relationship with nature and the systems that we absolutely depend upon to get our resources from.

But the other key point to make here is it's not all doom and gloom. This problem can be solved with enough political will and enough effort. If we set aside the right parcels of land, if we plant the right kinds of trees in the right kind of places to suck carbon out of the atmosphere. If we consume resources or make resources in a more sustainable way, then we can fix this problem. We can create in effect a future where both the people and the planet thrive.

But taking the right kinds of actions requires data and it requires a lot of data. So earlier this year, my institution along with 72 other major natural history museums across the globe, we published a paper in Science that was a first attempt at an ineffectively a kind of inventory of our respective collections across institute across 28 countries. And together we totted up that we had something in the region of 1.1 billion specimens in those collections.

And one of the really interesting facts about that is the bulk of that collection comes from the last 250 years. And that's a really important time because the time when there was the agricultural revolution is a time when we've seen the industrial revolution. And most recently of all is a time where climate change has really started to bite. And the signatures of each of those are written in our collection if we can extract the data and unlock that from these specimens that we hold in our collection.

Also, by bringing all of this data together, we can start to see insights about how our respective collections align where there are gaps and where there are areas or inconsistencies, imbalances in our data. So for example, one of the really positive things is that insects make up something like 50% of all our collections, which is a good thing cos roughly about half of all diversity on life that we've described is an insect, but we have a very small proportion of our staff, less than 10% that work on those insects. For example, also many of the big challenges that we have with biodiversity. Those big challenges are often present in the hyper biodiverse countries in the global south. But most of the data is locked up in collections in the global north. So obviously, digital digitization has a huge part to play in in terms of improving access to that data and delivering fresh insights from it.

Now, this study also revealed a really quite profound problem in terms of how we within our community, how we manage our data. For the most part, much of that data is highly fragmented and locked up in a series of institutional data silos, things like taxonomic names. For example, there are 20 million taxonomic names that often end up being repeated across these different databases. And similarly all sorts of geospatial information, molecular information data from scientific publications. It's all pushed and fragmented into these different silos.

"And that's illustrated by this little sort of cartoon graphic we have here. So if you take that little weevil, that specimen down at the bottom, typically what would happen to a museum specimen is over the course of time people will curate the data associated with that. Someone might come along and re-describe that as a different species or re-identify it. Someone will come along and add some molecular information, for example. Another person might add some images. But again, all of that data resides in separate fragmented databases.

So what we clearly need to do is move towards an integrated knowledge base, something we're calling the Planetary Knowledge Base, where we can ineffectively across the natural science collections community. We can co-curate that data and in effect stand on each other's understanding that we're all generating within our respective institutions and move a lot faster in terms of the science that we do with this data.

But also critically, we can add to this data much more effectively if we're working in an integrated way. And one of the big ways that we add to our collection is through the digitization of the data associated with it. And by digitization I mean fundamentally taking pictures, taking data which is often written on a series of handwritten labels associated with those specimens.

Now in my institution alone we're digitizing something in the region of about 350,000 specimens a year. And it's an enormous effort to manage the data and deliver the data workflows that extract that data and plug that into our databases. But with the help of the knowledge base that we're building, we're building integrated workflows that aid the transcription of that data, bring that data together and also aid in the quality control of that data.

But the knowledge base also isn't just about our internal efficiencies within our community. By aggregating that data we can start to build a set of services targeted at particular segments of the research community that use all this information and these knowledge products that I call them. We're gonna talk a little bit about that at the back end of this talk.

But just for the moment I wanna do a little bit of a deeper dive now on this digitization challenge because I think this gets to the heart of some of the scale, if you like, of the challenge that we have with natural science collections.

So natural science collections are really huge in probably every conceivable dimension. They range from millions of insects, many of which are perhaps smaller than a full stop on a printed page, to dinosaur femurs about 3 meters long.

They cover all sorts of periods of time. So the oldest specimens that we might have in our collection are meteorites, about 4.5 billion years old. And that runs right the way through to specimens we're collecting right now in some obscure parts of the world, places like the ocean floor.

We typically add to our collection at the rate of something like about 100,000 specimens a year in the case of the Natural History Museum. But the big problem is, as I mentioned previously, most of the data associated with that historical collection is locked up usually on a series of often handwritten labels.

And this is exemplified by this little bug that is sort of skewered with a pin sticking through it and you can see that series of labels on them and we have 30 million of those in our collection and all the data that we need to get at is stuck on that pin.

So over the course of the last three years we've been developing computer vision and machine learning systems to help with the extraction of that data from our collection. And that's what you can see on this slide here. And more recently also we're beginning to integrate robotics into that process too.

So this device that you can see on the left is one of our insect digitization stations. What we would do is typically take one of our specimens, one of our insects, we put that pinned insect in that device. A series of photographs are taken around that almost instantaneously.

Then using computer vision and machine learning the label regions are segmented out, stitched together into one integrated view. And then we can use various automated processes to get to transcribe the information that's on that label.

And as I say we're now starting to use computer, we're starting to use robotics to help facilitate that process. So in some cases we're picking and placing our specimens in our digitization systems. In other cases we're actually attaching cameras to the robot and then using that to speed up our digitization and using this sort of technology.

At least for some of our groups we've taken a process that would typically take about 3.5 minutes for one specimen, in some cases for some groups we can get that now down to about 1.5 seconds through our digitization systems. And we're really scaling up this process.

So what you can see on this slide are various shots of these digitization facilities, almost in sort of industrial climate controlled pest proof warehouses. We're using conveyor belt systems to speed up the transfer of those specimens and also integrating robotics now.

And once we've got that digitized record, that record acts as a kind of stub across which we can hang all the other kinds of information that we need in our knowledge base. So things like genomic data for example, various molecular bits of information, various information about the ecological associations of those species, things like what it feeds on, what it parasitizes, whether it's a predator.

We can also integrate all the scientific literature which is also being digitized that can be hung off the system too. So this knowledge base then becomes an amazing toolkit place where we can integrate all of that information.

And I'm gonna hand back to Sam to talk a little bit about the technicalities of how we've built our knowledge base.

Thank you Vince. So now we're going to take a deep technical dive into one of those critical components of this Planetary Knowledge Base, and that's an underlying knowledge graph.

So in this section we're going to discuss what a knowledge graph is, how we used AWS services to bring that knowledge graph to life, and the steps that we took to build a machine learning solution called a graph neural network on top of that knowledge graph in order to make the PKBs data even more useful to potential end users such as biodiversity researchers.

So as Vince mentioned, the Planetary Knowledge Base is a multi-component system that involves the large-scale transcription and digitization of specimen samples. But in order to fully harmonize the data that comes from these processes with data from other sources such as pre-existing open data sets, there needs to be a central data structure and database where all this data can be brought together.

In the case of the Planetary Knowledge Base that data foundation is a graph database and a knowledge graph. A knowledge graph is a way of structuring data such that it represents a network of nodes, sometimes also called entities or vertices, and the relationships between those nodes.

By using a knowledge graph we can bring together existing data, newly digitized data and data that will be collected and digitized in the future all into a single usable database.

Now the best way to build and store knowledge graphs is by using a graph database. A graph database is a purpose-built database for building, storing and navigating the relationships in highly interconnected data sets.

For graph databases on AWS, Amazon Neptune is an excellent choice. Amazon Neptune is a service that provides fully managed graph databases that are highly scalable, durable and available. And for workloads that are spiky or unpredictable, including workloads under development, Amazon Neptune offers a cost-effective serverless option that scales the database's underlying compute up and down in response to the workload demands of the database, reducing some of that undifferentiated heavy lifting associated with capacity planning and ensuring that costs are optimized during the times where the database's workload is not at its peak.

Amazon Neptune also integrates with Amazon SageMaker, which is one of AWS's go-to services for building and deploying machine learning solutions for any use case. This integration provides helpful tools such as Neptune ML for building machine learning solutions on top of graph databases including knowledge graphs and graph neural networks.

Amazon Neptune users can also leverage Neptune notebooks which are Jupyter notebooks hosted in Amazon SageMaker that come preloaded with the latest version of the open source Neptune graph notebook project which makes it easy to connect graph notebooks to graph databases and provides a suite of so-called "magic commands" or simple Python commands that make everything from connecting to databases to loading data to building and deploying machine learning solutions on graph databases quick and simple.

Data can be loaded into Amazon Neptune in common file formats such as CSV files, and the databases are integrated with popular open graph frameworks and APIs such as Apache TinkerPop and openCypher.

So it should come as no surprise that for the Planetary Knowledge Base graph database, we chose to build with Amazon Neptune. But where did we start in terms of the data?

Now in early conversations with Natural History Museum's artificial intelligence team, the team had identified an open data set that they wanted to use as the foundation of their graph database. That data set is called the Global Biodiversity Information Facility or GBIFs species occurrence data set.

Now GBIF itself is actually an international network of organizations and scientists set up and funded by governments around the world, with the aim of providing anyone anywhere open access to data about all types of life on Earth. Their flagship data set, the species occurrence data set, consists of over 2.5 billion species occurrence records from around the world published by over 2100 individual institutions in the form of over 90,000 unique data sets.

And then all of these data sets are taken by the GBIF international network and harmonized around a set of best practices and common data standards into a single usable data set. That data set is then distributed in cloud optimized formats and comes with open source tooling that makes it easy to create, manage and share biodiversity related data sets.

And it just so happened that the species occurrence dataset was already available on the AWS cloud in a public Amazon S3 bucket at no cost via the Amazon Sustainability Data Initiative. It was actually a really cool moment when we all connected the dots and realized the data set that we wanted to start with was already available on the AWS cloud.

So now we knew what data set we wanted to build around. So next, we needed to understand the broader structure of a knowledge graph we wanted to build.

So I want to remind everyone that the definition of a knowledge graph is a way of structuring data such that it represents a network of nodes and the relationships between those nodes. So what we needed to do was identify the types of nodes and the types of relationships that we wanted our knowledge graph to represent.

Fortunately, NHM's artificial intelligence team had an idea of what they wanted their knowledge graph to look like. And it all starts with the fundamental node type - the specimen. An example of a specimen node could be a butterfly that was collected in Brazil in 1960 or a mesquite plant collected at the Grand Canyon just this past summer.

Now initially the data that we built our knowledge graph with was the GBIF specimen species occurrence data set. But the long term vision is to take the specimen data that flows from those large-scale digitization processes that Vince described earlier and feed the Planetary Knowledge Base's knowledge graph database and knowledge graph."

"But specimen data itself doesn't tell the full picture of a natural science collection. The human impacts on those collections and the emerging spatial and temporal trends that we can expect from global biodiversity. For instance, we know that people are involved like scientists and researchers that go out and collect specimen samples, bring them home and then are recorded also by people in natural science collections. So we knew that we needed to have a person entity and relationships between the person and spec and specimen entities to represent this collected by and recorded by relationships.

We also know that institutions play a role and that people are affiliated with institutions just like Vince is affiliated with the Natural History Museum in London. But also that institutions are credited as the discovering institution of individual specimens. So we needed to have relationships that represented all of that information.

We also wanted to include geographic information at the country level. So think back to that butterfly specimen that was collected in Brazil, there are millions of specimen samples that come from Brazil or any individual country. So we needed a way to connect country nodes such as Brazil or the UK with the individual specimen samples that are that come from there. So we created an origin country relationship between these two types of nodes.

We also know that institutions have a relationship with countries they're located in them just like NHM is located in the UK. And finally, we needed a taxon node and a relationship between the species or specimen nodes and the taxon nodes that represent the taxonomic determination of the individual specimen samples such as kingdom phylum species, et cetera.

Now, some of this data we can get from the species occurrence data set. That's the specimen nodes and the taxon nodes denoted by the green circles here. But some of that data we needed to go out and collect and curate ourselves. So the person institutions, country nodes and this painstaking effort was undertaken by NHM's artificial intelligence team. So we're very thankful for them that they were willing to do that.

Great. So now we know what data sets we wanted to start with. We know what our knowledge graph looked like. So now we can actually dive into how we brought this knowledge graph to life using AWS services.

So the first thing to note that in order to read data into a graph database, it needs to be formatted such that it complies with an open graph framework such as Apache TinkerPop or OpenCypher. Now exactly how to format data in what's called Gremlin load format, which is the format that we use is associated with Apache Tinker is beyond the scope of this talk. But an important thing to note is that the raw GBZ species occurrence data set was not formatted such that it could be read in a graph database.

So this is step one for us, this is a classic ETL or extract transform and load job. So in order to solve this problem, and with simplicity in mind, we used AWS Glue a serverless data integration service that allows you to define and schedule ETL jobs without having to set up and manage the underlying compute needed to run those jobs specifically, we set up an AWS Glue job that grabbed the raw GB species occurrence data set from its public Amazon S3 bucket, transformed it and formatted into Gremlin load format, which is the format that we needed to read into our graph database. Saved those files as CSV files in a separate Amazon S3 bucket that we could then start working with. We also scheduled this job every to run every month because that GB species occurrence data set gets updated with new species every month that flow in from around the world.

So now we have our data formatted and it could be read into a graph database. So we want to get it loaded. But before we could load our data into a graph database, we have to set a graph database up. So you can set up an Amazon Neptune graph database through either the AWS console or the command line interface. But the fastest and easiest way to get started with Amazon Neptune and graph databases is by using the official AWS CloudFormation template for Amazon Neptune and Neptune ML, which will set up all the cluster instances necessary IAM roles and graph notebooks on your behalf.

Also for this specific project, we leveraged Amazon Neptune's serverless offering that I touched on earlier because there were no clear capacity or workload demand requirements. This freed us from having to manage and optimize the underlying compute capacity and ensured that when we weren't working on the database or the database didn't have high traffic that our costs were being optimized.

So now we've used that CloudFormation template. We've set up our graph database. So now we can get our data loaded. There's a couple of different ways you could load data into a graph database. You could load it piece by piece using a set of Gremlin commands, which is a querying language called steps. And you could do that also in script fashion. But the fastest way and the easiest way to load large amounts of data is by using a Neptune graph notebook. And what's called the Neptune bulk loader tool.

So you can initiate the Neptune bulk loader tool through a Neptune graph notebook through one of our handy Neptune magic commands the percent load command that comes built into Neptune graph notebooks. A Neptune bulk loader call is made up of several parameters, but many of these parameters come with usable defaults. So I want to draw your attention to four of these parameters that you'll probably need to set. If you start working with this widget, the first of those parameters is the source. So this is just the S3 Amazon S3 URI of the bucket where your formatted data that you want to load in your database lives.

The next parameter is format and this is a drop down menu. And all you're telling the bulk loader here is what is the file type of the files that you're interested in loading. For instance CSV Third, you have the region parameter. So this is the region where your Neptune database cluster lives and pro tip the region that your database cluster lives in needs to be the same as the region of your source S3 bucket.

And finally, you need to provide a load ARN which is an IAM role that gives the Neptune database permissions to reach in to that source Amazon S3 bucket and grab the formatted data. You could also accomplish this through the command line if you're more comfortable working there. So for instance, here I have a curl command. And what you'll notice is the parameters that you need to pass this command are identical to those from the Neptune bulk loader widget with the added requirement that you need to provide the port number and URL of the Amazon Neptune database endpoint

Cool. So now we have data loaded into our graph database. So now it's possible for end users to start querying that data. So next, we're gonna examine what it looks like to query a graph database in practice using a Neptune graph notebook. And in the following demonstration, we're going to imagine that we're a scientist at NHM that is preparing for an upcoming research expedition to one of our field sites in Fiji.

So what we want to do is use the planetary knowledge base to understand what types of specimens we might encounter in Fiji, but also the types of specimens that were recorded by our colleagues at NHM in past fields, field expeditions to this site.

So here we are in our graph notebook. And the first thing that we want to do is we want to make sure that our graph notebook is connected to our graph database. So we're gonna use another one of our handy Neptune magic commands to check the status of the of the graph database. We're gonna get a signal back that tells us that our graph database is up healthy and ready to be queried.

What we're going to do next is we're going to use the Gremlin querying language to devise a set of queries that allows us to get information out of our database. The first query that we're going to issue is intended to return a set of paths that all emanate from a country node with the name attribute of Fiji. Now a path in a graph database is a collection of nodes and edges or relationships that form a continuous and unique traversal through our graph.

So here you can see that that query is gonna return to us path information in the form of text. But we're also gonna be able to visualize that graph query by clicking on the graph tab that you see here. And when we click into this graph tab, we can now start exploring our data in a different way. This gives us a powerful new dimension to understand examine and explore the patterns in our data.

So we can see here that we can click through our nodes and our edges to understand how they all connect to each other. So we can see species that were discovered in Fiji and the taxon and discovering institutions of those individual species. So for instance, here, we can see that we've, we found a species that was actually discovered by someone at the Natural History Museum in London. So this again gives us a really powerful way of seeing our data to understand the patterns.

So let's issue a second query now. So the next query is going to be similar to our first. But we're going to add on that we only want to see species that have been discovered by NHM in London. So we can see we're adding that here. And when we issue this query, now again, we're gonna get a set of paths back and we can view it both in this text form and our console. But now we get a new graph to look at. And what you can see here is that all of the paths now contained in our new graph query are ones that contain a species node, the Fiji country node and the NHM institution nodes"

So these are all species that have been collected by NHM. And what we're going to use this for as our imagined scientist is that we're going to use this to spur conversations with our colleagues and ask questions and hopefully allow us to understand what people have done in Fiji in the past and be able to better prepare ourselves for our own expedition.

So query in our graph database has a lot of utility. But NHM wanted to take their graph database even further and build a machine learning solution called a graph neural network on top of their graph database.

If you're not familiar with graph neural networks, the important thing to know is that as these large highly connected data sets, graph style data sets grow in row. For instance, once they get to the size of 2.5 billion species, it gets harder and harder to extract valuable information from these data sets using queries based solely on human intuition.

In the case of the PPB, this means that exercises such as identifying new species binding, missing links or understanding dynamic processes such as species migration is made much more efficient with algorithmic assistance such as that which is provided by machine learning models we orchestrated are the training and deployment of the pk b's graph grow network through a graph neptune graph notebook. And the neptune ml feature.

Neptune ml is a feature powered by amazon sage maker and the open source deep graph library and is intended to make it easy to build machine learning solutions on top of neptune graph databases. But before we can actually get to training our machine learning model first we need to export the data from our graph database into a separate s3 bucket such that it could be utilized in a machine learning and model training workflow.

Fortunately, this is a really simple process and can be accomplished with a single line of code using one of our handy neptune magic commands specific to export jobs.

Next, we can actually get to building our graph neural network training and deploying a graph neural network with neptune ml involves three distinct stages which can be accomplished end to end in a neptune graph notebook in each of these stages. What we're going to do is we're going to set a few key parameters that are going to define the properties of the job and then pass those parameters into a single neptune magic command per stage.

So what are those stages?

Stage one, we need to set up a data processing job that's going to take that exported data sitting in s3 and perform a series of mappings transformations and train test splitting such as that the data can be read in by the deep graph library. Some of the parameters that we need to set here include the input and output s3 buckets of our formatted data and the compute incidence types we want this job to run on and then we can pass all of this information to our single neptune magic command. And that takes care of the rest stage two is when we actually get to training our graph narrow network.

So this is also as simple as a single neptune magic call. But behind the scenes, what's happening is we're spinning up a sage maker processing job that generates a model training strategy and launches a hyper parameter tuning job.

So some of the parameters that we need to set in this stage include the hyper parameters that we want to explore. And the number of unique combinations of hyper parameters that we want to run each of those unique combinations is going to represent its own training job. And then by passing these parameters into our neptune magic for training our machine learning model.

Neptune ml takes care of the rest including the identification of the best training job that we can use to deploy. So this process assumes that we're going to use one of the built in deep graph library models. But it's also possible to bring your and train your own custom models using the same neptune ml infrastructure and neptune magic commands. And this is exactly what nms artificial intelligence team is doing, training their own custom model using neptune ml infrastructure.

So after we've trained our graph neural network, we can actually deploy that such that it can be used for inference. Again, this is as simple as a single neptune magic command and we can then deploy our model to a sagemaker endpoint. But an important thing to remember here is by using this process, we're creating what's called a persistent endpoint, meaning that our endpoint and our model are going to be up running and available for inference. But they're also going to be incurring charges until we actually go in and delete that endpoint.

So if you do follow this process, just keep that in mind. And there we have it in just a few steps. We've trained and deployed a graph neural network built from our graph database data. Now that our model has been deployed, end users can use that model to make predictions. Inference is made easy. Also with neptune and ml because the deep graph models can be queried using something called gremlin inference queries, gremlin inference queries automate the process of preparing inference data, supplying that inference data to the model endpoint and returning the results.

I'm going to show you an example of one of these queries in an upcoming slide. But before I get there, i want to first discuss the three types of inference that neptune ml supports.

The first type is classification. So classification is where we're trying to predict a categorical variable of one of our nodes or edges of our graph. So for instance, maybe we have a specimen node that has a missing origin country. So we want to use our machine learning model to predict that missing origin country of our specimen node.

The second type is regression very similar in nature to classification except instead of trying to predict a categorical variable, we're trying to predict a numerical variable. So imagine that same specimen sample, but instead of a missing an origin country, it's missing it's discovery latitude. So we can use the other information that we have about that specimen node and maybe even its connections or relationships with a country node to predict the missing discovery latitude.

The third type of inference that neptune ml supports is link prediction. So this is where we're trying to predict a destination node given a source node and an outgoing edge label a ka a relationship type. So an example of this type of inference would be trying to predict the discovering institution of a specimen node that otherwise doesn't have any discovering institution relationship with any institution node.

So let's take a look at one of these queries and i chose the example of classification. So this query could be run from one of your neptune graph notebooks. And if you're familiar with gremlin queries, or you can remember back to our demo, when i issued a couple of standard gremlin queries, you'll notice that gremlin inference queries and standard gremlin queries used for growing the database are really similar.

So what we're doing in this query is first we're identifying our model end points of telling, telling or indicating what model we actually want to use to make predictions. And then we're issuing a query in that query. We're saying we want this to be a classification inference query and we've identified a specific specimen node and in that specimen node, we want you to return us the top three most probable results predicting the country code or origin country of that specimen node.

So if we would issue this query, what's going to happen is neptune ml is going to issue or is going to package the request, issue it to our model end point and return our results which are going to be the top three most probable results for what we're trying to predict.

Ok. So in the last few minutes of this talk, i just wanna talk a little bit about who we think are gonna be using the knowledge base and what its real scientific value is. And i think broadly speaking, there are four key use cases if you like for us here.

So one is it's just purely as a discovery tool, the siloed nature of our data at the moment, the fragmented nature of the data means that just purely the act of being to bring that data together and being able to facet that searches across that data really creates a fantastic tool for bridging the physical divide between where our specimens are and where our data is and where our problems are and the scientists that are using it.

Secondly, the knowledge graph and the knowledge base become an amazing tool for improving our records for identifying outliers, for identifying inconsistencies in our data that we can then fix it traditionally, the way that we work is that there's lots of people, lots of scientists manually curating that data. And increasingly those scientists are actually curating the models that look after the data for us, those rule sets, that's a big change in how we work, but it's going to make us much more efficient.

The next big thing is that the knowledge base helps us quantify change. We can start to see patterns in our data and see correlates of those patterns to try and explain them. And that's a huge step forward for us.

And lastly, by linking these data sets together, we can start to make predictions about what's going to happen, say for example, to biodiversity of particular groups of species or the extinction risk of certain species under various scenarios of environmental change.

So i just gonna give you one quick example of some early results from the knowledge base. And this is an example looking at the way that we within our community describe new species.

So typically we describe species at the rate of something like 18,000 to 20,000 new species a year. That's roughly on average what gets described. But there's a problem there because if we carry on at that rate, it's gonna take a further 300 years to describe what we think of the 6 million species that are still out there awaiting description.

Now, the good news is that many of those new species, they're not necessarily found by going out into hyper diverse countries and, and, and doing field work and finding them. They're actually just sitting in our collections right now. They're simply unrecognized as being new and the knowledge base because it's so good at identifying those outliers gives us the means potentially to identify putative new species that are already sitting in our collection.

And what's really exciting to me is that because we're busy digitizing all the scientific literature that goes along with those scientific description, tho those descriptions of species too, we can start to use generative a i to even describe those new species for us trained on that corpus of scientific literature that we already have in the model.

So this has the potential to really massively accelerate the way that we describe biodiversity at the moment. So ultimately, then this knowledge base and the knowledge graph that we're building is kind of like a model of the natural world which is really gonna revolutionize the impact of our collections and unlock the value that's trapped in all those physical specimens. At the moment.

In effect, we're building a kind of a global early warning system where we can start to see things like where the next pandemic might be coming from. By looking at the to face between humans and the vectors of disease and where they're likely to become zoonotic diseases infecting humans.

We can start to build conservation indicators telling us which species are most at risk or extinction or we can start to define better policies to minimize biodiversity loss.

Well, i'm gonna leave you today with a simple question. What can open data do for you? I encourage everyone over the next few weeks is take a few minutes of your time and explore the open data sets that amazon sustainability data initiative has to offer and you could scan the qr code on the screen right now and it'll take you right to the home page.

Hopefully that time will be used wisely such that it gives you at least something maybe you could use right away or maybe at least inspires you to do, perform your next workload. I also encourage everyone to talk with your account teams and your solutions architects about how you could integrate open data sets in your current or future workloads.

And finally, if you're interested in learning more about how our other customers are using open data sets or about new open data sets themselves, there's a lot of blogs out there describing the data sets and workloads on top of amazon sustainability data initiative. So i highly encourage you to check out those blogs.

So with that, we want to thank you for your time and attention today.