AWS open source strategy and contributions for PostgreSQL

最新推荐文章于 2024-05-20 17:26:43 发布

李白的朋友王维

最新推荐文章于 2024-05-20 17:26:43 发布

阅读量73

点赞数

文章标签： aws 亚马逊云科技科技人工智能 re:Invent 2023 生成式AI 云服务

本文链接：https://blog.csdn.net/just2gooo/article/details/135120452

版权

all right. so, so let's just jump right in a quick introduction. i'm jonathan katz. uh in addition to being a product manager at aws, um i'm an active contributor to the postgres project. i'm on the core team today.

we'll talk about a topic that is, you know, very near and dear to me and that's about contributing to postgres contributing to a very well known open source database.

so by a quick show of hands, how many people are using postgres today? cool, makes a lot of sense and you know, what, what i want to talk about today is, you know, how we as aws are contributing to postgres, but also looking at contributing to postgres in general. and by extension, how do you contribute to an open source project that has been around for a while?

and we'll look at some recent contributions, you know, how they got there and their overall impact of them, you know, throughout, you know, throughout like, you know, different aspects of the postscript community.

so to get started, you know, one reason that's important for aws to contribute to open source is that we do want to do it in a sustainable way. i in order to like, help the projects continue to grow and you know, sustain themselves regardless of who's contributing to them. and that's important.

and you think about projects like post grass and linux and python, i've been around a long time and the communities they've, you know, built around them and you know, the continuance of supporting those communities to make sure they continue to, you know, prosper and you know, keep delivering awesome features and great user experiences.

and this gets into this number right here. 37 what's so special about this number? and it's like jonathan, we don't care how old you are. and the reason is that this is also the age of postgres. postgres has been around a long time. you know, it started at uh you know, the university of california berkeley, you know, in uh in 1985 1986 and has continued to sustain contributions to this day, which means that it has a rich history.

so if you want to contribute to a project that's been around a long time, you need to understand a bit of its history and how everything works before you necessarily just dive and then start contributing.

meanwhile, you know, you might have also noticed a w bs has been contributing a lot more to postgres over the past few years. they take the past uh three postscript releases. postscript 16 was released in september and aws helped contribute to, you know, around 19% of the features in a variety of ways it could have been offering them, helping to review them playing, uh, you know, a significant part in getting the feature committed.

and the big question is, well, why, you know, why all of a sudden is aws interested in helping postgres? well, there's a variety of reasons. the first is that it's a point to aws itself. i mean, how many of you in here today are using auror and r ds to run post grass? yeah, quite a few. so it's important that we help to, you know, support postscript to continue to add new features and grow because this does ultimately help you when you're uh running postscripts on a r and r ds, it's also important to amazon itself. uh amazon, amazon uses postscripts to run, you know, many of its uh many of its operations. so making sure that we're making sustainable contributions to postgres just, you know, helps in general.

it's also, you know, it also helps you like understanding what you're doing in building with postgres. one of my favorite things to do is just, you know, talk to you and understand what are you doing, what are you building? what can we do to make postgres better? and to me that's very exciting.

a lot of what we're gonna talk about today just comes directly from conversations i've had with folks, you know, over the past several years and helping to relay them back to the community to understand how important they are, which also lets us, you know, do something like look around corners, be able to understand what's coming.

i think one of the biggest waves i've seen this in the past year is generative a i, i mean, you can't walk anywhere here without talking about generative a. i, we've even seen this, you know, come to postscript in a variety of ways, whether it's storing vector embeddings or trying to, you know, trying to write, you know, automatically generated sql queries aside from the ones that or ms currently generate, it's a point to try to understand what's coming so we can help position the postscript project in a way to be able to help support it.

so how do you know, i started off saying there's a rich history to postgres. how do you contribute to an established community? now, i don't want to scare folks off there. you certainly can come in and make smaller to medium sized contributions and not necessarily understand the whole ethos of a, of the postgres community. but it does help to understand that if you want to make a meaningful sustained contribution, how do you do that? the right way.

the first thing is you need to work with existing community processes, you might have your own engineering or contribution processes where you are, if you come into a, a community that has already established, these processes and you say no, you need to do it this way, you need to use this tool, you need to, you know, have, you know, this whole, you know pr system set up, you might not make many friends very quickly because you need to be able to earn trust.

so first when you come in, learn the rules, understand how, understand how a patch review works, how a commit fest works. so you know, whatever the cycle may be and by doing that and starting to establish a presence in the community, you earn trust, people realize, ok, they're in it for the long haul. this isn't just, you know, some necessarily a drive by contribution or you know, you know, they're, they're trying to impose their will on the community, they see, ok, they're really in it to help work with us, which helps you start thinking long term and start bringing in better practices that do that might ultimately evolve the community and help change some of the processes as need, you know, as need be. but you establish the voice to be able to, to influence things such as, you know, what are the priority features we need to work on what are ways we can grow, grow new contributors and expand the community.

so i understand this process can work with any open source community as you try to establish, establish and grow contribution process.

so that leads us into what we're talking today. specifically around postgres. and first, we'll do a brief overview of how you can contribute to postgres cause there's a lot of ways that go into beyond just contributing to the database. we'll then talk about what aws has been doing in recently to help contribute to post as well as deep dive into some of our recent contributions. and finally, we'll talk about what's next like, where do we go? how do we keep this database that's been around for nearly 40 years continuing to grow, evolve and you know, make sure that, you know, you are very much enjoying running your database workloads on post grass.

so an overview contributing to postgres, how many of you recognize what this is? yeah, this is snic, this is the postgres logo. and the reason i'm starting off with this is that just even this logo, you know, it it elicits an emotion, you know, you recognize it as you know, being related to postgres, you see an elephant, an elephant never forgets, right, kind of like your data. and in here, it also encapsulates the ethos that contributing to postgres because understanding that ethos helps you to contribute successfully.

the first thing about the post ghost community ethos as told by myself is you want to be equally fair to everyone and that's kind of an interesting wording, right? what do you mean like be equally fair to everyone? well, first off the community is very, very intent on treating everyone's opinions equally. it doesn't matter if you've been contributing for 30 years or one year we want, you know, we wanna be able to give weight to everything that folks are saying and granted a senior contributor to the core postgres code base. certainly their words carry a lot of weight, but it doesn't mean that's necessarily the final answer for how things will be solved. likewise, organizing events treat all sponsors equally, you know, at the same level, you wanna be equally fair to everyone and, and some, and that's how that's what's helped sustain such a, you know, a lot of folks involved in postcodes you've been contributing for for decades is starting from there, which leads to the next one.

no, not one entity can control more than 50% of the community. this has been very much a part of posts for the past 25 years that they don't want to see one organization become uh too influential. even when uh we've seen consolidation of companies in posts in recent years, the community has done things to make sure that, you know, not too many, you know, not too many key decision makers are all in one place. so that balance is very important to make sure that we can help keep postgres free and open source.

next strive for perfection. well, we know it's impossible to be perfect, but keep in mind, post is storing critical data. you don't wanna lose data, you don't want data to get corrupt. and post gra has a hard earned reputation of being a very reliable data source. so when things get committed to the project, you want to try to be as close to perfect as possible. that's why some things have taken many years to be able to be added to post graphs, you know, logical replication, which is a feature will go into greater depth in this, in this uh presentation is a very good example of that.

next, you gotta be able to have backbone and disagree and commit. now, you might recognize this as uh an amazon leadership principle, but it actually holds true in the postgres community. how many folks have read the hackers mailing list for postgres. so the hackers mailing list is the area where you have a lot of discussion over the development of post and you definitely need to have backbone there. folks are very respectful on it, but you're gonna see some very strong opinions from people who have very strong opinions. but when it comes time to make a decision, folks will disagree and commit, they may not necessarily agree with the decision, but they're gonna support it and you know, they'll be able to see it through.

finally, the community works together towards a common goal. it doesn't matter what company you work for, everyone wants postgres to, to be better. and we're gonna see examples throughout this talk where folks in different organizations have collaborated on features because they know it's gonna help the greater good, which is everyone who's using postgres quickly.

how's the post gus community organized? we can probably have, we can probably do like a 50 minute session just on this. so i'll cover it briefly. there's a core team, the core team is not necessarily a central governing body. i'd say it's more of, you know, it's more of a dispute resolution group that if there's no other way through regular community processes, you can solve something, the core team will, you know, will help resolve it. you have committers. those are the folks who have privileges to commit code to the postgres core engine. there's 31 of folks of those folks in the world, which is actually relatively small, small when you consider the scope and size of the project, but it takes many years to be able to earn commit bits. then you have all sorts of folks who are helping with all other aspects of the projects, you know, patch, writing, reviewing patches uh translating, which is a huge effort as, as well as everything involved with advocacy, talking about post grads creating educational materials, et cetera.

so postgres does have a well-defined development cycle and we'll cover this, you know, real briefly in, you know, animations, new development always begins in july, it starts with something called the july commit. fest. commit. fest is a postgrad specific term. it's just when folks really focus on committing patches to the, to the uh to the project. and this is where committers really, what they're doing is they're reviewing and determining if a patch is ready. but there's a lot of development going on during this period. there's a bunch of these upfront, you know, for, you know, from july through january, you have a commit fest every other month. and in february there's a check in, there's a developer meeting that's typically held at the foam open source conference in brussels. and this is where it's, you know, it's like triage and prioritization. what is actually going into the feature release? there's the final commit fest, which is maybe the most stressful time of the year if you're a committer because right after this, you have feature freeze and once you hit feature freeze, no new features go into postgres for that upcoming major release. and that is a hard and fast rule, you will almost never see any additional functionality added after this. typically you see functionality removed. so that whole period is nine months. so nine months of solid development, building out new features in postgres, which leads to may, which is one of my favorite months because we typically have an update release to the supported versions. we have beta one, which is figuring out what's all the new functionality going to the next release of postgres and there's a developer meeting that's gonna help plan the next release. so this is at the point in the development cycle, you're starting to work on multiple new things at once. you've meanwhile, you know, community focused on stabilizing the betas. meanwhile, and also starting to develop new things because remember, new development begins in july. so you have that, you know the next commit fest while you're stabilizing the upcoming release

And finally, in September, it all comes to fruition. You have the GA, there's a new version of Postgres available as you're building out the next version after that. All this takes 14 months. But what's nice is that there's a predictable cycle of when features get released.

And remember in the Postgres community, we strive for perfection that when we GA release, we are feeling very confident about those features being in your hands.

Now, that's just one aspect of contributing to Postgres - being the the core database engine. There's a whole world of other ways you can contribute, you know, contributing to Postgres is not just contributing to the database itself.

There's extensions which add functionality to Postgres without actually having to fork Postgres or create a brand new copy of Postgres. You have drivers, you're likely this is your likely your primary interface into Postgres using something like the JDBC drivers, Psycopg.

And there's the governance aspects of the project. Extensions are a great way to contribute to Postgres. There are thousands of these in the world I honestly, I don't even know how many Postgres extensions there are and there's just new ones popping up every day.

Uh just in the R&D's alone, there's over 90 extensions and there's ways to actually add more extensions to R&D's through trusted language extensions, which we'll touch on briefly in a little bit. But you can see that this is just a small subset of all the extensions available to Postgres.

So with Postgres, you know, if you want to find extensions first in Postgres itself, there's already extensions available. And these are, these are optional modules that you can enable on your, on your instance. And pretty much everyone in this room or everyone who's running a RDS in this room, you're using pg_stat_statements because this is the primary way you gather statistics about what queries are being run within your database.

There's all sorts of useful extensions as well just within Postgres based upon what data type support you need or if you're connecting to other databases or programming other things, there's extensions for utilities.

Um all of these are, you know, very, you know, are various ways to be able to manage different aspects of Postgres. It could be pg_cron which lets you schedule jobs directly in Postgres. Pg_partman that lets you uh handle partition management pg_hint_plan, which actually we'll talk a bit about later, which is great if you're migrating from databases that support hints and pg_audit, which does well audit tracking of your logs.

There's extensions that let you connect to remote data sources. Uh the MSSQL and Oracle FDWs are ways that you can query said databases directly from Postgres.

Uh TDS FDW works with uh SQL Server and Sybase. A Mock FDW is a way for you to query your Postgres logs directly from an SQL interface.

There's replication extensions, different ways of being able to copy data between different uh Postgres data sources or things that can uh read things streaming in real time.

Uh we recently announced uh pg_active. Uh it's an extension that enables active active replication directly from Postgres. And all these other extensions help help uh you administrate different ways of logically replicating data out of Postgres to downstream sources.

There's extensions to build more extensions uh particularly in procedural languages. Uh take PL/Rust, did you know you can write Rust functions and execute them directly from Postgres? It's pretty cool and finally search extensions and you know, in some ways know these extensions add new data types and ways of searching over them.

PostGIS is one very popular one, it turns Postgres into a geospatial database and is, you know, incredibly popular in the, the geospatial world pg_vector will talk a bit more about later today. It's become incredibly popular due to the rise of folks wanting to query vector data directly from their Postgres database.

So then there's drivers, drivers are, are a key drivers in some ways, are the unsung heroes of the Postgres community because they allow you to work with your native programming language and connect to Postgres and be able to, to manipulate them. And there's a lot of them.

Now, let's start with Java. You have the JDBC driver or as well as the R2DBC driver and take JavaScript. There's multiple drivers available in JavaScript to connect to Postgres python. I found for that allow you to do exactly that rust there, rust, there's some go, there's some ruby, there's some C++ et cetera, et cetera, et cetera, et cetera.

I'll tell you it was a lot of fun to try to animate this slide to uh get all the drivers in there. Oh, no. And this doesn't even include ORMs which are an abstraction layer on top of drivers. If you want to get the full list of drivers available for Post, there's, there's a wiki page for that which constantly gets updated and these are a way to help contribute to Post because any functionality you add to a driver helps enable developers to more easily work with Postgres.

And then there's governance. There's a variety of ways to help grow Postgres without necessarily writing code. There's MPOs that help promote events and various educational activities around Postgres as well as user groups and other committees that help to, you know, support the general uh structure of the project, whether it's the security committee or um uh the, there's a lot of them now, a lot of the committees, it does take some time to earn trust to be, to be a part of those committees.

But there are ways to uh to get active in the project to help earn that trust, including, you know, volunteering for the MPOs and different user groups.

So that said, let's start diving into, into contributions and what AWS has been doing over, you know, the past several years.

So as I mentioned, AWS uh contributed around 19% of the features to Postgres 16 and was the second largest contributor. And it's something that we take great pride in because it mentioned up top, it's, you know, Postgres you know, as important as it is an open source project, it is important to us, you know, in many different ways.

So the first question is, well, ok, you know, how did you do that? How did you build up all these contributions?

Well, first, it was just general involvement and being involved in the project. AWS was recognized as a major sponsor of the project. This kind of recognition from the Postgres project is earned. You can't buy your way into it that every year, uh the community reviews a list of sponsors or a list of folks contributing to the project and how their employers are you know, help align them to contribute. And if they see those earned contributions during the patch reviews, writing, writing code and being involved in the various governance aspects, you can get that recognition.

This also can this is also extends to working on Post extensions, no pun intended because Post extensions do help drive more folks to being able to use Postgres.

So we we do have a dedicated effort contributing to Postgres. Uh we have a, we have a team of contributors that all they do is focus on how they contribute to Postgres, whether it's the core database engines extensions and or they keep the lights on work of the community. And it does help to have folks just solely focused on it, particularly for an open source project that's been around a long time.

There's a lot of moving parts to building a database and it does require full time focus if you want to make sure that database succeeds. Now, I do want to note that if you go back 20 years ago, uh you had very few folks or maybe only one person who could focus on Postgres full time. It was a hobby project and that's been a big shift over the past 20 years.

So that Postgres has gone from being a hobbyist project to being used to mission critical workloads. And the good news is that there are folks who are treating it as such and focused on it full time. To make sure that we can offer the best possible support for it.

There's all sorts of different ways to contribute as mentioned. And you know, we focus, you know, as AWS, we focus on all these different areas beyond just code contributions. I I do want to call it systems. You know, the systems administration, Postgres manages pretty much all of its own infrastructure and it's, it's in a whole variety of different places.

And the reason things work, the reason you're able to say download Postgres from the Postgres website is because you have folks who are focused on the system of administration and that's important. You wanna make sure all the resources needs to develop the project are up and running now.

You know, there's, there's certainly different ways that this may evolve through the years. But again, this is one of those part of the Postgres ethos to help keep it free and open source.

So recent, you know, if you want to try to bucket, you know, with some of our direct contributions to the Postgres section have come from, you know, i'd like to think about in terms of performance, operability and security.

Performance, I think we all understand uh performances, you know, how quickly, you know, how quickly are you able to get a query return? How quickly can you index something?

Um operability is, you know, how do you administrate your data and be able to, you know, move, you know, move things around or do, do a lot of your maintenance operations and security, making sure data stays safe and secure.

And one specific contribution I want to call out is the last one - security - um preventing a CVE before it happened during the beta period. One of our engineers noticed an issue with a new feature that had been released would have certainly had a CVE. There was a permission check that wasn't being checked correctly. And, you know, someone could escalate catching that before it became a CVE. I consider it to be huge because firstly, i, i actually have to deal with the CVEs as part of my role in the Postgres community and it takes work.

So anytime you prevent the security issue from occurring, certainly, you know, certainly is a good thing. But it's also, I also might emphasize the importance of beta testing. You saw that there's that beta period available from, you know, May to September, anything you can do to test Postgres against production like workloads or test out new features helps because the community can help harden the release before the GA to make sure, you know, problems such as CVEs uh don't actually occur.

Now in, in the specific case of security issues, I know it's impossible to prevent all security issues. But that's why, you know, thorough testing, you know, certainly helps.

So let's deep dive a little bit. So one feature group, i really wanna deep dive on today is logical replication. And how many folks are familiar with what logical replication is? Ok.

So for those of you who aren't logical replication is a way of streaming changes out of a database in a way that could be interpreted by other systems. What i mean? So what i mean by that is that there's another type of replication, physical replication which are basically replicating, you know, block by block on disk what changes are occurring and then sending those down.

But not every system can understand those block by block changes. There's a lot of things that you need to keep exact. You have to keep your operating system exact, you have to keep your file system exact, et cetera with logical replication, other systems that are programmed to be able to understand that protocol can redo those changes.

So Postgres is logically replicating something. The a downstream system doesn't necessarily do be a Postgres database. It could be a Python script that understands how uh those changes are being replicated. And from that Python script, you can do all sorts of things you can do extract, transform load jobs, you can um you know, build that data, you know, and you know, build like some real time dashboards.

I actually used it as a way to create a effectively a hacky incremental materialized uh view where basically i was updating changes in real time in a table where i had to do some complex aggregate query. But what we've been seeing with logic, so logical application came out in Postgres version 9.4, which is about 10 years ago now. Yeah, it is about 10 years ago and it's been evolving ever since i'd say it's been a slow evolution. Here is the remaining transcript formatted for better readability:

First in Postgres itself, there's already extensions available. And these are, these are optional modules that you can enable on your, on your instance. And pretty much everyone in this room or everyone who's running a RDS in this room, you're using pg_stat_statements because this is the primary way you gather statistics about what queries are being run within your database.