AWS open source strategy and contributions for PostgreSQL

all right. so, so let's just jump right in a quick introduction. i'm jonathan katz. uh in addition to being a product manager at aws, um i'm an active contributor to the postgres project. i'm on the core team today.

we'll talk about a topic that is, you know, very near and dear to me and that's about contributing to postgres contributing to a very well known open source database.

so by a quick show of hands, how many people are using postgres today? cool, makes a lot of sense and you know, what, what i want to talk about today is, you know, how we as aws are contributing to postgres, but also looking at contributing to postgres in general. and by extension, how do you contribute to an open source project that has been around for a while?

and we'll look at some recent contributions, you know, how they got there and their overall impact of them, you know, throughout, you know, throughout like, you know, different aspects of the postscript community.

so to get started, you know, one reason that's important for aws to contribute to open source is that we do want to do it in a sustainable way. i in order to like, help the projects continue to grow and you know, sustain themselves regardless of who's contributing to them. and that's important.

and you think about projects like post grass and linux and python, i've been around a long time and the communities they've, you know, built around them and you know, the continuance of supporting those communities to make sure they continue to, you know, prosper and you know, keep delivering awesome features and great user experiences.

and this gets into this number right here. 37 what's so special about this number? and it's like jonathan, we don't care how old you are. and the reason is that this is also the age of postgres. postgres has been around a long time. you know, it started at uh you know, the university of california berkeley, you know, in uh in 1985 1986 and has continued to sustain contributions to this day, which means that it has a rich history.

so if you want to contribute to a project that's been around a long time, you need to understand a bit of its history and how everything works before you necessarily just dive and then start contributing.

meanwhile, you know, you might have also noticed a w bs has been contributing a lot more to postgres over the past few years. they take the past uh three postscript releases. postscript 16 was released in september and aws helped contribute to, you know, around 19% of the features in a variety of ways it could have been offering them, helping to review them playing, uh, you know, a significant part in getting the feature committed.

and the big question is, well, why, you know, why all of a sudden is aws interested in helping postgres? well, there's a variety of reasons. the first is that it's a point to aws itself. i mean, how many of you in here today are using auror and r ds to run post grass? yeah, quite a few. so it's important that we help to, you know, support postscript to continue to add new features and grow because this does ultimately help you when you're uh running postscripts on a r and r ds, it's also important to amazon itself. uh amazon, amazon uses postscripts to run, you know, many of its uh many of its operations. so making sure that we're making sustainable contributions to postgres just, you know, helps in general.

it's also, you know, it also helps you like understanding what you're doing in building with postgres. one of my favorite things to do is just, you know, talk to you and understand what are you doing, what are you building? what can we do to make postgres better? and to me that's very exciting.

a lot of what we're gonna talk about today just comes directly from conversations i've had with folks, you know, over the past several years and helping to relay them back to the community to understand how important they are, which also lets us, you know, do something like look around corners, be able to understand what's coming.

i think one of the biggest waves i've seen this in the past year is generative a i, i mean, you can't walk anywhere here without talking about generative a. i, we've even seen this, you know, come to postscript in a variety of ways, whether it's storing vector embeddings or trying to, you know, trying to write, you know, automatically generated sql queries aside from the ones that or ms currently generate, it's a point to try to understand what's coming so we can help position the postscript project in a way to be able to help support it.

so how do you know, i started off saying there's a rich history to postgres. how do you contribute to an established community? now, i don't want to scare folks off there. you certainly can come in and make smaller to medium sized contributions and not necessarily understand the whole ethos of a, of the postgres community. but it does help to understand that if you want to make a meaningful sustained contribution, how do you do that? the right way.

the first thing is you need to work with existing community processes, you might have your own engineering or contribution processes where you are, if you come into a, a community that has already established, these processes and you say no, you need to do it this way, you need to use this tool, you need to, you know, have, you know, this whole, you know pr system set up, you might not make many friends very quickly because you need to be able to earn trust.

so first when you come in, learn the rules, understand how, understand how a patch review works, how a commit fest works. so you know, whatever the cycle may be and by doing that and starting to establish a presence in the community, you earn trust, people realize, ok, they're in it for the long haul. this isn't just, you know, some necessarily a drive by contribution or you know, you know, they're, they're trying to impose their will on the community, they see, ok, they're really in it to help work with us, which helps you start thinking long term and start bringing in better practices that do that might ultimately evolve the community and help change some of the processes as need, you know, as need be. but you establish the voice to be able to, to influence things such as, you know, what are the priority features we need to work on what are ways we can grow, grow new contributors and expand the community.

so i understand this process can work with any open source community as you try to establish, establish and grow contribution process.

so that leads us into what we're talking today. specifically around postgres. and first, we'll do a brief overview of how you can contribute to postgres cause there's a lot of ways that go into beyond just contributing to the database. we'll then talk about what aws has been doing in recently to help contribute to post as well as deep dive into some of our recent contributions. and finally, we'll talk about what's next like, where do we go? how do we keep this database that's been around for nearly 40 years continuing to grow, evolve and you know, make sure that, you know, you are very much enjoying running your database workloads on post grass.

so an overview contributing to postgres, how many of you recognize what this is? yeah, this is snic, this is the postgres logo. and the reason i'm starting off with this is that just even this logo, you know, it it elicits an emotion, you know, you recognize it as you know, being related to postgres, you see an elephant, an elephant never forgets, right, kind of like your data. and in here, it also encapsulates the ethos that contributing to postgres because understanding that ethos helps you to contribute successfully.

the first thing about the post ghost community ethos as told by myself is you want to be equally fair to everyone and that's kind of an interesting wording, right? what do you mean like be equally fair to everyone? well, first off the community is very, very intent on treating everyone's opinions equally. it doesn't matter if you've been contributing for 30 years or one year we want, you know, we wanna be able to give weight to everything that folks are saying and granted a senior contributor to the core postgres code base. certainly their words carry a lot of weight, but it doesn't mean that's necessarily the final answer for how things will be solved. likewise, organizing events treat all sponsors equally, you know, at the same level, you wanna be equally fair to everyone and, and some, and that's how that's what's helped sustain such a, you know, a lot of folks involved in postcodes you've been contributing for for decades is starting from there, which leads to the next one.

no, not one entity can control more than 50% of the community. this has been very much a part of posts for the past 25 years that they don't want to see one organization become uh too influential. even when uh we've seen consolidation of companies in posts in recent years, the community has done things to make sure that, you know, not too many, you know, not too many key decision makers are all in one place. so that balance is very important to make sure that we can help keep postgres free and open source.

next strive for perfection. well, we know it's impossible to be perfect, but keep in mind, post is storing critical data. you don't wanna lose data, you don't want data to get corrupt. and post gra has a hard earned reputation of being a very reliable data source. so when things get committed to the project, you want to try to be as close to perfect as possible. that's why some things have taken many years to be able to be added to post graphs, you know, logical replication, which is a feature will go into greater depth in this, in this uh presentation is a very good example of that.

next, you gotta be able to have backbone and disagree and commit. now, you might recognize this as uh an amazon leadership principle, but it actually holds true in the postgres community. how many folks have read the hackers mailing list for postgres. so the hackers mailing list is the area where you have a lot of discussion over the development of post and you definitely need to have backbone there. folks are very respectful on it, but you're gonna see some very strong opinions from people who have very strong opinions. but when it comes time to make a decision, folks will disagree and commit, they may not necessarily agree with the decision, but they're gonna support it and you know, they'll be able to see it through.

finally, the community works together towards a common goal. it doesn't matter what company you work for, everyone wants postgres to, to be better. and we're gonna see examples throughout this talk where folks in different organizations have collaborated on features because they know it's gonna help the greater good, which is everyone who's using postgres quickly.

how's the post gus community organized? we can probably have, we can probably do like a 50 minute session just on this. so i'll cover it briefly. there's a core team, the core team is not necessarily a central governing body. i'd say it's more of, you know, it's more of a dispute resolution group that if there's no other way through regular community processes, you can solve something, the core team will, you know, will help resolve it. you have committers. those are the folks who have privileges to commit code to the postgres core engine. there's 31 of folks of those folks in the world, which is actually relatively small, small when you consider the scope and size of the project, but it takes many years to be able to earn commit bits. then you have all sorts of folks who are helping with all other aspects of the projects, you know, patch, writing, reviewing patches uh translating, which is a huge effort as, as well as everything involved with advocacy, talking about post grads creating educational materials, et cetera.

so postgres does have a well-defined development cycle and we'll cover this, you know, real briefly in, you know, animations, new development always begins in july, it starts with something called the july commit. fest. commit. fest is a postgrad specific term. it's just when folks really focus on committing patches to the, to the uh to the project. and this is where committers really, what they're doing is they're reviewing and determining if a patch is ready. but there's a lot of development going on during this period. there's a bunch of these upfront, you know, for, you know, from july through january, you have a commit fest every other month. and in february there's a check in, there's a developer meeting that's typically held at the foam open source conference in brussels. and this is where it's, you know, it's like triage and prioritization. what is actually going into the feature release? there's the final commit fest, which is maybe the most stressful time of the year if you're a committer because right after this, you have feature freeze and once you hit feature freeze, no new features go into postgres for that upcoming major release. and that is a hard and fast rule, you will almost never see any additional functionality added after this. typically you see functionality removed. so that whole period is nine months. so nine months of solid development, building out new features in postgres, which leads to may, which is one of my favorite months because we typically have an update release to the supported versions. we have beta one, which is figuring out what's all the new functionality going to the next release of postgres and there's a developer meeting that's gonna help plan the next release. so this is at the point in the development cycle, you're starting to work on multiple new things at once. you've meanwhile, you know, community focused on stabilizing the betas. meanwhile, and also starting to develop new things because remember, new development begins in july. so you have that, you know the next commit fest while you're stabilizing the upcoming release

And finally, in September, it all comes to fruition. You have the GA, there's a new version of Postgres available as you're building out the next version after that. All this takes 14 months. But what's nice is that there's a predictable cycle of when features get released.

And remember in the Postgres community, we strive for perfection that when we GA release, we are feeling very confident about those features being in your hands.

Now, that's just one aspect of contributing to Postgres - being the the core database engine. There's a whole world of other ways you can contribute, you know, contributing to Postgres is not just contributing to the database itself.

There's extensions which add functionality to Postgres without actually having to fork Postgres or create a brand new copy of Postgres. You have drivers, you're likely this is your likely your primary interface into Postgres using something like the JDBC drivers, Psycopg.

And there's the governance aspects of the project. Extensions are a great way to contribute to Postgres. There are thousands of these in the world I honestly, I don't even know how many Postgres extensions there are and there's just new ones popping up every day.

Uh just in the R&D's alone, there's over 90 extensions and there's ways to actually add more extensions to R&D's through trusted language extensions, which we'll touch on briefly in a little bit. But you can see that this is just a small subset of all the extensions available to Postgres.

So with Postgres, you know, if you want to find extensions first in Postgres itself, there's already extensions available. And these are, these are optional modules that you can enable on your, on your instance. And pretty much everyone in this room or everyone who's running a RDS in this room, you're using pg_stat_statements because this is the primary way you gather statistics about what queries are being run within your database.

There's all sorts of useful extensions as well just within Postgres based upon what data type support you need or if you're connecting to other databases or programming other things, there's extensions for utilities.

Um all of these are, you know, very, you know, are various ways to be able to manage different aspects of Postgres. It could be pg_cron which lets you schedule jobs directly in Postgres. Pg_partman that lets you uh handle partition management pg_hint_plan, which actually we'll talk a bit about later, which is great if you're migrating from databases that support hints and pg_audit, which does well audit tracking of your logs.

There's extensions that let you connect to remote data sources. Uh the MSSQL and Oracle FDWs are ways that you can query said databases directly from Postgres.

Uh TDS FDW works with uh SQL Server and Sybase. A Mock FDW is a way for you to query your Postgres logs directly from an SQL interface.

There's replication extensions, different ways of being able to copy data between different uh Postgres data sources or things that can uh read things streaming in real time.

Uh we recently announced uh pg_active. Uh it's an extension that enables active active replication directly from Postgres. And all these other extensions help help uh you administrate different ways of logically replicating data out of Postgres to downstream sources.

There's extensions to build more extensions uh particularly in procedural languages. Uh take PL/Rust, did you know you can write Rust functions and execute them directly from Postgres? It's pretty cool and finally search extensions and you know, in some ways know these extensions add new data types and ways of searching over them.

PostGIS is one very popular one, it turns Postgres into a geospatial database and is, you know, incredibly popular in the, the geospatial world pg_vector will talk a bit more about later today. It's become incredibly popular due to the rise of folks wanting to query vector data directly from their Postgres database.

So then there's drivers, drivers are, are a key drivers in some ways, are the unsung heroes of the Postgres community because they allow you to work with your native programming language and connect to Postgres and be able to, to manipulate them. And there's a lot of them.

Now, let's start with Java. You have the JDBC driver or as well as the R2DBC driver and take JavaScript. There's multiple drivers available in JavaScript to connect to Postgres python. I found for that allow you to do exactly that rust there, rust, there's some go, there's some ruby, there's some C++ et cetera, et cetera, et cetera, et cetera.

I'll tell you it was a lot of fun to try to animate this slide to uh get all the drivers in there. Oh, no. And this doesn't even include ORMs which are an abstraction layer on top of drivers. If you want to get the full list of drivers available for Post, there's, there's a wiki page for that which constantly gets updated and these are a way to help contribute to Post because any functionality you add to a driver helps enable developers to more easily work with Postgres.

And then there's governance. There's a variety of ways to help grow Postgres without necessarily writing code. There's MPOs that help promote events and various educational activities around Postgres as well as user groups and other committees that help to, you know, support the general uh structure of the project, whether it's the security committee or um uh the, there's a lot of them now, a lot of the committees, it does take some time to earn trust to be, to be a part of those committees.

But there are ways to uh to get active in the project to help earn that trust, including, you know, volunteering for the MPOs and different user groups.

So that said, let's start diving into, into contributions and what AWS has been doing over, you know, the past several years.

So as I mentioned, AWS uh contributed around 19% of the features to Postgres 16 and was the second largest contributor. And it's something that we take great pride in because it mentioned up top, it's, you know, Postgres you know, as important as it is an open source project, it is important to us, you know, in many different ways.

So the first question is, well, ok, you know, how did you do that? How did you build up all these contributions?

Well, first, it was just general involvement and being involved in the project. AWS was recognized as a major sponsor of the project. This kind of recognition from the Postgres project is earned. You can't buy your way into it that every year, uh the community reviews a list of sponsors or a list of folks contributing to the project and how their employers are you know, help align them to contribute. And if they see those earned contributions during the patch reviews, writing, writing code and being involved in the various governance aspects, you can get that recognition.

This also can this is also extends to working on Post extensions, no pun intended because Post extensions do help drive more folks to being able to use Postgres.

So we we do have a dedicated effort contributing to Postgres. Uh we have a, we have a team of contributors that all they do is focus on how they contribute to Postgres, whether it's the core database engines extensions and or they keep the lights on work of the community. And it does help to have folks just solely focused on it, particularly for an open source project that's been around a long time.

There's a lot of moving parts to building a database and it does require full time focus if you want to make sure that database succeeds. Now, I do want to note that if you go back 20 years ago, uh you had very few folks or maybe only one person who could focus on Postgres full time. It was a hobby project and that's been a big shift over the past 20 years.

So that Postgres has gone from being a hobbyist project to being used to mission critical workloads. And the good news is that there are folks who are treating it as such and focused on it full time. To make sure that we can offer the best possible support for it.

There's all sorts of different ways to contribute as mentioned. And you know, we focus, you know, as AWS, we focus on all these different areas beyond just code contributions. I I do want to call it systems. You know, the systems administration, Postgres manages pretty much all of its own infrastructure and it's, it's in a whole variety of different places.

And the reason things work, the reason you're able to say download Postgres from the Postgres website is because you have folks who are focused on the system of administration and that's important. You wanna make sure all the resources needs to develop the project are up and running now.

You know, there's, there's certainly different ways that this may evolve through the years. But again, this is one of those part of the Postgres ethos to help keep it free and open source.

So recent, you know, if you want to try to bucket, you know, with some of our direct contributions to the Postgres section have come from, you know, i'd like to think about in terms of performance, operability and security.

Performance, I think we all understand uh performances, you know, how quickly, you know, how quickly are you able to get a query return? How quickly can you index something?

Um operability is, you know, how do you administrate your data and be able to, you know, move, you know, move things around or do, do a lot of your maintenance operations and security, making sure data stays safe and secure.

And one specific contribution I want to call out is the last one - security - um preventing a CVE before it happened during the beta period. One of our engineers noticed an issue with a new feature that had been released would have certainly had a CVE. There was a permission check that wasn't being checked correctly. And, you know, someone could escalate catching that before it became a CVE. I consider it to be huge because firstly, i, i actually have to deal with the CVEs as part of my role in the Postgres community and it takes work.

So anytime you prevent the security issue from occurring, certainly, you know, certainly is a good thing. But it's also, I also might emphasize the importance of beta testing. You saw that there's that beta period available from, you know, May to September, anything you can do to test Postgres against production like workloads or test out new features helps because the community can help harden the release before the GA to make sure, you know, problems such as CVEs uh don't actually occur.

Now in, in the specific case of security issues, I know it's impossible to prevent all security issues. But that's why, you know, thorough testing, you know, certainly helps.

So let's deep dive a little bit. So one feature group, i really wanna deep dive on today is logical replication. And how many folks are familiar with what logical replication is? Ok.

So for those of you who aren't logical replication is a way of streaming changes out of a database in a way that could be interpreted by other systems. What i mean? So what i mean by that is that there's another type of replication, physical replication which are basically replicating, you know, block by block on disk what changes are occurring and then sending those down.

But not every system can understand those block by block changes. There's a lot of things that you need to keep exact. You have to keep your operating system exact, you have to keep your file system exact, et cetera with logical replication, other systems that are programmed to be able to understand that protocol can redo those changes.

So Postgres is logically replicating something. The a downstream system doesn't necessarily do be a Postgres database. It could be a Python script that understands how uh those changes are being replicated. And from that Python script, you can do all sorts of things you can do extract, transform load jobs, you can um you know, build that data, you know, and you know, build like some real time dashboards.

I actually used it as a way to create a effectively a hacky incremental materialized uh view where basically i was updating changes in real time in a table where i had to do some complex aggregate query. But what we've been seeing with logic, so logical application came out in Postgres version 9.4, which is about 10 years ago now. Yeah, it is about 10 years ago and it's been evolving ever since i'd say it's been a slow evolution. Here is the remaining transcript formatted for better readability:

First in Postgres itself, there's already extensions available. And these are, these are optional modules that you can enable on your, on your instance. And pretty much everyone in this room or everyone who's running a RDS in this room, you're using pg_stat_statements because this is the primary way you gather statistics about what queries are being run within your database.

There's all sorts of useful extensions as well just within Postgres based upon what data type support you need or if you're connecting to other databases or programming other things, there's extensions for utilities.

Um all of these are, you know, very, you know, are various ways to be able to manage different aspects of Postgres. It could be pg_cron which lets you schedule jobs directly in Postgres. Pg_partman that lets you uh handle partition management pg_hint_plan, which actually we'll talk a bit about later, which is great if you're migrating from databases that support hints and pg_audit, which does well audit tracking of your logs.

There's extensions that let you connect to remote data sources. Uh the MSSQL and Oracle FDWs are ways that you can query said databases directly from Postgres.

Uh TDS FDW works with uh SQL Server and Sybase. A Mock FDW is a way for you to query your Postgres logs directly from an SQL interface.

There's replication extensions, different ways of being able to copy data between different uh Postgres data sources or things that can uh read things streaming in real time.

Uh we recently announced uh pg_active. Uh it's an extension that enables active active replication directly from Postgres. And all these other extensions help help uh you administrate different ways of logically replicating data out of Postgres to downstream sources.

There's extensions to build more extensions uh particularly in procedural languages. Uh take PL/Rust, did you know you can write Rust functions and execute them directly from Postgres? It's pretty cool and finally search extensions and you know, in some ways know these extensions add new data types and ways of searching over them.

PostGIS is one very popular one, it turns Postgres into a geospatial database and is, you know, incredibly popular in the, the geospatial world pg_vector will talk a bit more about later today. It's become incredibly popular due to the rise of folks wanting to query vector data directly from their Postgres database.

So then there's drivers, drivers are, are a key drivers in some ways, are the unsung heroes of the Postgres community because they allow you to work with your native programming language and connect to Postgres and be able to, to manipulate them. And there's a lot of them.

Now, let's start with Java. You have the JDBC driver or as well as the R2DBC driver and take JavaScript. There's multiple drivers available in JavaScript to connect to Postgres python. I found for that allow you to do exactly that rust there, rust, there's some go, there's some ruby, there's some C++ et cetera, et cetera, et cetera, et cetera.

I'll tell you it was a lot of fun to try to animate this slide to uh get all the drivers in there. Oh, no. And this doesn't even include ORMs which are an abstraction layer on top of drivers. If you want to get the full list of drivers available for Post, there's, there's a wiki page for that which constantly gets updated and these are a way to help contribute to Post because any functionality you add to a driver helps enable developers to more easily work with Postgres.

And then there's governance. There's a variety of ways to help grow Postgres without necessarily writing code. There's MPOs that help promote events and various educational activities around Postgres as well as user groups and other committees that help to, you know, support the general uh structure of the project, whether it's the security committee or um uh the, there's a lot of them now, a lot of the committees, it does take some time to earn trust to be, to be a part of those committees.

But there are ways to uh to get active in the project to help earn that trust, including, you know, volunteering for the MPOs and different user groups.

So that said, let's start diving into, into contributions and what AWS has been doing over, you know, the past several years.

So as I mentioned, AWS uh contributed around 19% of the features to Postgres 16 and was the second largest contributor. And it's something that we take great pride in because it mentioned up top, it's, you know, Postgres you know, as important as it is an open source project, it is important to us, you know, in many different ways.

So the first question is, well, ok, you know, how did you do that? How did you build up all these contributions?

Well, first, it was just general involvement and being involved in the project. AWS was recognized as a major sponsor of the project. This kind of recognition from the Postgres project is earned. You can't buy your way into it that every year, uh the community reviews a list of sponsors or a list of folks contributing to the project and how their employers are you know, help align them to contribute. And if they see those earned contributions during the patch reviews, writing, writing code and being involved in the various governance aspects, you can get that recognition.

This also can this is also extends to working on Post extensions, no pun intended because Post extensions do help drive more folks to being able to use Postgres.

So we we do have a dedicated effort contributing to Postgres. Uh we have a, we have a team of contributors that all they do is focus on how they contribute to Postgres, whether it's the core database engines extensions and or they keep the lights on work of the community. And it does help to have folks just solely focused on it, particularly for an open source project that's been around a long time.

There's a lot of moving parts to building a database and it does require full time focus if you want to make sure that database succeeds. Now, I do want to note that if you go back 20 years ago, uh you had very few folks or maybe only one person who could focus on Postgres full time. It was a hobby project and that's been a big shift over the past 20 years.

So that Postgres has gone from being a hobbyist project to being used to mission critical workloads. And the good news is that there are folks who are treating it as such and focused on it full time. To make sure that we can offer the best possible support for it.

There's all sorts of different ways to contribute as mentioned. And you know, we focus, you know, as AWS, we focus on all these different areas beyond just code contributions. I I do want to call it systems. You know, the systems administration, Postgres manages pretty much all of its own infrastructure and it's, it's in a whole variety of different places.

And the reason things work, the reason you're able to say download Postgres from the Postgres website is because you have folks who are focused on the system of administration and that's important. You wanna make sure all the resources needs to develop the project are up and running now.

You know, there's, there's certainly different ways that this may evolve through the years. But again, this is one of those part of the Postgres ethos to help keep it free and open source.

So recent, you know, if you want to try to bucket, you know, with some of our direct contributions to the Postgres section have come from, you know, i'd like to think about in terms of performance, operability and security.

Performance, I think we all understand uh performances, you know, how quickly, you know, how quickly are you able to get a query return? How quickly can you index something?

Um operability is, you know, how do you administrate your data and be able to, you know, move, you know, move things around or do, do a lot of your maintenance operations and security, making sure data stays safe and secure.

And one specific contribution I want to call out is the last one - security - um preventing a CVE before it happened during the beta period. One of our engineers noticed an issue with a new feature that had been released would have certainly had a CVE. There was a permission check that wasn't being checked correctly. And, you know, someone could escalate catching that before it became a CVE. I consider it to be huge because firstly, i, i actually have to deal with the CVEs as part of my role in the Postgres community and it takes work.

So anytime you prevent the security issue from occurring, certainly, you know, certainly is a good thing. But it's also, I also might emphasize the importance of beta testing. You saw that there's that beta period available from, you know, May to September, anything you can do to test Postgres against production like workloads or test out new features helps because the community can help harden the release before the GA to make sure, you know, problems such as CVEs uh don't actually occur.

Now in, in the specific case of security issues, I know it's impossible to prevent all security issues. But that's why, you know, thorough testing, you know, certainly helps.

So let's deep dive a little bit. So one feature group, i really wanna deep dive on today is logical replication. And how many folks are familiar with what logical replication is? Ok.

So for those of you who aren't logical replication is a way of streaming changes out of a database in a way that could be interpreted by other systems. What i mean? So what i mean by that is that there's another type of replication, physical replication which are basically replicating, you know, block by block on disk what changes are occurring and then sending those down.

But not every system can understand those block by block changes. There's a lot of things that you need to keep exact. You have to keep your operating system exact, you have to keep your file system exact, et cetera with logical replication, other systems that are programmed to be able to understand that protocol can redo those changes.

So Postgres is logically replicating something. The a downstream system doesn't necessarily do be a Postgres database. It could be a Python script that understands how uh those changes are being replicated. And from that Python script, you can do all sorts of things you can do extract, transform load jobs, you can um you know, build that data, you know, and you know, build like some real time dashboards.

I actually used it as a way to create a effectively a hacky incremental materialized uh view where basically i was updating changes in real time in a table where i had to do some complex aggregate query. But what we've been seeing with logic, so logical application came out in Postgres version 9.4, which is about 10 years ago now. Yeah, it is about 10 years ago and it's been evolving ever since i'd say it's been a slow evolution.

First. there was enabling the mechanism to perform logical replication and now there's all sort of features around it that help you decide what you're logically replicating where you're logically replicating to and continuing to improve the performance of it. And the reason is that it is in the it is in the critical path now that you might have multiple databases, you know, putting, you know, pushing changes into a hub that then get pulled into a data warehouse and you have all sorts of different points and you know, talking to, talking to our customers, we've heard, we've heard uh a lot of demand for improving things around logical replication because it is, you know, it is an important part of the real time analytics systems or, you know, pushing data to, you know, across different regions.

So if you want to think about what features are needed to help support logical replication, you know, beyond the base support of it, the first is performance and performance is what i've been hearing the most lately about how do we make things faster with logical replication, particularly as we're seeing it used more and more i would say over the passover releases

There's been different features to help support this. For example, streaming changes from large transactions from your publisher to subscriber previously, it was an all or nothing. If you have a long running transaction, let's say it's, you know, taking a minute to complete all those changes would be held up before being sent to your subscriber. Uh post grads. I believe it was post chris 14 added the ability to stream it. So as the transaction was running on your publisher, the changes are being sent to the subscriber. And though they weren't being applied on the subscriber, they were all being sent over. So that way when the commit occurred, all the changes are then all that you can immediately begin replaying the transaction on the subscriber without having to send all the data over. So it saves some time and actually reduces the burden on your primary instance. So performance is important.

So is flexibility when, when you have physical replication, typically that's an all or nothing system, you're using it to keep, you know, effectively exact copy of your primary database, logical replication, you can choose to replicate a subset of changes. For example, let's say you have, you have a few tables that you want to use in a real time report, you know, downstream somewhere. You don't wanna replicate all the changes in your database, you only want to replicate the changes from those tables. And that's, you know, and that's the flexibility aspect of of logical replication. You know, where you can do that and finally safety, you know, these are changes, you know, these are changes to your data, this is your data. You wanna make sure you're able to protect them and make sure that when you're replicating things to a downstream system or even, you know, within your your publisher itself, you're not introducing anything that could potentially allow an attacker to escalate or leak information that you don't want to be leaked.

So post graph 16, you know, here's a, here's a selection of changes that were committed to post graph 16. Um in no particular order of replication from standbys, uh which we're gonna dive, you know, we're gonna dive deep into parallel apply of transactions, which we'll also dive into. Um i i do want to touch on these bottom two since we're not gonna dive on to them into the next couple of slides apply as table owner. This, this goes into the bucket of safety features or security features before post script 16. Uh the whenever you apply to change your subscriber, so just to back up a second when you apply a change, that's basically saving it on your on the local instance, the subscriber is the instance that's receiving that change. So when you apply that change, you're basically saying, ok, the subscriber has now saved that change before this release. Those changes were applied as the subscriber owner.

Now in post. The subscriber owner is your database super user and your database super user can do anything. Now part of the problem with that was that your database super can do anything. And when we apply changes there, there are things that can occur as you apply the change, for example, executing a trigger, a trigger is a basically a piece of user provided code. And when you have user provided code, oh the user can put things in the code that you know might do something like, hey, alter, you know, alter privileges, grant super user to myself. Now, you know, by default, there's, there's a lot of safeguards postcode has so you can make sure that users are not necessarily doing that. And in fact, you should be checking your user provided code regardless. But during the apply process, if you're applying these changes as a super user, then a super user will execute that code with super user privileges. And suddenly you might be granting an unprivileged user, super user privileges. So that had to change. And that's what that applies. Table owner privilege. Does your table owner is likely a lesser privileged user based upon your design. And when those changes are applied as the table owner, then you basically allow you allow so that uh the s you know, so that they're being executed with like the lower level privileges helping to prevent escalations like that.

Finally, uh use bre index on apply. Typically, when you do logical replication, you want to do this on primary keys, that could be a very fast look up, it could be a look up on a unique identity. There might be cases where that's not possible when, when it's not possible. Basically, post code has to do a sequential scan through the table to determine how to make that application, which can be a very costly process on a bigger table. So being able to use the bre uh index on the apply. Basically, it, it's a, it's a shortcut. If you don't have that primary key postscripts can look things up on the beach tree and be able to apply that change much more quickly.

Now, i'd wanna, i, i do wanna dive into this a bit. And first, you know this slide, my two year old called the butterfly and i like that name because it really shows how folks, you know, understand the community ethos here of rising tide lifts, all ships, we're all gonna work together to help make post better because all these companies on this slide all helped each other in different ways on these different features. For example, um an engineer from e db had uh demons, you know, had basically proposed the appliance table owner patch and folks on these other organizations helped to make sure that, you know, the design made sense as we put into postgres 16 and, and having everyone work together, particularly on a very complex set of changes allows, you know, allows post grass to continue to be better. And that is, you know, a core part of the community ethos, i want to deep dive on replication from standbys because this was actually one of the most requested features that i heard of, you know, even since i joined aws.

So just real quick, typically, when you set up replication, in this case, this is physical replication, you have a primary and you copy changes to a standby. And i say in this case, we also have a logical replica. So we're basically replicating to two different areas. So the purple is physical replication. The yellow is logical replication. And typically, you know, you might have users interacting with a logical replica. Let's say it's some kind of like real time reporting system. Now what happens is your primary tends to be your busiest system that's accepting your reads and writes. So there's a lot of traffic going to it. So we can already see here that on our primary, we have users interacting, making changes, reading, reading data from it and we're sending all these changes to the standby and the replica and all that takes work, the primary is doing a lot of work. And if you have a very busy system that's gonna take a lot of time.

So the standbys tend to be a lot less busy if you even have traffic on them at all. So there's an opportunity here to take some load off the primary and replicate changes from the, from the standby. As you can see, we can alleviate the load on the primary by by doing exactly that. Now, here's the, here's a very interesting about this feature. This started in 2016. You know, this is one of those things like, yeah, of course, this is an obvious idea or it might be an obvious idea, but it started in 2016, kind of stalled out. A major effort began in 2018, stalled out and like this continued year, you know, off and on for years because it actually was very challenging to design and it really took a concerted effort to get it through. And this is something that, you know, we, you know, we help sponsor in a major way, the, the primary author, you know, came from aws, but it was a team effort. Like it required reviews from multiple organizations, a lot of people working on it. And actually for post the post 16 release, it literally was committed like hours before future freeze, you know, thanks to, uh a core team member and a committer from microsoft and notice like this took seven years to get in and i'm not saying like every feature should take seven years. You know, i hope, you know, there's certainly a lot of one don't take that long. And two, you know, we always try to, you know, get better and faster, but we want to do it right. There's a lot of things that are challenging when uh you know, in terms of this feature, likely for a whole talk in itself. But, you know, this shows just the general effort the community put behind it to ultimately get it in.

Another one that is active in was a parallel apply of large transactions. So real quick, a logical replication, you know, there's two things that occur when you're sending uh information from, let's say a primary geological replica, you actually have two processes set up on each side, you have a sender and a receiver. What happens is like a change comes in the sender encodes it into the logical format since it's a receiver which then commits it to the the logical replica. So it's pretty straightforward. You can see that transaction, you know, you have a transaction and you know, ultimately get sent through a pipeline. But the problem is that there's several places that can think that can bottleneck. And let's say if you have a large transaction and the receiver is taking time to apply it, it starts stalling transactions on the primary which can start accumulating and this runs the risk of something called. Well, there's several risks that could occur, but the primary one is dis exhaustion where you accumulate so many changes on your primary that you actually run out a dis not because your database is full, but because you haven't been able to replay all the transactions on the the logical replica.

So this is where parallel apply comes in that if you have very large transactions, which can, which can take a lot more time to apply the receiver can work on saving them to the logical replica in parallel. And that helps to speed up the process. Now, this was a, this was a collaborative effort between and fujitsu. Fujitsu actually proposed this, you know, based upon what they were seeing from their customers and asked us to help with the design review. As uh 11 of our engineers was an expert in this area and we went back and forth, fujitsu helped implement it and we helped with the code review. And again, there's all sorts of different collaboration of each steps. But this is again another example of when you're contributing to any open source project that if, if you try to do everything better together, you work with each other to be able to take care of all the steps folks are good at when contributing a feature.

Now, as i mentioned, you know, there's other ways you can contribute to postgres and two areas that a dres is focused on are also drivers and extensions. And i'm just gonna touch upon a little bit of it today. As i mentioned, the other slide, there's a lot more projects we've been involved in, but there's two things of note. I wanna talk about first, the jd bc driver, which is the primary way java applications connect to post and two extensions, ph plan and pg vector.

So first, uh let's talk about the jd bc driver. Um i don't know how popular jd bc driver is, but the rumor is that it's very popular. Um in fact, i once had a talk with uh one of the primary authors of the jd bc driver. And when folks found out that uh that was his role, he got like a standing ovation in a room of 100 people like ii, i couldn't believe it. Drivers are important. This is, this is why i tried to emphasize that they're critical, they're critical in, in application building, you're building any application you're using, you know, some sort of driver. There's a recent uh contribution in the post uh to the jd bc driver, which, you know, personally i was very excited about and it was binary support for the vector data type. So vectors are very popular in generative a i. And what com you know, common flow is that i might, i might be in.

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值