Feature management: De-risking migrations and releases for PowerSchool

最新推荐文章于 2024-08-13 14:12:28 发布

李白的朋友王维

最新推荐文章于 2024-08-13 14:12:28 发布

阅读量66

点赞数

文章标签： aws 亚马逊云科技科技人工智能 re:Invent 2023 生成式AI 云服务

本文链接：https://blog.csdn.net/just2gooo/article/details/134819416

版权

all right, gonna go ahead and get started. thank you everybody for joining us today. uh we are going to be talking about future management and how we can derisk releases and migrations with launch darkly uh specifically over with our friends from our school.

so my name is peter mccarron. i am a senior technical marketing engineer for the launch darkly team. uh that's a fancy way of saying i work on developer experience. try to show you all how launch darkly works and how we can help you.

i'm joined today by adam hiley. uh he is the principal architect for power school. he's gonna have the bulk of the presentation show you all the cool stuff that they did uh both internally and using launch shaley.

but before we get into that, we have this question that we're all trying to answer here. why is releasing software so hard? ultimately, that's why we come to conferences like this. that's why we're trying to figure it out. we want to figure out why is it so hard for us to release software and then also accumulate a bunch of swag. but ultimately the reason that we fear software releases is because they're throw with risk, right.

we've all been in that process where we push something new, something goes wrong. and ultimately downtime is expensive. if you have an outage, that means that your customers are likely seeing an error, they're getting frustrated. it's going to lead to lost revenues. it's gonna be a problem all around.

additionally. if you're grouping everything into a single release, you're ultimately going to have your developer spend more time debugging code than actually pushing new features. and it's because you have everything on one, you're trying to find like a needle in a haystack when things go wrong, which ultimately leads to the fact that we say, you know, the status quo kind of works, right? the idea is that we don't need to introduce new systems or new tools because things are working the way that they're supposed to. but that leads to stagnation technical debt. and ultimately this whole process is just the reason that we don't like to release software.

and if you think about it, this is probably what the release process looks like. for a lot of folks, you have your development team in the middle there, they release a new feature, they trigger the c id pipeline, everything builds it deploys, then a book happens, right? we find an issue. if something goes wrong, somebody who happens to be on call gets a nice duty alert at 2 a.m. everybody's unhappy and they spend their weekend working triage, setting up support tickets, working in a slack channel and then you have to make a choice. am i going to roll back that feature or am i going to find the fix? and then i'm going to redeploy it, but this whole process slows everything down. and so what we end up doing is we say we're not going to release all the time, we're gonna release, you know, maybe once or twice a year or something.

and the challenges that are holding folks back are things like unexpected books. we've all had that experience. i'm sure where we say it works on my machine. all right, my local environment, everything was great. but then i pushed it to production and something i didn't catch ultimately failed. the other problem is that it's scary to push new software because i don't necessarily have a control in place where i can limit the impact radius of the people who are going to be affected by those changes. it's kind of like an all or nothing thing, right? i got to release software. everybody sees the change and then that can become a problem if something goes wrong and then once the feature is released, how do i actually know that the business impacts are measurable? how do we know what we release is actually helping our customers setting up a testing? it's, it's complex, it can be a little bit challenging and then you take this whole piece and you have it split between two different release processes. we have the mobile life cycle as well as the regular development life cycle. and there's kind of this fracture between those two where the mobile has all of these other complications in addition to everything you're trying to solve from a general release.

and so that's what we're trying to solve for at launch. darkly. so at launch, darkly, we're trying to help you derisk releases using future flags allow you to be able to release frequently and often with stability. but we also want to help you be able to control that impact radius. so you can feel empowered to be able to roll out your changes to targeting folks, be able to deliver nice customized experiences based on who your users are. and then once that feature gets released, be able to run experiments against it, create measurable impact to understand. is this change helping me or is it hurting me or do i need to go into a different direction? and then again, we unify it in the mobile life cycle experience so that we can do the same things for our mobile apps as well as our web apps.

and so what that looks like now is this, the developer is no longer at the center of this constantly getting pegged with errors and issues. the developer works on new code. they will deploy the feature we use launch darkly to create these targeted releases. adam is going to talk about that in his presentation about how they use that to control the release process. if we find an error, we have an automatic kill switch that turns that flag off. we don't have to get anybody page, nobody has to work over the weekend. we can resolve it on monday, we can then broaden our audience. so once we've moved past our testing phase or early access phase, we can make it available to everybody and then we can measure the performance through experimentation and now launch darkly at the center of that handling, handling those things. and your developers are focusing more on the code that they want to build and building the features that their customers love.

and so the launch shark platform handles a number of things. if you're curious about what's on here, i know it's an eye chart. come talk to us at the booth. we're happy to answer all these questions. we're going to focus today primarily on that increasing release velocity and stability section.

and i've said a lot of words about all this stuff and you might be saying, ok, well, what does that actually look like in practice? and that's where i want to invite my friend adam up to talk about what it looks like at power school. so adam hand it over to you.

all right. thank you. put this down over here. thanks peter.

all right. uh let's go ahead and start it. um so, hey, everyone, i'm adam hisley here. happy to be sharing some lessons learned from my team uh navient uh within power school. um so i'm gonna talk today a little bit about just uh a specific kind of feature flag change and it's one that i think uh hopefully folks that are generally interested in, which is about pulling uh pieces of a monolithic application out into a service oriented architecture. and one of the things i really love about launch darkly and feature flag oriented development in general is that it's applicable to a wide variety of problems. you know, you can use feature flags for, you know, individual features on a screen and releasing them. uh you can use it for ops flags and, and, and general even bugs fixes. but one that my team has found a lot of value with is this situation where we need to make a big, you know, somewhat ambitious architectural change in our system. and you know, how can we roll that out and release that, you know, with, with high confidence and low risk?

ok. so if you're not familiar with uh power school, uh have you never heard of navient before? uh you know, know that we're an ed tech company. um we're, we're serving uh millions of students, our customers, our schools and districts primarily in north america. and you know, uh if i go to this next slide here, you can see that uh power school in general. you know, it's over in over 13,000 schools and district organizations across 90 plus counties and territories and, you know, the whole power school uh ecosystem serves 45 million students. and, and i think, you know, the reason, uh i think this slide is so important is that it really hopefully underscores the importance to our team of having really strong, high confidence rollouts because our software is being used by uh you know, ed tech customers who are very tuned in. there's usually ac in a life cycle, you know, they want to do activities at particular periods during the school season, if we have a release and we don't have high confidence in it or if we miss a key target, uh that could have a real negative impact on the students who use our software. and so it's very important to us that we get this right. uh and that we can ship with, you know, high confidence and low risk.

so i'm the chief architect for navient, which is one of the product lines within power school and, you know, serves about as the earlier slide mentioned around 10 million students. and so i'm really going to focus there for the concepts in this talk. although all of power school uses launch darkly and all of power school does use feature flags.

um and you know, navient is a what i consider to be a pretty common, you know, enterprise application journey. i imagine this story will sound familiar to a lot of folks here is that it started with a small engineering team and a monolithic architecture back in 2002, um you know, product was met with success, you know, schools and districts love using nv. and so so, so the oversized over time, the team size grew and you know, as as naturally happens, you know, time went on. uh we, we started getting really interested in cloud based technologies. uh today, navient is entirely on aws. we retired our last data center dependency just a couple of years ago, which is awesome. and uh you know, as you continue to grow and as you continue to scale up, um you know, the engineering leadership within navient really starts to focus on this concept of, you know, how do we maintain high developer velocity at enterprise scale? you know, how can we be be releasing software to our customers quickly at high velocity, you know, as our customer base has grown from, you know, tens of thousands, hundreds of thousands into the millions of users.

ok. and so i, i mentioned earlier, i'm going to talk a little bit here about uh moving, you know, from a monolithic system with a lot of history into a more service oriented architecture. and before i jump into that, i just wanted to include this one slide. uh because anytime i say the words uh monolith and microservice. uh i, i tend to find that there's a little bit of a, you know, a straw man or a caricature that comes up and, and just before i go, i wanna dispel that uh a little bit here. you know, when i, when i, when i bring up these terms, i, i kinda always, you know, hear like on the left i have th this example of this uh monolithic application. it's, it's basically a jenga tower

uh it's so difficult to touch it. so, you know, if you, if you change any piece of it, the whole might fall over, you know, super risky, uh stay away at all costs. uh and on the right, i have um uh the, the netflix uh service ecosystem from one of the talks uh from the netflix team, which is, you know, kind of like the, the opposite extreme, the, the most extreme example of service oriented architecture from one of the most, you know, complicated, you know, software systems out there today.

and sometimes people will roll this out as well to say, oh, you know, service oriented architecture, if you start doing that, if you start breaking apart your monolith, you know, th this is what you're gonna end up with this, this crazy amount of complexity, you know, it's gonna be totally impossible to manage.

um and, and, and, and before i start talking about how we break apart a monolith into a service, you know, i just want to underscore this point that, you know, i don't personally think there's a lot of value in those kind of more extreme comparisons, you know, what my team's experience has been. and, and honestly, this has been my experience in my entire uh career is that, you know, really most software systems exist in some kind of hybrid state between those two architectures. you know, usually there, there is a monolithic system, often times one that you started with and there's a, there's a, there's a kind of trend over time as your team size grows to become more service oriented. and so, you know, i'm gonna describe something that i think is a good fit for that kind of reality. and, and uh you know, we'll, we'll talk a little bit about why that is um if we dive into what actually happens as a team looks at breaking apart a monolithic application, this slide tries to capture a bit of a heuristic again based on my experience, which is that, you know, in the beginning, when you're scaling up your team from small to maybe medium or medium small, um you know, adding more developers is fine, you know, you, you, you, you increase the size of your team and your output increases somewhat linearly. you know, you, you, you get a good bang for your buck.

uh but eventually you start hitting these inflection points, you know, you start hitting this point where you realize having more and more people contributing to the same code base, the same monolithic system uh starts to fall off in terms of the gains that you're getting in terms of output and under the hood.

uh what i think, you know, most people realize who maintain a monolithic system with what you can see pretty clearly is that uh while you can continue to increase the size of your team, uh the individual productivity of each member of that team eventually starts to go down.

um and, and the reasons for this are probably uh the subject of their own hour long talk. so i'm not gonna go super deep into it, but, you know, i think it hopefully is somewhat intuitive to folks.

um you know, for example, just the fact that teams are going to be uh stepping on each other's toes, you'll have code conflicts, you'll have folks who wanted to push a dependency update in your monolith. at the same time, someone else wants to roll out a new feature, you know, you, you get a lot of cross communication and, and a lot of uh basically velocity reduction as this team size increases.

ok. so what does this mean for my team on navient? uh you know, obviously, this means we want to continue this hybrid architecture and move. but, but i think there's a couple other interesting points that can be made here. uh the first one especially which is that if, if if you believe as i do that, there's this kind of growing reduction in individual developer productivity, then there's an important implication there, which is that you can actually realize pretty significant gains just by an incremental refactoring of your monolithic architecture. right? if if if the trend is always worse at the tail end of the graph, then backing up the tail end of that graph a little bit is going to give you your biggest bang for your buck. so you don't have to completely rewrite your monolith to get great gains. you can just pick a specific part of your monolith. go there, do an incremental refactoring and then figure out what to do next.

and again, the, the last point i may mention here probably obvious at this point. but, but what are we moving towards, you know, what we want to end up at the end of that is we want to end up with a loosely coupled architecture where independent high velocity teams don't have to do a, a ton of, you know, cross team orchestration in order to deliver new features to our customers.

ok. all right. so if you're along for the ride this far, uh and, and if this sounds great to you, uh the next question is the actual important one, which is, how do you get started? what, what are you gonna do?

um and you know, this is something that's easy to say. but, but of course, there are key challenges when you're trying to look at shipping and releasing uh an architectural uh refactoring.

um the first one here is something that i would highlight that, that our team definitely believes in. you know, we never want architectural changes really any if we can uh to prevent us from delivering new features and experiences to our customers. and, you know, sometimes i think, you know, you hear this kind of sense. uh i, i've certainly worked at prior companies where this was the case where, you know, if you're doing a big architectural refactoring, you know, no one wants to touch the product during that time. they, they just say, ok, you know, let's just, let's just focus on doing this architectural change.

my opinion personally is, is i actually want to do the exact opposite of that. and i, i want actually to use an architectural refactoring as an opportunity to deliver new functionality, new experiences to my customers. and i'll talk a little bit later about why i think that's a good idea.

the other thing that when we start thinking about doing this architecture, you immediately start thinking about releases. and, you know, for me, the the maximum here is i really want to avoid this quote unquote big bang. you know, i don't want to be shipping a brand new software architecture uh to my customers all at once. you know, that that's obviously going to be uh uh very risky for a lot of the reasons that peter just uh described earlier. and, and so how do we avoid that? right, the last two points i'll mentioned here are specific to this use case of, you know, really wanting to touch an older system, older systems have older systems have less reliable regression testing and they also have lots of data, right, lots of older data in navis cases, 20 years of it in some places. and, and so for us, you know, that represents a pretty significant, you know, concern in the sense that you might not have great testing coverage, you might have lost some domain understanding over the years. and so when you release this in a new iteration in a new architecture and a new data model, um you know, you might get surprised, you might learn something. you didn't, you didn't anticipate in your lower environment. and obviously you want to have a plan for that.

ok. so something that my team has done over the years and, and, and effectively has become kind of like a blueprint for us in terms of doing this type of architectural modification, you know, again, iteratively opportunistically uh is that we really focus on four key pieces. uh the, the first piece being a concept and we'll look at these in the following slides. but uh the first one being this concept of a micro front end, uh the second part being a domain api which is basically an independent back end that we want to break out into uh the next piece being a new database and a data migration strategy. and, and yes, we do do that. uh a lot of uh some teams i know might, might hesitate to move data i mentioned in my last slide data is hard.

um but you know, i i really think that's super important here actually because as i mentioned earlier, what we want is a loosely coupled architecture. uh and if you don't have uh uh independent data store, uh you're always gonna have that area of tight coupling and it's always gonna be a risk for you and, and, and in some cases, a pretty severe one. uh so, so we do do that as well. and then the last piece that ties us all together is a, is a very specific tiered release strategy where we really think from the beginning about how are we going to roll this out to our customers? how are we going to effectively deliver this new architecture?

uh the diagram on the right. um we, we'll see a little more of a drill down into this in a moment, but it, but it just at a high level at a 10,000 ft view. what it shows there is that, you know, we're using launch darkly and specifically feature flags within launch darkly to take our user base and cut them over from a legacy experience, you know, imagine like a 1.0 experience something that was created a while ago into our new user experience on our new architecture. and again, we we'll look a little bit more at how that happens in depth.

so going down the line here of the pieces of that blueprint. the first one are this concept of micro front ends and and this is an industry term. you may have heard this uh before. if you haven't heard of this before, you know, the high level concept is really that you want to take your use your user interface and you want to break it out to make it a loosely coupled component. uh so that your u i is not a build time dependency of your monolith.

um and you know, for some people, this, this may be obvious uh you know, of course, you should do this, but i've definitely worked with teams at the past uh who, who, you know, built a awesome loosely coupled service oriented architecture. but when it came time to modify their user interface, well, we, you know, we, we left that in the monolith. we, we didn't, we didn't want to break that out.

um if you do that, the the risk there is that you, you can really undermine the value proposition of your architectural migration, right? be because you can basically uh be left with all of the roadblocks of shipping your monolith. you know, it's still your your user experience, which the thing you really care about is still in your monoliths path of production.

uh micro front ends, basically give you a way to pull that front end out, make it stand alone.

um i don't have time in this talk to go through the details of exactly how we do this. it can be as simple as just an import javascript file, you know, instead of bundling it in your build process. if you're really curious about this, find me after the talk and you know, i'm happy to, to, to share more details about that.

but if you think about how this works, uh you know, in terms of enabling and and roll out uh this diagram kind of gives a quick look at that, which is we want our customers to go into our monolithic application uh log in uh reach a point where it may be time to show them a new ui a new user experience, our monolith code is going to check a feature flag online darkly. uh that's the enabled uh check right there. and if this is enabled for the current user, uh we're going to send them over to the micro friend and experience in addition.

and this part is important. the monolith may also want to use that feature flag evaluation itself to show a couple new, you know, for example, navigation elements, maybe there's a banner saying, hey, there's a new experience come check it out, right? so, so the monolith may use this feature flag as well. but then when it comes time to actually show the user the functionality, they're going to see a front end embedded within that monolithic application. and that front end is going to call our domain api on the back end.

one last thing that i'll point out here is you can see that uh the domain api in this case and the monolithic application both talk to launch darkly and that's a kind of key point here.

um you know, may, may not be immediately obvious why that's important. we want to make sure that as we're rolling out features to customers, you know, both the monolithic application and the domain api are respecting those feature flag settings. basically, this helps to prevent a scenario where, you know, maybe a user, your customers are are comparing bookmarks with each other. hey, i just saw this new experience. can you get to this? oh, yeah, sure. here's the url. we still want to make sure that if you're not in the cohort that is currently have access to this new experience, you know, the domain api on the back end will block that traffic and say, hey, you know, sorry, you have to come back later when, when this is available for you. and, and that's really important to have a centrally a central place where we can orchestrate this because again, we want these to be loosely coupled architectures. we don't want our monolith and our domain api blasting tons of calls at each other to figure this stuff out.

uh we're very happy to offload that work to launch darkly and just have it take care of it. for us.

i mentioned data and i mentioned data migrations. this slide shows uh uh basically a aws service diagram of how we do most of our data migrations uh for these kinds of projects.

uh we're at aws re invent here. iii i want to preface this slide by saying that um probably aws has released two new data migration tools uh by the end of this talk. and i certainly do not want anyone to, you know, assume that, that this slide is, is, you know, the only way to do it data migration uh or, or even maybe the best architecture to do one. but it's one that's been very effective for our teams.

um you know, effectively we're, we're big fans of server list. so what you have here on the left is a, is a data export from our monolithic database where we're using aws lambda. we're, we're creating data export files uh putting them on s3, then we're using eventbridge to do an event trigger. uh where on the other side, in our domain based application, there is an eventbridge listener and a step function uh orchestrating quite a lot of work sometimes uh to import that file

And, you know, this might be the point on the right where you might do some, you know, data model changes, you might, you might be doing some transforms on your data. You know, you, you might be building a new user experience where you want to translate certain and data points. And so we're using step functions for that here. On the right. Again, lots of different data based migration technologies. Uh you know, by all means, choose the one that your team likes the best. Uh but the thing i really want to call out here, you know, first off is the fact that once again, as we saw in the previous slide, we have these two systems that we really want to be loosely coupled, right? There's, there's a nice clean event uh differentiation between our export and, and our import systems, right?

Um but we need the ability to make sure that the feature flag that controls whether this should be happening for you or not as a customer is respected equally on both sides of the equation, right? So we have this loosely coupled system, but we need this centrally organized way to manage the release and roll out ideally on a per customer basis. And, and you know, we we use launch darkly for that. It works great. It's, it's a central place where we can align all of these concerns. You know, the micro front end that i talked about earlier, the domain api and our data migration can all rely on launch darkly to tell the state of hey, whose data, whose data should be migrated, who should see this user experience, you know, all in one place.

And so i mentioned before that, you know, we don't, you know, there are different ways to slice and dice this architecture. But, but really, here is the thing that, that i would call out based on my team's experience uh that that really leads to designing robust data integrations and, and having high confidence when you're, when it's time to roll these out to your customers. Uh the, the first line here, uh you know, something we ask all of our teams when we're building these systems is, you know, can i migrate data for a single customer or a segment of customers at a time?

And you know, this may not be something that you initially, your team initially thinks of when they're building this kind of data migration. I think the first impulse is, well, i got this database here. I'm, i'm gonna move all the data over there and, you know, hopefully that'll be it and i'll be done. Uh but in reality, what what really often happens, especially on older systems, right? Is you end up in this, in this scenario where a lot of times there are data peculiarities that are specific to certain customers, right? If a certain customer has been using the data for 15 years and they've set up their reporting in a very specific way or, or they, they know of their metrics looking a certain way. Uh it's very often that you might need to deploy a bug fix just for a single customer. And so having this capability in your data migration, being able to rerun your data migration on a per tenant basis, you know, baking that in is really going to give you a lot of flexibility. Uh hopefully you'll never need it, but if you need it, it's there. And you know, it's a really great thing to have baked in from the beginning.

Uh another thing here is that can i rerun my data migrations easily and rapidly? Uh you know, the architecture i showed earlier being all keyed on events. And you know, this, this data file concept does make that easy for us. You know, another thing that you really want to be ready to do from the beginning here is, is to, is to be ready to say, hey, you know, if, if we deploy a bug and, and, and you know, imagine a beta testing customer or something sees some issue reports it and, and you're able to ship a change, you don't want to wait a week to then say, ok, we think we fixed it, reach out to the customer again and see if it works. You really want to have a really quick turnaround time that's going to give you again high confidence that that your fix actually works and it's going to lead to a better customer experience.

And again, those bullet points at the end here are kind of reinforcing some of these things. I've said where, you know, when you run this type of re architecture in production, you really want to expect the unexpected and you really want an architecture that's gonna be flexible enough, you know, to, to really give your team some powerful tools to say, ok, you know, take this person out of the data migration, put them back on the old experience. I deployed a bug fix today. Ok? Now remigrate their data, it should be, you know, changed and now put them back in, right? And, and so this is something that we, that we plan for from the beginning.

All right. So i, i've mentioned a couple of times this, this release strategy, this concept that we are slicing and dicing the roll out here. Uh what does that actually look like for us? And, and this is actually something that, you know, all of power school generally follows this strategy where we tend to release our software. Uh not just big architectural refactoring but, but really features in general uh you know, follow this pattern.

Uh and so the first line item there is something we call internal evaluation and, and peter mentioned uh awesome thing called testing in production. If you don't know what that means, uh the the ability to have uh you know, a a aaa isolated from your customer's production environment where you can flip on a feature flag and test something is something that we found to be tremendously valuable. And so, uh you know, navient is a multi-tenant sas application. So which means that, you know, when we push our code, it's available to all our customers, uh you know, unless it's behind a feature flag. And so, you know what that translates to for us here is we actually have in our production environment, dummy school districts, dummy schools with completely fake randomized data that are purely for the purposes of our own internal testing, which means when we roll out a new feature, we can enable that feature in an internal only school in production. And we can, you know, basically throw uh user acceptance testing as well as functional testing at it. Again, the goal here is to really improve your team's confidence that when you flip it on for the first real customer, you know, everything should be exactly the same, you know, it's the same exact infrastructure, it's the same exact code you've already validated it.

The next step after internal evaluation is beta or some early access. And this is where we get usually uh uh a collection of early adopters, uh schools and districts who are really interested in trying getting their hands on some new functionality that we've rolled out. And this is a great thing to have because, you know, these are these are customers who are usually highly interested in what you're doing. They're gonna give you a lot of feedback, they really want to work with you. You know, they may be interested in, in, in the change you're making so on and so forth.

Um if you remember earlier, i mentioned that i don't like to make architectural refactoring completely independent from product changes. And this is the reason why, uh it is very hard to get uh interested early adopters to come try out your, your new release. If your new release uh doesn't have any new features for them. If, if it's just a invisible lift and shift of architecture, uh you're not gonna get any early adopters, you know, they'll just wait uh as long as they can and, you know, you, you'll have to do something behind the scenes. It's really nice when you have these interested uh beta early access customers. And also it's the right thing for the customer, of course, because it's giving them more features, more functionality.

The third item here is after we've exited this beta, early access period, we enter into controlled availability and this is where we're using the power of launch darkly, you know, really at its full potential, we're creating uh user segments uh usually for schools and districts. Although launch darkly gives you a lot of capabilities here, we've also done uh state based rollouts, we've even done individual tenant, you know, targeting rollouts. Uh but basically we, we're, we're creating a go to market plan where we roll out, you know, by different segments, this functionality switching over from the legacy experience to the new one. And, you know, there's a ton of advantages to this. But i think if you think about ed tech in general, um you can probably see why this is really important for us, you know, uh schools and districts are working on a seasonal cycle. That isn't exactly the same between all of them. You know, a lot of them have very specific plans at very specific times of the year for when they want to use your software. And so being able to accommodate that to a degree, being able to work with our customers to to, to ship new features at specific times has proven incredibly valuable for us. You know, it allows us to have a release that, that that really meets our customers needs uh while having the nice side effect of slowly rolling out code. So we can get on top of bugs faster in a smaller, more controlled population, you know, before we go full.

The last step here is general availability. This is the end of controlled availability. This is when you have effectively feature flagged on, you know, the the new experience for all of your cust excuse me, uh all of your customers and you know, that may seem like, ok, you know, jobs done now, right? It's available to everyone who, who cares about uh feature flags and all this stuff. But there's one key thing that i really wanna highlight here, that's really a lot of value add is that when you hit general availability, having a feature flagged architecture allows you to do quick and easy rollbacks, right?

So what's going happen is when you're in your, your, your general availability, there will be customers who may not have seen your new experience yet. There may be schools and districts who are simply too busy, you know, your, your, your team tried their best but, but they, they only get to it, you know, a few months in and, and, and as soon as they do, they raise a, they raise an alarm saying, hey, there's something here that's not working for us. This is broke in our reporting, you know, we, we need some help. Um having that feature flag in place is, you know, not the, this is not the first line of defense, but it's a great last line of defense. Uh if you absolutely have to, you know, roll something back again while you're pushing a bug fix out for someone, you know, restoring something for maybe a week and then flipping back over, you know, this is still something that we take advantage of in very rare cases even when we're in general availability.

Ok? And this slide, i think, uh just, you know, i mentioned this kind of this, this, the fact that we're moving to this new architecture, this just gives a couple of statistics of what this looks for us. I i just thought it might be helpful.

Um you know, the, the goal here on the engineering side is really to have building out in this new architecture, uh a slimmer distributed system where we can really take this opportunity to level up our technology practices. And so you see a lot of things there that are consistent across us applying this pattern, you know, it gives us this capability in a loosely coupled system to, to, to make significant engineering changes that are great for our velocity, you know, great for our developer happiness and, and great for our ability to deliver features to our customers.

Ok. And uh you know, so, so having applied this pattern over over several years now, um you know, i think a couple of things that i would call out about this, why this has worked for us.

Um you know, first off the fact that it's fundamentally iterative is, is huge. You know, we're not trying to, you know, boil the ocean uh reduce a monolith in one giant project into 10 services. You know, this is really something that you can do opportunistically uh piece by piece, you know, identifying hi high uh you know, the right domain, the right module within your monolith. You know, not just for your engineering team but also for your business. And you know, it's, it's very flexible, you're not locked into one, you know, giant project, uh you know, in, in five years or something crazy like that.

Um another thing that i think this really has been an important lesson for us is that when, when my team started thinking this way, we really started to realize that a good software architecture was in fact synonymous with having a strong roll out plan. And, you know, i think, you know, before we really went in on, on feature flags and, and really started thinking in this way, there's a kind of unfortunate isolation that happens where, oh you, you know, you got your architects, you got your engineers, they designed your software system. And then there's someone over here who's figuring out the go to market strategy and how to tell the customers this is coming. And, you know, th those, those two worlds don't interact all that much, right?

Um by doing it in this way, what we really realized is that, uh you know, actually, uh engineers want to be thinking about the roll out process, you know, especially if it's going to allow us to have higher confidence in our release and an easier path to bug fixes and you know, better ways to get customer satisfaction. Um you know, people are happy uh to use this.

Um and also it thinking about this from the beginning really forces us to design higher quality uh architectures uh because we're thinking from the beginning of like, well, how will we do a roll back? You know, if we migrate the data in this way? Oh yo, if you do it that way, you won't be able roll back actually, like having those kinds of design conversations early and upfront means you don't have to have them at the 11th hour when someone's called you with a massive production bug.

Um so, you know, really, really something that, that, that we've, you know, has become again synonymous with our architecture design process. And then the last point here, you know, i, i really can't say enough about uh internal testing and production and, and, and how much extra confidence that gives you. Uh of course, we are also testing, you know, full automation testing in a lower q a environment and, you know, end to end integration testing and, and all that stuff. But there really is, i, in my opinion, no substitute for saying, ok, you know, when i turn this on for a customer, all i'm doing is flipping a feature flag. I'm not pushing new infrastructure, i'm not deploying new code, you know, the decoupling of deployment from release. I i is really a tremendous value add here and it really improves your confidence when delivering to your customer.

Ok. So i mentioned that my team has been doing this for a while on navient um since 2019, which is when we started with launch darkly and we actually started doing this type of re architecting, you know, a little bit after that, uh the team has built out nine domain services uh pulling this is specifically the pattern where we're pulling modules out of our monolithic application. So of course, we've done other developments since then, but, but this is really focused there.

Uh and, you know, i, i'm called back to earlier when i said, you know, the kind of against the caricature of uh monolith and, and microservice here. Uh i don't really think that uh this means that, you know, monolithic architecture is bad and, and the proof for that is that, you know, we've actually had double uh the monolith commit rate since we've started this journey. And, and in fact, it's a little bit better than that. It's actually more year over year.

Um the point i want to make there is that doing this kind of incremental refactoring has improved our ability to work on our monolithic code system and to push code and features out uh to our customers. Uh and, you know, maybe that seems a little unintuitive at first. But, but in practice, you know, honestly, i, i think that's actually, you know, not surprising at all. Basically, we've taken features that would have been distractions from our teams who understand our monolithic codebase best.

we've moved them into loosely of coupled architectures where those can work independently. and now we've got, you know, our, our our teams really focusing on improving the code that is in our monolith and not, you know, orchestrating massive release discussions and, you know, giving best practices uh information to other teams, you know.

so, so they're just able to do more work. uh and, and alongside that, you know, just as a statistics uh from 2022 you know, we actually deleted 100 and 50,000 lines of code uh from our monolithic code base as a result specifically of these kinds of refactoring where we were able to pull the system out and uh you know, make it independent, uh which, you know, again, i am 100% sure that improves our commit rate one way or another, just less code you have to think about, easier to keep the system in your head.

uh you know, these are things we're, we're really happy about. um and of course, as i've mentioned here, you know, launch directly has been a key part of this process for us. they've been there every step along the way we use, launch darkly for more than just this type of architectural change.

uh you know, we're up to hundreds of feature flags that we're using to maintain all kinds of different uh things that we sh excuse me, uh shipping. uh and we use user segments, you know, last i checked we were using had over 20 for, for our most recent release.

ok. all right. and then this is my last slide before i'm gonna toss it back to peter for a little demo here.

um you know, i would definitely be remiss if i didn't talk about launch darkly and, and you know, this kind of developer experience and, and also talk about team sentiment. so what you're seeing here is uh the outcome and, and i apologize, the text is probably a little small, so i promise i'll read it.

um but what you're seeing here is, is is the outcome from our annual developer sentiment survey. and so this is something on navient that we do where we send a survey out to all of our engineers every year to ask them, hey, you know, what can you rank all of the tools that we're using? and this is not just for vendor tools like launch darkly, this is also for like aws cdk, you know, for using typescript, you know, everything year over year, every year since we've started doing that survey, uh launch darkly has ranked in the top three and, and sometimes the top one in terms of developer sentiment uh for, for tooling and experience.

and you know, uh i could, you know, talk for a long time about why i think that is uh but the reality that the important point, the reason i share this is because i'm 100% sure that that's been a critical factor to our success here. you know, that, that the ability to have, you know, to have released these domain services, completed these refactoring and, and shipped them to our customers successfully is the fact that our engineers enjoy working this way.

they, they, they, they like building systems this way, they like having access to launch darkly, they like being given, you know, a kind of powerful tool that really gives them a lot of capabilities here.

um and, you know, and that's been reflected, like i said, uh every year we've done uh sentiment analysis. all right. and so with that, uh i'm going to pass the baton back to peter who is going to give a demo of uh what this actually looks like, you know, an example doing a migration uh using launch starkly. and i think peter, you're even gonna show some functionality that was new released this year. so, so my, i'm, i'm actually gonna be watching as well. i'm very interested.

thanks adam. all right. i always love seeing this stuff in the real world. uh that's always good. so we're gonna flip over our monitors here and you're gonna see a very cluttered screen. that's gonna make sense in one second, i promise.

so i went ahead and recorded a video because i'm not tempting a live live demo. gods on, on this one. so what we're looking at here uh is in this convoluted mess. we have a production cluster of a kernes uh environment that's what's running on the top left uh terminal, uh we have a testing cluster that we're doing and then we have the applications that are running in each one of those clusters. those are the little hang tight things. and then we have our launch dark dark environment.

now, one of the things that adam talked about was this idea of releasing uh incrementally, right? having that release process. well, one of the things we heard from customers is that if i'm going to release a new feature, um how do i know if i'm releasing it in my test environment or how do i know if i'm releasing it in my production environment? uh how do i keep track of that launch darkly? and how do i keep track of the statuses?

so what you notice here is that i'm in my test environment where i'm going to be releasing this new capability and if you are seeing something new that maybe you haven't seen before, there's this little test cluster. so remember, right, i talked about that test cluster, then we have production and then we have this releasing. well, at galaxy, we introduced this new capability called release pipelines. and what release pipelines are is a way for you to track what the status of your flags are in your different environments of how you release your software.

so what you're seeing here on the uh this big long list, this is the custom release pipeline that i created. it's our very fancy release pipeline. it has two stages, we have our testing cluster as well as our production cluster. and then those are all the different flags that you saw on the home screen and the current statuses of them.

so what we can do is we can see have we actually released it into that environment. what percentage have we released it into that environment? what variations are we serving? this allows us to have that visibility to know when features are getting released to our different areas.

so we're going to click over into our release new website flag here in just a moment when past peter gets there. and now we're going to, again, we're in our test environment and we're going to release our new capability. so we release our new website, we see it pop up on the top there and it's for a new to do to do list that we're going to be utilizing where we say we're going to store more tasks in the cloud.

but you'll notice that the bottom cluster hasn't changed yet because we're not actually releasing that software here. we're using a different sdk key. we're using our production environment for that, but this allows me to be able to release to the testing cluster, make sure that everything's working properly. and when we get the q a thumbs up, we hit complete.

now we're saying to the rest of our teammates that we know that the production is ready. but you see that the flag is still off in that production environment. this is a visibility thing. this is where we can have that communication so that all teams are aligned like we were showing previously or like adam was talking about with his team.

so now we're gonna turn this on for our production cluster and we say the website is ready. i know it looks really fancy in a very small box. but trust me, it looks cooler later on.

so we turn that on and we see that gets released and we say, ok, good. we're seeing the same behaviors, we're seeing the same information that we wanted to see previously. so we're ready to go ahead and mark this as released. and we get a nice little thing here that says, hey, congrats we've released and then what you have is it tells you when the feature was released.

so using release pipelines, we can know exactly where the status of our flags are in our different environments. and we can make sure that our features are getting released in the correct way. this allows us to be able to simultaneously make different changes, do different testing variations.

so let's say we go back to our test environment, you know where i re invent we have this wonderful like, you know, purple orange gradient, maybe our text should look like that. but again, notice that it changed only again in our test cluster versus in our production cluster. so that's the separation piece.

so as you have that release process, you can utilize different launch darkly environments to be able to release features. so let's clean that up a little bit. and now we're gonna have just the dual screen, you don't have to squint as much.

so we have our, we're in our test cluster. again, we see our nice little, you know, re invent gradient text that we have. and one of the other things that adam talked about was this idea of using information about your users to be able to control access to new features.

so we have this new web form that we're gonna be launching. the one we have right now is for a very basic to do list. so let's just see what we have in our current state. so we have our really, it's a very complex. so be careful to follow along here. this is our new to do list. let's see what we have to do. we have to first say hello re invent. so hello, re invent. it's the very first task that we want to make sure that we complete on this. and then what we do is we cut it out of the top box and we paste it in the other one and we got a nice little check mark. very fancy to do that. that, that seems to be working well.

but i think we could probably get a little bit better. we could find a different system that's gonna allow us to be able to, to utilize that. so we're gonna go ahead and turn on that new web form. uh you know, to see what could we do as an improvement, maybe we should be moving our back end so that we're actually storing those tasks somewhere else. but we want to make sure that it's for the right users.

so we test it first with a cohort so we can create these rules where we have those user segments. and we can say that only our early access users are going to get the ability to see that new web form. we'll turn that on and then we'll go back to our application here. hm gotta click out of it. so i open this up. notice i put in a different id, you see all the information that's still there

We're all still sharing the same web form. We'll refresh our page here, excuse me. And then I'm gonna use my admin login because I'm one of the early access users. Now, I have a different web form. And what this web form is doing is this is actually talking to a different database. So I'm gonna open up a different tab here where I show you our new database back end. And this is going to be talking to a local Postgres database that's running in that same Kubernetes cluster.

And now we have our completed like check box, we have our ID, our time stamps. So again, we're gonna add our previous task of saying hello to re:Invent submit that we'll grab our data and there we go. So now we're storing our information actually in the Postgres database instead of just using a couple of text fields in our box. So we think that is probably a better thing. We set it low already so we can check that off at this point.

Now, though, like Adam was talking about, we may want to move our systems eventually. So right now, we're using this in a Postgres database that is currently in that same Kubernetes cluster, it's working fine, but, you know, pods are tend to crash. And so we may not want to have our data storage in something that could be a little bit more fragile. Maybe we're looking for something that's a little bit more scalable and we're gonna actually switch over to say DynamoDB, right.

So we're gonna show that we have the DynamoDB as an option. And in the past, what we could have done is we could have used different flag logic to change our API. So just to show we're gonna say that we're checking out our to-dos in Post and that shows that we're storing the information there and if we switch over to the Dynamo table, it won't actually be there.

So, the idea here is that when we initiate the migration. Now, we have a new flag type called migration flags and with migration flags is that they utilize the LaunchDarkly SDK to essentially create stages of migration in this particular one. This is a four stage migration. Uh we offer two, four and six stage migrations. And each of those variations is gonna do a different thing in the off variation in our four stage. That means we're gonna use the old system entirely. So like we are storing in that Post, the shadow means that we're gonna start writing data to our new database. But then we're only gonna be reading from the old one.

The live means that we're gonna start reading from our old database, but still writing to our old one just for security. And then finally, the complete is when we're gonna fully shift everything over the advantage of doing it. This way is we can do things like run consistency checks, we can get things like latency results, we can find out about error rates. This allows us to be able to have so much more control over what our users are seeing and we can do it in a percentage rollout.

So these cohorts are what we can group, we can do it by email address or school district or whatever type of context kind that you want to utilize. So we're gonna change this now to where we're writing to our new database, but we're still gonna be reading the results from our old database. So I'm gonna update this. We'll go back in and add a couple more tasks. Ok?

So if I refresh the data here, you'll see we have starting our task and Postgraph. That's the one that's currently working for us, gonna go ahead and enter another one. So adding another to do again, we'll see that show up only on the Post side. The reason here again is we're writing to the new database, but we're only retrieving from the old database. And this is important. The reason I'm kind of putting these in, I have those extra ones is because we're gonna illustrate that point in just a moment.

So once we see that there's no information over here on the other side, we're gonna go back to our migration flag. Now, one of the things that I love about these migration flags is the fact that you can get those metrics automatically. We will track things like error latency, you can have your own custom consistency checks. So I don't have that right now because we're just writing different data. And unsurprisingly calling to the local Kubernetes database is going a little bit faster than going out to the cloud. But that's ok. We, we knew that was going to happen. We're seeing a lower error rate.

So now we're ready to test this live and see what happens when we change over our different systems. So we're gonna update this, we're gonna go add another, we're gonna retrieve this data now and you'll notice those two that we have written earlier are in the Dynamo table. So we're receiving that same information where we had previously written it to the old database and we're retrieving it from there. But now that we've moved to live, we're retrieving that information that we had already written.

So this is how, again, we can make sure that it's a smooth transition. And if anything goes wrong, we can always remove it afterwards. Now that we feel pretty good about it, we're going to go ahead and say, I know I have to do where we say goodbye to Postgres. So eventually we'll be able to deprecate that pod. We're going to refresh this and then you might be surprised to see. Oh no, our, our data is gone. That's to be expected because we called to that same API and we're no longer calling from our Postgres database. We're only calling from Dynamo.

Now, the information is still in that if I were to bring up, you know, PSQL and actually show it to you. There is actually data still in there, but we don't neces, we're not pulling that information anymore. So if everything looks good, we can go ahead and switch this over to 100% complete and go ahead and save that. And then we're gonna flip this table. Uh we're gonna say we're fully on Dynamo. Now, I think that was the last one fully on Dynamo and we see that everything is working correctly, right?

So I know this is not like the craziest migration story that has ever happened, but it illustrates the point that we're trying to make it simpler. Now, we're gonna go into the code base to prove to you that I didn't just store everything in a text field actually in the application, what it looks like is this, this is the Node server. Uh this is a Node server file. Uh so we're using the Node server SDK for it.

What we've added is this new capability where we can set this config definition where we're setting these LD migration options. And within that there are going to be four function fields that we can utilize. So there's a writeOld, writeNew, readOld and readNew, write. So the four things that we would expect as part of those stages and what we do is we define the logic of what we should do when each of those functions get called.

And then we're going to initialize this migration as part of the SDK initialization. It's in this block that we can also put things like consistent checks and also make sure that our error rates are being tracked. So we see our latency and tracking and error tracking will always default to true, but you can always add that in there. But this is where you can add any other custom configurations.

And then if you see, we initialize the client like we always have and then we initialize the migration piece. Now in our route logic, instead of having to do all that other pieces, we just do this migration dot write. The name of the flag, the migration flag that we're using and the information that we're gonna be sending it sdk will automatically know which one of those functions that it should call the same thing with the migration rate.

So that's how we do the migrations. Uh you know, I'm really excited about this. Um, you know, I know that's a very quick overview. If you all have additional questions, feel free to come and do it. But the last thing I have to do on this specific demo is it wouldn't be a solid demo if I didn't have a shameless promotional plug. So we're gonna use a feature flag, of course to set that up. Turn our flag on. I'll tell you to come to us at booth 784. I have to do that.

So ultimately though, that is how our flags work. If you think about it, we're trying to make it easier to follow your own release processes. We're trying to make it easier for you to migrate systems. If you have additional questions on how that was all set up, definitely make sure that you come find us and it looks like we're going to have a little bit of time for some Q&A which is great. A little shock there.

Alright. So as I said, as Toggle gave a thumbs up, we are at booth 784. Thank you all for watching. But we can take a moment. We can answer any questions that you all might have. And if we drift a little bit over, we are happy to, uh, to meet you outside.

李白的朋友王维

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
Feature management: De-risking migrations and releases for PowerSchool

Ok?
复制链接

扫一扫