Let's say you want to delete a column from a database table. You would first push code that no longer uses that column, then wait until that push is finished, and then delete the column.
If you want to rename a column, it's trickier . The right way to do this would be roughly:
This generalizes into any kind of data migration. You probably wouldn't go through the overhead for this just to rename a column, but imagine changing from one data format to a more compressed representation: It's the same process.
Do all the new engineers know this? No, they learn as necessary. Often we just tolerate some errors during the restart window and don't go through a process like this.
Doesn't this make developers' life difficult? Occasionally, but there is no way around it. This is a consequence of running a service that never stops. It's not feasible to make an atomic code deployment process when you have hundreds of servers and don't want to have downtime. And if there is going to be a data migration that will take time to run, you'd need to do this even if you could deploy code atomically. I'd also suggest that this is not that difficult. Because of all the continuous deployment infrastructure our pushes are really lightweight. We always have the option of taking a service down to do a migration to avoid some of this overhead, and we have done that on a few occasions.
Doesn't this make the code look dirty? Temporarily, while this process is happening, it does, but at the end, a few hours later, the code is back to being clean.
If you want to rename a column, it's trickier . The right way to do this would be roughly:
- Create a new column with the new name.
- Push code that duplicates writes to both columns, but still reads only from the old column.
- Run a query to copy all data from the old column into the new column. At this point the columns are identical and will be maintained as identical because of the duplicate writes.
- Push code that switches reads to use the new column.
- Push code that stops the duplicate writes and just writes to the new column.
- Drop the old column.
This generalizes into any kind of data migration. You probably wouldn't go through the overhead for this just to rename a column, but imagine changing from one data format to a more compressed representation: It's the same process.
Do all the new engineers know this? No, they learn as necessary. Often we just tolerate some errors during the restart window and don't go through a process like this.
Doesn't this make developers' life difficult? Occasionally, but there is no way around it. This is a consequence of running a service that never stops. It's not feasible to make an atomic code deployment process when you have hundreds of servers and don't want to have downtime. And if there is going to be a data migration that will take time to run, you'd need to do this even if you could deploy code atomically. I'd also suggest that this is not that difficult. Because of all the continuous deployment infrastructure our pushes are really lightweight. We always have the option of taking a service down to do a migration to avoid some of this overhead, and we have done that on a few occasions.
Doesn't this make the code look dirty? Temporarily, while this process is happening, it does, but at the end, a few hours later, the code is back to being clean.