Read through
Amazonian Agility
(https://medium.com/frontira/amazonian-agility-e3720ff004f7),
which describes how the company Amazon applies agile principles in its work. Pay special attention to the alignment matrix towards the end of the article. A text version of this matrix is available here Word:
CS 250
12 Principles of Agile Business Manifesto Matrix Text Version
Word
. For additional information about Amazon’s process for developing software, check out the optional reading,
How is Software Developed at Amazon?
(Click
How is Software Developed at Amazon? Word Document
to download transcript.)
address the following:
·
What is the “Two Pizza Rule” and how did it help Amazon?
·
From the alignment matrix chart, choose
one of the 12 agile principles and explain how Amazon applies that principle in their work.
·
Choose a second agile principle from the alignment matrix chart. How would you apply this principle to your SNHU Travel project?
12 Principles of Agile Business Manifesto Matrix Text Version
The following is a text version of the matrix of the 12 Principles of Agile Business Manifesto as they apply to Amazon. Each list item below represents one of the principles. The bullets underneath each list item refer to Amazon’s application of the principle.
1. Primary focus is on customer need facilitated by constant improvement of customer experience.
· Customer Review Tool
· Two Pizza Team Model
· Customer Obsession
2. Strategies and tactics are highly adaptive, responsive, and change is welcomed.
· Risk Acceptance
· Startup Mentality
· Flexible Technological Architecture
3. Iterative, sprint methods deliver customer value through continuous progress and momentum.
· 10-15 day product cycles
· Customer Review Tool
· Startup Mentality
4. Effective cross-functional collaboration with a clear intent is supported.
· Filtered Customer-based data
· Kaizen Method
· Lack of Silos
5. Motivated individuals, empowered teams, flexible, trusted working environment and comfort with failure.
· Kaizen Method
· Value opinions of all employees
· Risk Acceptance
6. Bureaucracy and politics are minimized, co-location and face-to-face communication maximized, wherever possible.
· Two Pizza Team Model
· Startup Mentality
· Customer Focus
7. Working outputs are the optimum measure of progress and success.
· Value quality of products
· Value quality of customer experience
· Ability to minimize price and delivery times
8. Support relentless and sustainable innovation and progress. Change is constant, and the pace never slows.
· Continuous support of innovation
· Experimentation seen as critical to success
· Internal Reflection
· Kaizen Method
9. Technical excellence and good design are central to maintaining pace and agility.
· Agile Architecture
· Lean Cloud
· API
· Enterprise Service Bus
10. Minimize wasted effort, duplication and resources.
· Lean Cloud
· Agile Architecture
· Two Pizza Team Model
11. The best results emerge from small teams with a high degree of autonomy.
· Two Pizza Team Model
· Employee Empowerment
· Kaizen Method
12. Continuous improvement is achieved through embedded reflection time, behaviors and cultures that support learning.
· Kaizen Method
Continuous Improve
Developer Tools – AWS Online Te
/* Style Definitions */ p.MsoNormal, li.MsoNormal, div.MsoNormal {mso-style-unhide:no; mso-style-qformat:yes; mso-style-parent:""; mso-margin-top-alt:auto; margin-right:0in; mso-margin-bottom-alt:auto; margin-left:0in; mso-pagination:widow-orphan; font-size:12.0pt; font-family:"Times New Roman","serif"; mso-fareast-font-family:"Times New Roman"; mso-fareast-theme-font:minor-fareast;} h1 {mso-style-priority:9; mso-style-unhide:no; mso-style-qformat:yes; mso-style-parent:"Normal \(Web\)"; mso-style-link:"Heading 1 Char"; mso-style-next:Normal; mso-margin-top-alt:auto; margin-right:0in; mso-margin-bottom-alt:auto; margin-left:0in; mso-pagination:widow-orphan; mso-outline-level:1; font-size:20.0pt; font-family:"Times New Roman","serif"; mso-fareast-font-family:"Times New Roman"; mso-fareast-theme-font:minor-fareast; mso-font-kerning:0pt; font-weight:bold; text-decoration:underline; text-underline:single;} p {mso-style-priority:99; mso-margin-top-alt:auto; margin-right:0in; mso-margin-bottom-alt:auto; margin-left:0in; mso-pagination:widow-orphan; font-size:12.0pt; font-family:"Times New Roman","serif"; mso-fareast-font-family:"Times New Roman"; mso-fareast-theme-font:minor-fareast;} span.Heading1Char {mso-style-name:"Heading 1 Char"; mso-style-priority:9; mso-style-unhide:no; mso-style-locked:yes; mso-style-link:"Heading 1"; mso-ansi-font-size:20.0pt; mso-bidi-font-size:20.0pt; font-family:"Times New Roman","serif"; mso-fareast-font-family:"Times New Roman"; mso-fareast-theme-font:minor-fareast; font-weight:bold; text-decoration:underline; text-underline:single;} span.GramE {mso-style-name:""; mso-gram-e:yes;} .MsoChpDefault {mso-style-type:export-only; mso-default-props:yes; font-size:10.0pt; mso-ansi-font-size:10.0pt; mso-bidi-font-size:10.0pt;} @page WordSection1 {size:8.5in 11.0in; margin:1.0in 1.0in 1.0in 1.0in; mso-header-margin:.5in; mso-footer-margin:.5in; mso-paper-source:0;} div.WordSection1 {page:WordSection1;} -->
CS-250 Fireside Chat: DevOps at Amazon with Ken Exner, GM of AWS Developer
Tools – AWS Online Tech Talks CC
[00:00:00.00] [MUSIC PLAYING]
[00:00:10.90] AARON SCHWAM: Hi. Thank you so much for
joining us today for our Fireside Chat on DevOps at Amazon. My name is Aaron
Schwam. I’ll be your host. I lead product marketing for the AWS developer
tools. With us today is our special guest, Ken Exner, who is the GM of the AWS
developer tools. Ken, thank you for joining us.
[00:00:25.93] KEN EXNER: Thank you, Aaron.
[00:00:26.63] AARON SCHWAM: Can you please introduce
yourself for the audience.
[00:00:29.21] KEN EXNER: Sure. Hi. I’m Ken Exner. I manage
developer tools for Amazon, which includes both internal developer tools, so
everything we use inside of Amazon for software development, as well as
external developer tools, the developer tools that are available for AWS
customers.
[00:00:45.96] So on the internal side,
it’s all the things that software developers at Amazon use, from source code
management systems, to testing tools, deployment tools, some of the operational
tools that we use. And this is across all of Amazon. So not
just AWS, but the retail business and AWS.
[00:01:03.83] On the external side, it’s many of the same
tools that had been externalized for AWS customers, as well as tools like
Cloud9 in IDE, the SDKs, Command Line tools, Elastic Beanstalk, X-Ray– a lot
of the developer-focused services in AWS.
[00:01:21.83] AARON SCHWAM: Awesome. So let’s actually start
on the internal tools, the things that people don’t see or hear about as much.
So as the owner of the internal tools, how does that affect your day to day in
terms of do you have any influence over the way Amazon delivers software, how
we train people, et cetera?
[00:01:39.23] KEN EXNER: OK. Sure. So internally, we refer
to this as builder tools. So externally, we’re known as developer tools. But
inside of Amazon we’re affectionately known as builder tools. Amazon likes to
refer to people that work at Amazon as builders. So we are builder tools.
[00:01:56.61] In addition to providing the tools that we use
across Amazon for how we develop and release software, we’re also responsible
for how we onboard, train, communicate, and continue the education of
developers at Amazon. So for example, we work with developers in onboarding, so
that when they start at Amazon they go through an SDE boot camp, which is sort
of an immersive training program that allows them to get up to speed with how
to develop the Amazon way very quickly– how to use our tools, how to
understand what the Amazon processes are like. And this is sort of an immersive
two-day boot camp that is meant to be intensive.
[00:02:37.61] We’re also
responsible for a lot of the developer training and internal conferences. So we
have an internal developer conference called DevCon, which is kind of like our
internal reinvent. It’s where we communicate with developers. We do it three
times a year in different locations around the world, because we have multiple
locations. But it allows us to continue the ongoing training and development
and communication with developers at Amazon.
[00:03:04.40] AARON SCHWAM: Awesome. OK. So keeping that
builder tools hat on, I’d like to start with a story some of our customers are
familiar with, but we at least want to start with kind of the Amazon story for
how Amazon embraced DevOps. So kind of at a high level, can you walk through
what that looked like, and then we’ll dive in.
[00:03:21.41] KEN EXNER: Sure, sure. So the story usually
begins like in the early 2000s– so 2001, 2002. Everyone sort of knows Amazon
as being a DevOps culture. But this wasn’t always true. Back in the early
2000s, we were more of a traditional organizational structure.
[00:03:40.20] We had more of a
functional hierarchy, a functional matrix organization. We had development
teams. We had testing teams. We had operational teams. And they all reported up
to their leadership, so different silos of an organization.
[00:03:57.39] And we also had more
of a monolithic architecture. So not like what we have today, but it was more
of a monolithic style architecture. And what I mean by
that is that it was one big deployable unit. So it was– basically it was C–
it was a Perl Mason code on top of C++ on top of an Oracle database.
[00:04:18.50] And we had a bunch of developers all
clobbering away at this system– it was called the Obidos system– and trying
to push out changes in a release train model. So if you’re not familiar with a
release train model, it’s basically– it’s a manual release process where you
get on the release train, you maybe do one a week, one every two weeks. You
manually QA everything, and then you get out one deployment a week or every two
weeks.
[00:04:45.44] It was a very
cumbersome, slow process. And one of things that we started to realize is that
we were not able to move faster. We kept throwing more developers at the
problem, and that didn’t help. What we found is, you know, double the size of
the development team, and we weren’t getting double the productivity, because
we were bottlenecked by the architecture and by the organizational structure.
So that led to some of the changes that became the foundation for DevOps at
Amazon.
[00:05:14.81] AARON SCHWAM: OK. So that was kind of the
impetus, was we just weren’t seeing the return on investment of trying to move
faster. So at a high level, what were the things that we did to change our
structure and accelerate software delivery?
[00:05:30.45] KEN EXNER: Well, the big thing was– and I’ll
call it decomposing for agility. So we wanted to decompose not only the
organizational structure, but also the architecture. If you’re familiar with
Conway’s law, it states that the architecture will follow the organizational
pattern.
[00:05:50.26] So what we had was a
monolithic organizational structure, and we had a monolithic architecture. So
we wanted to decompose it and go to smaller pieces that could move in a more
agile fashion. So we did two things. We broke apart the monolithic
architecture, and we broke apart the monolithic organizational structure. So
I’ll talk about each of them.
[00:06:12.48] For decomposing the
monolithic architecture– back then, service-oriented architecture, SLA, was
the popular thing, the pattern de jure. Everyone was moving towards
service-oriented architectures. And we said, we’re
going to do the same thing. Amazon wanted to move towards a service-oriented
architecture.
[00:06:35.64] Today we would call
this a microservice, but essentially the same thing. We wanted to go from this
big monolithic architecture to a bunch of smaller services that each had their
own interfaces, and these services could communicate with these other services
via standard interfaces.
[00:06:53.10] The important part
here was that not only was it trying to decompose the architecture into smaller
pieces, but you had a decoupling that allowed them to communicate with each
other via standard interfaces. So I could deploy some changes to my service, as
long as I maintained a backwards compatibility with my interfaces– interface
was the contract– I could allow other teams to communicate with me. I could be
able to move as quickly as I wanted to.
[00:07:20.97] So we broke apart the
monolith into these smaller pieces. But we also realized we needed to change
the organizational structure so that we could organize differently. Because we were, again, going from this big monolithic architecture
to a bunch of smaller pieces. How were we going to organize around these
hundreds or thousands of different services?
[00:07:40.62] We could no longer have this big
development team and a big testing team. So what we realized is we wanted to
have more of a decentralized organizational structure– small teams owning
small services. And if we had these small teams owning small services, they
could be fully decoupled and able to move independently.
[00:08:00.78] And that’s what took us down this path of
DevOps, moving towards these small teams that owned everything end to end,
owned these services end to end. We called them two-pizza teams at the time.
And this has become part of Amazon folklore. The idea of a two-pizza team was
simply that you had a team that was meant to be small– small enough that you
could feed them with two pizzas.
[00:08:27.33] AARON SCHWAM: Clearly, I wasn’t on these
teams.
[00:08:28.88] KEN EXNER: No, no. You’d be your own two-pizza
team. But initially, we were focused on small. Again, we were going from this
big architecture to these small agile pieces, so we wanted everything to be
small.
[00:08:41.86] But that wasn’t as important as having these
teams be autonomous and independent. We wanted them to be able to move
independently, and you wanted them to have ownership. You wanted them to have
end-to-end ownership of their service.
[00:08:57.72] They were not just
responsible for the development of that service, they were responsible for the
operations. They were responsible for the testing. They were responsible for
the backlog and the roadmap, talking to customers. They owned that service end
to end.
[00:09:11.43] So essentially, the transformation was big
monolithic architecture to a bunch of smaller services, and more of a
functional hierarchy organizational structure that got decomposed into a bunch
of small, autonomous, two-pizza teams.
[00:09:25.28] AARON SCHWAM: So clearly, one of the big
lessons here is decomposed fragility and the two-pizza teams. So what are the
other major learnings that we had in our own DevOps journey?
[00:09:35.83] KEN EXNER: Sure. In addition to decomposing
fragility, I think another big part of the DevOps story for Amazon is around
tooling and automation. When we used to have a centralized ops team, they would
do all the deployments. As we broke apart into a bunch of different teams
owning their own development, they had to own their own release process as
well.
[00:09:59.35] So when that
happened, this is essentially how the builder tools team at Amazon was formed.
We were essentially trying to automate that release process, because we had a
bunch of teams that now owned their own deployment and release for their own
services.
[00:10:13.37] So we wanted to make
sure that they were following best practices, and they didn’t have to rely on
the best practices of these skilled experts. So how are we going to make sure
that these teams can deploy experts, but across thousands of different teams?
[00:10:27.70] And that took us down
the path of building tools to automate the entire release process. It started
with deployment. So how do you deploy to the retail site without any customer
impact? But it slowly kept going further until we had full CICD.
[00:10:43.48] AARON SCHWAM: Can I ask how you decided where
to start, and how you kind of went through the process?
[00:10:48.20] KEN EXNER: Well, I think like a lot of
businesses, we had started doing CI. So we automated the source and build
integration, and we just kept going beyond that. And we said, well, the entire
release process is a manual workflow, where you go from source to build to a
bunch of preproduction environments. You do budget testing that’s manual, and
then you deploy.
[00:11:10.92] We said, let’s just
turn all of that into automation. Amazon loves automation. Everything we do is
about, how do you automate a fulfillment center? How do you automate different
processes? So this to us was just another process that could be fully
automated.
[00:11:26.55] Now at first it was a
little bit scary. People were a little bit apprehensive about this idea that
you would commit a change to a source code repository and then it would flow
automatically to production. But then what we started to realize was that
anything that we could manually do as part of a release process, that we could
manually test or manually inspect, we could put them into automation.
[00:11:52.34] And that could happen every single time, no
matter how big or how small that deployment is, so that you know
every single time you do a deployment we would do the same tests. And you would
have this increasing battery of different tests that would happen every single
time.
[00:12:08.88] Initially, this
started with a lot of the integration test suite– you would start putting it
as part of a release process. And then it started going across other types of
testing– browser and web-based testing, things like Selenium. Then it started
going into load testing, automated load testing, as part of every single
deployment. And we kept doing more and more.
[00:12:29.97] And slowly over time we started to get fairly
sophisticated with our release process, so that every single time we were
deploying, we were going through a number of different types of testing.
[00:12:39.18] AARON SCHWAM: So how did you actually know
that you were kind of making progress and improving?
[00:12:44.49] KEN EXNER: Well, we measure everything. So at
Amazon we love to measure everything. We monitor everything. We measure
everything. And one of the things that we saw was that not only were we able to
push out changes more frequently, we were also having a higher quality product.
[00:13:01.50] We were able to have
fewer problems in production. So we had higher quality and faster release
cycles. So we were able to release more and get higher quality product out of
the release process.
[00:13:14.70] AARON SCHWAM: Awesome. So you mentioned CI.
You mentioned a lot of the sophisticated testing. It seems like there’s a
really scary kind of hump to get over there in terms of like continuous
deployment, which I know we practice at Amazon. Can you talk a little bit about
that and how we got over that fear?
[00:13:31.29] KEN EXNER: Sure. So the entire release process
is kind of a pessimistic process. We call them pessimistic deployments, because
we’re constantly trying to find reasons to fail the deployment. So if a test
fails in a preproduction environment, it’ll stop that, roll it back to its last
known good state. And then the developer has to go figure out what had happened
and we push that release again.
[00:13:57.60] But if at any point in the release process
there’s something that causes the system to either– alarms go off because CPU
spikes or a test fails, it’s going to stop that, automatically roll it back,
and then go back to the last known good state. So it’s constantly looking for
reasons to fail this deployment, either in preproduction environments where
it’s running a bunch of tests, or as it slowly fans this out into production.
[00:14:27.98] So when we deploy
into production, you typically deploy to one box in one AZ in one region. Let
it sit there for some period of time, bake there for some period of time. Also,
run a bunch of transactional tests– like has every single piece of that code
been exercised in production?
[00:14:47.67] If there are any
problems, it’s going to stop it, automatically roll it back, and go to its last
known good state. If it succeeds, it’s then going to start fanning out to the
rest of that AZ slowly, and then it’s going to go to the rest of that region
slowly, and then it’s going to fan out across the different regions.
[00:15:03.54] But throughout this entire process, all the
tests, all the automation, are always looking for reasons to stop it and throw
it out. So it’s a very pessimistic process. And if you can get through all
that, you usually have a good release.
[00:15:17.23] AARON SCHWAM: Very interesting. So that covers
kind of the CI and the CD. But what about like some of the most sensitive, or
what about like security? How do you maintain security in this model?
[00:15:30.43] KEN EXNER: Sure. Security is sort of managed
throughout the entire process. I think it starts first with the developers
needing to think like security engineers. And this is an important part of
Amazon’s culture. We want these engineers to not only be developers, but we
want them to be operators, and we want them to be security experts.
[00:15:54.76] Someone doesn’t come
to Amazon knowing much about DevOps or whatever, they typically come to Amazon
with a CS degree. They know engineering. And we end up having to teach them how
to think like security engineers, teaching them how to think like architects,
how to think like testers, because they’re going to be doing all these
different functions.
[00:16:14.78] So when it comes to
security, one of the first things a developer is going to do when they’re
starting a new project is they’re going to work on a threat model. They’re
going to design their architecture, and they’re going to work on a threat
model. And they’re going to review that threat model with a security engineer.
And that security engineer is going to give them feedback, and it’s going to–
they’re going to own their own security.
[00:16:36.07] So they’re going to
be responsible for thinking like a security engineer. Because they’re closest
to this, they are usually the people that are most apt and able to find
problems with the software. So we want them to think like a security engineer.
[00:16:51.22] Once they’ve done
that, they’re going to start doing their development, and they’re going to
submit their code for code review. And that code is going to go to one or more
peers. Teams can configure how they want to do peer
reviews. And their peers are going to give them code reviews that get done
before it gets committed.
[00:17:16.06] We’re also going to
do a bunch of automated checks as part of that code review process. So there’s
a bunch of static analysis that happens as part of that code review process. So
again, we can look for security issues through static analysis.
[00:17:28.15] Once it goes out to the build process, we have
additional static analysis that happens as part of the build process. Then as
we start to go out into the release pipeline, there’s a bunch of other
additional checks that happen. There’s a bunch of canary monitors that run
positive and negative security checks against that deployment before it goes
out. And we have a bunch of checks that verify the integrity of that pipeline.
[00:17:51.88] AARON SCHWAM: And these are all instituted
centrally or by each team?
[00:17:55.54] KEN EXNER: Both. So there are some things that
we can institute centrally that affect all of Amazon, or at least all of AWS
and all of retail, and there are some things that can be mandated or governed
at a local level.
[00:18:13.70] So one of the things
that we realized along the way is that we could inspect a pipeline. And if we
could inspect a pipeline, we could determine whether or not it was following
best practices. And if we could do that, and we could describe those best
practices, we could allow people to create rules that govern the shape and
structure and contents of a pipeline.
[00:18:38.08] So this allows me, as
a organizational leader, to have rules for my teams. So I can create a rule
that says that every new commit that gets deployed has to have at least 70%
code coverage, for example. So if I want to have a rule that I’m not going to
let anything deploy unless it’s 70% code coverage, meaning unit test coverage,
I can enforce that for my organization.
[00:19:05.89] Or AWS-wide– we have
certain rules that are either because of compliance or governance or security,
we want to have rules that cover every single deployment at Amazon. An example
I like to use is we don’t like people deploying to every region at the same
time.
[00:19:22.21] So if a team changes
their pipeline so that they’re deploying to every single region as a first step
in a pipeline, that’s a bad practice. We don’t want them to do that. We can
automatically stop that deployment because they’ve broken that policy, that
rule. And this applies to every team in AWS. So we’re able to create these
rules that stop deployments, either at a team level, organizational level, or
company-wide.
[00:19:47.90] AARON SCHWAM: So there must be some sort of–
I mean, that’s a lot for a team to take on. So there must be some sort of like
templates or best practices that you’re sharing or vending out to the
organization. How does that work?
[00:19:58.87] KEN EXNER: Well, in general, we want to give
teams the best practice pipeline by default. So we have templates for how to
get started. We model all of this in code so that you can set things up the
right way by default.
[00:20:12.89] So a lot of the internal scaffolding tools,
and a lot of the templates that we use for setting up a pipeline in all the
different resources that hang off that pipeline, we have best practice
templates for getting started. Super important. That
helps developers get off on the right foot.
[00:20:32.09] But then by having
this inspection capability where we can inspect changes to that, we can make
sure that they don’t change it in a way that’s bad. People don’t always know
all the best practices. There’s a lot of lessons
learned through the history of Amazon through a lot of mistakes that we’ve
made.
[00:20:50.73] We can take all those learnings and bake them
into policies, bake them into not only that the best practices that are a part
of the template, but the best practices that are enforced on that pipeline as
policy. And this way, developers don’t have to feel bad that they didn’t know
that you shouldn’t deploy to every region at the same time, because there’s a
policy that stops them from doing that.
[00:21:12.11] So initially, this, again, sounded a
bit draconian at first, like we were going to start blocking things. But it’s
actually been very liberating for us, because it allows us to make sure that
we’re maintaining best practices, and developers don’t have to make mistakes
and learn those the hard way.
[00:21:31.43] AARON SCHWAM: So I can see how that would be
very reassuring for engineers. They don’t have to be the experts of everything.
So you were you were talking about teaching. So I’d love to hear more about
kind of the cultural aspects of this that we know is really important.
[00:21:45.78] KEN EXNER: Yeah. So as I was saying, DevOps is
not something you can get a degree in. You get a degree in CS. And if you’re
going to expect developers–
[00:21:56.18] AARON SCHWAM: Or marketing.
[00:21:56.79] KEN EXNER: Or marketing. If you’re going to
expect developers or marketing to own not only development, but testing and
operations, everything else, you need to teach them those other things. Some of
that happens just through training and normal educational activities that we
do. But we have found that the most effective way to teach people is to make
them do it and review it with people.
[00:22:21.51] So the example I gave before about security
threat modeling, and then reviewing that with the security engineer, we do the
same kind of thing for other things. Another example would be, when you begin a
project, you’re also going to do a design. That design is not going to be done
by an architect. We call them principle engineers at Amazon.
[00:22:41.78] But an architect is not going to do
architecture. An engineer on the team is going to do that, and then they go
review that with a principal engineer or with an architect. The role of a
principal engineer or architect at Amazon is to review and teach. It is not to
actually do the architecture.
[00:22:58.28] Same thing with the
security engineers. Their job is not to do the threat model, it is to review
it. Because you want the engineers on the team to be doing
this themselves. Same thing with testing. You
might do a test plan at the beginning of a project, and you will view that with
an expert in testing.
[00:23:16.94] The tester is not
going to do your testing or do your test plan. They’re going to review it. Because we want the team to own their entire process– the
architecture, the security, the testing– they own all of it. And you
want them to develop expertise in all of it.
[00:23:32.58] So we spend a lot of time– it’s very
time-consuming– but we spend a lot of time teaching, because we want the engineer to know the security of their system. We want
them to know the testing. We want them to know the operations. So it’s an
important part of this.
[00:23:48.00] I think the other thing that’s pretty
important at Amazon is if something is important to the business, it’s
important that the leadership model that and show that. A great example of this
is security and operations. Operations is critically
important at Amazon.
[00:24:04.39] And you know this
because the leadership demonstrate that. They spend a lot of time actively
engaged in operational reviews, going through ops meetings, looking at pages of
graphs and giving feedback on the status of operations for each team.
[00:24:23.09] In a typical week, I
will spend at least six hours in ops meetings. And it’s not uncommon for directors
and VPs to spend a lot of time doing operational reviews. Andy Jassy, Charlie
Bell in database leadership– they spend a lot of time reviewing operations and
reviewing security. And when you come into Amazon, you see your leadership
spending this much time focus on operations and security, you learn very
quickly that it’s important. So I think it’s important that businesses
understand that if you want everyone in your business to take something
seriously, it helps that leadership does too.
[00:25:00.42] AARON SCHWAM: Yeah. So Charlie Bell actually
hosts a weekly meeting on this, does he not?
[00:25:03.95] KEN EXNER: Yeah. So yeah.
Every Wednesday we spend a couple hours– all of the AWS leadership team goes
into an ops meeting with Charlie Bell, and we review COEs, which are sort of a
five whys, RCA-type process, where we look at issues that have happened, and we
recause them and discuss what happened and learn from them.
[00:25:28.52] But we also spin a wheel. We have a wheel
spinning where it lands on a service, and then that service has to present
their ops, which is presenting their dashboards.
[00:25:40.70] AARON SCHWAM: So essentially, you have to be
ready.
[00:25:42.11] KEN EXNER: You have to be ready. So every team
in AWS has to be ready. At any given week, you could be asked to present your
dashboards. And you’re expected to know everything that happened.
[00:25:53.15] So we’ll pour through
all these different graphs. We’ll look at every single blip and ask what
happened. And I should be able to explain what had happened. So in order for
that to happen, I have to have my ops meeting so that I am able to dive deep
and understand what had happened for each of my services.
[00:26:12.12] AARON SCHWAM: That must’ve been fun in the
early days when there were only 10 services.
[00:26:16.15] KEN EXNER: We used to be able to fit in a
small room at a small table, but it’s gotten to be a very big event.
[00:26:22.79] AARON SCHWAM: Yeah. OK. So stepping back from
the engineering and the operations, we’ve got all these distributed two-pizza
teams all over Amazon. How do you handle kind of the planning and the
prioritization when we’ve got such a decentralized org structure?
[00:26:37.85] KEN EXNER: Is it some big master chess game
where we plan it? No. What we’ve actually found is that the best way to plan is
bottoms up. And the reason why is that the teams who are closest to product are
also closest to the customer.
[00:26:54.05] They’re the ones who
are working with customers on a day-to-day basis. They’re the ones who are not
only operating these services, but working with customers and fielding their
questions. They know what the customer wants. So rather than have this big
architecture where we sort of plan things in a grand way, what we do is a
bottoms-up planning process, because we want the people who are closest to the
customers to be telling us what we should be doing.
[00:27:17.40] So we have an annual
planning process. It’s called OP1, OP2– Operating Plan. And what it’s– it’s
fairly unique. I’ve not seen something like this in the industry. But at every
level of the organization you write a six-page doc that talks about what you’re
going to do over the next year, what your plan is going to be if you have flat
resources, and what you would do if you had incremental resources. And you
present your business plan, and you have six pages to do it. And then this
happens at every level of the organization.
[00:27:47.82] So for my
organization, I get a bunch of these six-page docs that I review with these
teams. And then I take all of this, and then I turn it into a six-page doc that
I then present to Andy Jassy and Andy Jassy’s leadership team. Andy Jassy does
the same thing. He turns all of AWS into one six-page document that he then
presents to Jeff Bezos and the S-team, and then resources flow down.
[00:28:12.54] But what is happening in this process is
you’re allowing everyone who is closest to the customer to have input into the
planning process. And the layers of management are essentially acting as a–
rationalizing the different requests, and making judgment about what we should
be focused on.
[00:28:30.71] But you’re still taking all the input. You’re
not silencing that. You’re pulling the input from the customer, from the people
closest to the customer. And then you present that all the way up, and then
resources flow down. So that’s how we plan things on an annual basis.
[00:28:47.90] After that, because
we want these different teams to still look and feel and operate like startups,
we manage them like startups. We ask them to take goals, so they come up with
goals given the resources that we’ve given them– this is your headcount, these
are different resources.
[00:29:05.78] We ask them to sign
up for goals, and then we track those goals over the course of the year. And
the rule of management is essentially to act sort of as a board of directors,
managing their different startups.
[00:29:18.32] Each of these
different startups has their goals and their metrics. And we meet once a month
or once a week to review how those teams are doing against those goals and
those metrics. But the decision-making, what the team is doing, is being
decided on by that team. And our job is just to manage them like a board of
directors.
[00:29:38.99] AARON SCHWAM: Awesome. So you’ve talked about
basically three things. We had decompose for agility.
We had automation and CICD. And we talked about kind of a culture on the back
end here. So what are the biggest challenges or downsides that come with kind
of this model and the way we operate?
[00:29:57.14] KEN EXNER: There’s a few. Well, one of the
problems that we’ve had is when you have a bunch of autonomous independent teams,
communication and consistency can be hard. So even though our planning process
tries to find duplicates– like this team over here is going to be– is working
on the same thing as this team over here, through the planning process we
usually try to find the most egregious cases of that.
[00:30:24.11] It still happens. We
still end up with two versions of the same thing in completely different parts
of the company. Because you don’t have as much visibility
across the entire company. So that happens.
[00:30:37.52] One of the quotes I
love from Jeff Bezos is, "I’d rather have two of something than none of
something." And the idea is that we accept that. We accept that it’s going
to be an imperfect model, because you’re going to end up with duplication.
[00:30:51.44] But we rationalize it afterwards. We fix it
afterwards. We don’t want to slow things down in order to get things completely
aligned. So we accept a little bit of that.
[00:31:01.07] The other problem
with this model is it’s hard to have consistency. You don’t only have
duplication, you have challenges of consistency. But what we’ve found is that
that’s usually just a factoring problem. You can usually solve that by creating
another two-pizza team that factors out that problem and turns into another
service or component that then drives consistency. So it is difficult, but
we’ve found ways to cope with it.
[00:31:28.76] AARON SCHWAM: What about cooperation across
teams or organizations?
[00:31:35.30] KEN EXNER: Yeah. If every team has agency, they
own their own business decisions. There is often this problem of how you
convince that team to do something that you need them to do. And it’s
essentially like a bunch of different startups, and you have to convince each
of the team–
[00:31:52.70] AARON SCHWAM: Be compelling.
[00:31:53.27] KEN EXNER: You have to make a compelling
argument that this is something that we should be doing. Usually, what we do is
anything that’s going to be driven across all these teams,
we try to do as part of the annual planning process. There are some things that
are going to be sort of top-down.
[00:32:06.77] We are going to
decide as a business we are going to go into a new region. So we ask all teams
to plan for that. There are sometimes things that originate in one team that
they haven’t asked of a different team. And if you can get that into the
planning process, you allow the team to plan for it. Make the case– get them
to plan for it.
[00:32:24.57] AARON SCHWAM: So one other kind of challenge I
was curious about is scaling DevOps teams. So how does that work? Can every
team be a DevOps team? Would love your thoughts.
[00:32:34.58] KEN EXNER: Sure. Well, we have found that most
teams that do software development can be DevOps teams. It gets a little harder
when you get a little bit further from software development into things like IT
and stuff like that. But if you’re doing software development, we have found
most teams can be DevOps teams.
[00:32:51.11] We scale them by
splitting them apart– mitosis. We have a team that’s owning
something. It gets a little bit bigger. We split it apart. This team owns this
service. This team owns this service. And we just keep splitting them apart.
[00:33:06.08] EC2, for example, began as
one two-pizza team. It’s no longer one two-pizza team. It seems much
bigger than that. And it just keeps breaking apart into a bunch of smaller
services, and a bunch of smaller teams owning those services. So we just keep
splitting things apart.
[00:33:21.59] Usually, it reflects
the architecture, because you don’t want to have one– EC2 is not one big
monolithic service. It’s a bunch of different services. So you have these
smaller teams owning smaller services, and you keep splitting things apart.
[00:33:34.67] AARON SCHWAM: All right. Ken,
thank you so much for joining us today. We will reserve time for
Q&A. So we’re going to pause now and then take your questions from the
chat.
[00:33:50.03] So first up, we have
a monolithic architecture. How should we think about moving to the
microservices model?
[00:33:57.17] KEN EXNER: Sure. So I’ve seen companies do
this– one of three models. So there is sort of the Big Bang approach. We’re
going to make a top-down decision that we’re going to break apart the monolith
and do it all at once. This is kind of what Amazon did.
[00:34:14.45] Jeff Bezos and the leadership team decided we
are going to do this. And it’s going to take us a few years, but we’re just
going to do it. That works– sometimes it doesn’t, but that’s one model.
[00:34:25.86] Another model I’ve
seen is businesses that basically just take anything that’s net new, sort of
anything that’s green field they’re going to do the new way. So leave
everything behind that’s monolithic, and anything that’s new develop
it as a microservice in the way you want it to be going forward. And then pull
the business in that direction over time, so net new.
[00:34:50.81] The third way I’ve
seen this happen is carving it apart piece by piece. So slowly over time,
saying we’re going to carve this piece off and turn it into a microservice, and
then slowly chip away at the monolith.
[00:35:05.36] I don’t know which is the best pattern, but
I’ve seen companies succeed and fail in any of those three different models. It
really is just about execution. You can do it all at once, you can sort of chip
away at it, or you can do something net new. Any of those models can work.
[00:35:21.67] AARON SCHWAM: OK. Next
question. We release a big batch of features every couple of months. How
would it look to do releases a couple of times a day?
So I’m kind of interpreting this question as like long-running features.
[00:35:36.49] KEN EXNER: Yeah. Yeah. I think this is– so
AWS does a lot of releases, but we don’t do like one every second– well, not
yet. But we are actually pushing changes like every couple of seconds.
[00:35:52.96] But what happens is not all those changes get
surfaced as features. Sometimes if it’s just a bug fix, we can just push that
out. Sometimes if it’s sort of an additive feature, we can push it behind a
feature flag. We use feature flags a lot at Amazon.
[00:36:10.08] AARON SCHWAM: What’s a feature flag?
[00:36:10.66] KEN EXNER: A feature flag is basically
wrapping the change in code that has a flag that you flip when you’re ready to
release it. So the change is sort of pushed out to production, but it’s not
customer visible until you change the flag, until you flip the configuration.
Then it exposes the feature.
[00:36:29.74] AARON SCHWAM: So it’s literally just not read
in the code file? Or it is and [INAUDIBLE]–
[00:36:34.75] KEN EXNER: It is, but it’s wrapped in a flag
in a configuration file so that it doesn’t become public until the
configuration is changed. But it is pushed out to production. That is very
common.
[00:36:47.83] The other thing is
sometimes it’s an additive feature that doesn’t get exposed immediately. But we
can just sort of push that out, and then when we’re ready to expose it, we’ll
just expose it. But the idea is we always want to be pushing things out
regularly.
[00:37:02.59] Typically, a team
will push things– do 5 to 10 different deployments a day, constantly
releasing. But as long as you can wrap things in a feature flag, or if it’s sort
of a net new thing that doesn’t change the existing experience, you can be
pushing these things out very regularly.
[00:37:20.98] AARON SCHWAM: OK. That was very interesting.
So we’ve got a couple more here. Next one– our security team is hesitant. What
have you learned to help to get the security team onboard with this model?
That’s a good one.
[00:37:32.02] KEN EXNER: Yeah. So I think security orgs, you
should be a little bit concerned about developers having all the control. But
I’ve seen a lot of that change. As security organizations start to realize that
by having a lot of automation, you can ensure a lot more governance and a lot
more controls.
[00:37:52.19] So there’s the whole
DevSecOps movement right now. And what DevSecOps is is about injecting security
into the release process. What I described earlier, how Amazon does things, we have a lot more security testing, a lot more
security controls that are part of the release process. That can happen every
single time. So we have a lot more controls in place because we have all this
automation.
[00:38:18.26] So I think as most security orgs start to
realize this, that they can get more consistency by moving towards DevOps and
towards this full automation, they actually get a lot more comfortable. Because
they can actually ensure that processes are being followed every single time a
deployment happens, that you’re not relying on manual processes.
[00:38:38.33] So if I can have a given positive/negative
security test that runs every single time a change happens and that’s part of
every release process, I’m going to get a lot more comfortable with it. So I
think security orgs are starting to naturally start to embrace DevOps, as they
start to realize that DevOps is also DevSecOps, it’s about injecting security
into the release process, they’re getting a lot more comfortable.
[00:39:06.73] AARON SCHWAM: OK. Next up.
Are these teams really only all just software developers?
[00:39:13.99] KEN EXNER: That’s a good question. No. So when
we moved towards these two-pizza teams at Amazon, we actually said, OK, we’re
going to have just general SDEs.
[00:39:25.21] AARON SCHWAM: Software Development Engineers.
[00:39:26.08] KEN EXNER: Software Development Engineers. And
we tried that for a while. And then we started to realize that most teams did
have some other specialist requirements. A good example of this is front-end
developers or webdevs. A good webdev is going to be much better than a
well-intentioned back-end developer.
[00:39:49.21] So we realized that
there was room for specialization, and that these teams, like startups, should
have a mix of different skills. There’s going to be a lot of generalists SDEs–
Software Development Engineers. But there’s typically going to also be a webdev
or a front-end engineer, product manager, a doc writer, a systems engineer.
[00:40:14.70] So we start having
different functions within these two-pizza teams. Most of it tends to be
software development engineers, but where there is specialization, we’ve
learned that it helps to have specialists. So have a product manager. Don’t
have an engineer try to be a PM. Don’t have an
engineer trying to be a marketer. So we started in–
[00:40:36.35] AARON SCHWAM: And don’t have a marketer try to
be an engineer.
[00:40:38.00] KEN EXNER: Don’t have a marketer try to be an
engineer. We started to realize that it actually helped to have a little bit of
diversity in these teams.
[00:40:45.62] AARON SCHWAM: Thank you so much. That’s all
the time we have. We really appreciate that you made the time to join us today.
To learn more, we’ve got a couple different resources for you. The first is our
DevOps page– aws@amazon.com/devops. It’s got a lot of definitional stuff that
we talked about today, but we’re also pushing a lot more content to that page
this year.
[00:41:03.17] We also have lots of
reinvent videos and things of that nature up on YouTube. And of course, this
talk that you saw today will be up on YouTube in a couple of days. Thank you.