- Problem statement: what kind of problem is presented by the authors and why this problem is important?
- Approach & Design: briefly describe the approach designed by the authors
- Strengths and Weaknesses: list the strengths and weaknesses, in your opinion
- Evaluation: how did the authors evaluate the performance of the proposed scheme? What kind of workload was designed and used?
- Conclusion: by your own judgement.
Experiences Building PlanetLab
Larry Peterson, Andy Bavier, Marc E. Fiuczynski, Steve Muir
Department of Computer Science
Princeton Universit
y
Abstract. This paper reports our experiences building
PlanetLab over the last four years. It identifies the re-
quirements that shaped PlanetLab, explains the design
decisions that resulted from resolving conflicts among
these requirements, and reports our experience imple-
menting and supporting the system. Due in large part
to the nature of the “PlanetLab experiment,” the discus-
sion focuses on synthesis rather than new techniques, bal-
ancing system-wide considerations rather than improving
performance along a single dimension, and learning from
feedback from a live system rather than controlled exper-
iments using synthetic workloads.
1 Introduction
PlanetLab is a global platform for deploying and eval-
uating network services [21, 3]. In many ways, it has
been an unexpected success. It was launched in mid-
2002 with 100 machines distributed to 40 sites, but to-
day includes 700 nodes spanning 336 sites and 35 coun-
tries. It currently hosts 2500 researchers affiliated with
600 projects. It has been used to evaluate a diverse set of
planetary-scale network services, including content dis-
tribution [33, 8, 24], anycast [35, 9], DHTs [26], robust
DNS [20, 25], large-file distribution [19, 1], measurement
and analysis [30], anomaly and fault diagnosis [36], and
event notification [23]. It supports the design and evalua-
tion of dozens of long-running services that transport an
aggregate of 3-4TB of data every day, satisfying tens of
millions of requests involving roughly one million unique
clients and servers.
To deliver this utility, PlanetLab innovates along two
main dimensions:
• Novel management architecture. PlanetLab ad-
ministers nodes owned by hundreds of organiza-
tions, which agree to allow a worldwide community
of researchers—most complete strangers—to access
their machines. PlanetLab must manage a complex
relationship between node owners and users.
• Novel usage model. Each PlanetLab node should
gracefully degrade in performance as the number of
users grows. This gives the PlanetLab community
an incentive to work together to make best use of its
shared resources.
In both cases, the contribution is not a new mechanism or
algorithm, but rather a synthesis (and full exploitation) of
carefully selected ideas to produce a fundamentally new
system.
Moreover, the process by which we designed the sys-
tem is interesting in its own right:
• Experience-driven design. PlanetLab’s design
evolved incrementally based on experience gained
from supporting a live user community. This is
in contrast to most research systems that are de-
signed and evaluated under controlled conditions,
contained within a single organization, and evalu-
ated using synthetic workloads.
• Conflict-driven design. The design decisions that
shaped PlanetLab were responses to conflicting re-
quirements. The result is a comprehensive architec-
ture based more on balancing global considerations
than improving performance along a single dimen-
sion, and on real-world requirements that do not al-
ways lend themselves to quantifiable metrics.
One could view this as a new model of system design, but
of course it isn’t [6, 27].
This paper identifies the requirements that shaped the
system, explains the design decisions that resulted from
resolving conflicts among these requirements, and reports
our experience building and supporting the system. A
side-effect of the discussion is a fairly complete overview
of PlanetLab’s current architecture, but the primary goal
is to describe the design decisions that went into building
PlanetLab, and to report the lessons we have learned in
the process. For a comprehensive definition of the Plan-
etLab architecture, the reader is referred to [22].
2 Background
This section identifies the requirements we understood at
the time PlanetLab was first conceived, and sketches the
high-level design proposed at that time. The discussion
includes a summary of the three main challenges we have
faced, all of which can be traced to tensions between the
requirements. The section concludes by looking at the
relationship between PlanetLab and similar systems.
OSDI ’06: 7th USENIX Symposium on Operating Systems Design and ImplementationUSENIX Association 351
2.1 Requirements
PlanetLab’s design was guided by five major require-
ments that correspond to objectives we hoped to achieve
as well as constraints we had to live with. Although we
recognized all of these requirements up-front, the follow-
ing discussion articulates them with the benefit of hind-
sight.
(R1) It must provide a global platform that supports
both short-term experiments and long-running services.
Unlike previous testbeds, a revolutionary goal of Planet-
Lab was that it support experimental services that could
run continuously and support a real client workload. This
implied that multiple services be able to run concurrently
since a batch-scheduled facility is not conducive to a
24×7 workload. Moreover, these services (experiments)
should be isolated from each other so that one service
does not unduly interfere with another.
(R2) It must be available immediately, even though
no one knows for sure what “it” is. PlanetLab faced a
dilemma: it was designed to support research in broad-
coverage network services, yet its management (control)
plane is itself such a service. It was necessary to deploy
PlanetLab and start gaining experience with network ser-
vices before we fully understood what services would be
needed to manage it. As a consequence, PlanetLab had
to be designed with explicit support for evolution. More-
over, to get people to use PlanetLab—so we could learn
from it—it had to be as familiar as possible; researchers
are not likely to change their programming environment
to use a new facility.
(R3) We must convince sites to host nodes running
code written by unknown researchers from other organi-
zations. PlanetLab takes advantage of nodes contributed
by research organizations around the world. These nodes,
in turn, host services on behalf of users from other re-
search organizations. The individual users are unknown
to the node owners, and to make matters worse, the ser-
vices they deploy often send potentially disruptive pack-
ets into the Internet. That sites own and host nodes, but
trust PlanetLab to administer them, is unprecedented at
the scale PlanetLab operates. As a consequence, we must
correctly manage the trust relationships so that the risks
to each site are less than the benefits they derive.
(R4) Sustaining growth depends on support for auton-
omy and decentralized control. PlanetLab is a world-
wide platform constructed from components owned by
many autonomous organizations. Each organization must
retain some amount of control over how their resources
are used, and PlanetLab as a whole must give geographic
regions and other communities as much autonomy as pos-
sible in defining and managing the system. Generally,
sustaining such a system requires minimizing centralized
control.
(R5) It must scale to support many users with mini-
mal resources. While a commercial variant of PlanetLab
might have cost recovery mechanisms to provide resource
guarantees to each of its users, PlanetLab must operate
in an under-provisioned environment. This means con-
servative allocation strategies are not practical, and it is
necessary to promote efficient resource sharing. This in-
cludes both physical resources (e.g., cycles, bandwidth,
and memory) and logical resources (e.g., IP addresses).
Note that while the rest of this paper discusses the
many tensions between these requirements, two of them
are quite synergistic. The requirement that we evolve
PlanetLab (R2) and the need for decentralized control
(R4) both point to the value of factoring PlanetLab’s man-
agement architecture into a set of building block compo-
nents with well-defined interfaces. A major challenge of
building PlanetLab was to understand exactly what these
pieces should be.
To this end, PlanetLab originally adopted an organiz-
ing principle called unbundled management, which ar-
gued that the services used to manage PlanetLab should
themselves be deployed like any other service, rather than
bundled with the core system. The case for unbundled
management has three arguments: (1) to allow the sys-
tem to more easily evolve; (2) to permit third-party de-
velopers to build alternative services, enabling a software
bazaar, rather than rely on a single development team
with limited resources and creativity; and (3) to permit
decentralized control over PlanetLab resources, and ulti-
mately, over its evolution.
2.2 Initial Design
PlanetLab supports the required usage model through dis-
tributed virtualization—each service runs in a slice of
PlanetLab’s global resources. Multiple slices run con-
currently on PlanetLab, where slices act as network-wide
containers that isolate services from each other. Slices
were expected to enforce two kinds of isolation: resource
isolation and security isolation, the former concerned
with minimizing performance interference and the latter
concerned with eliminating namespace interference.
At a high-level, PlanetLab consists of a centralized
front-end, called PlanetLab Central (PLC), that remotely
manages a set of nodes. Each node runs a node man-
ager (NM) that establishes and controls virtual machines
(VM) on that node. We assume an underlying virtual ma-
chine monitor (VMM) implements the VMs. Users create
slices through operations available on PLC, which results
in PLC contacting the NM on each node to create a local
VM. A set of such VMs defines the slice.
We initially elected to use a Linux-based VMM due to
Linux’s high mind-share [3]. Linux is augmented with
Vservers [16] to provide security isolation and a set of
schedulers to provide resource isolation.
OSDI ’06: 7th USENIX Symposium on Operating Systems Design and Implementation USENIX Association352
2.3 Design Challenges
Like many real systems, what makes PlanetLab interest-
ing to study—and challenging to build—is how it deals
with the constraints of reality and conflicts among re-
quirements. Here, we summarize the three main chal-
lenges; subsequent sections address each in more detail.
First, unbundled management is a powerful design
principle for evolving a system, but we did not fully un-
derstand what it entailed nor how it would be shaped by
other aspects of the system. Defining PlanetLab’s man-
agement architecture—and in particular, deciding how to
factor management functionality into a set of independent
pieces—involved resolving three main conflicts:
• minimizing centralized components (R4) yet main-
taining the necessary trust assumptions (R3);
• balancing the need for slices to acquire the resources
they need (R1) yet coping with scarce resources
(R5);
• isolating slices from each other (R1) yet allowing
some slices to manage other slices (R2).
Section 3 discusses our experiences evolving PlanetLab’s
management architecture.
Second, resource allocation is a significant challenge
for any system, and this is especially true for PlanetLab,
where the requirement for isolation (R1) is in conflict
with the reality of limited resources (R5). Part of our ap-
proach to this situation is embodied in the management
structure described in Section 3, but it is also addressed
in how scheduling and allocation decisions are made on a
per-node basis. Section 4 reports our experience balanc-
ing isolation against efficient resource usage.
Third, we must maintain a stable system on behalf of
the user community (R1) and yet evolve the platform to
provide long-term viability and sustainability (R2). Sec-
tion 5 reports our operational experiences with Planet-
Lab, and the lessons we have learned as a result.
2.4 Related Systems
An important question to ask about PlanetLab is whether
its specific design requirements make it unique, or if our
experiences can apply to other systems. Our response is
that PlanetLab shares “points of pain” with three simi-
lar systems—ISPs, hosting centers, and the GRID—but
pushes the envelope relative to each.
First, PlanetLab is like an ISP in that it has many
points-of-presence and carries traffic to/from the rest of
the Internet. Like ISPs (but unlike hosting centers and the
GRID), PlanetLab has to provide mechanisms that can
by used to identify and stop disruptive traffic. PlanetLab
goes beyond traditional ISPs, however, in that it has to
deal with arbitrary (and experimental) network services,
not just packet forwarding.
Second, PlanetLab is like a hosting center in that its
nodes support multiple VMs, each on behalf of a differ-
ent user. Like a hosting center (but unlike the GRID or
ISPs), PlanetLab has to provide mechanisms that enforce
isolation between VMs. PlanetLab goes beyond hosting
centers, however, because it includes third-party services
that manage other VMs, and because it must scale to large
numbers of VMs with limited resources.
Third, PlanetLab is like the GRID in that its resources
are owned by multiple autonomous organizations. Like
the GRID (but unlike an ISP or hosting center), PlanetLab
has to provide mechanisms that allow one organization to
grant users at another organization the right to use its re-
sources. PlanetLab goes far beyond the GRID, however,
in that it scales to hundreds of “peering” organizations by
avoiding pair-wise agreements.
PlanetLab faces new and unique problems because it
is at the intersection of these three domains. For exam-
ple, combining multiple independent VMs with a single
IP address (hosting center) and the need to trace disrup-
tive traffic back to the originating user (ISP) results in
a challenging problem. PlanetLab’s experiences will be
valuable to other systems that may emerge where any of
these domains intersect, and may in time influence the
direction of hosting centers, ISPs, and the GRID as well.
3 Slice Management
This section describes the slice management architecture
that evolved over the past four years. While the discus-
sion includes some details, it primarily focuses on the de-
sign decisions and the factors that influenced them.
3.1 Trust Assumptions
Given that PlanetLab sites and users span multiple orga-
nizations (R3), the first design issue was to define the un-
derlying trust model. Addressing this issue required that
we identify the key principals, explicitly state the trust as-
sumptions among them, and provide mechanisms that are
consistent with this trust model.
Over 300 autonomous organizations have contributed
nodes to PlanetLab (they each require control over the
nodes they own) and over 300 research groups want to
deploy their services across PlanetLab (the node own-
ers need assurances that these services will not be dis-
ruptive). Clearly, establishing 300×300 pairwise trust
relationships is an unmanageable task, but it is well-
understood that a trusted intermediary is an effective way
to manage such an N×N problem.
PLC is one such trusted intermediary: node owners
trust PLC to manage the behavior of VMs that run on
their nodes while preserving their autonomy, and re-
searchers trust PLC to provide access to a set of nodes
that are capable of hosting their services. Recognizing
this role for PLC, and organizing the architecture around
OSDI ’06: 7th USENIX Symposium on Operating Systems Design and ImplementationUSENIX Association 353
Owner
1
2
3
4
PLC
Service
Developer
(User)
Node
Figure 1: Trust relationships among principals.
it, is the single most important aspect of the design be-
yond the simple model presented in Section 2.2.
With this backdrop, the PlanetLab architecture recog-
nizes three main principals:
• PLC is a trusted intermediary that manages nodes
on behalf a set of owners, and creates slices on those
nodes on behalf of a set of users.
• An owner is an organization that hosts (owns) Plan-
etLab nodes. Each owner retains ultimate control
over their own nodes, but delegates management of
those nodes to the trusted PLC intermediary. PLC
provides mechanisms that allow owners to define re-
source allocation policies on their nodes.
• A user is a researcher that deploys a service on a set
of PlanetLab nodes. PlanetLab users are currently
individuals at research organizations (e.g., universi-
ties, non-profits, and corporate research labs), but
this is not an architectural requirement. Users create
slices on PlanetLab nodes via mechanisms provided
by the trusted PLC intermediary.
Figure 1 illustrates the trust relationships between node
owners, users, and the PLC intermediary. In this figure:
1. PLC expresses trust in a user by issuing it credentials
that let it access slices. This means that the user
must adequately convince PLC of its identity (e.g.,
affiliation with some organization or group).
2. A user trusts PLC to act as its agent, creating slices
on its behalf and checking credentials so that only
that user can install and modify the software running
in its slice.
3. An owner trusts PLC to install software that is able
to map network activity to the responsible slice.
This software must also isolate resource usage of
slices and bound/limit slice behavior.
4. PLC trusts owners to keep their nodes physically se-
cure. It is in the best interest of owners to not cir-
cumvent PLC (upon which it depends for accurate
policing of its nodes). PLC must also verify that ev-
ery node it manages actually belongs to an owner
with which it has an agreement.
Given this model, the security architecture includes the
following mechanisms. First, each node boots from an
immutable file system, loading (1) a boot manager pro-
gram, (2) a public key for PLC, and (3) a node-specific
secret key. We assume that the node is physically secured
by the owner in order to keep the key secret, although a
hardware mechanism such as TCPA could also be lever-
aged. The node then contacts a boot server running at
PLC, authenticates the server using the public key, and
uses HMAC and the secret key to authenticate itself to
PLC. Once authenticated, the boot server ensures that the
appropriate VMM and the NM are installed on the node,
thus satisfying the fourth trust relationship.
Second, once PLC has vetted an organization through
an off-line process, users at the site are allowed to cre-
ate accounts and upload their private keys. PLC then
installs these keys in any VMs (slices) created on be-
half of those users, and permits access to those VMs via
ssh. Currently, PLC requires that new user accounts
are authorized by a principal investigator associated with
each site—this provides some degree of assurance that
accounts are only created by legitimate users with a con-
nection to a particular site, thus satisfying the first trust
relationship.
Third, PLC runs an auditing service that records in-
formation about all packet flows coming out of the node.
The auditing service offers a public, web-based interface
on each node, through which anyone that has received un-
wanted network traffic from the node can determine the
responsible users. PLC archives this auditing information
by periodically downloading the audit log.
3.2 Virtual Machines and Resource Pools
Given the requirement that PlanetLab support long-lived
slices (R1) and accommodate scarce resources (R5), the
second design decision was to decouple slice creation
from resource allocation. In contrast to a hosting cen-
ter that might create a VM and assign it a fixed set of
resources as part of an SLA, PlanetLab creates new VMs
without regard for available resources—each such VM is
given a fair share of the available resources on that node
whenever it runs—and then expects slices to engage one
or more brokerage services to acquire resources.
To this end, the NM supports two abstract objects: vir-
tual machines and resource pools. The former is a con-
tainer that provides a point-of-presence on a node for a
slice. The latter is a collection of physical and logical
resources that can be bound to a VM. The NM supports
operations to create both objects, and to bind a pool to a
VM for some fixed period of time. Both types of objects
are specified by a resource specification (rspec), which is
a list of attributes that describe the object. A VM can run
as soon as it is created, and by default is given a fair share
of the node’s unreserved capacity. When a resource pool
OSDI ’06: 7th USENIX Symposium on Operating Systems Design and Implementation USENIX Association354
is bound to a VM, that VM is allocated the corresponding
resources for the duration of the binding.
Global management services use these per-node opera-
tions to create PlanetLab-wide slices and assign resources
to them. Two such service types exist today: slice cre-
ation services and brokerage services. These services can
be separate or combined into a single service that both
creates and provisions slices. At the same time, different
implementations of brokerage services are possible (e.g.,
market-based services that provide mechanisms for buy-
ing and selling resources [10, 14], and batch scheduling
services that simply enforce admission control for use of
a finite resource pool [7]).
As part of the resource allocation architecture, it was
also necessary to define a policy that governs how re-
sources are allocated. On this point, owner autonomy
(R4) comes into play: only owners are allowed to invoke
the “create resource pool” operation on the NM that runs
on their nodes. This effectively defines the one or more
“root” pools, which can subsequently be split into sub-
pools and reassigned. An owner can also directly allocate
a certain fraction of its node’s resources to the VM of a
specific slice, thereby explicitly supporting any services
the owner wishes to host.
3.3 Delegation
PlanetLab’s management architecture was expected to
evolve through the introduction of third-party services
(R2). We viewed the NM interface as the key feature,
since it would support the many third-party creation and
brokerage services that would emerge. We regarded PLC
as merely a “bootstrap” mechanism that could be used to
deploy such new global management services, and thus,
we expected PLC to play a reduced role over time.
However, experience showed this approach to be
flawed. This is for two reasons, one fundamental and
one pragmatic. First, it failed to account for PLC’s cen-
tral role in the trust model of Section 3.1. Maintaining
trust relationships among participants is a critical role
played by PLC, and one not easily passed along to other
services. Second, researchers building new management
services on PlanetLab were not interested in replicating
all of PLCs functionality. Instead of using PLC to boot-
strap a comprehensive suite of management services, re-
searchers wanted to leverage some aspects of PLC and
replace others.
To accommodate this situation, PLC is today struc-
tured as follows. First, each owner implicitly assigns all
of its resources to PLC for redistribution. The owner can
override this allocation by granting a set of resources to
a specific slice, or divide resources among multiple bro-
kerage services, but by default all resources are allocated
to PLC.
Second, PLC runs a slice creation service—called
pl conf—on each node. This service runs in a stan-
dard VM and invokes the NM interface without any ad-
ditional privilege. It also exports an XML-RPC interface
by which anyone can invoke its services. This is impor-
tant because it means other brokerage and slice creation
services can use pl conf as their point-of-presence on
each node rather than have to first deploy their own slice.
Originally, the PLC/pl conf interface was private as we
expected management services to interact directly with
the node manager. However, making this a well-defined,
public interface has been a key to supporting delegation.
Third, PLC provides a front-end—available either as
a GUI or as a programmatic interface at www.planet-
lab.org—through which users create slices. The PLC
front-end interacts with pl conf on each node with the
same XML-RPC interface that other services use.
Finally, PLC supports two methods by which slices
are actually instantiated on a set of nodes: direct and
delegated. Using the direct method, the PLC front-end
contacts pl conf on each node to create the correspond-
ing VM and assign resources to it. Using delegation, a
slice creation service running on behalf of a user con-
tacts PLC for a ticket that encapsulates the right to create
a VM or redistribute a pool of resources. A ticket is a
signed rspec; in this case, it is signed by PLC. The agent
then contacts pl conf on each node to redeem this ticket,
at which time pl conf validates it and calls the NM to
create a VM or bind a pool of resources to an existing
VM. The mechanisms just described currently support
two slice creation services (PLC and Emulab [34], the
latter uses tickets granted by the former), and two bro-
kerage services (Sirius [7] and Bellagio [2], the first of
which is granted capacity as part of a root resource allo-
cation decision).
Note that the delegated method of slice creation is
push-based, while the direct method is pull-based. With
delegation, a slice creation service contacts PLC to re-
trieve a ticket granting it the right to create a slice, and
then performs an XML-RPC call to pl conf on each node.
For a slice spanning a significant fraction of PlanetLab’s
nodes, an implementation would likely launch multiple
such calls in parallel. In contrast, PLC uses a polling
approach: each pl conf contacts PLC periodically to re-
trieve a set of tickets for the slices it should run.
While the push-based approach can create a slice in
less time, the advantage of pull-based approach is that
it enables slices to persist across node reinstalls. Nodes
cannot be trusted to have persistent state since they are
completely reinstalled from time to time due to unrecov-
erable errors such as corrupt local file systems. The pull-
based strategy views all nodes as maintaining only soft
state, and gets the definitive list of slices for that node
from PLC. Therefore, if a node is reinstalled, all of its
slices are automatically recreated. Delegation makes it
OSDI ’06: 7th USENIX Symposium on Operating Systems Design and ImplementationUSENIX Association 355
possible for others to develop alternative slice creation
semantics—for example, a “best effort” system that ig-
nores such problems—but PLC takes the conservative
approach because it is used to create slices for essential
management services.
3.4 Federation
Given our desire to minimize the centralized elements of
PlanetLab (R4), our next design decision was to make it
possible for multiple independent PlanetLab-like systems
to co-exist and federate with each other. Note that this
issue is distinct from delegation, which allows multiple
management services to co-exist withing a single Planet-
Lab.
There are three keys to enabling federation. First, there
must be well-defined interfaces by which independent in-
stances of PLC invoke operations on each other. To this
end, we observe that our implementation of PLC natu-
rally divides into two halves: one that creates slices on
behalf of users and one that manages nodes on behalf of
owners, and we say that PLC embodies a slice authority
and a management authority, respectively. Correspond-
ing to these two roles, PLC supports two distinct inter-
faces: one that is used to create and control slices, and
one that is used to boot and manage nodes. We claim
that these interfaces are minimal, and hence, define the
“narrow waist” of the PlanetLab hourglass.
Second, supporting multiple independent PLCs im-
plies the need to name each instance. It is PLC in
its slice authority role that names slices, and its name
space must be extended to also name slice authori-
ties. For example, the slice cornell.cobweb is implicitly
plc.cornell.cobweb, where plc is the top-level slice au-
thority that approved the slice. (As we generalize the slice
name space, we adopt “.” instead of “ ” as the delimiter.)
Note that this model enables a hierarchy of slice author-
ities, which is in fact already the case with plc.cornell,
since PLC trusts Cornell to approve local slices (and the
users bound to them).
This generalization of the slice naming scheme leads
to several possibilities:
• PLC delegates the ability to create slices to regional
slice authorities (e.g., plc.japan.utokyo.ubiq);
• organizations create “private” PlanetLab’s (e.g.,
epfl.chawla) that possibly peer with each other, or
with the “public” PlanetLab; and
• alternative “root” naming authorities come into exis-
tence, such as one that is responsible for commercial
(for-profit) slices (e.g., com.startup.voip).
The third of these is speculative, but the first two scenar-
ios have already happened or are in progress, with five
private PlanetLabs running today and two regional slice
Service Lines of Code Language
Node Manager 2027 Python
Proper 5752 C
pl conf 1975 Python
Sirius 850 Python
Stork 12803 Python
CoStat + CoMon 1155 C
PlanetFlow 5932 C
Table 1: Source lines of code for various management services
authorities planned for the near future. Note that there
must be a single global naming authority that ensures all
top-level slice authority names are unique. Today, PLC
plays that role.
The third key to federation is to design pl conf so that
it is able to create slices on behalf of many different slice
authorities. Node owners allocate resources to the slice
authorities they want to support, and configure pl conf to
accept tickets signed by slice authorities that they trust.
Note that being part of the “public” PlanetLab carries the
stipulation that a certain minimal amount of capacity be
set aside for slices created by the PLC slice authority, but
owners can reserve additional capacity for other slice au-
thorities and for individual slices.
3.5 Least Privilege
We conclude our description of PlanetLab’s management
architecture by focusing on the node-centric issue of how
management functionality has been factored into self-
contained services, moved out of the NM and isolated in
their own VMs, and granted minimal privileges.
When PlanetLab was first deployed, all management
services requiring special privilege ran in a single root
VM as part of a monolithic node manager. Over time,
stand-alone services have been carved off of the NM
and placed in their own VMs, multiple versions of some
services have come and gone, and new services have
emerged. Today, there are five broad classes of manage-
ment services. The following summarizes one particular
“suite” of services that a user might engage; we also iden-
tify alternative services that are available.
Slice Creation Service: pl conf is the default slice cre-
ation service. It requires no special privilege: the
node owner creates a resource pool and assigns it to
pl conf when the node boots. Emulab [34] offers
an alternative slice creation service that uses tickets
granted by PLC and redeemed by pl conf.
Brokerage Service: Sirius [7] is the most widely used
brokerage service. It performs admission control on
a resource pool set aside for one-hour experiments.
OSDI ’06: 7th USENIX Symposium on Operating Systems Design and Implementation USENIX Association356
Sirius requires no special privilege: pl conf allo-
cates a sub-pool of resources to Sirius. Bellagio [2]
and Tycoon [14] are alternative market-based bro-
kerage services that are initialized in the same way.
Monitoring Service: CoStat is a low-level instrumen-
tation program that gathers data about the state of
the local node. It is granted the ability to read
/proc files that report data about the underlying
VMM, as well as the right to execute scripts (e.g.,
ps and top) in the root context. Multiple ad-
ditional services—e.g., CoMon [31], PsEPR [5],
SWORD [18]—then collect and process this infor-
mation on behalf of users. These services require no
additional privilege.
Environment Service: Stork [12] deploys, updates, and
configures services and experiments. Stork is
granted the right to mount the file system of a client
slice, which Stork then uses to install software pack-
ages required by the slice. It is also granted the right
to mark a file as immutable, so that it can safely be
shared among slices without any slice being able to
modify the file. Emulab and AppManager [28] pro-
vide alternative environment services without extra
privilege; they simply provide tools for uploading
software packages.
Auditing Service: PlanetFlow [11] is an auditing ser-
vice that logs information about packet flows, and
is able to map externally visible network activity to
the responsible slice. PlanetFlow is granted the right
to run ulogd in the root context to retrieve log in-
formation from the VMM.
The need to grant narrowly-defined privileges to cer-
tain management services has led us to define a mecha-
nism called Proper (PRivileged OPERation) [17]. Proper
uses an ACL to specify the particular operations that can
be performed by a VM that hosts a management service,
possibly including argument constraints for each opera-
tion. For example, the CoStat monitoring service gath-
ers various statistics by reading /proc files in the root
context, so Proper constrains the set of files that can be
opened by CoStat to only the necessary directories. For
operations that affect other slices directly, such as mount-
ing the slice’s file system or executing a process in that
slice, Proper also allows the target slice to place addi-
tional constraints on the operations that can be performed
e.g., only a particular directory may be mounted by Stork.
In this way we are able to operate each management ser-
vice with a small set of additional privileges above a nor-
mal slice, rather than giving out coarse-grained capabili-
ties such as those provided by the standard Linux kernel,
or co-locating the service in the root context.
Finally, Table 1 quantifies the impact of moving func-
tionality out of the NM in terms of lines-of-code. The
LOC data is generated using David A. Wheeler’s ’SLOC-
Count’. Note that we show separate data for Proper and
the rest of the node manager; Proper’s size is in part a
function of its implementation in C.
One could argue that these numbers are conservative,
as there are additional services that this list of manage-
ment services employ. For example, CoBlitz is a large
file transfer mechanism that is used by Stork and Emulab
to disseminate large files across a set of nodes. Similarly,
a number of these services provide a web interface that
must run on each node, which would greatly increase the
size of the TCB if the web server itself had to be included
in the root context.
4 Resource Allocation
One of the most significant challenges for PlanetLab has
been to maximize the platform’s utility for a large user
community while dealing with the reality of limited re-
sources. This challenge has led us to a model of weak
resource isolation between slices. We implement this
model through fair sharing of CPU and network band-
width, simple mechanisms to avoid the worst kinds of in-
terference on other resources like memory and disk, and
tools to give users information about resource availability
on specific nodes. This section reports our experiences
with this model in practice, and describes some of the
techniques we’ve adopted to make the system as effec-
tive and stable as possible.
4.1 Workload
PlanetLab supports a workload mixing one-off experi-
ments with long-running services. A complete character-
ization of this workload is beyond the scope of this paper,
but we highlight some important aspects below.
CoMon—one of the performance-monitoring services
running on PlanetLab—classifies a slice as active on a
node if it contains a process, and live if, in the last five
minutes, it used at least 0.1% (300ms) of the CPU. Fig-
ure 2 shows, by quartile, the number of active and live
slices across PlanetLab during the past year. Each graph
shows five lines; 25% of PlanetLab nodes have values
that fall between the first and second lines, 25% between
the second and third, and so on.
Looking at each graph in more detail, Figure 2(a) illus-
trates that the number of active slices on most PlanetLab
nodes has grown steadily. The median active slice count
has increased from 40 slices in March 2005 to the mid-
50s in April 2006, and the maximum number of active
slices has increased from 60 to 90. PlanetLab nodes can
support large numbers of mostly idle slices because each
VM is very lightweight. Additionally, the data shows that
75% of PlanetLab nodes have consistently had at least 4
0
OSDI ’06: 7th USENIX Symposium on Operating Systems Design and ImplementationUSENIX Association 357
0
20
40
60
80
100
05/May 05/Jul 05/Sep 05/Nov 06/Jan 06/Mar
S
lic
es
w
ith
a
pr
oc
es
s
Min
1st Q Median 3rd Q Max
(a) Active slices, by quartile
0
5
10
15
20
25
30
05/May 05/Jul 05/Sep 05/Nov 06/Jan 06/Mar
S
lic
es
us
in
g
>0
.1
%
C
P
U
Min 1st Q Median 3rd Q Max
(b) Live slices, by quartile
Figure 2: Active and live slices on PlanetLab
active slices during the past year.
Figure 2(b) shows the distribution of live slices. Note
that at least 50% of PlanetLab nodes consistently have a
live slice count within two of the median. Additional data
indicates that this is a result of three factors. First, some
monitoring slices (like CoMon and PlanetFlow) are live
everywhere, and so create a lower bound on the number
of live slices. Second, most researchers do not appear
to greedily use more nodes than they need; for example,
only 10% of slices are deployed on all nodes, and 60% are
deployed on less than 50 nodes. We presume researchers
are self-organizing their services and experiments onto
disjoint sets of nodes so as to distribute load, although
there are a small number of popular nodes that support
over 25 live slices. Third, the slices that are deployed on
all nodes are not live on all of them at once. For instance,
in April 2006 we observed that CoDeeN was active on
436 nodes but live on only 269. Robust (and adaptive)
long-running services are architected to dynamically bal-
ance load to less utilized nodes [33, 26].
Of course we did not know what PlanetLab’s work-
load would look like when we made many early design
decisions. As reported in Section 2.2, one such decision
was to use Linux+Vservers as the VMM, primarily be-
cause of the maturity of the technology. Since this time,
alternatives like Xen have advanced considerably, but we
have not felt compelled to reconsider this decision. A key
reason is that PlanetLab nodes run up to 25 live VMs,
and up to 90 active VMs, at at time. This is possible
because we could build a system that supports resource
overbooking and graceful degradation on a framework of
Vserver-based VMs. In contrast, Xen allocates specific
amounts of resources, such as physical memory and disk,
to each VM. For example, on a typical PlanetLab node
with 1GB memory, Xen can support only 10 VMs with
100MB memory each, or 16 with 64MB memory. There-
fore, it’s not clear how a PlanetLab based on Xen could
support our current user base. Note that the management
architecture presented in the previous section is general
enough to support multiple VM types (and a Xen proto-
type is running in the lab), but resource constraints make
it likely that most PlanetLab slices will continue to use
Vservers for the foreseeable future.
4.2 Graceful Degradation
PlanetLab’s usage model is to allow as many users on a
node as want to use it, enable resource brokers that are
able to secure guaranteed resources, and gracefully de-
grade the node’s performance as resources become over
utilized. This section describes the mechanisms that sup-
port such behavior and evaluates how well they work.
4.2.1 CPU
The earliest version of PlanetLab used the standard Linux
CPU scheduler, which provided no CPU isolation be-
tween slices: a slice with 400 Java threads would get
400
times the CPU of a slice with one thread. This situation
occasionally led to collapse of the system and revealed
the need for a slice-aware CPU scheduler.
Fair share scheduling [32] does not collapse under
load, but rather supports graceful degradation by giving
each scheduling container proportionally fewer cycles.
Since mid-2004, PlanetLab’s CPU scheduler has per-
formed fair sharing among slices. During that time, how-
ever, PlanetLab has run three distinct CPU schedulers: v2
used the SILK scheduler [3], v3.0 introduced CKRM (a
community project in its early stages), and v3.2 (the cur-
rent version) uses a modification of Vserver’s CPU rate
limiter to implement fair sharing and reservations. The
question arises, why so many CPU schedulers?
The answer is that, for the most part, we switched CPU
schedulers for reasons other than scheduling behavior.
We switched from SILK to CKRM to leverage a com-
munity effort and reduce our code maintenance burden.
However, at the time we adopted it, CKRM was far from
production quality and the stability of PlanetLab suffered
as a result. We then dropped CKRM and wrote another
CPU scheduler, this time based on small modifications to
the Vservers code that we had already incorporated into
the PlanetLab kernel. This CPU scheduler gave us the ca-
pability to provide slices with CPU reservations as well
as shares, which we lacked with SILK and CKRM. Per-
OSDI ’06: 7th USENIX Symposium on Operating Systems Design and Implementation USENIX Association358
0
20
40
60
80
100
05/May 05/Jul 05/Sep 05/Nov 06/Jan 06/Mar
A
va
ila
bl
e
C
P
U
%
Min 1st Q Median 3rd Q Max
Figure 3: CPU % Available on PlanetLab
haps more importantly, the scheduler was more robust, so
PlanetLab’s stability dramatically improved, as shown in
Section 5. We are solving the code maintenance problem
by working with the Vservers developers to incorporate
our modifications into their main distribution.
The current (v3.2) CPU scheduler implements fair
sharing and work-conserving CPU reservations by over-
laying a token bucket filter on top of the standard Linux
CPU scheduler. Each slice has a token bucket that accu-
mulates tokens at a specified rate; every millisecond, the
slice that owns the running process is charged one token.
A slice that runs out of tokens has its processes removed
from the runqueue until its bucket accumulates a mini-
mum amount of tokens. This filter was already present
in Vservers, which used it to put an upper bound on the
amount of CPU that any one VM could receive; we sim-
ply modified it to provide a richer set of behaviors.
The rate that tokens accumulate depends on whether
the slice has a reservation or a share. A slice with a reser-
vation accumulates tokens at its reserved rate: for exam-
ple, a slice with a 10% reservation gets 100 tokens per
second, since a token entitles it to run a process for one
millisecond. The default share is actually a small reser-
vation, providing the slice with 32 tokens every second,
or 3% of the total capacity.
The main difference between reservations and shares
occurs when there are runnable processes but no slice has
enough tokens to run: in this case, slices with shares are
given priority over slices with reservations. First, if there
is a runnable slice with shares, tokens are given out fairly
to all slices with shares (i.e., in proportion to the number
of shares each slice has) until one can run. If there are
no runnable slices with shares, then tokens are given out
fairly to slices with reservations. The end result is that the
CPU capacity is effectively partitioned between the two
classes of slices: slices with reservations get what they’ve
reserved, and slices with shares split the unreserved ca-
pacity of the machine proportionally.
CoMon indicates that the average PlanetLab node has
its CPU usage pegged at 100% all the time. However,
fair sharing means that an individual slice can still ob-
0
20
40
60
80
100
0 500 1000 1500 2000
P
er
ce
nt
of
sl
ic
es
MB used when reset
Figure 4: CDF of memory consumed when slice reset
tain a significant percentage of the CPU. Figure 3 shows,
by quartile, the CPU availability across PlanetLab, ob-
tained by periodically running a spinloop in the CoMon
slice and observing how much CPU it receives. The data
shows large amounts of CPU available on PlanetLab: at
least 10% of the CPU is available on 75% of nodes, at
least 20% CPU on 50% of nodes, and at least 40% CPU
on 25% of nodes.
4.2.2 Memory
Memory is a particularly scarce resource on PlanetLab,
and we were faced with with chosing between four de-
signs. One is the default Linux behavior, which either
kernel panics or randomly kills a process when memory
becomes scarce. This clearly does not result in grace-
ful degradation. A second is to statically allocate a fixed
amount of memory to each slice. Given that there are up
to 90 active VMs on a node, this would imply an imprac-
tically small 10MB allocation for each VM on the typical
node with 1GB of memory. A third option is to explic-
itly allocate memory to live VMs, and reclaim memory
from inactive VMs. This implies the need for a control
mechanism, but globally synchronizing such a mecha-
nism across PlanetLab (i.e., to suspend a slice) is prob-
lematic at fine-grained time scales. The fourth option is
to dynamically allocate memory to VMs on demand, and
react in a more predictable way when memory is scarce.
We elected the fourth option, implementing a simple
watchdog daemon, called pl mom, that resets the slice
consuming the most physical memory when swap has al-
most filled. This penalizes the memory hog while keep-
ing the system running for everyone else.
Although pl mom was noticably active when first
deployed—as users learned to not keep log files in mem-
ory and to avoid default heap sizes—it now typically re-
sets an average of 3-4 VMs per day, with higher rates dur-
ing heavy usage (e.g., major conference deadlines). For
example, 200 VMs were reset during the two week run-
up to the OSDI deadline. We note, however, that roughly
one-third of these resets were on under-provisioned nodes
(i.e., nodes with less than 1GB of memory).
Figure 4 shows the cumulative distribution function of
OSDI ’06: 7th USENIX Symposium on Operating Systems Design and ImplementationUSENIX Association 359
0
20
40
60
80
100
05/Dec 06/Jan 06/Feb 06/Mar 06/Apr
A
va
ila
bl
e
M
B
M
em
or
y
Min 1st Q Median 3rd Q Max
Figure 5: Memory availability on PlanetLab
how much physical memory individual VMs were con-
suming when they were reset between November 2004
and April 2006. We note that about 10% of the resets
(corresponding largely to the first 10% of the distribution)
occurred on nodes with less than 1GB memory, where
memory pressure was tighter. Over 80% of all resets had
been allocated at least 128MB. Half of all resets occurred
when the slice was using more than 400MB of memory,
which on a shared platform like PlanetLab indicates ei-
ther a memory leak or poor experiment design (e.g., a
large in-memory logfile).
Figure 5 shows CoMon’s estimate of how many MB of
memory are available on each PlanetLab node. CoMon
estimates available memory by allocating 100MB, touch-
ing random pages periodically, and then observing the
size of the in-memory working set over time. This serves
as a gauge of memory pressure, since if physical memory
is exhausted and another slice allocates memory, these
pages would be swapped out. The CoMon data shows
that a slice can keep a 100MB working set in memory
on at least 75% of the nodes (since only the minimum
and first quartile line are really visible), so it appears that
there is not as much memory pressure on PlanetLab as we
expected. This also reinforces our intuition that pl mom
resets slices mainly on nodes with too little memory or
when the slice’s application has a memory leak.
4.2.3 Bandwidth
Hosting sites can cap the maximum rate at which the
local PlanetLab nodes can send data. PlanetLab fairly
shares the bandwidth under the cap among slices, using
Linux’s Hierarchical Token Bucket traffic filter [15]. The
node bandwidth cap allows sites to limit the peak rate at
which nodes send data so that PlanetLab slices cannot
completely saturate the site’s outgoing links.
The sustained rate of each slice is limited by the
pl mom watchdog daemon. The daemon allows each
slice to send a quota of bytes each day at the node’s cap
rate, and if the slice exceeds its quota, it imposes a much
smaller cap for the rest of the day. For example, if the
slice’s quota is 16GB/day, then this corresponds to a sus-
tained rate of 1.5Mbps; once the slice sends more than
100
1000
10000
100000
06/Jan/21 06/Feb/04 06/Feb/18 06/Mar/04 06/Mar/18 06/Apr/01 06/Apr/15
T
x
B
an
dw
id
th
(K
b/
s)
1st Q Median 3rd Q Max
(a) Transmit bandwidth in Kb/s, by quartile
100
1000
10000
100000
06/Jan/21 06/Feb/04 06/Feb/18 06/Mar/04 06/Mar/18 06/Apr/01 06/Apr/15
R
x
B
an
dw
id
th
(K
b/
s)
1st Q Median 3rd Q Max
(b) Receive bandwidth in Kb/s, by quartile
Figure 6: Sustained network rates on PlanetLab
16GB, it is capped at 1.5Mbps until midnight GMT. The
goal is to allow most slices to burst data at the node’s cap
rate, but prevents slices that are sending large amounts of
data from badly abusing the site’s local resources.
There are two weaknesses of PlanetLab’s bandwidth
capping approach. First, some sites pay for bandwidth
based on the total amount of traffic they generate per
month, and so they need to control the node’s sustained
bandwidth rather than the peak. As mentioned, pl mom
limits sustained bandwidth, but it operates on a per-slice
(rather than per-node) basis and cannot currently be con-
trolled by the sites. Second, PlanetLab does not currently
cap incoming bandwidth. Therefore, PlanetLab nodes
can still saturate a bottleneck link by downloading large
amounts of data. We are currently investigating ways to
fix both of these limitations.
Figure 6 shows, by quartile, the sustained rates at
which traffic is sent and received on PlanetLab nodes
since January 2006. These are calculated as the sums of
the average transmit and receive rates for all the slices
of the machine over the last 15 minutes. Note that the
y axis is logarithmic, and the Minimum line is omit-
ted from the graph. The typical PlanetLab node trans-
mits about 1Mb/s and receives 500Kb/s, corresponding
to about 10.8GB/day sent and 5.4GB/day received. These
numbers are well below the typical node bandwidth cap
of 10Mb/s. On the other hand, some PlanetLab nodes do
actually have sustained rates of 10Mb/s both ways.
OSDI ’06: 7th USENIX Symposium on Operating Systems Design and Implementation USENIX Association360
0
20
40
60
80
100
05/May 05/Jul 05/Sep 05/Nov 06/Jan 06/Mar
D
is
k
F
ul
l%
Min 1st Q Median 3rd Q Max
Figure 7: Disk usage, by quartile, on PlanetLab
4.2.4 Disk
PlanetLab nodes do not provide permanent storage: data
is not backed up, and any node may be reinstalled without
warning. Services adapt to this environment by treating
disk storage as a cache and storing permanent data else-
where, or else replicating data on multiple nodes. Still,
a PlanetLab node that runs out of disk space is essen-
tially useless. In our experience, disk space is usually ex-
hausted by runaway log files written by poorly-designed
experiments. This problem was mitigated, but not en-
tirely solved, by the introduction of per-slice disk quotas
in June 2005. The default quota is 5GB, with larger quo-
tas granted on a case-by-case basis.
Figure 7 shows, by quartile, the disk utilization on
PlanetLab. The visible dip shortly after May 2005 is
when quotas were introduced. We note that, though disk
utilization grows steadily over time, 75% of Planetlab
nodes still have at least 50% of the disks free. Some
PlanetLab nodes do occasionally experience full disks,
but most are old nodes that do not meet the current sys-
tem requirements.
4.2.5 Jitter
CPU scheduling latency can be a serious problem for
some PlanetLab applications. For example, in a packet
forwarding overlay, the time between when a packet ar-
rives and when the packet forwarding process runs will
appear as added network latency to the overlay clients.
Likewise, many network measurement applications as-
sume low scheduling latency in order to produce pre-
cisely spaced packet trains. Many measurement applica-
tions can cope with latency by knowing which samples to
trust and which must be discarded, as described in [29].
Scheduling latency is more problematic for routing over-
lays, which may have to drop incoming packets.
A simple experiment indicates how scheduling latency
can affect applications on PlanetLab. We deploy a packet
forwarding overlay, constructed using the Click modular
software router [13], on six Planetlab nodes co-located
at Abilene PoPs between Washington, D.C. and Seattle.
Our experiment then uses ping packets to compare the
0
20
40
60
80
100
75 80 85 90 95 100
P
er
ce
nt
Millisecond RTT from ping
Network
Overlay w/SCHED_RR
Overlay
Figure 8: RTT CDF on network, overlay with and without SCHED RR
RTT between the Seattle and D.C. nodes on the network
and on the six-hop overlay. Each of the six PlanetLab
nodes running our overlay had load averages between 2
and 5, and between 5 and 8 live slices, during the ex-
periment. We observe that the network RTT between the
two nodes is a constant 74ms over 1000 pings, while the
overlay RTT varies between 76ms and 135ms. Figure 8
shows the CDF of RTTs for the network (leftmost curve)
and the overlay (rightmost curve). The overlay CDF has
a long tail that is chopped off at 100ms in the graph.
There are several reasons why the overlay could have
its CPU scheduling latency increased, including: (1) if
another task is running when a packet arrives, Click must
wait to forward the packet until the running task blocks
or exhausts its timeslice; (2) if Click is trying to use more
than its “fair share”, or exceeds its CPU guarantee, then
its token bucket CPU filter will run out of tokens and
it will be removed from the runqueue until it acquires
enough tokens to run; (3) even though the Click process
may be runnable, the Linux CPU scheduler may still de-
cide to schedule a different process; and (4) interrupts
and other kernel activity may preempt the Click process
or otherwise prevent it from running.
We can attack the first three sources of latency using
existing scheduling mechanisms in PlanetLab. First, we
give the overlay slice a CPU reservation to ensure that
it will never run out of tokens during our experiment.
Second, we use chrt to run the Click process on each
machine with the SCHED RR scheduling policy, so that
it will immediately jump to the head of the runqueue
and preempt any running task. The Proper service de-
scribed in Section 3.5 enables our slice to run the privi-
leged chrt command on each PlanetLab node.
The middle curve in Figure 8 shows the results of re-
running our experiment with these new CPU scheduling
parameters. The overhead of the Click overlay, around
3ms, is clearly visible as the difference between the two
left-most curves. In the new experiment, about 98% of
overlay RTTs are within 3ms of the underlying network
RTT, and 99% are within 6ms. These CPU scheduling
mechanisms are employed by PL-VINI, the VINI (VIr-
OSDI ’06: 7th USENIX Symposium on Operating Systems Design and ImplementationUSENIX Association 361
tual Network Infrastructure) prototype implemented on
PlanetLab, to reduce latency in an overlay network as an
artifact of CPU scheduling delay [4].
We note two things. First, the obstacle to making
this solution available on PlanetLab is primarily one of
policy—choosing which slices should get CPU reserva-
tions and bumps to the head of the runqueue, since it
is not possible to reduce everyone’s latency on a heav-
ily loaded system. We plan to offer this option to short-
term experiments via the Sirius brokerage service, but
long-running routing overlays will need to be handled on
a case-by-case basis. Second, while our approach can
provide low latency to the Click forwarder in our exper-
iment 99% of the time, it does not completely solve the
latency problem. We hypothesize that the remaining CPU
scheduling jitter is due to the fourth source of latency
identified earlier, i.e., kernel activity. If so, we may be
able to further reduce it by enabling kernel preemption, a
feature already available in the Linux 2.6 kernel.
4.2.6 Remarks
Note that only limited conclusions can be drawn from the
fact that there is unused capacity available on PlanetLab
nodes. Users are adapting to the behavior of the system
(including electing to not use it) and they are writing ser-
vices that adapt to the available resources. It is impossible
to know how many resources would have been used, even
by the same workload, had more been available. How-
ever, the data does document that PlanetLab’s fair share
approach is behaving as expected.
5 Operational Stability
The need to maintain a stable system, while at the same
time evolving it based on user experience, has been a ma-
jor complication in designing PlanetLab. This section
outlines the general strategies we adopted, and presents
data that documents our successes and failures.
5.1 Strategies
There is no easy way to continually evolve a system that
is experiencing heavy use. Upgrades are potentially dis-
ruptive for at least two reasons: (1) new features intro-
duce new bugs, and (2) interface changes force users to
upgrade their applications. To deal with this situation, we
adopted three general strategies.
First, we kept PlanetLab’s control plane (i.e., the ser-
vices outlined in Section 3) orthogonal from the OS. This
meant that nearly all of the interface changes to the sys-
tem affected only those slices running management ser-
vices; the vast majority of users were able to program to
a relatively stable Linux API. In retrospect this is an ob-
vious design principle, but when the project began, we
believed our main task was to define a new OS interface
tailored for wide-area services. In fact, the one example
where we deviated from this principle—by changing the
socket API to support safe raw sockets [3]—proved to be
an operational diaster because the PlanetLab OS looked
enough like Linux that any small deviation caused dis-
proportionate confusion.
Second, we leveraged existing software whereever
possible. This was for three reasons: (1) to improve the
stability of the system; (2) to lower the barrier-to-entry
for the user community; and (3) to reduce the amount of
new code we had to implement and maintain. This last
point cannot be stressed enough. Even modest changes
to existing software packages have to be tracked as those
packages are updated over time. In our eagerness to reuse
rather than invent, we made some mistakes, the most no-
table of which is documented in the next subsection.
Third, we adopted a well-established practice of rolling
new releases out incrementally. This is for the obvious
reason—to build confidence that the new release actually
worked under realistic loads before updating all nodes—
but also for a reason that’s somewhat unique to Planet-
Lab: some long-running services maintain persistent data
repositories, but doing so depends on a minimal number
of copies being available at any time. Updates that reboot
nodes must happen incrementally if long-running storage
services are to survive.
Note that while few would argue with these
principles—and it is undoubtedly the case that we would
have struggled had we not adhered to them—our experi-
ence is that many other factors (some unexpected) had a
significant impact on the stability of the system. The rest
of this section reports on these operational experiences.
5.2 Node Stability
We now chronicle our experience operating and evolving
PlanetLab. Figure 9 illustrates the availability of Plan-
etLab nodes from September 2004 through April 2006,
as inferred from CoMon. The bottom line indicates the
PlanetLab nodes that have been up continuously for the
last 30 days (stable nodes), the middle line is the count
of nodes that came online within the last 30 days, and
the top line is all registered PlanetLab nodes. Note that
the difference between the bottom and middle lines repre-
sents the “churn” of PlanetLab over a month’s time; and
the difference between the middle and top lines indciates
the number of nodes that are offline. The vertical lines in
Figure 9 are important dates, and the letters at the top of
the graph let us refer to the intervals between the dates.
There have clearly been problems providing the com-
munity with a stable system. Figure 9 illustrates several
reasons for this:
• Sometimes instability stems from the community
stressing the system in new ways. In Figure 9, inter-
val A is the run-up to the NSDI’05 deadline. During
this time, heavy use combined with memory leaks in
OSDI ’06: 7th USENIX Symposium on Operating Systems Design and Implementation USENIX Association362
GFEDCBA
A: Runup to NSDI ’05 deadline
B: After NSDI ’05 deadline
C: 3.0 rollout begins
D: 3.0 stable release
E: 3.1 stable release
F: 3.2 rollout begins
G: 3.2 stable release
0
100
200
300
400
500
600
700
04/Sep 04/Nov 05/Jan 05/Mar 05/May 05/Jul 05/Sep 05/Nov 06/Jan 06/Mar
N
od
e
co
un
t
Stable nodes (up > 30 days)
Active in last 30 days
Registered nodes
Figure 9: Node Availability
some experiments caused kernel panics due to Out-
of-Memory errors. This is the common behavior
of Linux when the system runs out of memory and
swap space. pl mom (Section 4.2.2) was introduced
in response to this experience.
• Software upgrades that require a reboot obviously
effect the set of stable nodes (e.g., intervals C and
D), but installing buggy software has a longer-term
affect on stability. Interval C shows the release of
PlanetLab 3.0. Although the release had undergone
extensive off-line testing, it had bugs and a relatively
long period of instability followed. PlanetLab was
still usable during this period but nodes rebooted at
least once per month.
• The pl mom watchdog is not perfect. There is a
slight dip in the number of stable core nodes (bottom
line) in interval D, when about 50 nodes were re-
booted because of a slice with a fast memory leak; as
memory pressure was already high on those nodes,
pl mom could not detect and reset the slice before
the nodes ran out of memory.
We note, however, that the 3.2 software release from late
December 2005 is the best so far in terms of stability:
as of February 2006, about two-thirds of active Planet-
Lab nodes have been up for at least a month. We at-
tribute most of this to abandoning CKRM in favor of the
Vservers native resource management framework and a
new CPU scheduler.
One surprising fact to emerge from Figure 9 is that a
lot of PlanetLab nodes are dead (denoted by the differ-
ence between the top and middle lines). Research orga-
nizations gain access to PlanetLab simply by hooking up
two machines at their local site. This formula for growth
has worked quite well: a low barrier-to-entry provided
the right incentive to grow PlanetLab. However, there
have never been well-defined incentives for sites to keep
nodes online. Providing such incentives is obviously the
right thing to do, but we note that that the majority of
the off-line nodes are at sites that no longer have active
slices—and at the time of this writing only 12 sites had
slices but no functioning nodes—so it’s not clear what
incentive will work.
Now that we have reached a fairly stable system, it be-
comes interesting to study the “churn” of nodes that are
active yet are not included in the stable core. We find it
useful to differentiate between three categories of nodes:
those that came up that day (and stayed up), those that
went down that day (and stayed down), and those that
have rebooted at least once. Our experience is that, on
a typical day, about 2 nodes come up, about 2 nodes go
down, and about 4 nodes reboot. On 10% of days, at least
6 nodes come up or go down, and at least 8 nodes reboot.
Looking at the archives of the support and
planetlab-users mailing lists, we are able to iden-
tify the most common reasons nodes come up or go down:
(1) a site takes its nodes offline to move them or change
their network configuration, (2) a site takes its nodes
offine in response to a security incident, (3) a site acci-
dently changes a network configuration that renders its
nodes unreachable, or (4) a node goes offline due to a
hardware failure. The last is the most common reason
for nodes being down for an extended period of time; the
third reason is the most frustrating aspect of operating a
system that embeds its nodes in over 300 independent IT
organizations.
Understanding the relative frequency of different sorts
of site events may be important for designers of other
large-scale distributed systems; this is a topic for further
study.
5.3 Security Complaints
Of the operational issues that PlanetLab faces, respond-
ing to security complaints is perhaps the most interesting,
if only because of what they say about the current state of
the Internet. We comment on three particular types of
complaints.
The most common complaints are the result of IDS
alerts. One frequent scenario corresponds to a perceived
DoS attack. These are sometimes triggered by a poorly
designed experiment (in which case the responsible re-
searchers are notified and expected to take corrective ac-
tion), but they are more likely to be triggered by totally
innocent behavior (e.g., 3 unsolicited UDP packets have
triggered the threat of legal action). In other cases, the
alerts are triggered by simplistic signitures for malware
that could not be running on our Linux-based environ-
ment. In general, we observe that any traffic that devi-
OSDI ’06: 7th USENIX Symposium on Operating Systems Design and ImplementationUSENIX Association 363
ates from a rather narrow range of acceptable behavior is
increasingly viewed as suspect, which makes innovating
with new types of network services a challenge.
An increasingly common type of complaint comes
from home users monitoring their firewall logs. They see
connections to PlanetLab nodes that they do not recog-
nize, assume PlanetLab has installed spyware on their
machines, and demand that it be removed. In reality,
they have unknowingly used a service (e.g., a CDN) that
has imposed itself between them and a server. Receiving
packets from a location service that also probes the client
to select the most appropriate PlanetLab node to service
a request only exacerbates the situation [35, 9]. The take-
away is that even individual users are becoming increas-
ingly security-senstive (if less security-sophisticated than
their professional counterparts), which makes the task of
deploying alternative services increasingly problematic.
Finally, PlanetLab nodes are sometimes identified as
the source or sink of illegal content. In reality, the con-
tent is only cached on the node by a slice running a CDN
service, but an overlay node looks like an end node to
the rest of the Internet. PlanetLab staff use PlanetFlow
to identify the responsible slice, which in turn typically
maintains a log that can be used to identify the ultimate
source or destination. This information is passed along
to the authorities, when appropriate. While many hosting
sites are justifiably gun-shy about such complaints, the
main lesson we have learned is that trying to police con-
tent is not a viable solution. The appropriate approach
is to be responsive and cooperative when complaints are
raised.
6 Discussion
Perhaps the most fundamental issue in PlanetLab’s de-
sign is how to manage trust in the face of pressure to de-
centralize the system, where decentralization is motivated
by the desire to (1) give owners autonomous control over
their nodes and (2) give third-party service developers the
flexibility they need to innovate.
At one end of the spectrum, individual organizations
could establish bilateral agreements with those organiza-
tions that they trust, and with which they are willing to
peer. The problem with such an approach is that reaching
the critical mass needed to foster a large-scale deploy-
ment has always proved difficult. PlanetLab started at the
other end of the spectrum by centralizing trust in a sin-
gle intermediary—PLC—and it is our contention that do-
ing so was necessary to getting PlanetLab off the ground.
To compensate, the plan was to decentralize the system
through two other means: (1) users would delegate the
right to manage aspects of their slices to third-party ser-
vices, and (2) owners would make resource allocation de-
cisions for their nodes. This approach has had mixed suc-
cess, but it is important to ask if these limitations are fun-
damental or simply a matter of execution.
With respect to owner autonomy, all sites are allowed
to set bandwidth caps on their nodes, and sites that have
contributed more than the two-node minimum required
to join PlanetLab are allowed to give excess resources
to favored slices, including brokerage services that redis-
tribute those resources to others. In theory, sites are also
allowed to blacklist slices they do not want running lo-
cally (e.g., because they violate the local AUP), but we
have purposely not advertised this capability in an effort
to “unionize” the research community: take all of our ex-
periments, even the risky ones, or take none of them. (As
a compromise, some slices voluntarily do not run on cer-
tain sites so as to not force the issue.) The interface by
which owners express their wishes is clunky (and some-
times involves assistance from the PlanetLab staff), but
there does not seem to be any architectural reason why
this approach cannot provide whatever control over re-
source allocation that owners require (modulo meeting
requirements for joining PlanetLab in the first place).
With respect to third-party management services, suc-
cess has been much more mixed. There have been some
successes—Stork, Sirius, and CoMon being the most no-
table examples—but this issue is a contentious one in the
PlanetLab developer community. There are many pos-
sible explanations, including there being few incentives
and many costs to providing 24/7 management services;
users preferring to roll their own management utilities
rather than learn a third-party service that doesn’t exactly
satisfy their needs; and the API being too much of a mov-
ing target to support third-party development efforts.
While these possibilities provide interesting fodder for
debate, there is a fundamental issue of whether the cen-
tralized trust model impacts the ability to deploy third-
party management services. For those services that re-
quire privileged access on a node (see Section 3.5) the an-
swer is yes—the PLC support staff must configure Proper
to grant the necessary privilege(s). While in practice such
privileges have been granted in all cases that have not vio-
lated PlanetLab’s underlying trust assumptions or jeopar-
dized the stability of the operational system, this is clearly
a limitation of the architecture.
Note that choice is not just limited to what manage-
ment services the central authority approves, but also
to what capabilities are included in the core system—
e.g., whether each node runs Linux, Windows, or Xen.
Clearly, a truly scalable system cannot depend on a sin-
gle trusted entity making these decisions. This is, in fact,
the motivation for evolving PlanetLab to the point that it
can support federation. To foster federation we have put
together a software distribution, called MyPLC, that al-
lows anyone to create their own private PlanetLab, and
potentially federate that PlanetLab with others (including
the current ”public” PlanetLab).
OSDI ’06: 7th USENIX Symposium on Operating Systems Design and Implementation USENIX Association364
This returns us to the original issue of centralized ver-
sus decentralized trust. The overriding lesson of Plan-
etLab is that a centralized trust model was essential to
achieving some level of critical mass—which in turn al-
lowed us to learn enough about the design space to define
a candidate minimal interface for federation—but that it
is only by federating autonomous instances that the sys-
tem will truly scale. Private PlanetLabs will still need bi-
lateral peering agreements with each other, but there will
also be the option of individual PlanetLabs scaling inter-
nally to non-trivial sizes. In other words, the combination
of bilateral agreements and trusted intermediaries allows
for flexible aggregation of trust.
7 Conclusions
Building PlanetLab has been a unique experience. Rather
than leveraging a new mechanism or algorithm, it has re-
quired a synthesis of carefully selected ideas. Rather than
being based on a pre-conceived design and validated with
controlled experiments, it has been shaped and proven
through real-world usage. Rather than be designed to
function within a single organization, it is a large-scale
distributed system that must be cognizant of its place in
a multi-organization world. Finally, rather than having to
satisfy only quantifiable technical objectives, its success
has depended on providing various communities with the
right incentives and being equally responsive to conflict-
ing and difficult-to-measure requirements.
Acknowledgments
Many people have contributed to PlanetLab. Timothy
Roscoe, Tom Anderson, and Mic Bowman have provided
significant input to the definition of its architecture. Sev-
eral researchers have also contributed management ser-
vices, including David Lowenthal, Vivek Pai and Ky-
oungSoo Park, John Hartman and Justin Cappos, and Jay
Lepreau and the Emulab team. Finally, the contributions
of the PlanetLab staff at Princeton—Aaron Klingaman,
Mark Huang, Martin Makowiecki, Reid Moran, Faiyaz
Ahmed, Brian Jones, and Scott Karlin—have been im-
measurable.
We also thank the anonymous referees, and our shep-
herd, Jim Waldo, for their comments and help in improv-
ing this paper.
This work was funded in part by NSF Grants CNS-
0520053, CNS-0454278, and CNS-0335214.
References
[1] ANNAPUREDDY, S., FREEDMAN, M. J., AND MAZIERES, D. Shark: Scal-
ing File Servers via Cooperative Caching. In Proc. 2nd NSDI (Boston, MA,
May 2005).
[2] AUYOUNG, A., CHUN, B., NG, C., PARKES, D., SHNEI-
DMAN, J., SNOEREN, A., AND VAHDAT, A. Bellagio: An
Economic-Based Resource Allocation System for PlanetLab.
http://bellagio.ucsd.edu/about.php.
[3] BAVIER, A., BOWMAN, M., CULLER, D., CHUN, B., KARLIN, S., MUIR,
S., PETERSON, L., ROSCOE, T., SPALINK, T., AND WAWRZONIAK, M.
Operating System Support for Planetary-Scale Network Services. In Proc.
1st NSDI (San Francisco, CA, Mar 2004).
[4] BAVIER, A., FEAMSTER, N., HUANG, M., PETERSON, L., AND REX-
FORD, J. In VINI Veritas: Realistic and Controlled Network Experimenta-
tion. In Proc. SIGCOMM 2006 (Pisa, Italy, Sep 2006).
[5] BRETT, P., KNAUERHASE, R., BOWMAN, M., ADAMS, R., NATARAJ,
A., SEDAYAO, J., AND SPINDEL, M. A Shared Global Event Propaga-
tion System to Enable Next Generation Distributed Services. In Proc. 1st
WORLDS (San Francisco, CA, Dec 2004).
[6] CLARK, D. D. The Design Philosophy of the DARPA Internet Protocols.
In Proc. SIGCOMM ’88 (Stanford, CA, Aug 1988), pp. 106–114.
[7] DAVID LOWENTHAL. Sirius: A Calendar Service for PlanetLab.
http://snowball.cs.uga.edu/dkl/pslogin.php.
[8] FREEDMAN, M. J., FREUDENTHAL, E., AND MAZIERES, D. Democratiz-
ing content publication with Coral. In Proc. 1st NSDI (San Francisco, CA,
Mar 2004).
[9] FREEDMAN, M. J., LAKSHMINARAYANAN, K., AND MAZIERES, D. OA-
SIS: Anycast for Any Service. In Proc. 3rd NSDI (San Jose, CA, May
2006).
[10] FU, Y., CHASE, J., CHUN, B., SCHWAB, S., AND VAHDAT, A. SHARP:
An Architecture for Secure Resource Peering. In Proc. 19th SOSP (Lake
George, NY, Oct 2003).
[11] HUANG, M., BAVIER, A., AND PETERSON, L. PlanetFlow: Maintaining
Accountability for Network Services. ACM SIGOPS Operating Systems
Review 40, 1 (Jan 2006).
[12] JUSTON CAPPOS AND JOHN HARTMAN. Stork: A Soft-
ware Packagement Management Service for PlanetLab.
http://www.cs.arizona.edu/stork.
[13] KOHLER, E., MORRIS, R., CHEN, B., JANNOTTI, J., AND KAASHOEK,
M. F. The Click Modular Router. ACM Transactions on Computer Systems
18, 3 (Aug 2000), 263–297.
[14] LAI, K., RASMUSSON, L., ADAR, E., SORKIN, S., ZHANG, L., AND
HUBERMAN, B. A. Tycoon: An Implemention of a Distributed Market-
Based Resource Allocation System. Tech. Rep. arXiv:cs.DC/0412038, HP
Labs, Palo Alto, CA, USA, Dec. 2004.
[15] LINUX ADVANCED ROUTING AND TRAFFIC CONTROL.
http://lartc.org/.
[16] LINUX VSERVERS PROJECT.
http://linux-vserver.org/.
[17] MUIR, S., PETERSON, L., FIUCZYNSKI, M., CAPPOS, J., AND HART-
MAN, J. Privileged Operations in the PlanetLab Virtualised Environment.
SIGOPS Operating Systems Review 40, 1 (2006), 75–88.
[18] OPPENHEIMER, D., ALBRECH, J., PATTERSON, D., AND VAHDAT, A.
Distributed Resource Discovery on PlanetLab with SWORD. In Proc. 1st
WORLDS (San Francisco, CA, 2004).
[19] PARK, K., AND PAI, V. S. Scale and Performance in the CoBlitz Large-File
Distribution Service. In Proc. 3rd NSDI (San Jose, CA, May 2006).
[20] PARK, K., PAI, V. S., PETERSON, L. L., AND WANG, Z. CoDNS: Im-
proving DNS Performance and Reliability via Cooperative Lookups. In
Proc. 6th OSDI (San Francisco, CA, Dec 2004), pp. 199–214.
[21] PETERSON, L., ANDERSON, T., CULLER, D., AND ROSCOE, T. A
Blueprint for Introducing Disruptive Technology into the Internet. In Proc.
HotNets–I (Princeton, NJ, Oct 2002).
[22] PETERSON, L., BAVIER, A., FIUCZYNSKI, M., MUIR, S., AND ROSCOE,
T. PlanetLab Architecture: An Overview. Tech. Rep. PDN–06–031, Plan-
etLab Consortium, Apr 2006.
[23] RAMASUBRAMANIAN, V., PETERSON, R., AND SIRER, E. G. Corona: A
High Performance Publish-Subscribe System for the World Wide Web. In
Proc. 3rd NSDI (San Jose, CA, May 2006).
OSDI ’06: 7th USENIX Symposium on Operating Systems Design and ImplementationUSENIX Association 365
[24] RAMASUBRAMANIAN, V., AND SIRER, E. G. Beehive: O(1) Lookup Per-
formance for Power-Law Query Distributions in Peer-to-Peer Overlays. In
Proc. 1st NSDI (San Francisco, CA, Mar 2004), pp. 99–112.
[25] RAMASUBRAMANIAN, V., AND SIRER, E. G. The Design and Imple-
mentation of a Next Generation Name Service for the Internet. In Proc.
SIGCOMM 2004 (Portland, OR, Aug 2004), pp. 331–342.
[26] RHEA, S., GODFREY, B., KARP, B., KUBIATOWICZ, J., RATNASAMY,
S., SHENKER, S., STOICA, I., AND YU, H. OpenDHT: A Public DHT
Service and its Uses. In Proc. SIGCOMM 2005 (Philadelphia, PA, Aug
2005), pp. 73–84.
[27] RITCHIE, D. M., AND THOMPSON, K. The UNIX Time-Sharing System.
Communications of the ACM 17, 7 (Jul 1974), 365–375.
[28] RYAN HUEBSCH. PlanetLab Application Manager.
http://appmanager.berkeley.intel-research.net/.
[29] SPRING, N., BAVIER, A., PETERSON, L., AND PAI, V. S. Using PlanetLab
for Network Research: Myths, Realities, and Best Practices. In Proc. 2nd
WORLDS (San Francisco, CA, Dec 2005).
[30] SPRING, N., WETHERALL, D., AND ANDERSON, T. Scriptroute: A Public
Internet Measurement Facility. In Proc. 4th USITS (Seattle, WA,Mar 2003).
[31] VIVEK PAI AND KYOUNGSOO PARK. CoMon: A Monitoring Infrastruc-
ture for PlanetLab. http://comon.cs.princeton.edu.
[32] WALDSPURGER, C. A., AND WEIHL, W. E. Lottery Scheduling: Flexible
Proportional-Share Resource Management. In Proc. 1st OSDI (Monterey,
CA, Nov 1994), pp. 1–11.
[33] WANG, L., PARK, K., PANG, R., PAI, V. S., AND PETERSON, L. L. Reli-
ability and Security in the CoDeeN Content Distribution Network. In Proc.
USENIX ’04 (Boston, MA, Jun 2004).
[34] WHITE, B., LEPREAU, J., STOLLER, L., RICCI, R., GURUPRASAD, S.,
NEWBOLD, M., HIBLER, M., BARB, C., AND JOGLEKAR, A. An Inte-
grated Experimental Environment for Distributed Systems and Networks.
In Proc. 5th OSDI (Boston, MA, Dec 2002), pp. 255–270.
[35] WONG, B., SLIVKINS, A., AND SIRER, E. G. Meridian: A Lightweight
Network Location Service without Virtual Coordinates. InProc. SIGCOMM
2005 (Philadelphia, PA, Aug 2005).
[36] ZHANG, M., ZHANG, C., PAI, V. S., PETERSON, L. L., AND WANG,
R. Y. PlanetSeer: Internet Path Failure Monitoring and Characterization in
Wide-Area Services. In Proc. 6th OSDI (San Francisco, CA, Dec 2004),
pp. 167–182.
OSDI ’06: 7th USENIX Symposium on Operating Systems Design and Implementation USENIX Association366