next up previous

Axioms of system administration

To begin a formal discussion, we need to establish a frame of reference, i.e. the ground rules for the discussion. In this section, a basic fundament is proposed with the aim of striking a balance between reality and suitability for analysis. It is also necessary to partially limit the scope of the discussion to avoid unnecessary complication. Although the aim of this presentation is not precise mathematical rigor, it is the aim to indicate that such a rigor is possible and to indicate how it can be provided. A secondary aim is to communicate the key elements of the discussion to a more theoretical audience; for these reasons, the language adopted is one which is meant to build bridges between system administration and more mathematical disciplines. Readers are asked to keep an open mind with regard to use of terms however, since technical disciplines often use words in meanings which are specific to those disciplines, and this could lead to confusion.

A computer system is analogous to a community[4] composed of many interacting and competing players: i.e. users and administrators. It can only properly be discussed in terms of the aims and activities of this collective and of individual members of that community. Not all the members of a community share the same objectives, as a general rule. Traugott and Huddleston have pointed out[5] that it is often pertinent to view a local computer community as a single virtual machine, rather than as a conglomeration of individual hosts. In this paper, the term computer system will be used to refer to the collective hosts of a local domain, or some appropriate logical unit of networked computers. It is taken for granted that there may be internal competition for resources and even conflict between competing parties.

In order to formulate a theory of system administration we must establish a set of possible goals, procedures and obstructions and state them in formal terms. The aims and intentions of each computer system are different; usually they are prescribed by a system policy, i.e. a formal statement of intent and allowed practice. The aim is then to postulate or derive strategies which best achieve those goals, given the essential constraints. From this viewpoint, one expects the language of constrained competition to play a role in a theory of system administration. Even if one could frame such a theory in formal terms, what would be the purpose of such an exercise? The principal benefit of such an attempt is to create a rigid protocol for discussing system administration, which is general enough to cover most of the actual problems and possibilities, but which is stringent enough to prevent its perversion by parties with vested interests in proving a certain point of view.

There is a number of stages in this programme. To begin with, one needs some basic axioms which all parties agree on, propositions which define the aims of system administration. Next one needs to abstract a model of a computer system which is sufficient to capture the dynamical interaction between all of the players, but which is sufficiently simple to be surrendered for analysis. Here we shall suppose that a computer's resources (memory, CPU, disk etc) are divided into two parts,

$\displaystyle R = R_c \oplus R_m,$     (1)

i.e. a part which determines the behavioural configuration parameters C of the system working resources, and a remainder part (the working resources themselves) which users of the system can change as a normal part of their interaction with the system. This remainder part can be observed over an appropriate time scale, giving a set of measurements $\overline M$ which indicate how the system is being used.

The different possible configurations of the system resources $\lbrace C
\rbrace$ are made up from the independent operation types $\lbrace T \rbrace$which lead to these configurations. From this definition one needs to be sure that a unique description is possible, i.e. eliminate points of contention about the description itself. Finally, one must be sure that the description is sufficiently complete, i.e. that there exists a mapping between policy and system configuration which is as complete as the problem itself. The purpose of this section is to introduce the key players in this description, in advance of a fuller description in the coming sections.

In order to state the purpose of system administration, we may take the basic tenet or principle to be the following:

Basic assumption 1   The requirements and constraints of any computer system are defined at any time t by an implementable system policy P(t). This policy determines the actions or rules of play for a system administrator, but not necessarily the actions of users. It includes a specification of which and how many users are allowed to access the system.

The policy P(t) is not usually a continuous function of time, but may change catastrophically (in the mathematical sense) over a time scale which is much longer than the time scale over which users act and make changes to the system. The nature of this policy is not yet determined.

In order to make a policy implementable, it must be possible to relate it to a complete configuration instruction for the system C(t), through rules and constraints. These rules and constraints could be issued verbally to users, or could be programmed into configuration files of software components which form the system. A single complete configuration instruction for the system can be thought of as being a sum of two parts:

$\displaystyle C = C_r \oplus C_u,$     (2)

a specification of resource configurations Cr which describe how the software and hardware landscape is configured, and a specification of user configurations Cu, which describes who is allowed to do what with the resources (this includes remote, network users who access services through local agents). A specification of user configurations Cu(numbers of users and their rights to resources) could easily be separated from system policy conceptually, but it is convenient to view the policy as a complete specification of the system plus its intended and actual usage. The meaning of the symbol $\oplus$is that of a heuristic union: configuration specifications take many forms (are objects of many types). They are most easily thought of as sets of more primitive objects, in which case the addition of sets implies their strict union.

A complete configuration instruction can be thought of geometrically as a point in a vector space, which is found by adding together instructions of linearly independent (orthogonal) types. One does this by introducing a set of primitive configuration instruction types $\lbrace T^i \rbrace$, and writing the complete configuration as a linear combination of these:

$\displaystyle C = \sum_i \; c_i T^i,$     (3)

with set-valued coefficients ci. The basis of primitive configuration operations will be described later. A complete configuration usually contains instructions for the operators of the system also. The system administrator can also be viewed as part of this system for the sake of abstraction.

The set of all configurations $\lbrace C
\rbrace$ contains much redundancy. Let us imagine that the mapping of complete configuration instructions to the final state of the system is many to one, and that this multiplicity can be represented by a group of permutations and transformations ${\cal G}$1. Thus equivalent configurations could be formed by permuting configuration instructions, if ordering is unimportant, or by exchanging (transforming) one component of the configuration for another. An example of this is the following: the configuration of a World Wide Web service might be possible with several equivalent software systems, each with equivalent configuration files: in this case, these would form an equivalence class. Conversely, if even a minor detail distinguishes them, then they are inequivalent configurations.

Let us define an implementable policy P(t) as being any representative member of the set of equivalent configurations $\lbrace
C(t)\rbrace$

$\displaystyle P(t) \equiv \lbrace C(t) \rbrace / {\cal G}.$     (4)

A policy could naturally mean more than a configuration (or computer and its operators), but as long as other aspects of the policy cannot be implemented by either machine or human, they are irrelevant to the system. Having related policy to configuration instructions, the path is clear to define the state S(t) of the system.

Let a state of the system describe a single configuration of users Cu, of system resources Cr and a set of measured average metrics $\overline M$ which summarizes the average usage of the system in relation to the users[2]. The metrics $\overline M$ represent a first order response (feedback interaction) between users and resources. The state is written, again, as a direct sum

$\displaystyle \overline S_p \equiv \left(\frac{C_r \oplus C_u}{\cal G}\right) \oplus \overline M(C_r,C_u)$     (5)

where the behavioural metrics $\overline M(C_r,C_u)$ are functions of resources and user activity. One may now state the following provable hypothesis.

Theorem 1   Any sufficiently complete system policy P(t) specifies, by implication, a representative average ideal state $\overline S_p(t)$, from an equivalence class of ideal states $\lbrace\overline S\rbrace$ under ${\cal G}$, for the computer system concerned, over user time-scales Tu, provided that the rate of change of policy dP/dt is much smaller than changes in user behaviour $d\overline M/dt$, i.e. the policy changes on the order of weeks or months rather than hours or days.

Corollary: The ideal state can only be identified on average, since interactions with unpredictable user activity are constantly causing fluctuations $\delta S(t)$ in the state of the system. These fluctuations also occur at a rate $d\overline M/dt \gg dP/dt$.

The existence of an ideal state has already been used in designing the author's site Configuration Engine (cfengine)[6], but it has not previously been explained at length. The proof of this theorem is straightforward, from the definitions. Every computer system has a finite set of resources and configuration objects which is completely prescribed by a total configuration Cp. Each resource object may be in a state described by a finite length bit string, describing a distinct configuration $s_i \in C_r$. There is therefore a mapping from the configuration Cp to the actual state

$\displaystyle C_p \rightarrow \overline S_p + \delta S.$     (6)

This mapping is one to many, since $\delta S$ is a stochastic variable. The averaging operation eliminates the non-uniqueness by extinguishing $\delta S$, provided the averaging process is defined, i.e., $d\delta S/dt \gg dP/dt$. Thus any complete set of average measurements contributes to the average state in a well-defined manner.

The meaning of `sufficiently complete system policy' is now clear. The covering of the policy domain must be as large as the domain of state one wishes to cover, since it follows from the above definitions that the association of policies to states is now one to one, after one factors out the equivalences ${\cal G}$. The uniqueness is secured by making the configuration instruction itself a part of the state. Without this, there would still be ambiguity, since there is no guarantee that a measurement $\overline M(C_r,C_u)$ is a unique function of its arguments. This completes the proof.

This sufficiency referred to above has the corollary that an incomplete system policy P1 cannot determine a unique state for the whole system, only a part of it. An incomplete policy divides the system into two or more parts, since the total policy is still in one to one correspondence with the states.

$\displaystyle P_1+P_2+\ldots \rightarrow S_{p1} + S_{p2}+ \ldots$     (7)

By the virtue of the fundamental theorem, we have the important conclusion that the necessary and sufficient condition for implementation a policy P(t) (i.e. the ability to map it onto a system configuration over a period of time) is that the total average state $\overline S(t) = \overline S_p(t)$.

Let us take a moment to understand the structure and meaning of average ideal state. It is tempting to think of the system as being in an ideal state at some time t0 and then deviating from it at later times. The precise state of the system at some reference time might seem to characterize an ideal to our subjective judgement, but the ideal state of configuration must change with time, since the computer system is, by nature, influenced by users whose activities are not completely secured by a policy. To freeze one's view of the ideal in time, is to place unreasonable restrictions on the use of the system (we shall see this later in examples connected to the use of fixed disk quotas). A specification of resource and user boundary conditions is not the same as a specification of the ideal dynamical behaviour of the system, if users are allowed to act on the resources.

Given that the policy and user configurations are stable over the prescribed time-scales, one may take the average value (or distribution of values) for each metric which characterizes the response of system over shorter time-scales $\overline M(C_r,C_u)$ as being representative of the state at time t. This summarizes the effect of feedback of users on resources, or the statistical interaction between the users and the system. Since we have prescribed every bit string affecting the dynamics of the system at the outset as policy, and we have measured the average result of those prescriptions at time t, we have a complete description of the system in terms of an implementable policy.

$\displaystyle \overline S_p = P \oplus \overline M(P).$     (8)

Not surprisingly, this expression is directly analogous to linear response theory in the physics of time-varying systems. The policy plays the role of a constraint of the motion, while the statistical metric $\overline M$, has the role of the integrated response of the evolved state at time t. The `equations of motion' which lead to the evolution of a system also have an analogue here: they are the operations carried out by the system software on the resources.


 
Figure: The existence of an ideal state. This picture shows the mapping of equivalent configurations of users Cu plus hardware Cr to a factored set $C/{\cal G}$ which can be interpreted as the set of implementable policies P. A unique configuration results in a measurable effect on the system $\overline M$, i.e. the feedback resulting from the policy P. The combination of the configuration with its average effect on the system defines a unique state $\overline S$.
\begin{figure}
\psfig{file=maps.eps,width=11cm}
\end{figure}

The permutation or invariance group ${\cal G}$ is of no concern to this paper except as a matter of principle for the most pedantic. It is a heuristic representation of all of the involved details which are irrelevant to a theoretical formulation, but which occur in practice2. Nevertheless, it has a theoretical implication: the ideal state $\overline S$ has a number of equivalent representations, e.g. those formed by permuting or swapping configuration details whose ordering or equivalence is unimportant. This multiplicity could be a benefit or a hazard to the task of implementation of policy. This remains to be determined.

Having established the existence of an ideal state, the second basic assumption is:

Basic assumption 2   The long term aim of system administration is to optimize the policy P(t) for maximum productivity, insofar as this is allowed by local constraints. The short term aim is to keep the system as close to the resulting ideal state $\overline S_p(t)$ as possible, i.e. to minimize fluctuations $\delta S = \overline S - S$.

One expects the average state $\overline S_p$ to exhibit persistent behaviour however, i.e. be invariant for periods approaching the duration of the system policy. Note that, what we are essentially doing, by making the assertion of an ideal state, is to separate slowly varying changes from quickly varying changes.

\begin{displaymath}S(t) = \overline S_p(t) + \delta S(t).
\end{displaymath} (9)

Errors and misconfigurations (fluctuations $\delta S$) can accumulate over short periods of time, shorter than the time scale over which the average or ideal state changes. In terms of the relative rates of change:

\begin{displaymath}\max\left\vert\frac{1}{\delta S}\frac{d }{dt} \delta S\right\...
...
\frac{1}{\overline S_p}\frac{d}{dt} \overline S_p\right\vert.
\end{displaymath} (10)

The business of system administration is therefore a problem in regulation, or in minimizing the effect of $\delta S$.

We arrive at the following: a theory of system administration would attempt to answer the questions:

Is there an optimal strategy for keeping the system as close as possible to its ideal state, and maximize its productivity?
To answer these questions, we must understand more deeply the meaning of the abstract formulation above. To begin, we backtrack and re-examine the underpinning concepts.


next up previous
Next: A generic computer system Up: On the theory of Previous: On scales
Mark Burgess
2000-03-24