Building Systems

Over the course of this documentation we have been moving from building modules to building systems. You may have noticed that as we move away from the module building, the tools become less polished and less complete. This now reaches its climax as we describe the tools and support for combining the modules into a single, distributed, live system.

The system building tools described in this section can be charitably called "fluid:" there are many areas for improvement, upgrading, and even just bug fixing. In addition, the current set of tools are very much tied to Unix, and more specifically, tied to the X window system.

Within these limitations, the ModUtils package does provide a fairly comprehensive and useful suite of tools for building, debugging, and running multi-module, multi-processor, distributed robotic systems.

Repository Manager Overview

The core of the ModUtils system building tool is the repository manager. As previously explained, one possible configuration source is the central repository of configuration files. The repository manager is this central repository, and modules connect to it via the underlying TCP/IP messaging toolkit.

The repository manager program, or repoman, is simply a Python interpreter (http://www.python.org) which has been extended to read configuration files and create configuration sources. By itself, the executable simply looks like a normal Python interpreter, and you can run it and do whatever you normally do with Python. We chose Python because of its ease of use and fundamental object-oriented structure. Unlike languages such as Perl, a Python program is fairly readable to anyone who has any familiarity with structured procedural object oriented languages. Most ModUtils users will not have to ever look at a Python program, but will use the Python module RepoMan.py which contains the class definitions and code to run the repository manager.

So, the repository manager is the central "repository" of configuration information. It reads in a configuration file (which may read in other configuration files) and in this file are named structures. For every module in the system there is a corresponding named structure. For example, if you have a module named "Foo", there will be struct Foo { ... } in the configuration file which will list the parameters to pass on to Foo. When Foo runs it is given the configuration source specification which connects it to the repository manager, and it will have the contents of the Foo structure delivered to it to parameterize its operation.

The repository manager is not just a passive collection of configuration information. It acts as a central "blackboard" for the system. Modules can set information in their structures which other modules, or the repository manager itselve, can monitor to affect system operation. Thus the repository becomes a fairly architecture independent back channel for communicating information in a reliable, albeit high latency, manner.

More importantly, the repository manager actually starts and stops all of the modules in the system. Thus you do not have to worry about the intricacies of how to specify a configuration source that talks with the repository manager, the repository manager starts your modules with the appropriately parameterized configuration source specification string. In order to do this, there is a section of the repository configuration file (in a structure called Modules) which contains module "meta-information," i.e., not configuration parameters that go to the modules themselves, but rather configuration parameters that specify where and how to run the modules.

You can create subclasses of RepoMan that start up, shut down, and monitor the system in arbitrary ways. For example, the Collect.py script contains a subclass of RepoMan, CollectManager which sets up unique, date and time based directory names for collection at run time and consolidates the data into those directories at the finish of the run. Subclasses of RepoMan can be set up to monitor the operation of your system through changes in the data base, and not just to monitor, but to change modes, starting, stopping, and reparameterizing modules as necessary.

There is an immense amount of detail hidden in the Python implementations of these scripts, but in this documentation I will mainly try and just hit the points typically needed for building systems. To actually go and write arbitrary system scripts will take furth study of the existing scripting code and experimentation.

As was said earlier, most of the time you will not be writting any Python code at all, just using the existing RepoMan.py and CollectManager.py Python modules. In fact, to hide some of the invocation complications, we provide two shell scripts for easing their use: RepoMan and Collect. These shell scripts simply take the list of modules to run as their arguments, and start the appropriate repository management code.

System Structure Hints

While the repository manager can handle almost any structure for your file system, there are some lessons we have learned that can help you in your system development.

First, it is a good idea to have an NFS (network file system) partition that contains your code that is mounted on all of the systems that are relevent. Thus you will only have to compile on one machine in order for the code to be rebuilt for all of the machines that share that partition. Of course, this only works if the machines are compatible operating systems and hardware platforms, but if that is the case, then providing this ability will vastly reduce the headaches of distributed development. The repository manager system absolutely does not require this, it is just a good idea.

In addition, we find it useful to have a widely accessible data partition, available on all machines as /home/data. This will be where data gets consolidated to. In addition, on each machine we advise having a locally mounted partition /data for local caching of data. You can virtual eliminate the need for NFS data partitions by putting all of your data logging modules on the machine which hosts /home/data, but we find it is useful to have the ability to easily log text data files that your programs may generate in a central place or to locally cache high-bandwidth data such as video streams for later consolidation instead of trying to ship them over the network in real time. Again, these rules are not written in stone, they are just rules of thumb.

Repository Configuration Files

The repository configuration file consists of three parts.

The module meta-information section contained in the structure Modules. For example, for a system that has two possible modules, Foo and Bar the shape of the Modules structure would be
```
   struct Modules {
     struct Foo {
        .
        .
     }
     struct Bar {
        .
        .
     }
   }
```
Information about shared memory regions contained in the structure SharedMemory. This is usually empty:
```
   struct SharedMemory {
   }
```
Your modules' parameters. These are in standalone structures. For every named structure in the Modules section there must be a corresponding parameter structure, i.e., for the above example you may have
```
   struct Foo {
     int my_parameter = 1;
   }
   struct Bar {
     int my_parameter = 5;
   }
```

Now, you could create one big file for every project, but we have found it more useful to have a set of files that can easily be replaced via clever usage of the environment variables that affect where the system looks for files. To help with this, here is a "typical" base for a repository configuration file, repoman.cfg

%include machines.cfg
%include modules.cfg
%include memory.cfg
%include module_params.cfg
%include module_param_changes.cfg

This file starts with machines.cfg, where you will assign variables to host name for later use in your system, for example, you may do

string sick_host = BUMPER;
string left_sick_host = $sick_host;
string right_sick_host = $sick_host;

and then later you will use the variables $left_sick_host and $right_sick_host to refer to where the Sick laser scanner modules should run.

Then we move on to modules.cfg, which contains your struct Modules declaration which uses all of the host names defined in machines.cfg rather than any specific host names. We continue with the memory.cfg file, which contains the empty SharedMemory structure. Then, the meat of your system parameters are in module_params.cfg. Finally, we include module_param_changes.cfg file, which by default is empty.

The idea is that you can create "nominal" system configuration files and put them in your $CONFIG_DIR, but by setting $LOCAL_CONFIG_DIR or by changing directories you can radically change the system behavior, but still inherit much of it. For example, if you want to run a "replay" version of your system which uses canned data for the sensors, then you change directories to someplace that has the right machines.cfg for running the modules off of your vehicle and which has module_param_changes.cfg set up to modify the appropriate module parameters to run from canned data instead of from live sensors.

One note: When you start being "clever" about configuration files like this, it can be hard to keep track of what is being read from where for certain. This is where the CONFIG_VERBOSE environment variable can come in handy to tell you which configuration files you are reading (at level 1) or even what is being read from them (at level 2).

Basic Module Meta-Information

In the Modules meta-information section there are some attributes that are basic and important no matter how the repository manager is being used. These include the machine on which the module needs to run, what other modules are required to support the module, and also what other modules this module can "override," i.e., what other required modules this module can substitute for.

hostname This string attribute specifies the machine the module will run on. If this attribute is left unset, then it will default to the same machine the repository manager is running on. Leaving the attribute blank can be useful for system overview graphical displays for which it is useful to run on the same machine as the repository manager, whatever that machine may be.

required_modules This string attribute specifies the modules that are required to run to support this module. In essense, the repository manager will "run" all of the required modules specified by this attribute. This can be extremely useful in reducing the number of modules you have to list on the command line, but one word of warning: it is very difficult to predetermine the order in which the modules will start so it is important that your modules handle starting in "incorrect" orders properly.

override_module This is the most subtle of the basic attributes. It allows a module to be a substitute for another module's prerequisite module. Say module Foo is listed on the required_modules of module Bar. If we start Bar, then Foo will start. If module Bletch has the attribute
```
 string override_module = Foo;
```
then if the invocation is something like
```
 RepoMan Bletch Bar 
```
then the running of the Bletch module will fulfill Bar's requirement for Foo, since Bletch "overrides" Foo. This is extremely useful for "substituting" a module for another: For example, say a perception module normally requires the state estimator module. We can have another module which is the simulated state estimator module that overrides the state estimator module. By placing the simulated state estimator module on the RepoMan command line, it will substitute seamlessly for the real state sensor module. Although it is not common, you can put multiple modules in the override_module attribute. For example, if your system normally has separate state estimation and vehicle controller modules and you have a simulator module that provides both functions, then the simulator module will have an override_module attribute that lists both the state estimation and the vehicle controller modules.

For example, examine the following contrived repository configuration file section:

struct Modules {
  struct Foo {
    string hostname = BUMPER.AHS.RI.CMU.EDU;
  }

  struct Bar {
    string hostname = ANTENNA.AHS.RI.CMU.EDU;
    string required_modules = Foo;
  }
  
  struct Bletch {
    string override_module = Foo;
  }
}

In this example we see that if you type

 RepoMan Foo

you will start Foo on the machine BUMPER.AHS.RI.CMU.EDU. If you type

 RepoMan Bar

you will start Foo, which will cause Bar to start on ANTENNA.AHS.RI.CMU.EDU. If you type

 RepoMan Bletch Bar

you will start Bar and Bletch (which will "override" Foo). Bletch will run on the same machine as the repository manager.

Repository Assisted Shared Memory

The Problem

In the Communications Interfaces section we saw that shared memory interfaces are built by default to be set with a hostname. This works well in many cases where there are data source producing data that clients are consuming. In an example using our Reconfigurable Interface Example: Roads, you might see the structures for RoadFollower, a module that detects roads, and RoadClient, a module that uses those road look something like,

struct Modules {
  struct RoadFollower {
    string host = $road_follower_host;
  }
  struct RoadClient {
    string host = $road_client_host;
    string required_modules = RoadFollower;
  }
}

struct RoadFollower {
  spec road_dest_spec = shmem;
}

struct RoadClient {
  spec road_source_spec {shmem: 
    string machine = $road_follower_host;
  }
}

In this example we see that RoadFollower is run on the machine specified by the configuration variable $road_follower_host, and the road source used by RoadClient uses that same variable to specify where to look for the road shared memory. To move RoadFollower to another machine, then we simply change the variable $road_follower_host. But, all is not well, even in this example: Say you want to "override" RoadFollower with another module running on a different machine. Then you either have to manually change the RoadClient road source specification or make a copy of the RoadClient module which has the "alternate" road follower source machine and use that new, modified road client instead of the original RoadClient module.

Things get much worse when the data flow is going from the client to the server. For example, take a vehicle controller which uses shared memory to take commands from a single "driver" client.

struct Modules {
  struct VehController {
    string host = $controller_host;
  }
  struct VehDriver1 {
    string host = $veh_driver1_host;
    string required_modules = VehController;
  }
  struct VehDriver2 {
    string host = $veh_driver2_host;
    string required_modules = VehController;
  }
}

struct VehController {
  spec command_source_spec {shmem:
    string machine = $veh_driver1_host;
  }
}
struct VehDriver1 {
  spec command_dest_spec = shmem;
}
struct VehDriver2 {
  spec command_dest_spec = shmem;
}

In this example, as written, the command flow is from VehDriver1 on $veh_driver1_host to the VehController on $controller_host. Note that there is no easy way to switch to having the vehicle controlled by VehDriver2 on a different machine short of manually going in and changing the VehController command_source_spec.machine attribute every time, or having a separate variation on VehController for every possible "driver" in the system. Either choice is a system maintenance nightmare.

The Solution

In order to solve these problems, we use the repository to keep track of which running modules "own" which named memory region, and since all modules have a hostname attribute, which named memory region is running on which hosts. As each module is setup for running, it is checked for the attribute owned_memory. For each of the memory region names in this list, a new structure with that memory region name is added to the SharedMemory structure (which, if you remember, is that mysterious empty structure you have to include in the repository configuration file). We then add a special instance of shared memory with the repository tag. When a module creates a shared memory region a repository tag, it causes a query to the repository manager for the sub-structure of SharedMemory with the appropriate name. If the client gets a valid answer, i.e., the owning module, then it querys the repository for the substructure of Modules with the appropriate name of the owner, and then the client goes on and creates the appropriate managed shared memory region using the correct machine name and memory region name.

So, the road example configuration file from above becomes,

struct Modules {
  struct RoadFollower {
    string host = $road_follower_host;
    string owned_memory = RoadMem;
  }
  struct RoadClient {
    string host = $road_client_host;
    string required_modules = RoadFollower;
  }
}

struct RoadFollower {
  spec road_dest_spec {shmem:
    spec mem {repository: name = RoadMem; }
  }
}

struct RoadClient {
  spec road_source_spec {shmem: 
    spec mem {repository: name = RoadMem; }
  }
}

Here we see that we use the standard mem attribute of the shmem instance. This tells the shared memory interface instance to use the specific specification string repository: name = RoadMem; to create the shared memory region. The result will be a memory region originating on $road_follower_host named RoadMem that the RoadClient is attached to. If you base your code on the standard recipe for shared memory interfaces, then this will be the standard idiom for specifying shared memory in a repository configuration file. With this approach we can override RoadFollower with another module, which also must declare itself as the owner of RoadMem, and the system will automatically be configured to look to that module's machine for the RoadMem shared memory region.

Similarly, for the vehicle command example, the repository configuration file will look like,

struct Modules {
  struct VehController {
    string host = $controller_host;
  }
  struct VehDriver1 {
    string host = $veh_driver1_host;
    string required_modules = VehController;
    string owned_memory = VehCommandMem;
  }
  struct VehDriver2 {
    string host = $veh_driver2_host;
    string required_modules = VehController;
    string owned_memory = VehCommandMem;
  }
}

struct VehController {
  spec command_source_spec {shmem:
    spec mem {repository: name = VehCommandMem; }
  }
}
struct VehDriver1 {
  spec command_dest_spec {shmem:
    spec mem {repository: name = VehCommandMem; }
  }
}
struct VehDriver2 {
  spec command_dest_spec {shmem:
    spec mem {repository: name = VehCommandMem; }
  }
}

Thus we can switch between using VehDriver1 and VehDriver2 as the commander of vehicle motion without having to edit the configuration file.

Some Caveats

You may notice that you must match the owned memory region name in the Modules section with the memory region name in the repository specification in the parameters section. It might be a good idea to use configuration variables to make sure these stay consistent so that typos can be caught as syntax errors rather than with confusing run-time errors about non-existent memory regions. So far we have not found this to be sufficiently useful to actually do it, but it might be a good idea.

In addition, you need to watch out when you have potentially variably sized memory regions, such as with the road example. At this point, the attribute that specifies the size of the memory region must match exactly on the source and the destination specification. This is because it is unknown which will start first, the source or the destination, and whichever starts first is what defines the maximum size of the memory region. If you specify max_points = 10; on one and max_points = 20; on the other and get unlucky as to which starts first, you may (50% of the time) get a memory size error in one of the modules. This is an obvious place to make the SharedMemory structure actually specify something a priori rather than simply be used to store dynamic information. In the future, we may want to store attributes such as max_points in a SharedMemory memory region structure so that instead of reading this from the locally defined memory specification, both sides would read it from this central, known location. This is not implemented yet, and some details remain to be worked out before it can be.

Windowed Development

The typical way to use the repository manager is for windowed development. In this mode, each module is run in its own xterm (the X window system terminal emulator). These xterms are run so that the window name is the module name and they are also run with an obscure feature enabled that allows anyone to send commands to the window to be executed. This could be considered a security hole, as such windows will react to any sent command just as if you typed it at the console, but the chance of anyone doing this is remote, and modern X server and firewall configurations usually eschew direct X connections in favor of tunneling X events through SSH. This approach to running modules is a serious hack that depends on running X, but the advantage is that anything can be run in these windows and you can interact with the windows just as you would normal X windows.

As an aside, we recommend using a simple window manager such as fvwm2 as opposed to a modern window manager such as Gnome or KDE, as fvwm2 has a very predictable interaction with the windows. In addition, you must get your shell to not take over the title bar. The repository manager associates windows to modules by looking through all the windows for the single window with the same name as the module. By default, such shells as bash and tcsh come with setups that helpfully change your window titles to the machine and/or current working directory. This will totally cripple the windowed repository manager system.

In addition, the repository manager will use SSH to run programs on other machines. For it to succeed, you must set up your SSH keys so that you can log into other machines without having to type a password.

For development, you typically run the repository manager with the RepoMan script. This script takes the list of modules to start, i.e.

 RepoMan Foo Bar Bletch

will start the modules Foo, Bar, and Bletch and all required supporting modules.

Once all of the client xterms are started, the windowed repository manager itself starts a very simple GUI. This GUI consists entirely of a large button panel (implemented in TCL/TK through PyInter for those who care) with a series of buttons on it:

Setup Execute any setup commands that are necessary for the various modules such as changing to the correct directory
Run Run the modules
Finish Gracefully request that the modules shut down
Stop Send a Ctrl-C to all windows to rudely order modules to shut down
Destroy Destroy all windows (note: this might leave the modules still running due to the vagaries of Unix and X windows)
Quit Quit the repository manager
Create Try to recreate the module windows if they are destroyed. This may not be all that safe to use, and it is usually a better idea to quit and restart the repository manager to get the same effect.

A typical sequence is Setup, Run, Finish, Stop, Destroy, Quit. A debugging note: if you hit Setup or Run and nothing happens you should check the window where you ran the repository manager for reported configuration file syntax errors. Also if you simply change module parameters, you do not have to quit the repository manager and restart, but you can simply alternate between Run and Finish as the configuration file is re-read before every Run (and Setup). If you change the machines that modules are running on you should Stop, Destroy, Quit and restart in order to ensure that modules will be running in the expected places.

A critical attribute for a module's meta-information structure is the class. The class attribute specifies what kind of thing to run. Its value is the name of the Python class that is in charge of running the module code, which could be user defined. Some standard examples are,

Module This runs an official module in response to a Run. If you omit the class attribute, Module is assumed.
Basic This allows you to run an arbitrary program in response to a Run
InitModule This runs an official module in response to a Setup. Note that InitModule's ignore the Finish directive and only respond to Stop.
InitBasic This runs an arbitrary program in response to a Setup

In addition, there are a variety of module meta-information attributes that you can set to affect how the module is run.

dir The directory to change to when setting up. If omitted, there will be no directory changing at set up time
preamble Additional command to execute when setting up after the change of directory. This requires a forced carriage return at the end to work properly, i.e.,
```
 string preamble="stty -F /dev/ttyS0 baudrate 9600\n"; 
```
command How to run the module. If you don't give a command, it constructs one out of the module name prepended with ./, i.e. it assumes the executable is the same name as the module in the current working directory. Unlike the preamble, you do not have to worry about terminating the command with a forced carriage return.
embed_command_in Sometimes, rather than resetting the command attribute, you want to simply embed the command in another string and execute that. If you set the embed_command_in to a string with a s in it, then the command will get inserted into the embed_command_in at the s and that, plus a trailing carriage return, will be executed instead of the command. A common embed_command_in runs the command in a while loop, so that if the module crashes it gets restarted after a short wait , i.e.
```
 string embed_command_in='while 1\n%s\nsleep 5\nend\n';
```
do_block_test This boolean attribute affects whether or not the target module is configured to run a "block test" or not. A block test is an extra thread associated with a module that wakes up every two seconds and checks to see if the module has successfully completed a run cycle. If not, it marks the module as blocked for the run time status reporting mechanism. Some modules have a problem with extra threads, so you can set do_block_test to false to avoid running this purely informational thread.
user This optional string attribute allows you to SSH to the module's host with a different user than the default. The SSH keys must still be set up to log into the host with this different user with no password.

Then, there are a variety of attributes that affect the size, nature, and placement of the xterms that will run the modules.

geometry The X geometry of the created window. For example,
```
 string geometry = "80x10+0+320"; 
```
will create a window 80 characters wide, 10 characters high, 0 pixels from the left side of the screen and 320 pixels from the top of the screen.

font The font of the created X window. For exmaple,
```
 string font = "6x10"; 
```
will create a window using the fix sized 6 point font.

xterm_options This lets you pass arbitrary options to the xterm. The main use for this, at least when you are using the fvwm2 window manager, is to choose which panel to put the created window on. For example,
```
 string xterm_options = "-xrm '*Page:2 2'"; 
```
will start the window on the virtual window in the second column and second row.

Using a debugger

When you are developing code, even when integrated in a system, you might find you want to run the code in a debugger. If you are using GDB, the standard debugger for Unix, one method is to attach to the running process. The mechanism for this is to run the system as normal, log in to the machine containing the process you want to debug, do ps -auxwww | grep <program name> and find the process ID of your program, and then from the appropriate directory do gdb <program name> <process id>.

Sometimes, you will want to GDB your program from the very beginning. The typical way to do this is to run the system, Ctrl-C your program, and in the same window (or another window on the appropriate machine in the appropriate directory, perhaps within my favorite IDE, emacs) GDB your program. Then, after you set up any break points, use the X window to select the appropriate arguments to connect to the central database from the aborted run in the original X window, and then run the GDB'd program with these arguments, i.e. type r <pasted arguments>. Note: You can just keep running and killing the GDB'd program with the same arguments as long as the Run button on the main GUI has been hit and not re-hit. If you hit the Finish button, then the program will not be runnable (i.e., it will check with the repository and see that it's running parameter has been set to false, and will exit immediately and silently). As long as the repository manager runs on the same machine, you shouldn't have to recopy and paste the arguments, but you will have to remember to kill the running un-GDB'd program every time you hit the Run button.

If you desire, you can also be able to change the command attribute of the module's meta-information to something like gdb <program>\nbreak main\nrun. Note that this will cause the program to run with an initial breakpoint in the main routine. Despite the "hackiness" of the cut-and-paste method, I have found it to be more flexible and less unwieldy than the "official" method, but tastes may differ.

Monitoring a running system

Module developers are encouraged to report the status of their modules, both as numerical confidence numbers and with short descriptive strings. We provide the program repomon to report this, and other information, in a graphical form which provides a central "debugging" report of the health of the system.

repoman is installed in $UTILS_DIR/bin, takes no arguments, and must be run on the same machine as the repository manager. It contacts the repository manager and gets a list of all of the running modules, and where they are running. It then attaches to the shared memory region maintained by each module which reports both the status that the user gives as well as such internally maintained information such as cycle time, last run time, last update date, and module status.

The modules are grouped by host (identified by numeric IP address, not name). Each module is identified by name. To the left of the name is the "status square," whose color means the following,

Blue: initializing
Yellow: blocked
Dark Yellow: paused
Green: running
Red: stopped
Black: uncontactable

When the system is running successfully, you want to see all green boxes. Red boxes mean that a program has crashed, whereas black boxes mean that a shared memory manager has crashed.

Part of the reporting of each module is the time of the last successful run and the last "update." An update includes the "behind the scene" checking for blocking that most modules will do, so if the module is not crashed, this should be getting updated even when the module's run method is blocked. In addition, repomon reports two numbers for each of these times, first, the delta time since the run or update has changed, and second in parentheses the difference between the run or update and the local clock. The differencing with the local clock can show drifts of time between machines whereas the delta time since the last change shows the "real" passage of time since the last run or update. Then comes the cycle time, which is the average number of seconds of a cycle. Then come the "status" number reported by a module using ConfigSource::setStatus and the "confidence" number reported by a module using ConfigSource::setConfidence. Finally, comes the string message which modules set by using ConfigSource::setStatusMessage. If the module never sets this, then it is set to "Module Running" after a successful run method, and set to "Module Blocked" when the module is blocked.

Ubiquitous Modules

One "feature" of RepoMan.py is "ubiquitous modules." A ubiquitous module is one which runs some version of itself on every machine. They might be a good idea for something like vehicle state propagation daemons in a system where vehicle state is used "ubiquitously" by almost every module in the system. The idea is that there is a "server" version of the module which runs on a particular machine (for the vehicle state example, this is the module which is connected to the actual sensors), and on every other machine in the system we automatically run a "client" version of the module, in this case one which reads the vehicle state from the server module and keeps a history of the vehicle state for propagation to all modules running on that client machine.

The key attribute for a ubiquitous module is

bool ubiquitous=true;

The host attribute for the ubiquitous module is chosen as the "server" module, and the parameter specification for the server module should be in the parameter section as with a normal module. The client specification is given by the attribute client_spec with a %s (as in, the C printf string directive) embedded in it. This %s will be replaced by the server module host name. For example, the client spec,

    struct client_spec {
      spec state_sense_spec {remote: 
        string host="%s";
      }
    }

will have the host variable set to the machine name for each client machine in the system.

Normally, we expect the same command to be used for the client as the server (from the command attribute), but you can set the string attribute client_command to set the client command to something else.

Finally, there is the matter of sizing and positioning the client windows. These client windows will use the same font as the server window (from the font attribute), but the geometry of the clients is specified by the client_geom attribute, which must have a %d embedded in it (for example, string client_geom = "40x5-300+\%d";). In addition, you should provide a client_inc integer attribute which specifies the vertical "increment" for the client window. For each client window, the %d is replaced by a multiple of client_inc. The idea is that the server window is at 0, and the client windows are displaced, usually vertically, by client_inc pixels.

Collecting Data

We have built in hooks for a standard collection process. This process uses the python module CollectManager.py and is usually invoked by the script Collect, which is given a list of modules to run just as the RepoMan script was.

The process of collection is simple: When you run, if there is any collection to be done a directory is created on the disk available via NFS by all the relevant machines. This directory is based on the global parameter data_directory, and it has the unfortunate default of ".". The actual directory created for a run is the data directory with month-day-year.hour-minute.second. Thus for any run you will have a unique, data and time derived directory where the data will end up. Modules run and put data in a directory on their local machines. This directory is specificed by global parameter incoming_directory, and defaults to data. When you hit Finish then the modules cleanly exit and data is consolidated from the incoming directories on each machine to the data directory.

Any module that is collecting data must be a CollectModule, i.e., in its module meta-information it must have the class attribute set to CollectModule. In addition, collection modules must list the files that they are collecting to in the module meta-information (this must match any logging specifications in the module's parameter structure, and this match is not done automatically). There are two separate types of logging for a collection module: crunched files and copy files.

Remember that canned data files are created in two parts, index and data. You set a collection module's string crunch_files attribute to the list of canned data files that the collection module will produce (remember each file name will have the incoming_directory global parameter prepended to it). On finishing, the collection script will send commands to each collection module's window which will crunch the appropriate index and data files from the incoming directory together into a single file, and then move the resulting file to the data directory.

In addition, all files listed in a collection module's copy_files attribute (prepended with the global parameters incoming_directory) will simply be copied from the incoming directory to the data directory. This will take care of consolidating simple text log files or other non-canned data files.

One note: You should be careful in using the collection script, as there is currently no feedback about the commands being sent to a module window being executed. For a long collection, it may take a significant amount of time to consolidate the data, and you must be careful not to start another collection before the consolidation is done. You may want to peruse the collection module windows as part of the collection process to ensure that the consolidation actually worked and terminated correctly.

Production Systems

As we have seen, the X window system seems fundamental to starting up the system. This can be extremely painful for "production" systems that must start up, monitor themselves, and shut down with no interaction from a human. To alleviate this pain, we have created hooks that help support such production systems that have no need for the X window system. These hooks have not been packaged and collected into an easy-to-use python modules like RepoMan or CollectManager, and their use requires the building of a custom Python module to run the production system.

The overall structure is that we have created Launcher and Unlauncher scripts. Instead of creating X windows and sending commands, when we start up a production system we ssh an invocation of the Launcher script to that machine. The Launcher script contacts the running repository manager, finds out what modules need to be run on that machine, and then starts them. The Unlauncher script is how you cleanly shut down the system.

The build a production system you need to subclass from RepoMan and initialize the super class with no_display set to 1. Then your run method should probably first send Unlauncher commands to all the machines to clean up any previous invocations, like this

    for h in self.hosts.keys():
      cmd = 'ssh %s Unlauncher %s >& /dev/null &' % (h, self.host)
      os.system(cmd)

Then you may want to sleep and process events a bit to make sure any dying processes die nicely

    time.sleep(1)
    while self.server.processEvents(1.0):
      pass

Finally the run method launches your modules with something like,

    for h in self.hosts.keys():
      cmd = 'ssh %s Launcher %s' % (h, self.host)
      os.system('ssh %s Launcher %s >& /dev/null &' % (h, self.host))

How and when to shut down is your scripts problem. Note that the production script system makes it very difficult to shut down the system through anything except unlaunching.

Part of the philosophy of the production system is that modules are run via simple scripts that restart them if they crash. In addition, modules can also be marked as "vital," i.e., if they crash then that indicates a basic breach in the integrity of the run and the whole production is brought to a halt. An example is a production system whose primary interest is data collection, and a failure of a data collection module. Just restarting this collection module will cause inconsistent data, abrogating the basic reason for the production system.

One caveat: The production system will only work with official modules, not with just any program like the regular RepoMan or Collect scripts. If you attempt to use a program that is not a module (i.e., its class is not derived from Module), then that program will be skipped.

The Launcher script uses attributes in common with the RepoMan script, such as dir and command for where and how to run the program. Instead of the preamble directive, it looks at the prerun string attribute for what to run before the command. In addition, there is a postrun attribute which indicates what should run after the command.

Each program is run inside of a script to make it "persistent." If the module dies, then the script will wait for a certain number of seconds (defined by the integer attribute interval, which defaults to 5), and then will restart the module. If you set the boolean attribute vital to true, then if the module dies, the global parameter string stop_all will be set to 1. This will result in the invocation of the method stop_all, which should bring everything to a halt.

If you have to kill the remnants of a production system on a machine by hand, you should do something like

killall -9 run_persistent_module run_vital_module run_iptshmgr iptshmgr iserver_set sleep

and then continue to use the Unix command killall to take out your modules by program name: Note, you should use the Unix command ps to verify what you have left to kill.

Generated on Fri Jun 16 13:21:27 2006 for ModUtils by

1.4.4