The system building tools described in this section can be charitably called "fluid:" there are many areas for improvement, upgrading, and even just bug fixing. In addition, the current set of tools are very much tied to Unix, and more specifically, tied to the X window system.
Within these limitations, the ModUtils package does provide a fairly comprehensive and useful suite of tools for building, debugging, and running multi-module, multi-processor, distributed robotic systems.
The repository manager program, or repoman
, is simply a Python interpreter (http://www.python.org) which has been extended to read configuration files and create configuration sources. By itself, the executable simply looks like a normal Python interpreter, and you can run it and do whatever you normally do with Python. We chose Python because of its ease of use and fundamental object-oriented structure. Unlike languages such as Perl, a Python program is fairly readable to anyone who has any familiarity with structured procedural object oriented languages. Most ModUtils users will not have to ever look at a Python program, but will use the Python module RepoMan.py
which contains the class definitions and code to run the repository manager.
So, the repository manager is the central "repository" of configuration information. It reads in a configuration file (which may read in other configuration files) and in this file are named structures. For every module in the system there is a corresponding named structure. For example, if you have a module named "Foo", there will be struct Foo { ... }
in the configuration file which will list the parameters to pass on to Foo. When Foo runs it is given the configuration source specification which connects it to the repository manager, and it will have the contents of the Foo
structure delivered to it to parameterize its operation.
The repository manager is not just a passive collection of configuration information. It acts as a central "blackboard" for the system. Modules can set information in their structures which other modules, or the repository manager itselve, can monitor to affect system operation. Thus the repository becomes a fairly architecture independent back channel for communicating information in a reliable, albeit high latency, manner.
More importantly, the repository manager actually starts and stops all of the modules in the system. Thus you do not have to worry about the intricacies of how to specify a configuration source that talks with the repository manager, the repository manager starts your modules with the appropriately parameterized configuration source specification string. In order to do this, there is a section of the repository configuration file (in a structure called Modules
) which contains module "meta-information," i.e., not configuration parameters that go to the modules themselves, but rather configuration parameters that specify where and how to run the modules.
You can create subclasses of RepoMan
that start up, shut down, and monitor the system in arbitrary ways. For example, the Collect.py
script contains a subclass of RepoMan
, CollectManager
which sets up unique, date and time based directory names for collection at run time and consolidates the data into those directories at the finish of the run. Subclasses of RepoMan
can be set up to monitor the operation of your system through changes in the data base, and not just to monitor, but to change modes, starting, stopping, and reparameterizing modules as necessary.
There is an immense amount of detail hidden in the Python implementations of these scripts, but in this documentation I will mainly try and just hit the points typically needed for building systems. To actually go and write arbitrary system scripts will take furth study of the existing scripting code and experimentation.
As was said earlier, most of the time you will not be writting any Python code at all, just using the existing RepoMan.py
and CollectManager.py
Python modules. In fact, to hide some of the invocation complications, we provide two shell scripts for easing their use: RepoMan
and Collect
. These shell scripts simply take the list of modules to run as their arguments, and start the appropriate repository management code.
First, it is a good idea to have an NFS (network file system) partition that contains your code that is mounted on all of the systems that are relevent. Thus you will only have to compile on one machine in order for the code to be rebuilt for all of the machines that share that partition. Of course, this only works if the machines are compatible operating systems and hardware platforms, but if that is the case, then providing this ability will vastly reduce the headaches of distributed development. The repository manager system absolutely does not require this, it is just a good idea.
In addition, we find it useful to have a widely accessible data partition, available on all machines as /home/data
. This will be where data gets consolidated to. In addition, on each machine we advise having a locally mounted partition /data
for local caching of data. You can virtual eliminate the need for NFS data partitions by putting all of your data logging modules on the machine which hosts /home/data
, but we find it is useful to have the ability to easily log text data files that your programs may generate in a central place or to locally cache high-bandwidth data such as video streams for later consolidation instead of trying to ship them over the network in real time. Again, these rules are not written in stone, they are just rules of thumb.
Modules
. For example, for a system that has two possible modules, Foo
and Bar
the shape of the Modules
structure would be struct Modules { struct Foo { . . } struct Bar { . . } }
SharedMemory
. This is usually empty: struct SharedMemory {
}
Modules
section there must be a corresponding parameter structure, i.e., for the above example you may have struct Foo { int my_parameter = 1; } struct Bar { int my_parameter = 5; }
Now, you could create one big file for every project, but we have found it more useful to have a set of files that can easily be replaced via clever usage of the environment variables that affect where the system looks for files. To help with this, here is a "typical" base for a repository configuration file, repoman.cfg
%include machines.cfg %include modules.cfg %include memory.cfg %include module_params.cfg %include module_param_changes.cfg
This file starts with machines.cfg
, where you will assign variables to host name for later use in your system, for example, you may do
string sick_host = BUMPER; string left_sick_host = $sick_host; string right_sick_host = $sick_host;
$left_sick_host
and $right_sick_host
to refer to where the Sick laser scanner modules should run.
Then we move on to modules.cfg
, which contains your struct
Modules declaration which uses all of the host names defined in machines.cfg
rather than any specific host names. We continue with the memory.cfg
file, which contains the empty SharedMemory
structure. Then, the meat of your system parameters are in module_params.cfg
. Finally, we include module_param_changes.cfg
file, which by default is empty.
The idea is that you can create "nominal" system configuration files and put them in your $CONFIG_DIR
, but by setting $LOCAL_CONFIG_DIR
or by changing directories you can radically change the system behavior, but still inherit much of it. For example, if you want to run a "replay" version of your system which uses canned data for the sensors, then you change directories to someplace that has the right machines.cfg
for running the modules off of your vehicle and which has module_param_changes.cfg
set up to modify the appropriate module parameters to run from canned data instead of from live sensors.
One note: When you start being "clever" about configuration files like this, it can be hard to keep track of what is being read from where for certain. This is where the CONFIG_VERBOSE
environment variable can come in handy to tell you which configuration files you are reading (at level 1) or even what is being read from them (at level 2).
Modules
meta-information section there are some attributes that are basic and important no matter how the repository manager is being used. These include the machine on which the module needs to run, what other modules are required to support the module, and also what other modules this module can "override," i.e., what other required modules this module can substitute for.
hostname
This string attribute specifies the machine the module will run on. If this attribute is left unset, then it will default to the same machine the repository manager is running on. Leaving the attribute blank can be useful for system overview graphical displays for which it is useful to run on the same machine as the repository manager, whatever that machine may be.
required_modules
This string attribute specifies the modules that are required to run to support this module. In essense, the repository manager will "run" all of the required modules specified by this attribute. This can be extremely useful in reducing the number of modules you have to list on the command line, but one word of warning: it is very difficult to predetermine the order in which the modules will start so it is important that your modules handle starting in "incorrect" orders properly.
override_module
This is the most subtle of the basic attributes. It allows a module to be a substitute for another module's prerequisite module. Say module Foo
is listed on the required_modules
of module Bar
. If we start Bar
, then Foo
will start. If module Bletch
has the attribute string override_module = Foo;
RepoMan Bletch Bar
Bletch
module will fulfill Bar's
requirement for Foo
, since Bletch
"overrides" Foo
. This is extremely useful for "substituting" a module for another: For example, say a perception module normally requires the state estimator module. We can have another module which is the simulated state estimator module that overrides the state estimator module. By placing the simulated state estimator module on the RepoMan
command line, it will substitute seamlessly for the real state sensor module. Although it is not common, you can put multiple modules in the override_module
attribute. For example, if your system normally has separate state estimation and vehicle controller modules and you have a simulator module that provides both functions, then the simulator module will have an override_module
attribute that lists both the state estimation and the vehicle controller modules.For example, examine the following contrived repository configuration file section:
struct Modules { struct Foo { string hostname = BUMPER.AHS.RI.CMU.EDU; } struct Bar { string hostname = ANTENNA.AHS.RI.CMU.EDU; string required_modules = Foo; } struct Bletch { string override_module = Foo; } }
In this example we see that if you type
RepoMan Foo
Foo
on the machine BUMPER.AHS.RI.CMU.EDU
. If you type RepoMan Bar
Foo
, which will cause Bar
to start on ANTENNA.AHS.RI.CMU.EDU
. If you type RepoMan Bletch Bar
Bar
and Bletch
(which will "override" Foo
). Bletch
will run on the same machine as the repository manager.RoadFollower
, a module that detects roads, and RoadClient
, a module that uses those road look something like, struct Modules { struct RoadFollower { string host = $road_follower_host; } struct RoadClient { string host = $road_client_host; string required_modules = RoadFollower; } } struct RoadFollower { spec road_dest_spec = shmem; } struct RoadClient { spec road_source_spec {shmem: string machine = $road_follower_host; } }
RoadFollower
is run on the machine specified by the configuration variable $road_follower_host
, and the road source used by RoadClient
uses that same variable to specify where to look for the road shared memory. To move RoadFollower
to another machine, then we simply change the variable $road_follower_host
. But, all is not well, even in this example: Say you want to "override" RoadFollower
with another module running on a different machine. Then you either have to manually change the RoadClient
road source specification or make a copy of the RoadClient
module which has the "alternate" road follower source machine and use that new, modified road client instead of the original RoadClient
module.Things get much worse when the data flow is going from the client to the server. For example, take a vehicle controller which uses shared memory to take commands from a single "driver" client.
struct Modules { struct VehController { string host = $controller_host; } struct VehDriver1 { string host = $veh_driver1_host; string required_modules = VehController; } struct VehDriver2 { string host = $veh_driver2_host; string required_modules = VehController; } } struct VehController { spec command_source_spec {shmem: string machine = $veh_driver1_host; } } struct VehDriver1 { spec command_dest_spec = shmem; } struct VehDriver2 { spec command_dest_spec = shmem; }
VehDriver1
on $veh_driver1_host
to the VehController
on $controller_host
. Note that there is no easy way to switch to having the vehicle controlled by VehDriver2
on a different machine short of manually going in and changing the VehController
command_source_spec.machine
attribute every time, or having a separate variation on VehController
for every possible "driver" in the system. Either choice is a system maintenance nightmare.hostname
attribute, which named memory region is running on which hosts. As each module is setup for running, it is checked for the attribute owned_memory
. For each of the memory region names in this list, a new structure with that memory region name is added to the SharedMemory
structure (which, if you remember, is that mysterious empty structure you have to include in the repository configuration file). We then add a special instance of shared memory with the repository
tag. When a module creates a shared memory region a repository
tag, it causes a query to the repository manager for the sub-structure of SharedMemory
with the appropriate name. If the client gets a valid answer, i.e., the owning module, then it querys the repository for the substructure of Modules
with the appropriate name of the owner, and then the client goes on and creates the appropriate managed shared memory region using the correct machine name and memory region name.So, the road example configuration file from above becomes,
struct Modules { struct RoadFollower { string host = $road_follower_host; string owned_memory = RoadMem; } struct RoadClient { string host = $road_client_host; string required_modules = RoadFollower; } } struct RoadFollower { spec road_dest_spec {shmem: spec mem {repository: name = RoadMem; } } } struct RoadClient { spec road_source_spec {shmem: spec mem {repository: name = RoadMem; } } }
mem
attribute of the shmem
instance. This tells the shared memory interface instance to use the specific specification string repository: name = RoadMem;
to create the shared memory region. The result will be a memory region originating on $road_follower_host
named RoadMem
that the RoadClient
is attached to. If you base your code on the standard recipe
for shared memory interfaces, then this will be the standard idiom for specifying shared memory in a repository configuration file. With this approach we can override RoadFollower
with another module, which also must declare itself as the owner of RoadMem
, and the system will automatically be configured to look to that module's machine for the RoadMem
shared memory region.Similarly, for the vehicle command example, the repository configuration file will look like,
struct Modules { struct VehController { string host = $controller_host; } struct VehDriver1 { string host = $veh_driver1_host; string required_modules = VehController; string owned_memory = VehCommandMem; } struct VehDriver2 { string host = $veh_driver2_host; string required_modules = VehController; string owned_memory = VehCommandMem; } } struct VehController { spec command_source_spec {shmem: spec mem {repository: name = VehCommandMem; } } } struct VehDriver1 { spec command_dest_spec {shmem: spec mem {repository: name = VehCommandMem; } } } struct VehDriver2 { spec command_dest_spec {shmem: spec mem {repository: name = VehCommandMem; } } }
VehDriver1
and VehDriver2
as the commander of vehicle motion without having to edit the configuration file.Modules
section with the memory region name in the repository
specification in the parameters section. It might be a good idea to use configuration variables to make sure these stay consistent so that typos can be caught as syntax errors rather than with confusing run-time errors about non-existent memory regions. So far we have not found this to be sufficiently useful to actually do it, but it might be a good idea.
In addition, you need to watch out when you have potentially variably sized memory regions, such as with the road example. At this point, the attribute that specifies the size of the memory region must match exactly on the source and the destination specification. This is because it is unknown which will start first, the source or the destination, and whichever starts first is what defines the maximum size of the memory region. If you specify max_points = 10;
on one and max_points = 20;
on the other and get unlucky as to which starts first, you may (50% of the time) get a memory size error in one of the modules. This is an obvious place to make the SharedMemory
structure actually specify something a priori rather than simply be used to store dynamic information. In the future, we may want to store attributes such as max_points
in a SharedMemory
memory region structure so that instead of reading this from the locally defined memory specification, both sides would read it from this central, known location. This is not implemented yet, and some details remain to be worked out before it can be.
As an aside, we recommend using a simple window manager such as fvwm2 as opposed to a modern window manager such as Gnome or KDE, as fvwm2 has a very predictable interaction with the windows. In addition, you must get your shell to not take over the title bar. The repository manager associates windows to modules by looking through all the windows for the single window with the same name as the module. By default, such shells as bash and tcsh come with setups that helpfully change your window titles to the machine and/or current working directory. This will totally cripple the windowed repository manager system.
In addition, the repository manager will use SSH to run programs on other machines. For it to succeed, you must set up your SSH keys so that you can log into other machines without having to type a password.
For development, you typically run the repository manager with the RepoMan
script. This script takes the list of modules to start, i.e.
RepoMan Foo Bar Bletch
Foo
, Bar
, and Bletch
and all required supporting modules.Once all of the client xterms are started, the windowed repository manager itself starts a very simple GUI. This GUI consists entirely of a large button panel (implemented in TCL/TK through PyInter for those who care) with a series of buttons on it:
Setup
Execute any setup commands that are necessary for the various modules such as changing to the correct directoryRun
Run the modulesFinish
Gracefully request that the modules shut downStop
Send a Ctrl-C to all windows to rudely order modules to shut downDestroy
Destroy all windows (note: this might leave the modules still running due to the vagaries of Unix and X windows)Quit
Quit the repository managerCreate
Try to recreate the module windows if they are destroyed. This may not be all that safe to use, and it is usually a better idea to quit and restart the repository manager to get the same effect.
A typical sequence is Setup
, Run
, Finish
, Stop
, Destroy
, Quit
. A debugging note: if you hit Setup
or Run
and nothing happens you should check the window where you ran the repository manager for reported configuration file syntax errors. Also if you simply change module parameters, you do not have to quit the repository manager and restart, but you can simply alternate between Run
and Finish
as the configuration file is re-read before every Run
(and Setup
). If you change the machines that modules are running on you should Stop
, Destroy
, Quit
and restart in order to ensure that modules will be running in the expected places.
A critical attribute for a module's meta-information structure is the class
. The class
attribute specifies what kind of thing to run. Its value is the name of the Python class that is in charge of running the module code, which could be user defined. Some standard examples are,
Module
This runs an official module in response to a Run
. If you omit the class
attribute, Module
is assumed.Basic
This allows you to run an arbitrary program in response to a Run
InitModule
This runs an official module in response to a Setup
. Note that InitModule's
ignore the Finish
directive and only respond to Stop
.InitBasic
This runs an arbitrary program in response to a Setup
In addition, there are a variety of module meta-information attributes that you can set to affect how the module is run.
dir
The directory to change to when setting up. If omitted, there will be no directory changing at set up timepreamble
Additional command to execute when setting up after the change of directory. This requires a forced carriage return at the end to work properly, i.e., string preamble="stty -F /dev/ttyS0 baudrate 9600\n";
command
How to run the module. If you don't give a command, it constructs one out of the module name prepended with ./
, i.e. it assumes the executable is the same name as the module in the current working directory. Unlike the preamble
, you do not have to worry about terminating the command with a forced carriage return.embed_command_in
Sometimes, rather than resetting the command
attribute, you want to simply embed the command in another string and execute that. If you set the embed_command_in
to a string with a s
in it, then the command
will get inserted into the embed_command_in
at the s
and that, plus a trailing carriage return, will be executed instead of the command. A common embed_command_in
runs the command in a while loop, so that if the module crashes it gets restarted after a short wait , i.e. string embed_command_in='while 1\n%s\nsleep 5\nend\n';
do_block_test
This boolean attribute affects whether or not the target module is configured to run a "block test" or not. A block test is an extra thread associated with a module that wakes up every two seconds and checks to see if the module has successfully completed a run cycle. If not, it marks the module as blocked for the run time status reporting mechanism. Some modules have a problem with extra threads, so you can set do_block_test
to false to avoid running this purely informational thread.user
This optional string attribute allows you to SSH to the module's host with a different user than the default. The SSH keys must still be set up to log into the host with this different user with no password.Then, there are a variety of attributes that affect the size, nature, and placement of the xterms that will run the modules.
geometry
The X geometry of the created window. For example, string geometry = "80x10+0+320";
font
The font of the created X window. For exmaple, string font = "6x10";
xterm_options
This lets you pass arbitrary options to the xterm. The main use for this, at least when you are using the fvwm2 window manager, is to choose which panel to put the created window on. For example, string xterm_options = "-xrm '*Page:2 2'";
ps -auxwww | grep <program name>
and find the process ID of your program, and then from the appropriate directory do gdb <program name> <process id>
.
Sometimes, you will want to GDB your program from the very beginning. The typical way to do this is to run the system, Ctrl-C
your program, and in the same window (or another window on the appropriate machine in the appropriate directory, perhaps within my favorite IDE, emacs) GDB your program. Then, after you set up any break points, use the X window to select the appropriate arguments to connect to the central database from the aborted run in the original X window, and then run the GDB'd program with these arguments, i.e. type r <pasted arguments>
. Note: You can just keep running and killing the GDB'd program with the same arguments as long as the Run
button on the main GUI has been hit and not re-hit. If you hit the Finish
button, then the program will not be runnable (i.e., it will check with the repository and see that it's running
parameter has been set to false, and will exit immediately and silently). As long as the repository manager runs on the same machine, you shouldn't have to recopy and paste the arguments, but you will have to remember to kill the running un-GDB'd program every time you hit the Run
button.
If you desire, you can also be able to change the command
attribute of the module's meta-information to something like gdb <program>\nbreak main\nrun
. Note that this will cause the program to run with an initial breakpoint in the main routine. Despite the "hackiness" of the cut-and-paste method, I have found it to be more flexible and less unwieldy than the "official" method, but tastes may differ.
repomon
to report this, and other information, in a graphical form which provides a central "debugging" report of the health of the system.
repoman
is installed in $UTILS_DIR/bin
, takes no arguments, and must be run on the same machine as the repository manager. It contacts the repository manager and gets a list of all of the running modules, and where they are running. It then attaches to the shared memory region maintained by each module which reports both the status that the user gives as well as such internally maintained information such as cycle time, last run time, last update date, and module status.
The modules are grouped by host (identified by numeric IP address, not name). Each module is identified by name. To the left of the name is the "status square," whose color means the following,
When the system is running successfully, you want to see all green boxes. Red boxes mean that a program has crashed, whereas black boxes mean that a shared memory manager has crashed.
Part of the reporting of each module is the time of the last successful run and the last "update." An update includes the "behind the scene" checking for blocking that most modules will do, so if the module is not crashed, this should be getting updated even when the module's run
method is blocked. In addition, repomon
reports two numbers for each of these times, first, the delta time since the run or update has changed, and second in parentheses the difference between the run or update and the local clock. The differencing with the local clock can show drifts of time between machines whereas the delta time since the last change shows the "real" passage of time since the last run or update. Then comes the cycle time, which is the average number of seconds of a cycle. Then come the "status" number reported by a module using ConfigSource::setStatus
and the "confidence" number reported by a module using ConfigSource::setConfidence
. Finally, comes the string message which modules set by using ConfigSource::setStatusMessage
. If the module never sets this, then it is set to "Module Running" after a successful run
method, and set to "Module Blocked" when the module is blocked.
RepoMan.py
is "ubiquitous modules." A ubiquitous module is one which runs some version of itself on every machine. They might be a good idea for something like vehicle state propagation daemons in a system where vehicle state is used "ubiquitously" by almost every module in the system. The idea is that there is a "server" version of the module which runs on a particular machine (for the vehicle state example, this is the module which is connected to the actual sensors), and on every other machine in the system we automatically run a "client" version of the module, in this case one which reads the vehicle state from the server module and keeps a history of the vehicle state for propagation to all modules running on that client machine.The key attribute for a ubiquitous module is
bool ubiquitous=true;
The host
attribute for the ubiquitous module is chosen as the "server" module, and the parameter specification for the server module should be in the parameter section as with a normal module. The client specification is given by the attribute client_spec
with a %s
(as in, the C printf
string directive) embedded in it. This %s
will be replaced by the server module host name. For example, the client spec,
struct client_spec { spec state_sense_spec {remote: string host="%s"; } }
host
variable set to the machine name for each client machine in the system.
Normally, we expect the same command to be used for the client as the server (from the command
attribute), but you can set the string attribute client_command
to set the client command to something else.
Finally, there is the matter of sizing and positioning the client windows. These client windows will use the same font as the server window (from the font
attribute), but the geometry of the clients is specified by the client_geom
attribute, which must have a %d
embedded in it (for example, string client_geom = "40x5-300+\%d";
). In addition, you should provide a client_inc
integer attribute which specifies the vertical "increment" for the client window. For each client window, the %d
is replaced by a multiple of client_inc
. The idea is that the server window is at 0, and the client windows are displaced, usually vertically, by client_inc
pixels.
CollectManager.py
and is usually invoked by the script Collect
, which is given a list of modules to run just as the RepoMan
script was.
The process of collection is simple: When you run, if there is any collection to be done a directory is created on the disk available via NFS by all the relevant machines. This directory is based on the global parameter data_directory
, and it has the unfortunate default of ".". The actual directory created for a run is the data directory with month-day-year.hour-minute.second
. Thus for any run you will have a unique, data and time derived directory where the data will end up. Modules run and put data in a directory on their local machines. This directory is specificed by global parameter incoming_directory
, and defaults to data
. When you hit Finish
then the modules cleanly exit and data is consolidated from the incoming directories on each machine to the data directory.
Any module that is collecting data must be a CollectModule
, i.e., in its module meta-information it must have the class
attribute set to CollectModule
. In addition, collection modules must list the files that they are collecting to in the module meta-information (this must match any logging specifications in the module's parameter structure, and this match is not done automatically). There are two separate types of logging for a collection module: crunched files and copy files.
Remember that canned data files are created in two parts, index and data. You set a collection module's string crunch_files
attribute to the list of canned data files that the collection module will produce (remember each file name will have the incoming_directory
global parameter prepended to it). On finishing, the collection script will send commands to each collection module's window which will crunch the appropriate index and data files from the incoming directory together into a single file, and then move the resulting file to the data directory.
In addition, all files listed in a collection module's copy_files
attribute (prepended with the global parameters incoming_directory
) will simply be copied from the incoming directory to the data directory. This will take care of consolidating simple text log files or other non-canned data files.
One note: You should be careful in using the collection script, as there is currently no feedback about the commands being sent to a module window being executed. For a long collection, it may take a significant amount of time to consolidate the data, and you must be careful not to start another collection before the consolidation is done. You may want to peruse the collection module windows as part of the collection process to ensure that the consolidation actually worked and terminated correctly.
RepoMan
or CollectManager
, and their use requires the building of a custom Python module to run the production system.
The overall structure is that we have created Launcher
and Unlauncher
scripts. Instead of creating X windows and sending commands, when we start up a production system we ssh
an invocation of the Launcher
script to that machine. The Launcher
script contacts the running repository manager, finds out what modules need to be run on that machine, and then starts them. The Unlauncher
script is how you cleanly shut down the system.
The build a production system you need to subclass from RepoMan
and initialize the super class with no_display
set to 1. Then your run
method should probably first send Unlauncher
commands to all the machines to clean up any previous invocations, like this
for h in self.hosts.keys(): cmd = 'ssh %s Unlauncher %s >& /dev/null &' % (h, self.host) os.system(cmd)
time.sleep(1) while self.server.processEvents(1.0): pass
run
method launches your modules with something like, for h in self.hosts.keys(): cmd = 'ssh %s Launcher %s' % (h, self.host) os.system('ssh %s Launcher %s >& /dev/null &' % (h, self.host))
Part of the philosophy of the production system is that modules are run via simple scripts that restart them if they crash. In addition, modules can also be marked as "vital," i.e., if they crash then that indicates a basic breach in the integrity of the run and the whole production is brought to a halt. An example is a production system whose primary interest is data collection, and a failure of a data collection module. Just restarting this collection module will cause inconsistent data, abrogating the basic reason for the production system.
One caveat: The production system will only work with official modules, not with just any program like the regular RepoMan
or Collect
scripts. If you attempt to use a program that is not a module (i.e., its class is not derived from Module
), then that program will be skipped.
The Launcher
script uses attributes in common with the RepoMan
script, such as dir
and command
for where and how to run the program. Instead of the preamble
directive, it looks at the prerun
string attribute for what to run before the command. In addition, there is a postrun
attribute which indicates what should run after the command.
Each program is run inside of a script to make it "persistent." If the module dies, then the script will wait for a certain number of seconds (defined by the integer attribute interval
, which defaults to 5), and then will restart the module. If you set the boolean attribute vital
to true, then if the module dies, the global parameter string stop_all
will be set to 1. This will result in the invocation of the method stop_all
, which should bring everything to a halt.
If you have to kill the remnants of a production system on a machine by hand, you should do something like
killall -9 run_persistent_module run_vital_module run_iptshmgr iptshmgr iserver_set sleep
killall
to take out your modules by program name: Note, you should use the Unix command ps
to verify what you have left to kill.