Glass Room
part of the ArsDigita Community System
by Philip Greenspun
ArsDigita Glass Room is a module that lets the community system
implement the final component of the
ArsDigita Server Architecture: coordinating a bunch of human beings
to ensure the reliable operation of a Web service.
The first function that Glass Room must accomplish is the distribution
of information. The glassroom_info
table contains:
- the name of the service
- a reference to the host that does Web service
- a reference to the host that does RDBMS service
- a reference to the host that does primary DNS service
- a reference to the host that does secondary DNS service
- a reference to the host that serves for disaster recovery
For each of the physical computer systems involved, there is an entry in
glassroom_hosts
:
- main hostname
- ip address
- operating system version
- description of physical configuration (e.g., "Sun Ultra 2 pizza box with two
CPUs, 1.25 GB of RAM (4 SIMM slots free), two fast-wide SCSI disk drives
(SCA connectors), one 68-pin mini-SCSI cable to disk enclosure
containing 13 additional disks, one 68-pin mini-SCSI cable to DDS3 tape
drive containing")
- model #, serial #
- street address
- how one gets to the console port
- service contract phone number
- service contract number and any other details
- phone number and main contact for the facility where it is hosted
(e.g., NOC at above.net or exodus.net)
- human-readable description of the file system backup strategy and
schedule for this host
- human-readable description of the RDBMS backup strategy and schedule
for this host (if applicable)
An expired Verisign certificate can be nearly fatal to a service that
requires SSL to operate. Users get hammered with nasty warning messages
that they don't understand. So we need the
glassroom_certificates
table with the following columns:
- hostname to which this cert applies
- who issued the cert (usually Verisign)
- email address encoded in the cert
- expiry date
Important news, such as the fact that regular backups have been halted
and someone is restoring from tape, are recorded using the standard ACS
/news subsystem.
Modeling the software
Every site is going to depend on a set of software modules that can be
versioned. The ones that occasion the most discussion are presumably
the custom-written software, e.g., the scripts that drive the Web site.
However, we still need to keep track of packaged software. People might
need to know that we're currently running Oracle 8.0 but plan to upgrade
to 8.1 in April 1999.
We also are going to tie bug tickets and feature requests to software
modules so that only the relevant personnel need be alerted. Here's
what the glassroom_modules
table keeps:
- name of module, e.g., "Solaris", "Oracle", "ArsDigita Reporte" for
packaged software or "foobar.com" for the custom Web scripts
- where we got it (URL, vendor phone number)
- current operating version (a date of download if the software itself
doesn't come with a version)
- who installed it (references users table)
So that bug tickets and feature requests can be closed out with a
structured "fixed in Release 4.1", Glass Room needs to know about
software releases. We have a table glassroom_releases
containing:
- module_id (references glassroom_modules)
- release_date (null until done)
- anticipated_release_date
- release name (just text; Glass Room doesn't care if 3.7 comes after 4.0)
- manager (a person; references users(user_id))
We also use this table even when we're talking about software releases
that we're merely installing, not developing (e.g., for Oracle 8.1).
Modeling and Logging Procedures
A procedure is something that must be regularly done, e.g.,
"verify backup tape". We want to log everything of this nature that has
been done, by whom, and when. Glass Room needs to know which of these
procedures need to be done and how frequently. That way it can check
the log and raise some alerts when procedures haven't been done
sufficiently recently.
We keep a single glassroom_logbook
table in which all kinds
of events are intermingled. Some of these might even be ad-hoc events
for which we don't have a procedure on record as needing to be done.
So that the system can do automated checking of the logbook table, we
keep glassroom_procedures
:
- procedure name (no spaces, e.g., "verify_backup_tape"; so we can use
this as a database key)
- responsible_user
- responsible_user_group
(one of the preceding must be non-null)
- maximum time interval (in days or fractions of days)
- importance (1 through 10; 10 is most important)
Logbook entries can be made by human beings or robots. As the Glass
Room is generally running on a geographically separate machine from the
production servers, the robots will have to make their log entries via
HTTP GET or POST.
Here's the data model for glassroom_logbook
:
- entry_time
- entry_author (user id; provision is made for robots by registering
them as users)
- procedure_name (generally references the procedures table but need
not for one-time events)
- notes
People can comment on logbook entries, but we just do this with the
general_comments table.
Suggested Procedures
Check at least the following:
- Oracle exports completing successfully
- Oracle exports cover all production users
- Oracle exports can be successfully imported into another system
- tape backups occurring
- verification of tape made yesterday in same drive
- off-site transfers of tapes occurring
- verification of off-site transferred tapes read into another machine
Domains
We don't want an unpaid InterNIC invoice rendering our service
inaccessible to most users. So we keep track of all the domains on
which our service depends, when they expire, who has paid the bill, and
when the last bill was paid.
create table glassroom_domains (
domain_name varchar(50), -- e.g., 'photo.net'
last_paid date,
by_whom_paid varchar(100),
expires date
);
Bug Tracking, Feature Requests, and Tickets
In the tech world, people seem to like organizing things by trouble
ticket:
- Joe Customer opens a ticket when he is unhappy about a bug on a page
- If it is a high priority bug, a variety of folks get notified via
email and maybe pager; if it is a low priority bug, it sits in the queue
until someone notices
- A coordinator assigns the bug to Jane Programmer, causing the system
to send Jane email
- Jane Programmer fixes the bug and records that fact, causing the
system to send email to Joe Customer
The same kind of interaction works well for feature requests, except
that Jane Programmer might need to record the version number of the
software that will incorporate the new feature.
So that the group can see whether everyone is working together
effectively, the system can produce reports such as "average time to
implement a requested feature", "response time for bugs arranged by the
person who reported them", etc.
philg@mit.edu