MetaServerII relay v1.0 README
By Andy McFadden (fadden@uts.amdahl.com)
Written: Janyary 1993
Updated: 29-Jan-93


OVERVIEW

blah blah

usage: relayII [-v] [-t]
	-v	verbose messages and logging
	-t	truncate log file before starting


COMPILING/INSTALLATION

The first step is to unpack relayII.  Obviously you already did this.

The next step is to edit "relay.h".  There are some configurable parameters
near the top of the file:

/*#define DEBUG*/
	This turns on a lot of noisy stuff.

/* files - adjust to taste */
#define LOGFILE		"relaylog"
	Where to put the log entries.  Unless the "-t" option is given, they
	will be appended to the end of the file.  The format of log entries
	is discussed in the "DAILY ADMIN" section.

#define CONFIGFILE	"relayrc"
	Where to read the configuration from.  The configurable stuff is
	described in the "CONFIGURATION" section.

/* some time constants, in seconds */
#define TIME_RES	30		/* how often do we wake up */
	Normally relayII sits waiting for users.  However, every TIME_RES
	seconds it will wake up and check all of it's connections to see if
	any of them have timed out.  (Thus, connections don't time out after
	*exactly* 20 seconds, but rather some time between 20 and 45.)

#define MAX_USER	10		/* after this long, bye bye user */
	How long to wait for a user to finish.  Most user connections will
	terminate after a fraction of a second, so this is pretty generous.


The next step is to build relayII.  Check the Makefile for any system-
dependent stuff, and type "make".  It's pretty straightfoward, so I don't
anticipate any problems.

You should now have a working copy of relayII.  Before you can start
it up, however, it must be correctly configured.


CONFIGURATION

This section contains the sample "metarc" file with explanations.  Blank lines
and lines starting with '#' will be ignored.

#
# Sample configuration for MetaServerII
#

# ports to listen on for user connections
#
# ----- num --- kind -- "description" ------------------------- extra-stuff ---
port:	3520	old	"Original METASERVER format"
port:	3521	new	"New and improved format"
port:	3522	verbose	"Verbose format, with player lists"
port:	3523	info	"News & information"			metaII_info
port:	3524	info	"Server FAQ posting (long)"		server_faq

	This should be fairly obvious.  Port assignments begin with "port:",
	and are followed by the number of the port to use.  Ports < 1024
	require you to run as root, and are probably best left alone.  The
	"description" field is currently only used on the verbose display.

	The "kind" field may be one of those shown (you should already be
	familiar with what each looks like).  The "info" kind deserves special
	attention.  Instead of relating to an internal output format, it
	points to a file which will be opened, read, and sent to the client.
	The name of the file is held in the "extra-stuff" column.

# couldn't connect to server, wait xx minutes
wait_noconn:		30
# server is empty, wait xx minutes
wait_empty:		15
# server is open (1-15 players), wait xx minutes
wait_open:		10
# server has a wait queue, wait xx minutes
wait_queue:		15

	These are the variable delay settings.  If a server is checked and
	nobody is playing, MS-II will add the value of wait_empty to the
	current time and not check the server again until that time.  Same
	for the others.  The idea here is that servers which are down will
	probably stay down, but servers with a nonzero number of players will
	change in the near future and therefore must be checked soon.

# check port-1 (player list port) first?
try_alt_port:		on

	If this is on, port-1 is checked BEFORE the Netrek port (so if the
	netrek port is 2592, it will check port 2591 first).  It looks for
	the output of a "players" command, or the string "No one is playing"
	(defined in server.c).  If it is told that noone is playing it will
	immediately mark the server as empty and stop processing (which is
	good because it prevents MS-II from waking the daemon up once every
	15 minutes).

# shall we do logging?
do_logging:		on

	If this is off, NO output will be sent to the log file.

# where are we?
location:		Santa Clara, CA

	Geographic location.  This is for display purposes only.

# who maintains me?
admin:			fadden@uts.amdahl.com

	Your e-mail address, so that people with comments can find you.  If
	MS-II is unable to open the file associated with an "info" port,
	it will send your e-mail address instead.

# should we write to a checkpoint file before exiting?
save_ckp:               on

	Controls whether or not the checkpoint file is written when the
	process exits.  This should always be "on" unless you are debugging
	the code, in which case the checkpoint file tends to get in the way.

#
# List of servers; runs until end of file.  MetaServerII will use the IP
# address if one is given; if you use 0.0.0.0 it will resolve the domain
# name.
#
# Note that this format is identical to the FAQ server list; just cut & paste.
# If you plan to take MetaServerII up and down frequently, put the most
# popular servers near the top, so that they can be scanned right away when
# the program comes up.
#
servers:

# server name ----------------- IP address ---- Port -- Notes (optional) ------
#tde.uts.amdahl.com		0.0.0.0		6592	Amdahl-only
rwd4.mach.cs.cmu.edu		128.2.209.169	2592	CMU.
bronco.ece.cmu.edu		128.2.210.65	2592	Bronco is back!
calvin.usc.edu			128.125.62.143	2592	USC.

	The comments in the sample metarc explain it fairly well.  The
	numeric IP address is used because it makes starting up faster and
	it works even if DNS is out to lunch or is missing parts of the map
	for foreign countries.  The "notes" column runs until a newline is
	seen, so no quotes are necessary.


Tailor the values to your liking, and start it up.  You may want to run it
in verbose mode (-v) the first time to see what it's doing.  This also stores
more information in the log file.

[ As a temporary measure, starting a hostname with "RSA" will cause the 'R'
flag to appear.  So you'd put "RSAbronco.ece.cmu.edu".  The extra characters
will be stripped out before any further processing is done. ]


DAILY ADMINISTRATION

Okay, now you've got it up and running.  There are a few things you need to
deal with on a daily basis.

First, the log file.  As of the time I'm writing this, an average day in the
life of the MetaServerII process sees 750 connections from users (update:
it's three months later, and it now receives 1800 connections on average).
This creates about 60K of data in the log file (which has an entry for every
user connection plus an entry for every failure).  You can generate some
statistics with the "metastat" program, which scans "metalog" and reports how
many users connected to each service (it's a shell script with lots of
hardcoded stuff).

You can truncate the log file with "date > metalog".  This will work because
MS-II always seeks to the end of the file before sending anything to the
log, and it has stdio buffering turned off.  DO NOT remove the log file and
try to recreate it.  This will remove the directory entry, but the log
will continue to exist on the disk until the process is killed (it's a UNIX
thing).  If you remove the log file you will have to kill the process to
free the disk space, so you might as well kill it, turn logging off in the
metarc, and then restart it.

There is currently no way to change the configuration of an actively running
MetaServerII process.  Changes are made by updating the metarc, killing the
process, and then immediately restarting it.  This results in a loss of
service for about two seconds, which shouldn't affect anybody.

To prevent information from being lost, MS-II traps common signals (SIGINT,
SIGTERM), and saves its state in a checkpoint file (the name is somewhat
misleading; it doesn't actually checkpoint in the usual sense.  It would be
trivial to make it do so, however).  When the process is restarted, MS-II
will read the checkpoint file, remove it, and continue where it left off.

The metackp file is nothing more than two timestamps (time file created, time
first process was started) and a dump of the "servers" array.  If you change
the definition of the SERVER struct you will invalidate the checkpoint file,
so if you are making changes to the program it's probably a good idea to
remove metackp before restarting.

Since the contents of the "servers" array are probably going to be different
from what's in the metackp file (presumably the process was killed to add
more servers or correct information), MS-II can't just reload the entire
array.  Instead it goes through and checks to see if the server name, port,
IP address, and comment all match.  If they are all identical, then the
rest of the information (server status, time of next check, etc) are copied
into memory.  If they don't match, the information is ignored.

This way, new servers will be checked immediately, deleted servers will be
gone, and changed servers will show the changes.  It's a simple mechanism
and has its flaws, but it works well enough.

To summarize:
- truncate the log file with "date > metalog".
- if you don't want a log file, turn logging off.
- to change information in the metarc, make the changes, kill the process,
  and restart it immediately.


That is all you need to know to run MetaServerII.  The next sections go into
more detail about how everything works.


OUTPUT FORMATS AND USER CONNECTIONS

There are currently four defined user formats:

    old      - original METASERVER format
    new      - same information as "old", but reformatted and with some goodies
    verbose  - information from "new" port, but with player lists as well
    info     - displays the contents of a file

As stated earlier, you can map a port to any given format by adding a line
to the metarc.

When a user connects to MS-II, main_loop() calls new_user() with the index
of the port he connected to, and the file descriptor for the listen socket.
new_user() handles the client connection, sets up some structures, and then
calls one of the display_*() to generate output.

There is one display_*() routine for each of the display formats.  Each one
will allocate a buffer, attach it to the struct for the particular client,
and then fill it with data.  The display routines don't actually send anything
over the socket; that is handled in a non-blocking fashion.

When the display routine returns, the socket is added to the list of
writeable file descriptors.  When the server sees that it is writeable, the
data will be sent to the client.  When all of the data is sent, the connection
is closed and the resources are freed.

If the user sits on the port for too long, MS-II will time the connection
out and free everything.  An error message will be sent to the log file.
This usually only happens when connecting to hosts in foreign countries.

Note that, after receiving several hundred connections in the space of two
minutes, I have added a rudimentary "flood control" mechanism.  Look at
new_user() in scan.c for details.


ADDING A NEW OUTPUT FORMAT

(1) Take a look at disp_old.c, disp_new.c, and disp_info.c to get a
    "feel" for what is involved.

(2) Create a new file with a function in it that will be called (don't worry
    about the contents yet; just add the prototype).  Add the new file to
    the Makefile.

(3) Add a new "DI_*" definition to the enum line in meta.h.  This value will
    be used in switch statements and array indices.

(4) Add the definition to the PORT_DEF array in main.c.  The name in quotes
    corresponds to the name you put in the metarc file.

(5) Add the definition to the case statement in new_user() in scan.c.  Have
    it call the routine you added in step (2).

(6) Add a line to the metarc which invokes your new format.  The "description"
    field is used by the "verbose" format.

(7) Add your new function (this is the fun part).

That's it.  Whatever your new function outputs will now be sent on the
port you gave in the metarc.

To make it ridiculously easy, I provided a "Uprintf()" function which works
just like fprintf() or sprintf(), except that the first argument is the
number of the socket you want to print to (of course, it's actually printing
to a buffer).  You have complete access to all of the internal structures;
take a look at display_verbose() in disp_new.c for ideas.

Uprintf() will handle allocation of the user's buffer, and (depending on
what state MS-II is in if/when I release it) will automatically resize the
buffer if the data overflows it.

If something doesn't seem to be working, check the log file for error
messages.  MS-II will trap most of the common errors.


HOW THE CONTROLLING LOOP WORKS

Take a look at main_loop().  It works like this:

- initialize file descriptor sets
- initialize all listen ports (the list of ports is derived from the list
  you specified in the metarc)
- set a timer to go off every TIME_RES seconds

- LOOP:
  - if the timer went off, go check the connections, killing ones that have
    been open too long and opening new server connections whose time has come
  - issue a select() call on all readable and writeable file descriptors
  - for every file descriptor:
    - if it's a user request, call new_user()
    - if it's a message from a server, call handle_server()
    - if it's a writeable server, connect it (it's subsequently removed)
    - if it's a writeable client, call handle_user()

When the timer goes off, the handler just sets a flag ("coffee", as in "wake
up and smell the").  It does *NOT* directly cause anything to happen.  This
way there's no need for synchronization stuff around structure updates.

Since the timer causes the select() call to return, it will be handled even
if there isn't any user activity.  It calls check_status(), which works like
this:

- look for "stale" user connections, and drop them
- if we are querying a server, drop it if it's been connected for too long
- else, find another server to examine, and call prep_server()

prep_server() is explained later.  The global "busy" variable indicates
which server, if any, is currently being examined.  I chose to limit it to
one server at a time because (1) it's easier on the network, (2) it's easier
on the CPU, (3) it's easier on the programmer.

(This restriction does not apply to user connections; you can have as many
users connected as you have free file descriptors.  I don't think I've ever
seen more than two connected at a time though.)

MS-II picks a server based on the current time and the "next_update" field
for that server.  The next_update time is initially set in main.c (see the
comments in meta.h to see how to spread the times out), and is set to
now + xx minutes when the connection is closed, where "xx" is determined
according to the server's status (up, down, open, queue) and the values
listed in the metarc.  To prevent starvation, it always starts at last_busy+1.


HOW THE SERVER CONNECTION WORKS

It is expected that servers will have various problems, including being
down, sending garbage data, and just plain being slow.  The server connection,
like the user connections, is handled with non-blocking I/O.

prep_server() prepares the server connection by initializing the server
structure, and then calling open_server(port-1).

open_server() handles all the nasty socket details.  It issues a non-blocking
connect() call, and sets some flags that are examined in server.c.  The
server's socket is added to the list of writeable fds for select() to examine.

Eventually, the connect() will fail or succeed, and select() will show that
the socket is writeable.  main_loop() will call server_connected(), which
just changes the state of the structure and clears the non-blocking status
from the file descriptor (socket writes are always non-blocking).  main_loop()
immediately calls handle_server().

handle_server() calls read_minus(), which tries to read from the socket.  When
that fails or succeeds, the structure is updated appropriately, and the
connection is closed with close_server().  handle_server() then calls
open_server(port), and returns.

The next time through, handle_server() calls read_server(), which reads from
the socket, interpreting the data like ck_players does.  Note that the calls
to read_minus() and read_server() may return (-2) to indicate that they have
not finished reading all of the data.  If they do that, handle_server() returns
to main_loop() and waits until the socket becomes readable again.

After all the data is read, handle_server() calls close_server() to close the
connection and then nuke_server().  nuke_server() sets the last_update and
next_update fields, updates the flags for the "new" display, and then sets
"coffee" to TRUE to force main_loop() to immediately check for another server
to handle (otherwise it'd have to wait until the timer went off or some other
I/O activity occurred).

For details on how read_minus() and read_server() work, see server.c.



POSSIBLE ENHANCEMENTS

A simple idea is to record and maintain statistics on use for all of the
different servers.  This could be combined with other information to guide
players toward a specific server (perhaps a new one which hasn't seen much
use).

This could be taken a step further and MS-II could be used as a relay point
for clients.  They could request the "server du jour" and have it automatically
chosen by MS-II.

A different idea is to extend the server communication so that it actually
logs a player on.  This would allows MS-II to get the tournament mask, giving
it the ability to determine if a server is really open or up but closed.


MISCELLANEOUS

MetaServerII isn't a CPU hog, but it is active around the clock.  On a Sun 4
it used about 3 minutes of CPU time to handle 1600 user connections and 35
servers over a 24 hour period.  Profiling shows that most of its time is
spent in select() and read(); the only routine which accounts for more than 2
or 3% of the time is scan_packets().

It may be possible to increase efficiency by waiting for a fraction of a
second before issuing the read().  This might allow more data to arrive,
resulting in fewer system calls.  Maybe.

