FILE: CCF/CCFcctl/README.ft CCF Collaborative Computing Frameworks Emory University Atlanta, GA, USA June 1999 CCF is a software system that supports collaborative, distributed, computer-based problem solving in the natural sciences, business, government, and in educational environments. The goal is to evolve a virtual environment for distributed computation that supports integrated human AV communication, high performance heterogeneous computing and distributed data management facilities. CCF is a research project at Emory University involving the Math/Computer Science and Chemistry departments. Recently, the Computer Science Department of The University of Reading (UK) has joined the project. DISCLAIMER: This is alpha release 2.00 of CCF -- Collaborative Computing Frameworks. This software is provided as is with no warranty expressed or implied. We hope you find it useful, but we won't be held responsible for any damage that may occur from reading, compiling, installing, using, or even thinking about it. LICENSE: CCF is Copyright (C) 1996 by Emory University except for the code in directories GSM, LPC, LPC10 in the CCFaudio directory and is distributed under the terms of GNU General Public License (GPL) and the GNU Library General Public License (LPGL). The files COPYING and COPYING.LIB in each directory will tell exact licensing restrictions. This package is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation; either version 2 of the license, or (at your option) any later version. CREDITS: CCF was created by Vaidy Sunderam, Injong Rhee, Alan Krantz, Shun Yan Cheung, Julie Sult, Soeren Olesen, Paul Gray, Phil Hutto, Sarah Chodrow, Michael Hirsch, Ted Goddard, Mic Grigni, N. Balaguru, Jim Nettles, Luigi Marzilli, Sue Onuschak, Scott Childs, Kevin Williams and Roger Loader,James Pascoe of the University of Reading(UK) The CCF project is sponsored by the U.S. National Science Foundation under the multidisciplinary challenges initiative. CCF currently supports ten platforms: Mandrake Linux 6.5, Red Hat 6.0 and 6.2, SuSe Linux 6.4, TurboLinux 6.0, SunOS-5.7, SunOS-5.6, SunOS-5.5.1 and IRIX 6.2. The SunOS-5.7 version is the most thoroughly tested. OVERVIEW: The new release of CCF contains two major additions: - A substantial degree of fault tolerance - A monitor tool to visualise the fault tolerance process The purpose of this file is to describe the mechanisms employed so that independent development can be pursued. FAULT TOLERANCE Previous distributions of the CCF have not been resilient to failures of individual session members.This release incorporates a degree of fault tolerance so that this version of CCF can cope with the following situations: - a session can continue if a majority survive a failure event (defined as a crash or loss of communication by one or more members) if the survivors include the session owner/error master, WP and SNS servers. - surviving members continue with no degradation of performance. - failed members can rejoin as new members - session members can leave in any order (???). The fault tolerant mechanism operates in the following way.Each participant in a session has an error monitor and error handler thread attached to it. Each error monitor thread log errors detected on the reliable/atomic channels associated with the session. After 3 equivalent failure events an election request is sent to the error handler thread. The collection of error handler threads perform an election to form a consensus to remove failed members. ELECTION HANDLING The election is triggered when the error monitor has detected three consecutive failure reports for a given message (or accrues a log of 10 failures). The error handler then invites all session members to vote on the proposed removal of all other session members. The votes are collated and if a majority ruling is present, the failed members are removed from the session. Note that for an election to be successful, the number of failed members must be a minority and the session owner, white pages and session name server must be contactable. Multiple failed hosts are dealt with in a single election as are hosts that fail during an election. VISUALISATION By default, fault tolerant oriented debug message printing is configured into to the compilation process. However, it can be suppressed by toggling the JSP_DEBUG flag in the CCFcctl/src/Makefile. file. Another addition made to this distribution is the inclusion of a new monitor tool. This is designed to visualise election based data in a more intuitive fashion the eventual intention being to remove debug print from the console window. To start the monitor window, simply select it from the tools menu. A watch glass icon will appear in the toolbar signifying that the window is ready. This icon can be toggled to show/hide the window. MONITOR WINDOW The pulldown menus in the monitor window perform three functions. The red `exit' button is a `last resort' in the event of a system failure and will exit the entire session. If you wish to close the window, click the watch glass icon on the main ccsm toolbar. The `view' menu is designed as an alternative way of selecting from the three main tabs at the top of the window. It also shows the tab that is currently in focus. The `participants' menu provides a listing of the session participants. Further information about specific participants is dumped to the console window when selected from this menu. Note also that the about button will display a pop-up window giving credit and contact information. The three tabs are designed to provide different views for each of the three main development areas. Note that only the `Fault tolerance and General' tab is currently active. The main fault tolerance panel contains information concerning the session health. Most of this information is self-explanatory, but several of the sub-windows require explanation. - the `Update' slider can be used to modify the duration between information updates. The flashing green symbol indicates a successful diagnostics poll and also serves to provide a visual indication of the update frequency. - the `Thread Status' box gives information on the status of the major fault tolerance threads. It is normal for all of the threads (besides the `election timer') to be `BUSY'. - the `CCTL status' sub-window shows data relevant to the sessions health. It gives the states that the error monitor and error handler threads are currently in. These can be one of the following: - E(M|H)_SESS_OK - session is healthy. - E(M|H)_SESS_FAIL - some failure messages have been detected. - E(M|H)_SESS_ELECT - an election is in progress. The error log and ER_IND sizes are also given. Note that an ER_IND message triggers the election process. - the `Timer Status' window offers both timer related and non-timer related data. Note that the timer is used during an election to provide a timeout. The `Iteration' field simply shows the current diagnostic iteration. The mask fields also supply election based data. For more information on masks, see the contact information below. - The `Thread selection' slider is used to view data output by different CCF processes on the same machine. As tools are started, you will see a message similar to the following in the console window: JSP: diagnostics thread (1) waking up; diagFd is: 19 log filename is: /tmp/ccfdiag.1 The number in brackets is the diagnostics thread id for the process concerned. By selecting this using the `Thread selection' slider, diagnostic information specific to each tool can be viewed. This is useful for verifying consistency amongst session members. - The large `Election Matrices' panel at the bottom of the screen is used to visualise each session members votes during the election process. The buttons on the right of the window perform the following functions: - Process - currently unused. - Append - appends a concise textual copy of the information shown to a file in the users home directory. By default, this file is called '~/ccfdiag'. Subsequent button presses will not overwrite this file (i.e. information is appended). - Pause - pauses the display until 'Unpause' is clicked. - Exit - provides same function as the red 'exit' button in the main pulldown menu. CONTACT / CREDIT The fault tolerance protocols and monitor tool were created / developed by: - Roger Loader (Roger.Loader@reading.ac.uk) - James Pascoe (J.S.Pascoe@reading.ac.uk) who are both members of the department of Computer Science at the University of Reading in the UK. We would like to acknowledge that this work could not have been done without the support of Emory University {Vaidy Sunderam, vss@mathcs.emory.edu} and the sponsorship of the U.S. National Science Foundation under the multidisciplinary challenges initiative. As always, please send bug reports, comments or suggestions to the following address: J.S.Pascoe@reading.ac.uk