NASA Logo, National Aeronautics and Space Administration

What Is AutoClass

AutoClass is an unsupervised Bayesian classification system that seeks a maximum posterior probability classification.

Key features:

  • determines the number of classes automatically;
  • can use mixed discrete and real valued data;
  • can handle missing values;
  • processing time is roughly linear in the amount of the data;
  • cases have probabilistic class membership;
  • allows correlation between attributes within a class;
  • generates reports describing the classes found; and
  • predicts "test" case class memberships from a "training" classification.

AutoClass uses only vector valued data, in which each instance to be classified is represented by a vector of values, each value characterizing some attribute of the instance. Values can be either real numbers, normally representing a measurement of the attribute, or they can be discrete, one of a countable attribute dependent set of such values, normally representing some aspect of the attribute.

AutoClass models the data as mixture of conditionally independent classes. Each class is defined in terms of a probability distribution over the meta-space defined by the attributes. AutoClass uses Gaussian distributions over the real valued attributes, and Bernoulli distributions over the discrete attributes. Default class models are provided.

AutoClass finds the set of classes that is maximally probable with respect to the data and model. The output is a set of class descriptions, and partial membership of the instances in the classes.

For more details click here (Bayesian Classification (AutoClass): Theory and Results; 1995, postscript - 220k), and here (Bayesian Classification Theory; 1991, postscript - 200k), and here (list of references).

What Is AutoClass C

AutoClass C was programmed by Dr. Diane Cook (cook@centauri.uta.edu) and Joseph Potts (potts@cse.uta.edu) of the University of Texas at Arlington. William Taylor (william.m.taylor@nasa.gov) "productized" the software through extensive testing, addition of sample data bases, and re-working the user documentation.


Significant new features of the C implementation are:

  • it is about 10-20 times faster than the Lisp implementations: AutoClass III & AutoClass X;
  • it uses double precision floating point for its "inner loop" weight calculations, producing a higher "signal-to-noise" ratio than the Lisp versions, and thus more precise convergences for very large data sets (adding double precision to the Lisp versions would slow them down even more).

It provides four models:

  • single_multinomial - discrete attribute multinomial model, including missing values.
  • single_normal - real valued attribute model with no missing values; sub-types: location and scalar.
  • single_normal_missing - real valued attribute model with missing values; sub-types: location and scalar.
  • multi_normal - real valued covariant normal model with no missing values.

Additional models were done in Lisp for AutoClass X, and may be implemented in C at some later time. These models are:

  • single_multinomial_ignore - discrete attribute multinomial model, ignoring missing values.
  • single_poisson - models low value count (integer) attributes as Poisson distributions.
  • multi_multinomial_dense - a dense covariant multinomial model.
  • multi_multinomial_sparse - a sparse covariant multinomial model.

The C implementation also does not provide single_multinomial model value translations, and canonical model group/attribute ordering.

Update History

Version: 1.0 --- 15 Apr 95 --- initial version of AutoClass C

Version: 1.5 --- 08 May 95 --- ported to Sun Solaris 2.4; corrected string overwrite problems; compilation of file search-control.c is now optimized; & added binary data file input option.

Version: 2.0 --- 08 Jun 95 --- ported to SGI IRIX version 5.2; converted binary i/o from non-standard (open/close/ read/write) to ANSI (fopen/fclose/fread/fwrite); converted from srand/rand to srand48/lrand48 for random number generation; added prediction capability which uses a "training" classification to predict probabilistic class membership for the cases of a "test" data file; added new ".s-params" parameter "screen_output_p"; added output of real and discrete attribute statistics when data base is initially read; corrected garbage output when ".r-params" parameter "xref_class_report_att_list" contains mixed real and discrete attributes; corrected the handling of unknown real values in reports output; and corrected an error in function "output_warning_msgs" which caused an abort condition.

Version: 2.5 --- 28 Jul 95 --- Influence values report has been significantly revised and reformatted; add SunOS/Solaris C compiler support; correct segmentation fault which occurs when more than 25 type = real, subtype = scalar attributes are defined; correct "LOG domain" errors in generation of influence values for model "single_multinomial"; and added mods for port to Linux operating system using gcc compiler.

Version: 2.6 --- 02 Aug 95 --- Correct segmentation fault which occurs when more than 50 type = real, subtype = scalar attributes are defined; add function safe_log to prevent "log: SING error" error messages; and require user to confirm search runs using test settings for .s-params file parameters: start_fn_type and randomize_random_p.

Version: 2.7 --- 16 Aug 95 --- Add search parameter to allow AutoClass to be run as a background task.

Version: 2.8 --- 03 Sep 96 --- Add search parameter "read_compact_p", which directs AutoClass to read the "results" and "checkpoint" files in either binary format or ascii format; redefine make files with -I and -L parameters for SunOS 4.1.3; change make file naming conventions; prevent corruption of discrete data translation tables when translations are longer than 40 characters; increase from 3000 to 20000 the value of VERY_LONG_STRING_LENGTH to handle very large datum lines; increase DATA_ALLOC_INCREMENT from 100 to 1000 for reading very large datasets; add DATA_ALLOC_INCREMENT logic of READ_DATA to XREF_GET_DATA -- this will prevent segmentation faults encountered when reading very large .db2 files into the reports processing function of AutoClass; in FORMAT_DISCRETE_ATTRIBUTE, do not process attributes with warning or error messages -- this prevents segmentation faults; in XREF_GET_DATA, free database allocated memory after it is transferred into report data structures --this reduces the amount of memory required when generating reports for very large data bases, and prevents running out of memory; in all functions calling malloc/realloc for dynamic memory allocation, checks have been added to notify the user if memory is exhausted; and port the "make" file for HP-UX operating system using the bundled "cc" compiler.

Version: 2.9 --- 17 Oct 96 --- Correct bugs which occur when generating reports of discrete type data -- these were introduced in version 2.8. Added new parameter for both ".s-params" & ".r-params" files: break_on_warnings_p.

Version: 3.0 --- 15 Apr 97 -- New parameter for .r-params files: report_mode -- "text" (current report output) or "data" (parsable format for further processing); correct minor bugs; improve input checking for .hd2 file; correct segmentation fault which occurred in prediction runs when the size of the "test" database was larger than that of the "training" database; and new parameter for .s-params & .r-params files: free_storage_p.

Version: 3.1 --- 04 Jul 97 --- New parameters for .r-params files: comment_data_headers_p, max_num_xref_class_probs, start_sigma_contours_att, & stop_sigma_contours_att. Allow checkpoint files to be loaded for reconvergence. Allow reports to be generated for data sets of 100,000 cases and more, without causing a segmentation fault. For "-predict" runs, handle "test" cases which are not predicted in be in any of the "training" classes. When there is more than one covariant normal correlation matrix, print all of them. In the case cross-reference report (report_type = "xref_case") generated with the data option (report_mode = "data"), other class probabilities are now printed. In the case and class cross- reference reports, the print out of probabilities has increased by one significant digit (0.04 => 0.041), and the minimum value printed is now 0.001, rather than 0.01. Add capability to compute sigma class contour values for specified pairs of real valued attributes.

Version: 3.2 --- 13 Apr 98 --- Changed the behavior of search parameter force_new_search_p; amplified some documentation sections; corrected several segmentation faults in reports generation; corrected several errors in sigma contours output; correct problem with cross-reference reports class assignment when there are more than five marginal probabilities; change layout of influence values report to print matrices after all class attributes are listed; warn user when default start_j_list may not find the correct number of classes in data set; warn user of search trials which do not converge and print convergence summary at the end of each run; the multi-normal model was corrected to prevent oscillation in the expectation maximization calculations; and allow non-contiguous groups of attributes to be specified for sigma contours calculations.

Version: 3.2.1 -- 04 Jun 98 --- Minor documentation changes.

Version: 3.2.2 --- 02 Jul 98 --- Minor documentation changes.

Version: 3.3 --- 23 Sep 98 --- Integrated source port of version 3.2.2 to Windows NT/95/98. Update sample AutoClass C run files contained in autoclass-c/sample.

Version: 3.3.1 --- 30 Nov 98 --- Correct incompatibility with .results[-bin] files written by AutoClass C versions prior to version 3.3.

Version: 3.3.2 --- 13 Sep 99 --- In all situations warning and error messages are now written to the log file.

Version: 3.3.3 --- 01 May 00 --- Add Dec Alpha support; correct Dec Alpha crashes when attempting to free memory at the end of search runs; conditionalize two warning tests to fail in batch mode; and separate log files are now written for "-search" (.log) and "-reports" (.rlog).

Version: 3.3.4 --- 24 Jan 02 --- Correct bugs in -predict and -report modes; correct "safe_log" function for range near 0; and minor code cleanup. Update sample AutoClass C run files contained in autoclass-c/sample.

Version: 3.3.5 --- 24 Jul 08 --- Add FreeBSD and MacOSX support; correct minor bugs.

Version: 3.3.6 --- 01 Sep 09 --- Improvements to reports for 'report_mode = "data"' and 'comment_data_headers_p = true'.

Compatibility and Requirements

AutoClass C was written in ANSI C using the GNU gcc compiler version 2.6.3 running on a SunSparc under SunOS 4.1.3.

It has also been ported to and tested on:

  • SunSparc under Solaris 2.6 using GCC version 2.95.2;
  • SunSparc under Solaris 2.4 using SPARCompiler C version 3.00;
  • SunSparc under SunOS 4.1.3 using SPARCompiler C version 3.00;
  • SGI Indigo under IRIX 5.2 using the bundled cc compiler; and
  • RedHat Linux version 6.1, GCC version 2.95.2;
  • Dec Alpha under OSF 4.0 using the cc compiler;
  • HP9000/735 and HP9000/C110 under HPUX 10.10 using the bundled cc compiler; and
  • Windows NT/95/98 using the Microsoft Visual C++ 6.0 compiler.
  • Mac OSX 10.4 using gcc 4.0

Considerations for porting to other platforms, operating systems, and compilers:

  • int and float types must be at least 32 bit words
  • floating point arithmetic must be IEEE standard
  • values.h constant #defines are not consistent with IEEE standard -- used Symbolics Genera 8.3 values in autoclass.h
  • globals.c, io-results.c, and search-control-2.c: G_safe_file_writing_p = TRUE; only supported under Unix, since it does system calls to mv (rename file) and rm (delete file).
  • utils.c: char_input_test -- which implements the typing of 'q' and to quit the search -- uses Unix system call fcntl, and file fcntlcom-ac.h; get_universal_time -- uses Unix system call time.
  • init.c: init -- uses Unix system call getcwd (get current working directory); sets "normalizer" value for random number generator library function "srand48".
  • search-control.c, search-basic.c, search-control-2.c, & utils.c: Use C library functions srand48/lrand48 for random number generation.

Limitations

AutoClass C is limited by memory requirements that are roughly in proportion to the number of data, times the number of attributes (the data space); plus the number of classes, times number of modeled attributes (the model space); plus a fixed program space. Thus there should be no limit on the number of attributes beyond the program addressable memory, but there are definite tradeoffs with respect to the model space, and performance degradations as paging requirements increase.

For very large data sets, you may well find that even if you can handle the data, the processing time is excessive. If that is the case, it may be worthwhile to try class generation on random subsets of the data set.
This should pick out the major classes, although it will miss small ones that are only vaguely represented in the random subsets. You can then switch to prediction mode to classify the entire data set.

Technical Questions

Contact John Stutz if you have questions concerning the applicability of AutoClass to your data analysis situation.

Implementation Questions

Contact William Taylor if you have questions concerning the implementation, installation, and running of AutoClass C, including "bugs" and features you may add to the existing code.

Obtaining AutoClass C

AutoClass C is available free here as a "gzipped" tar file. Note that the anonymous ftp site csr.uta.edu will no longer provide the latest version of AutoClass C.

The AutoClass C files include source code, user documentation, two theory papers (in postscript), a sample run, and five test data bases. The uncompressed files are about 5.7 megabytes. When built, this becomes about 7.5 megabytes.

Click on one of the following to download the AutoClass C files to your host --

Click on the following to download the new (06May02) version 3.3.4 Windows stand-alone executable to your host --
(Corrects required .dll file, MSVCRTD.dll was not found problem.)

Execute Autoclass.exe in an "MS-DOS" window (Win98), or in a "Command Prompt" window (Win2000),
not in a "Run Command" window.

Then, read autoclass-c/read-me.text or autoclass-c-win\read-me.text, and you are off to the races!

Information on SNOB

Information on a related, but independently developed, classification program -- SNOB -- written in FORTRAN, is available here

Last updated September 01, 2009

First Gov logo
NASA Logo - nasa.gov