Onboard Detection of Snow, Ice, Clouds, and Other Geophysical Processes Using Kernel Methods
A. N. Srivastava and J. Stroeve
The detection of clouds within a satellite image is essential for retrieving surface geophysical parameters from optical and thermal imagery. Even a small percentage of cloud cover within a radiometer pixel can adversely affect the determination of surface variables such as albedo and temperature. Thus, onboard processing of satellite data requires reliable automated cloud detection algorithms that are applicable to a wide range of surface types. Unfortunately cloud-detection, particularly over snow- and ice-covered surfaces, is a problem that plagues the field of remote sensing because of the lack of spectral contrast. This paper discusses preliminary results based on kernel methods for unsupervised discovery of snow, ice, clouds, and other geophysical processes based on data from the MODIS instrument and discusses implementation in computationally constrained environments such as those found on satellites.
Mixture Density Mercer Kernels: A Method to Learn Kernels Directly from Data
A. N. Srivastava
This paper presents a method of generating Mercer Kernels from an ensemble of probabilistic mixture models, where each mixture model is generated from a Bayesian mixture density estimate. We show how to convert the ensemble estimates into a Mercer Kernel, describe the properties of this new kernel function, and give examples of the performance of this kernel on unsupervised clustering of synthetic data and also in the domain of unsupervised multispectral image understanding.
Data Mining for Counter Terrorism Workshop
SIAM 2003 Authors
The tragedy of September 11 had immeasurable and permanent effects on the United States and the rest of the world and brought issues of security and defense to the forefront. To help prevent such disasters, a successful security program would include provisions to secure borders, the transportation sector, and critical infrastructure. A critical enabler of such a program would be the ability to synthesize and analyze data from multiple sources. The purpose of this workshop is to discuss ways in which data mining and machine learning can be used to analyze data from numerous sources of high-complexity for the purpose of preventing future terrorist activity. This is inherently a multidisciplinary activity, drawing from areas such as intelligence, international relations, and security methodology. From the data mining and machine-learning world this activity draws from scalable text mining, data fusion, data visualization, data warehousing methods.
Discovering System Health Anomalies using Data Mining Techniques
A. N. Srivastava
We discuss a statistical framework that underlies envelope detection schemes as well as dynamical models based on Hidden Markov Models (HMM) that can encompass both discrete and continuous sensor measure-ments for use in Integrated System Health Management (ISHM) applications. The HMM allows for the rapid assimilation, analysis, and discovery of system anomalies. We motivate our work with a discussion of an aviation problem where the identification of anomalous sequences is essential for safety reasons. The data in this application are discrete and continuous sensor measurements and can be dealt with seamlessly using the methods described here to discover anomalous flights. We specifically treat the problem of discovering anomalous features in the time series that may be hidden from the sensor suite and compare those methods to standard envelope detection methods on test data designed to accentuate the differences between the two methods. Iden-tification of these hidden anomalies is crucial to building stable, reusable, and cost-efficient systems. We also discuss a data mining framework for the analysis and discovery of anomalies in high-dimensional time series of sensor measurements that would be found in an ISHM system. We conclude with recommendations that describe the tradeoffs in building an integrated scalable platform for robust anomaly detection in ISHM applications.
Data Mining for Features Using Scale-Sensitive Gated Experts
A. N. Srivastava, R. Su, and A. S. Weigend
This article introduces a new tool for exploratory data analysis and data mining called Scale-Sensitive Gated Experts (SSGE) which can partition a complex nonlinear regression surface into a set of simpler surfaces (which we call features). The set of simpler surfaces has the property that each element of the set can be efficiently modeled by a single feedforward neural network. The degree to which the regression surface is partitioned is controlled by an external scale parameter. The SSGE consists of a nonlinear gating network and several competing nonlinear experts. Although SSGE is similar to the mixture of experts model of Jacobs et al. the mixture of experts model gives only one partitioning of the input-output space, and thus a single set of features, whereas the SSGE gives the user the capability to discover families of features. One obtains a new member of the family of features for each setting of the scale parameter. In this paper, we derive the Scale-Sensitive Gated Experts and demonstrate its performance on a time series segmentation problem. The main results are: 1) the scale parameter controls the granularity of the features of the regression surface, 2) similar features are modeled by the same expert and different kinds of features are modeled by different experts, and 3) for the time series problem, the SSGE finds different regimes of behavior, each with a specific and interesting interpretation.
Virtual Sensors: Using Data Mining Techniques to Efficiently Estimate Remote Sensing Spectra
A. N. Srivastava, N. C. Oza, and J. Stroeve
Various instruments are used to create images of the earth and other objects in the universe in a diverse set of wavelength bands with the aim of understanding natural phenomena. Sometimes these instruments are built in a phased approach, with additional measurement capabilities added in later phases. In other cases, technology may mature to the point that the instrument offers new measurement capabilities that were not planned in the original design of the instrument. In still other cases, high-resolution spectral measurements may be too costly to perform on a large sample, and therefore, lower resolution spectral instruments are used to take the majority of measurements. Many applied science questions that are relevant to the earth science remote sensing community require analysis of enormous amounts of data that were generated by instruments with disparate measurement capabilities. This paper addresses this problem using virtual sensors: a method that uses models trained on spectrally rich (high spectral resolution) data to fill in unmeasured spectral channels in spectrally poor (low spectral resolution) data. The models we use in this paper are multilayer perceptrons, support vector machines (SVMs) with radial basis function kernels, and SVMs with mixture density Mercer kernels. We demonstrate this method by using models trained on the high spectral resolution Terra Moderate Resolution Imaging Spectrometer (MODIS) instrument to estimate what the equivalent of the MODIS 1.6- m channel would be for the National Oceanic and Atmospheric Administration Advanced Very High Resolution Radiometer (AVHRR/2) instrument. The scientific motivation for the simulation of the 1.6 micron channel is to improve the ability of the AVHRR/2 sensor to detect clouds over snow and ice.
An Ensemble Approach to Buillding Mercer Kernels with Prior Information
A. N. Srivastava, J. Schumann, and B. Fischer
This paper presents a new methodology for automatic knowledge driven data mining based on the theory of Mercer Kernels, which are highly nonlinear symmetric positive definite mappings from the original image space to a very high, possibly infinite dimensional feature space. We describe a new method called Mixture Density Mercer Kernels to learn kernel function directly from data, rather than using pre-defined kernels. These data adaptive kernels can encode prior knowledge in the kernel using a Bayesian formulation, thus allowing for physical information to be encoded in the model. Specifically, we demonstrate the use of the algorithm in situations with extremely small samples of data. We compare the results with existing algorithms on data from the Sloan Digital Sky Survey (SDSS) and demonstrate the method's superior performance against standard methods. The results show that the Mixture Density Mercer Kernel described here outperforms tree-based classification in distinguishing high-redshift galaxies from lowredshift galaxies by approximately 16% on test data, bagged trees by approximately 7%, and bagged trees built on a much larger sample of data by approximately 2%. The code for these experiments has been generated with the AUTOBAYES tool, which automatically generates efficient and documented C/C++ code from abstract statistical model specifications. The core of the system is a schema library which contains templates for learning and knowledge discovery algorithms like different versions of EM, or numeric optimization methods like conjugate gradient methods. The template instantiation is supported by symbolicalgebraic computations, which allows AUTOBAYES to find closedform solutions and, where possible, to integrate them into the code.
Discovering Recurring Anomalies in Text Reports Regarding Complex Space Systems
A. N. Srivastava and B. Zane-Ulman
Many existing complex space systems have a significant amount of historical maintenance and problem data bases that are stored in unstructured text forms. The problem that we address in this paper is the discovery of recurring anomalies and relationships between problem reports that may indicate larger systemic problems. We will illustrate our techniques on data from discrepancy reports regarding software anomalies in the Space Shuttle. These free text reports are written by a number of different people, thus the emphasis and wording vary considerably. We test four automatic methods of anomaly detection in text that are popular in the current literature on text mining. The first method that we describe is k-means or Gaussian mixture model and its application to the term-document matrix. The second method is the Sammon nonlinear map, which projects high dimensional document vectors into two dimensions for visualization and clustering purposes. The third method is based on an analysis of the results of applying a new cluster-ing method, Expectation Maximization on a mixture of von Mises Fisher distributions, that represents each document as a point on a high dimensional sphere. In this space, we perform clustering to obtain sets of similar documents. The results are derived from a new method known as spectral clustering, where vectors from the term-document matrix are embedded in a high dimensional space for clustering. The paper concludes with recommendations regarding the development of an operational text mining system for analysis of problem reports that arise from complex space systems. We also contrast such systems with general purpose text min-ing systems, illustrating the areas in which this system needs to be specified for the space domain.
Predicting Engine Parameters Using the Optical Spectrum of the Space Shuttle Main Engine Exhaust Plume
A. N. Srivastava and W. Buntine
The Optical Plume Anomaly Detection (OPAD) system is under development to predict engine anomalies and engine parameters of the Space Shuttle's Main Engine (SSME). The anomaly detection is based on abnormal metal concentrations in the optical spectrum of the rocket plume. Such abnor- malities could be indicative of engine corrosion or other malfunctions. Here, we focus on the second task of the OPAD system, namely the prediction of engine parameters such as rated power level (RPL) and mixture ratio (MR). Because of the high dimen- sionality of the spectrum, we developed a linear algorithm to resolve the optical spectrum of the exhaust plume into a number of separate components, each with a different physical interpretation. These components are used to predict the metal concentrations and engine parameters for online support of ground-level testing of the SSME. Currently, these predictions are labor intensive and cannot be done online. We predict RPL using neural networks and give preliminary results.
Enabling the Discovery of Recurring Anomalies in Aerospace Problem Reports using High-Dimensional Clustering Techniques
A. N. Srivastava, R. Akella, et. al.
This paper describes the results of a significant research and development effort conducted at NASA Ames Research Center to develop new text mining algorithms to discover anomalies in free-text reports regarding system health and safety of two aerospace systems. We discuss two problems of significant import in the aviation industry. The first problem is that of automatic anomaly discovery concerning an aerospace system through the analysis of tens of thousands of free-text problem reports that are written about the system. The second problem that we address is that of automatic discovery of recurring anomalies, i.e., anomalies that may be described in different ways by different authors, at varying times and under varying conditions, but that are truly about the same part of the system. The intent of recurring anomaly identification is to determine project or system weakness or high-risk issues. The discovery of recurring anomalies is a key goal in building safe, reliable, and cost-effective aerospace systems.
Discovering Atypical Flights in Sequences of Discrete Flight Parameters
S. Budalakoti, A. N. Srivastava, et. al.
This paper describes the results of a novel research and development effort conducted at the NASA Ames Research Center for discovering anomalies in discrete parameter sequences recorded from flight data. Many of the discrete parameters that are recorded during the flight of a commercial airliner correspond to binary switches inside the cockpit. The inputs to our system are records from thousands of flights for a given class of aircraft and destination. The system delivers a list of potentially anomalous flights as well as reasons why the flight was tagged as anomalous. This output can be analyzed by safety experts to determine whether or not the anomalies are indicative of a problem that could be addressed with a human factors intervention. The final goal of the system is to help safety experts discover significant human factors issues such as pilot mode confusion, i.e., a flight in which a pilot has lost situational awareness as reflected in atypicality of the sequence of switches that he or she throws during descent compared to a population of similar flights. We view this work as an extension of Integrated System Health Management (ISHM) where the goal is to understand and evaluate the combined health of a class of aircraft at a given destination.
NOVEL METHODS FOR PREDICTING PHOTOMETRIC REDSHIFTS FROM BROADBAND
M. J. Way, A. N. Srivastava
We calculate photometric redshifts from the Sloan Digital Sky Survey Main Galaxy Sample, the Galaxy Evolution Explorer All Sky Survey, and the Two Micron All Sky Survey using two new training-set methods. We utilize the broadband photometry from the three surveys alongside Sloan Digital Sky Survey measures of photometric quality and galaxy morphology. Our first training-set method draws from the theory of ensemble learning while the second employs Gaussian process regression, both of which allow for the estimation of redshift along with a measure of uncertainty in the estimation. The Gaussian process models the data very effectively with small training samples of approximately 1000 points or less. These two methods are compared to a well-known artificial neural network training-set method and to simple linear and quadratic regression. We also demonstrate the need to provide confi- dence bands on the error estimation made by both classes of models. Our results indicate that variations due to the optimization procedure used for almost all neural networks, combined with the variations due to the data sample, can produce models with variations in accuracy that span an order of magnitude. A key contribution of this paper is to quantify the variability in the quality of results as a function of model and training sample. We show how simply choosing the â€˜â€˜bestâ€™â€™ model given a data set and model class can produce misleading results.
Anomaly Detection in Large Sets of High-Dimensional Symbol
Suratna Budalakoti, Ashok N. Srivastava, Ph.D., Ram Akella, Ph.D., Eugene Turkov
This paper addresses the problem of detecting and describing anomalies in large sets of high-dimensional symbol sequences. 1 The approach taken uses unsu- pervised clustering of sequences using the normalized longest common subsequence (LCS) as a similarity measure, followed by detailed analysis of outliers to detect anomalies. As the LCS measure is expensive to compute, the first part of the paper discusses existing algorithms, such as the Hunt-Szymanski algorithm, that have low time-complexity. We then discuss why these algorithms often do not work well in practice and present a new hybrid algorithm for computing the LCS that, in our tests, outperforms the Hunt- Szymanski algorithm by a factor of five. The second part of the paper presents new algorithms for outlier analysis that provide comprehensible indicators as to why a particular sequence was deemed to be an out- lier. The algorithm provide a coherent description to an analyst of the anomalies in the sequence, com- pared to more â€™normalâ€™ sequences. The algorithms we present are general and domain-independent, so we discuss applications in related areas such as anomaly detection.
Classification of Damage Signatures in Composite Plates using One-Class SVMs
S. Das, A.N. Srivastava, A. Chattopadhyay
Damage characterization through wave propagation and scattering is of considerable interest to many non-destructive evaluation techniques. For fiber-reinforced composites, complex waves can be generated during the tests due to the non-homogeneous and anisotropic nature of the material when compared to isotropic materials. Additional complexities are introduced due to the presence of the damage and thus results in difficulty to characterize these defects. The inability to detect damage in composite structures limits their use in practice. A major task of structural health monitoring is to identify and characterize the existing defects or defect evolution through the interactions between structural features and multidisciplinary physical phenomena. In a wave-based approach to addressing this problem, the presence of damage is characterized by the changes in the signature of the resultant wave that propagates through the structure. In order to measure and characterize the wave propagation, we use the response of the surface-mounted piezoelectric transducers as input to an advanced machine-learning based classifier known as a Support Vector Machine.
Characterizing Variability and Multi-Resolution Predictions of Virtual Sensors
A. N. Srivastava and R. Nemani
In previous papers, we introduced the idea of a Virtual Sensor, which is a mathematical model trained to learn the potentially nonlinear relationships between spectra for a given image scene for the purpose of predicting values of a subset of those spectra when only partial measurements have been taken. Such models can be created for a variety of disciplines including the Earth and Space Sciences as well as engineering domains. These nonlinear relationships are induced by the physical characteristics of the image scene. In building a Virtual Sensor a key question that arises is that of characterizing the stability of the model as the underlying scene changes. For example, the spectral relationships could change for a given physical location, due to seasonal weather conditions. This paper, based on a talk given at the American Geophysical Union (2005), discusses the stability of predictions through time and also demonstrates the use of a Virtual Sensor in making multi-resolution predictions. In this scenario, a model is trained to learn the nonlinear relationships between spectra at a low resolution in order to predict the spectra at a high resolution.
Intelligent Data Understanding Group Lead
Ames Research Center
Mail Stop Mail Stop 269-4
Moffett Field, CA 94035