|Day 1||Day 2|
Predictive health monitoring and fuel consumption analysis are playing an increasingly important role in commercial fleets. This new technology is emerging as a critical capability in the machine-to-machine (M2M) communication and asset tracking industry. MineFleet offers a real-time decision support solution for monitoring the vehicle health and performance of large number of vehicles in commercial fleets. This talk will present an overview of the MineFleet and share my experience in building the system. It will discuss the large distributed architecture of MineFleet comprised of thousands of onboard embedded devices, the server, and the web-service. It will also present some of the systems and algorithmic challenges and their solutions in MineFleet. It will discuss some of the data stream mining algorithms that run on board the vehicle in a highly resource-constrained environment and how the results are integrated across multiple vehicles.Speaker Bio
Hillol Kargupta isanAssociateProfessorintheDepartmentof ComputerScienceandElectrical Engineering, University of Maryland, Baltimore County. He received the PhD degree in computer science from the University of Illinois at Urbana-Champaign in 1996. He is also a co-founder of Agnik LLC, a data analytics company for distributed, mobile, and embedded environments. His research interests include mobile and distributed data mining. Dr. Kargupta won a US National Science Foundation CAREER award in 2001 for his research on ubiquitous and distributed data mining. He along with his coauthors received the best paper award at the 2003 IEEE International Conference on Data Mining for a paper on privacy-preserving data mining. His papers were also selected for Best of 2008 SIAM Data Mining Conference (SDM'08) and Most Interesting Paper of WebKDD'06. He won the 2000 TRW Foundation Award, 1997 Los Alamos Award for Outstanding Technical Achievement, and 1996 SIAM annual best student paper award. His research has been funded by theUSNationalScienceFoundation,USAirForce,Department ofHomelandSecurity,NASA,Office of Naval Research, andvarious other organizations. He has published more than ninety peer-reviewed articles in journals, conferences, and books. He has co-edited several books. He is an associate editor of the IEEE Transactions on Knowledge and Data Engineering,IEEE Transactions on Systems, Man, and Cybernetics, Part B and Statistical Analysis and Data Mining Journal. He is/was the Program Co-Chair of 2009 IEEE International Data Mining Conference, General Chair of 2007 NSF Next Generation Data Mining Symposium, Program Co-Chair of 2005 SIAM Data Mining Conference, Program vice-chair of 2005 PKDD Conference, Program vice-chair of 2008 &
2005 IEEE International Data Mining Conference, Program Vice Chair for 2008 & 2005 Euro-PAR Conference, and Associate General Chair of the 2003 ACM SIGKDD Conference, among others. More information about him can be found at http://www.cs.umbc.edu/~hillol.
It is becoming increasingly clear that the next generation of web search and advertising will rely on a deeper understanding of user intent and task modeling, and a correspondingly richer interpretation of content on the web. How we get there, in particular, how we understand web content in richer terms than bags of words and links, is a wide open and fascinating question. I will discuss some of the options here, and look closely at the role that information extraction can play.
Raghu Ramakrishnan is Chief Scientist for Audience and Cloud Computing at Yahoo!, and is a Research Fellow, heading the Community Systems area in Yahoo! Research. He was Professor of Computer Sciences at the University of Wisconsin-Madison, and was founder and CTO of QUIQ, a company that pioneered question-answering communities, powering Ask Jeeves' AnswerPoint as well as customer-support for companies such as Compaq. His research has influenced query optimization in commercial database systems, and the design of window functions in SQL:1999. His paper on the Birch clustering algorithm received the SIGMOD 10-Year Test-of-Time award, and he has written the widely-used text "Database Management Systems" (with Johannes Gehrke).
He is Chair of ACM SIGMOD, on the Board of Directors of ACM SIGKDD and the Board of Trustees of the VLDB Endowment, and has served as editor-in-chief of the Journal of Data Mining and Knowledge Discovery, associate editor of ACM Transactions on Database Systems, and the Database area editor of the Journal of Logic Programming. Ramakrishnan is a Fellow of the Association for Computing Machinery (ACM) and the Institute of Electrical and Electronics Engineers (IEEE), and has received several awards, including a Distinguished Alumnus Award from IIT Madras, a Packard Foundation Fellowship in Science and Engineering, an NSF Presidential Young Investigator Award, and an ACM SIGMOD Contributions Award.
In this talk, we tackle a fundamental problem that arises when using sensors to monitor the ecological condition of rivers and lakes, the network of pipes that bring water to our taps, or the activities of an elderly individual when sitting on a chair: Where should we place the sensors in order to make effective and robust predictions?
Such sensing problems are typically NP-hard, and in the past, heuristics without theoretical guarantees about the solution quality have often been used. In this talk, I will present algorithms which efficiently find provably near-optimal solutions to large, complex sensing problems. Our algorithms are based on the key insight, that many important sensing problems exhibit submodularity, an intuitive diminishing returns property: Adding a sensor helps more if we have placed few sensors so far, and less if we have already placed many sensors. In addition to identifying most informative locations for placing sensors, our algorithms can handle settings, where sensor nodes need to be able to reliably communicate over lossy links, where mobile robots are used for collecting data or where solutions need to be robust against adversaries and sensor failures.
I will present results applying our algorithms to several real-world sensing tasks, including environmental monitoring using robotic sensors, activity recognition using a built sensing chair, and a sensor placement competition. I will conclude with drawing an interesting connection between sensor placement for water monitoring, and the problem of selecting blogs to read in order to learn about the biggest stories discussed on the web.
This talk is primarily based on joint work with Andreas Krause.
Carlos Guestrin's current research spans the areas of planning, reasoning and learning in uncertain dynamic environments, focusing on applications in sensor networks.He is an assistant professor in the Machine Learning and in the Computer Science Departments at Carnegie Mellon University. Previously, he was a senior researcher at the Intel Research Lab in Berkeley. Carlos received his MSc and PhD in Computer Science from Stanford University in 2000 and 2003, respectively, and a Mechatronics Engineer degree from the Polytechnic School of the University of Sao Paulo, Brazil, in 1998.Carlos' work received awards at a number of conferences and a journal: KDD 2007, IPSN 2005 and 2006, VLDB 2004, NIPS 2003 and 2007, UAI 2005, ICML 2005, and JAIR in 2007.He is also a recipient of the ONR Young Investigator Award, the NSF Career Award, the Alfred P. Sloan Fellowship, the IBM Faculty Fellowship, the Siebel Scholarship and the Stanford Centennial Teaching Assistant Award.
Carlos is currently a member of the Information Sciences and Technology (ISAT) advisory group for DARPA.
The land remote sensing community has a long history of using supervised and unsupervised methods to help interpret and analyze remote sensing data sets. Until relatively recently, most remote sensing studies have used fairly conventional image processing and pattern recognition methodologies. In the past decade, NASA has launched a series of remote sensing missions known as the Earth Observing System (EOS). The data sets acquired by EOS instruments provide an extremely rich source of information related to the properties and dynamics of the Earth’s terrestrial ecosystems. However, these data are also characterized by large volumes and complex spectral, spatial and temporal attributes. Because of the volume and complexity of EOS data sets, efficient and effective analysis of them presents significant challenges that are difficult to address using conventional remote sensing approaches. In this paper we discuss results from applying a variety of different data mining approaches to global remote sensing data sets. Specifically, we describe three main problem domains and sets of analyses: (1) supervised classification of global land cover from using data from NASA’s Moderate Resolution Imaging Spectroradiometer; (2) the use of linear and non-linear cluster and dimensionality reduction methods to examine coupled climate-vegetation dynamics using a twenty year time series of data from the Advanced Very High Resolution Radiometer; and (3) the use of functional models, non-parametric clustering, and mixture models to help interpret and understand the feature space and class structure of high dimensional remote sensing data sets. The paper will not focus on specific details of algorithms. Instead we describe key results, successes, and lessons learned from ten years of research focusing on the use of data mining and machine learning methods for remote sensing and Earth science problems.
Increasing availability of hyperspectral remotely sensed data provides new opportunities for improved representation of important characteristics in atmospheric, terrestrial, and oceanographic processes. Higher classification accuracies may also be possible, although achieving high classification accuracy and good generalization when sample sizes are small relative to the dimension of the input space is a challenging problem, especially when the number of classes is also large. Supervised classification algorithms typically rely heavily on feature selection or linear feature extraction to mitigate the effect of high dimensionality. These traditional approaches fail to exploit the nonlinear characteristics that are inherent in hyperspectral data. Recent results from the machine learning community, which assume that the original high dimensional data lie on a low dimensional manifold, attempt to derive a coordinate system that resides on the nonlinear manifold. Isometric mapping (ISOMAP), one of the most popular approaches, parameterizes the coordinate system using geodesic distances between points, based computation of all pairwise distances between points and a shortest path algorithm. The resulting O(N3) algorithm is not viable for typical remote sensing applications, which involve very large data sets.
This presentation will focus on new approaches we have developed to determine global manifolds for large-scale remote sensing applications using intelligent landmarks, a backbone approach, and a spatial-spectral segmentation method, as well as a semi-supervised method for tuning global manifolds to better represent local phenomena across nonstationary images. We will also present results of our investigations of manifold learning applied to analysis of hyperspectral data, including supervised classification of land cover, extraction of shallow water bathymetry, and representation of complex phenomena exhibited in hyperspectral imagery of the coastal zone.
Dr. Crawford’s research interests include statistical pattern recognition, data fusion, and multiresolution modeling related to analysis of remotely sensed data.Recently, her group has focused on new methods for classification of high dimensional data involving manifold learning, knowledge transfer, and active learning. In 2004-2005, Dr. Crawford was as a Jefferson Senior Science Fellow at the U.S. Department of State.She has published more than 100 journal and conference papers and is a Fellow of the IEEE.Dr. Crawford served as a member of the NASA Earth System Science and Applications Advisory Committee and was a member of the NASA EO-1 Science team.She currently serves on the advisory committee to the NASA Socioeconomic Applications and Data Center.Return to top
After reviewing key background concepts in fuzzy systems and evolutionary computing, we will focus on the use of local fuzzy models, which are related to both kernel regressions and locally weighted learning. Instead of using a manual approach to develop such models, we use evolutionary algorithms to search in the design space of these models.
With these models we will determine the remaining life of a unit in a fleet of vehicles. Instead of developing individual models (based on the track history of each unit) or developing a global model (based on the collective track history of the fleet), we propose local fuzzy models based on clusters of peers, similar units with comparable utilization and performance. For each cluster of peers we create a local fuzzy model. We combine the fuzzy peer-based approach for performance modeling with an evolutionary framework for model maintenance. Our process generates a collection of competing models, evaluates their performance in light of the currently available data, refines the best models using evolutionary search, and selects the best one after a finite number of iterations. This process is repeated periodically to automatically produce updated and improved versions of the model.
To illustrate this methodology we chose an asset selection problem: given a fleet of industrial vehicles (diesel electric locomotives), we want to select the best subset (of fixed or variable size) for mission-critical utilization. To this end, we predict the remaining life for each unit in the fleet. We then sort the fleet using this prediction and select the highest ranked units. The model chosen to perform this prediction/selection task is a fuzzy instance based model. A series of experiments using data from locomotive operations were conducted and the results from an initial validation exercise are presented.
The approach of constructing local predictive models using fuzzy similarity with neighboring points along appropriate dimensions is not specific to any asset type, but it applies to many other Prognostics and Health Management (PHM) problems.
The exploration of space requires that we continue our dependence on remote scientific platforms.The continued success of the Mars Exploration Rovers highlights the great benefits of navigational autonomy.However, science operations continue to require a team of scientists to select the specific experiments to perform and to precisely guide sensor placement.While this works well on Mars, which has a communication delay ranging from 6.5 to 44 minutes, this model will be strained during future operations on Jupiter’s icy moons, and will most certainly break on future Saturnian missions to Titan and Enceladus.For robotic missions to operate in the outer solar system at a production level comparable to that of the Mars rovers, they will require greater autonomy, not only in mapping and navigation, but also in experimental design and sensor placement.
I will introduce our initial efforts to develop a software-based inquiry engine that relies on a generalized form of information theory called the inquiry calculus.This computational technology enables one to compute the optimal experimental question to ask in a given situation.This technology depends on predicting the probable answers to questions, and selecting the question based on the entropy of the probability distribution of potential answers.I will demonstrate these concepts on a robotic arm that has been programmed to identify and characterize shapes on a playing field using only a simple light sensor.Synthesizing Information From Multiple Climate Models:a Bayesian Approach to Probabilistic Climate Change Projections
Future projections of climate change rely for the most part on the results of simulations by General Circulation Models, simulating the main climate processes at work in the Earth's atmosphere and oceans, over land and sea ice, and there complex interactions.
Different climate models, however, produce different climate projections, even under the same future forcing scenario, and even on average over large-scale regions. How do we best estimate what future climate will be like, and the uncertainty associated with it, on the basis of an ensemble of these experiments? I will describe the main issues that underlie the analysis of ensembles of climate models, and briefly offer an overview of alternative lines of attack. I will then expand on our proposed Bayesian approach, estimating joint projections of temperature and precipitation change at regional scales.
Remote sensing data from global observing satellites, combined with data from ecosystem models, offers an unprecedented opportunity for predicting and understanding the behavior of the Earth's ecosystem. This data consists of a sequence of global snapshots of the Earth, and includes various atmospheric, land and ocean variables such as sea surface temperature (SST), pressure, precipitation, vegetation index (NDVI), and Net Primary Production (NPP).Due to the large amount of data that is available, data mining techniques are needed to facilitate the automatic extraction and analysis of interesting patterns from the Earth Science data. However, mining patterns from Earth Science data is a difficult task due to the spatio-temporal nature of the data.
This talk will discuss various challenges involved in analyzing the data, and present some of our work on the design of efficient algorithms for finding spatio-temporal patterns from such data and their applications in discovering interesting relationships among ecological variables from various parts of the Earth.
Vipin Kumar is currentlyWilliam Norris Professor and Head of Computer Science and Engineeringat the University of Minnesota.His research interests include High Performance computing and data mining. He has authored over 200 research articles, and co-edited or coauthored 9 books including the widely used text book ``Introduction to Parallel Computing", and "Introduction to Data Mining" both published by Addison-Wesley.
Kumar has served aschair/co-chair for over a dozen conferences/workshopsin the area of data mining and parallel computing.
Kumar is a founding co-editor-in-chief of Journal of Statistical Analysis and Data Mining, editor-in-chief of IEEEIntelligent Informatics Bulletin, and series editor of Data Mining and Knowledge Discovery Book Series published by CRC Press/Chapman Hall.Kumar is a Fellow of the AAAS, ACM andIEEE.He received the 2005 IEEE Computer Society's Technical Achievement Award for contributions to the design and analysis of parallel algorithms, graph-partitioning, and data mining.
The last decade has seen the academic community introduce host of techniques for anomaly detection from both online data streams and in large archives of stream data. However, few if any of these techniques seem to have achieved adoption by the practitioners on the "front line". An informal survey suggests two reasons for this. First, many of the methods require careful adjusting of many parameters, in some cases as many as nine. Second, the academic community is often satisfied to test on small datasets, the majority of works on anomaly detection in top journals and conferences test on datasets that are less than one megabyte in size.
In this talk I will summarize research efforts by the data mining group at UCR to address these problems. For the former problem, we show that anomaly detection methods with one or zero parameters can outperform more complex methods in many cases. For the scalability issue we show the results of experiments that perform anomaly detection in terabyte sized datasets containing up to one hundred million time series.
The talk will conclude with a (large!) list of open problems and research directions.
Dr. Keoghs research interests are in data mining, machine learning and information retrieval. He has published more than 100 papers, including a dozen papers in each of ACM SIGKDD and IEEE ICDM. He has won best paper awards at ACM SIGMOD, ACM SIGKDD and IEEE ICDM.
In addition he has won several teaching awards. He is the recipient of a 5-year NSF Career Award for "Efficient Discovery of Previously Unknown Patterns and Relationships in Massive Time Series Databases" and a grant from Aerospace Corp to develop a time series visualization tool for monitoring space launch telemetry. Dr Keogh has given well-received tutorials on time series, machine learning and data mining all over the world.
Anomaly Detection in GPS Networks Using Statistical
Joint work with: Marlon Pierce(2), Xiaoming Gao(2), Yehuda Bock(3) (1) Jet Propulsion Laboratory, California Institute of Technology(2) Community Grids Laboratory, Indiana University Bloomington(3) Scripps Institution of Oceanography, University of California San Diego
The detection of anomalies in GPS measurements of surface displacement is important not only for maintaining data quality and for network health monitoring, but also for identifying rare but scientifically significant signals associated with solid earth processes.For example, recent evidence has shown that GPS measurements are capable of detecting some types of slow seismic events.Known examples of these events are rare, however, and identifying additional events has high scientific value.
To perform anomaly detection in GPS networks, we have developed a time series analysis method based on the use of hidden Markov models (HMMs) that can identify such signals based on evidence either on a sensor-by-sensor basis or by considering the network as a whole.In order to facilitate the identification of anomalies, we have imbedded this technology in a web services/web portal environment that enables users to assess current and historic activity within a GPS network.
Our method is a data-driven approach that does not depend on a physical model of the underlying system.Displacement measurement time series from individual GPS sensors are analyzed by using a robust, unconstrained, algorithm to fit hidden Markov models to the data.This allows us to segment the time series according to the statistical properties of the observations into distinct modes, which can be associated with underlying patterns of activity. We perform analysis over an entire sensor network or sub-network by comparing segmented time series.When multiple sensors change modes at the same time, we infer that a significant regional signal is present.
Through our web portal, the status of GPS stations in the network can be quickly assessed through color coded icons on a Google Maps interface.These icons inform users whether individual stations have changed modes, remained in the same mode, or are missing data, and distinctions are made between the most up to date and lagged mode changes.Easy drill down capabilities are provided through this interface, allowing users to view entire segmented time series, view model parameters, and scroll backwards and forwards in time.These capabilities enable interactive research.Our initial implementation focuses on GPS data from the California portion of the Plate Boundary Observatory, but our approach is general and therefor readily extensible to other GPS networks and even other types of sensor networks.
joint work with Roman Fresnedo (Boeing) and Tom Dietterich (Oregon State University)
The last decade has witnessed several successes of deployed systems that have learning and inference components. One of the hardest tasks lying ahead for learning systems is the prediction and detection of high-risk low-probability events. Small counts (or zero counts) are notoriously difficult to address especially because those rare observations or events have high costs associated with missing them. An even more challenging task is the testing and evaluation of predictors and detectors that claim rare event and anomaly detection capabilities, and no statistical, inference, or learning magic will solve the problem in the absence of any knowledge on how the observed values in the data were generated.
This presentation will describe a set of statistical tests for estimating confidence intervals for the risk of high-cost classification decisions and a Bayesian extension to these test, that allows the incorporation of prior knowledge on characteristics of the task and of the detection model - for estimating the distribution of the risk of decisions. We will be presenting experimental results on the tests and on employing several structural models for the probability of observations, and will discuss the challenges lying ahead.
Our goal is to generate comprehensible and accurate models from multiple time series for anomaly detection.The models need to produce anomaly scores in an online manner for real-life monitoring tasks.We introduce three algorithms that work in a constructed feature space and evaluate them with a real data set from the NASA shuttle program.Our off-line and on-line evaluations indicate that our algorithms can be more accurate than two existing algorithms.