» Speaker Abstracts

Toggle Section

Session: Astroinformatics: Learning from Data in the Astronomical Sciences 1

Citizen Science and Astroinformatics - Data Science at the Frontiers of Astronomy

Kirk Borne
George Mason University

I will describe the synergy between computational algorithms and human computation in addressing some of the challenges of learning from Big Data. I will focus on two topics, Astroinformatics (which is Data Science for Astronomy) and Citizen Science in Astronomy, within the broader context of collaborative annotation of massive data for search and discovery. The application of Data Science machine learning algorithms to the collections of labels and tags produced through human-machine interaction will enable novel and unexpected discoveries.

Session: Astroinformatics: Learning from Data in the Astronomical Sciences 1

Crowd Sourcing vs. Computational Intelligence - Some Lessons from the Zooniverse

John Wallin
Center for Computational Science, Department of Physics and Astronomy
Middle Tennesee State University

The Zooniverse project was created to allow scientists to connect their work to volunteers from around the world to help them analyze large data sets. In these projects, these Citizen Scientist volunteers classify, characterize, and transcribe image and sound data from a wide variety of scientific disciplines. As computational intelligence algorithms improve, the tasks needed for Citizen Scientists need to change and evolve. We present work some of the recent results showing how data from Citizen Science projects can be used to validate and train computational intelligence algorithms, and how pairing crowd sourcing with computational intelligence can produce scalable solutions to the extreme data challenges facing us in the future.

Session: Astroinformatics: Learning from Data in the Astronomical Sciences 1

Filtergraph: An Innovative Online Portal for Rapid and Intuitive Visualization of Massive Multi-Dimensional Datasets

Dan Burger
Vanderbilt University

Filtergraph is a web application being developed by the Vanderbilt Initiative in Data-intensive Astrophysics (VIDA) to flexibly handle a large variety of astronomy datasets. While current datasets at Vanderbilt are being used to search for eclipsing binaries and extrasolar planets, this system can be easily reconfigured for a wide variety of data sources. The user loads a flat-file dataset into Filtergraph which instantly generates an interactive data portal that can be easily shared with others. From this portal, the user can immediately generate scatter plots, histograms, and tables based on the dataset. Key features of the portal include the ability to filter the data in real time through user-specified criteria, the ability to select data by dragging on the screen, and the ability to perform arithmetic operations on the data in real time. The application is being optimized for speed in the context of very large datasets: for instance, plot generated from a stellar database of 3.1 million entries render in less than 2 seconds on a standard web server platform. This web application has been created using the Web2py web framework based on the Python programming language. Filtergraph is freely available at http://filtergraph.vanderbilt.edu/

Session: Data Science and Climate 1

Low-Rank Spatial Models for Large Remote-Sensing Datasets

Matthias Katzfuss
University of Heidelberg

With the proliferation of modern high-resolution measuring instruments mounted on satellites, planes, ground-based vehicles and monitoring stations, a need has arisen for statistical methods suitable for the analysis of large spatial datasets observed on large, heterogeneous spatial domains.

Many statistical approaches to this problem rely on low-rank models, for which the process of interest is modeled as a linear combination of spatial basis functions plus a fine-scale-variation term. For the full-scale approximation, a type of low-rank model that uses a so-called parent covariance and a set of knots to parameterize the model components, I will discuss two extensions: First, I will describe how to make Bayesian inference on the set of knots. Second, I will argue that it is often advantageous to use a nonstationary parent covariance, and I propose a generalization of the Matern covariance to the sphere that can be used for global satellite CO2 measurements.

Session: Data Science and Climate 1

Likelihood-based Climate Model Evaluation

Amy Braverman
Jet Propulsion Laboratory

From the humblest methods may sometimes come great insight. Tasked to assist atmospheric scientists working on an advanced algorithm that deduces atmospheric carbon dioxide by examining satellite-based atmospheric spectra, it became quickly apparent that the interface between data mining methods and the climate researcher's paradigm would provide the greatest bottleneck, along with strong correlation, high dimensionality, and large volume in the data itself. To alleviate this concern, we reformulated the data mining problem to more closely resemble an automated version of the typical threshold filters and linear fits employed by climate scientists in their data analysis and used genetic algorithms to explore the trade-off space of performance versus percent data accepted. This simple system can then be used to perform feature selection to isolate data features that cause algorithm failure and analyze upstream data quality. Our results shed light on the dominant sources of error, those being partial or thin clouds, remove the justification for trying to fit away errors in the final output, and provide a performance curve for the evaluation of differing versions of the algorithm itself. Examples will be shown from the Greenhouse gasses Observing SATelite (GOSAT).

Session: Data Science and Climate 1

Informing Climate Retrieval Development Using Data Mining

Lukas Mandrake
Jet Propulsion Laboratory

Session: Learning from Data

Learning from Data, Big and Small

Kirk Borne
George Mason University

The volume of data has grown to the point that politicians, educators, business people, scientists, social media specialists, and countless others are paying attention to this exponential flood of information. It is indeed exponential since the data growth rate is proportional to the existing data volume, with a doubling time of less than one year in many contexts.

The explosion of interest in the topic is saturating the discussion in all data-intensive domains. The challenges associated with Big Data are technological, algorithmic, and sociological. I will address the fundamental challenges that Big Data pose to scientific research and education.

An informatics approach to scientific research includes a variety of data science disciplines, including statistics, visualization, machine learning, data mining, data modeling, data indexing, data structures, and more. Accordingly, science education must evolve to incorporate these emerging methods and algorithms within traditional programs. I will describe some approaches to this revolution in scientific research and education.

Session: Learning from Data

Session: From Large Earth Science Datasets to Compelling Scientific Results

Experience in extracting scientific information from the data collected by back-scattered ultraviolet (BUV) instruments flown on satellites since 1970

Pawan K Bhartia
NASA Goddard Space Flight Center

Joanna Joiner
NASA Goddard Space Flight Center

We will discuss our experience in extracting scientific information from more than a dozen back-scattered ultraviolet (BUV) instruments flown on satellites since April 1970. The data from the BUV series of instruments are not only one the the longest earth science data collected from satellites, the volume of the data from these instruments and their demand for processing power has increased as fast as the decrease in cost of storage and processing. The first TOMS instrument launched on NASA's Nimbus-7 satellite in Oct 1978 produced less than 1 kbit of data per second. Yet at the time of launch it was estimated that processing all the data it was collecting will require almost the entire capacity of the largest IBM 360 computer operating at NASA Goddard Space Flight Center (GSFC) at that time. Since this computer served the needs of all the scientists working at GSFC, it was considered unfeasible to process all the data from TOMS. Though the situation was remedied by optimizing the processing code at substantial cost, the volume of data TOMS produced was considered so large that many potential users of the data chose not to invest their resources in analyzing it. Some information technologists have blamed the delay in finding the Antarctic ozone hole in the TOMS data to the lack of adequate visualization and data mining capability existing at NASA at the time. Though this is a misrepresentation of what actually happened, it is nevertheless true that our ability to extract meaningful information from large datasets has not expanded as fast the as the volume of the data we are collecting. In many cases we are analyzing large datasets with the tools and techniques developed for much smaller datasets. We will discuss our experience in applying modern methods of extracting information from large datasets that include neural networks and principal component analysis.

Session: From Large Earth Science Datasets to Compelling Scientific Results

Assessment of aerosol intercontinental transport with big data from EOS satellites

Hongbin Yu
University of Maryland and NASA Goddard Space Flight Center

Evidence of aerosol intercontinental transport (ICT) is both widespread and compelling. Model simulations suggest that ICT could significantly affect regional air quality and climate, but the broad inter-model spread of results underscores a need of constraining model simulations with measurements. Satellites have inherent advantages over in situ measurements to characterize aerosol ICT, because of their spatial and temporal coverage. Significant progress in satellite remote sensing of aerosol properties during the Earth Observing System (EOS) era offers the opportunity to increase quantitative characterization and estimates of aerosol ICT beyond the capability of pre-EOS era satellites that could only qualitatively track aerosol plumes. EOS satellites also observe emission strengths and injection heights of some aerosols, aerosol precursors, and aerosol-related gases, which can help characterize aerosol ICT. In this talk, we will show how a synergy of three-dimensional observations of aerosols from multiple EOS satellites supplemented by model simulations provides an insight into the relative contributions of ICT imported and domestic aerosols over North America. Implications of the findings for climate and air quality will also be discussed.

Session: From Large Earth Science Datasets to Compelling Scientific Results

Implications of Satellite Swath Width on Aerosol Optical Thickness Statistics

Peter R. Colarco
NASA Goddard Space Flight Center

Lorraine Remer
University of Maryland

Ralph Kahn
NASA GSFC

Robert Levy
NASA GSFC

A primary consideration for future aerosol satellite missions is the spatial coverage provided by the measurements. For example, for a polar orbiting satellite, two important questions that arise are how much of the Earth is sampled per orbit, and how long does it take to achieve a global sample? In the current generation of sensors, for example, the Moderate-Resolution Imaging Spectroradiometer (MODIS) instrument has a wide swath (~2300 km) and provides nearly daily global coverage. On the other hand, the Multi-Angle Imaging Spectroradiometer (MISR) has a much narrower swath (~380 km) and takes about eight days to sample the globe. At the extreme, the nadir-only view provided by the Cloud-Aerosol Lidar with Orthogonal Polarization (CALIOP) has an orbital repeat cycle of 16 days, but samples very little of the Earth’s surface in achieving this “global coverage.” Each of these instruments has different capabilities for detecting aerosols, and it is generally understood that there is often a trade-off between coverage and capability.

The fundamental question in this study is whether spatial coverage matters to the statistics of the aerosol optical thickness (AOT), a primary quantity desired from satellite measurements. We investigate is using AOT fields obtained from the MODIS data record. In our study the full swath MODIS data set is sampled to extract MISR- and CALIOP-like spatial coverage versions of the data set. We investigate simple observability questions, such as where it is and is not possible to measure aerosols because of the spatial sampling choice. We additionally investigate the suitability of narrow swath measurements for detecting trends in the AOT field.

Session: From Large Earth Science Datasets to Compelling Scientific Results

Improving our Understanding of Land-Use Change and Fire in Southeast Asia by Using Advanced Mathematical Techniques, Models, and Multiple Remotely Sensed Measurements in Tandem

Jason Cohen
National University of Singapore

Rapid and wide-scale changes are occurring to the land surface in Southeast Asia, due to both urban and agricultural expansion. These changes are both rapidly occurring and not well constrained. These changes are leading to widespread changes, which are starting to be observable at climatological scales. Some of these influences are natural in origin, while others are anthropogenic, and include three major drivers: (a) changes as a response to various phases of the Monsoon; (b) human induced fires; and (c) permanent alteration of the land for urbanization and other economic activities. These changes are one of the causes of a significant portion of global Black Carbon (BC) and Organic Carbon (OC) aerosols emitted into the atmosphere, both of which are highly variable in both space and time.

Therefore, to better quantify the properties of the land surface at large spatial scales, and the resulting properties of the aerosols resulting from changes in these land surfaces, it is imperative to look at the problem over a sufficiently long time-scale so as to capture processes important in this region of the world. In this work, data has been used reaching back at fast as possible, in most cases at least 10 years. However, when working with so much date, new quantitative methods of analysis are required. The purpose of this presentation is to introduce two-such proof-of-concept approaches, as well as some initial and interesting results.

The first proof-of-concept approach involved using a Kalman Filter based on a coupled climate/radiation/aerosol/urbanization model, and data consisting of BC concentrations and remotely sensed AAODs. This work has produced the first global average estimate and uncertainty of annual BC emissions. This result produced an optimized range for BC emissions ranging from 200% to 300% the emissions currently used by the IPCC, AEROCOM, and Bond et al. However, an important additional point is related to the issues of fires and land-use change is elucidated. The emissions, concentrations, and AAODs in an annual average sense are significantly underestimated in Southeast Asia, which also happens to be impacted by large-scale fires!

The second proof-of-concept approach involved using PCA as a tool to extract the standing modes of multiple remotely sensed datasets, and analyzing those that contribute the most variance. This tool allows for both the spatial and associated temporal structure of the dataset can be elucidated. This approach has been used in connection with AOD data from MISR and MODIS; NDVI, LAI, and EVI from MODIS; Precipitation from TRMM; and various aerosol products from CALIPSO.

Combining the first and second techniques has led to the determination of a unique temporally and spatially varying properties correspond one-to-one with all the known large-scale fire events in Southeast Asia over the past decade. Running these new results through the same modeling system allows for a comparison against known datasets, and these results will be presented. It will be demonstrated that the inter-seasonal and inter-annual variations can be better captured with this new technique as compared with other commonly used fire-based a-priori emissions sets, such as those based on GFED or MODIS fire radiative power.

Finally, other applications of the second technique have allowed for successful detection of a few important connections to be made with respect to the properties of the land-surface and the climate system over Southeast Asia. Two of these results that will be discussed include both important natural and anthropogenic signals: interactions between fires and precipitation, and interactions between the monsoons and the larger-scale land-surface properties.

Session: Astroinformatics: Learning from Data in the Astronomical Sciences 2

Finding Rare Astronomical Objects Using Efficient Bayesian Networks

Ashish Mahabal
California Institute of Technology

Current time-domain surveys find tens of transients per night with detection threshold set high (well over a magnitude). Surveys in the near-future are set to find several order of magnitudes more. A vast majority of these belong to well-understood types. The challenge is in identifying the rarer types and concentrate the scarce follow-up resources on those. Early characterization/classification often has to be from scarce data. This includes (1) fluxes at one or two recent epochs, perhaps an archival, co-added flux, often in the form of an upper limit, (2) position with an error depending on the wavelength of discovery, (3) archival parameters like nearest radio source. The total number of parameters such a list can run to is several tens, most of which are unavailable for any given transient. New discriminant follow-up is not only expensive to obtain but for some of the rapid transients it can mean an unacceptable delay. Bayesian methods allow one to deal with missing parameters. But learning from data is an expensive if not impossible task given the large number of parameters. Naive Bayesian networks work up to an extent, but do not deal well with redundancy. The ideal solution is smaller networks for each class with well-crafted parameters based on domain knowledge. Here we detail the concept with a three parameter network which uses only archival parameters to discriminate between supernovae and non-supernovae. Thus it can opearte in real-time without needing any follow-up observations. We describe such a deployment as part of the Cataline Real-time Transient Survey (CRTS). We further detail how the small binary network fits in to a larger multi-class network.

Session: Astroinformatics: Learning from Data in the Astronomical Sciences 2

Data Triage of Astronomical Transients and Variables: A Machine Learning Approach

Umaa Rebbapragada
Jet Propulsion Laboratory

This talk presents real-time machine learning systems for triage of big data streams generated by photometric and image-differencing pipelines. Our first system is a transient event detection system in development for the Palomar Transient Factory (PTF), a fully-automated synoptic sky survey that has demonstrated real-time discovery of optical transient events. The system is tasked with discriminating between real astronomical objects and artifacts of the image differencing pipeline. We performed a machine learning forensics investigation into the initial PTF classification system that led to training data improvements and the development of new features, both of which dramatic improved the false positive and negative rates. The second machine learning system is a real-time classification engine of transients and variables in development for the Australian Square Kilometre Array Pathfinder (ASKAP), an upcoming wide-field radio survey with unprecedented ability to investigate the radio transient sky. The goal of our system is to classify light curves into known classes with as few observations as possible in order to trigger follow-up on costlier assets. We discuss the violation of standard machine learning assumptions incurred by this task, and propose the use of ensemble and hierarchical machine learning classifiers that make predictions most robustly.

Session: Astroinformatics: Learning from Data in the Astronomical Sciences 2

Application of PCA to the Atmospheres of Jupiter and Saturn: Temporal and Seasonal Changes

Padma A. Yanamandra-Fisher
Space Science Institute (SSI), Rancho Cucamonga, CA

Given the wealth of observations of Jupiter and Saturn since Galileo’s first telescopic observations in 1610, there are still important unanswered questions about their atmospheres and the dynamics of various processes that are still not understood. The high-resolution data returned from several spacecraft missions, placed in context of the larger timeline of ground-based observations, in principle, allow us to develop an insight into these processes. Recently, Jupiter and Saturn have been exhibiting dramatic atmospheric changes nearly continuously since 2007. The underlying basis for these changes may be a common driver of atmospheric disturbances in Jovian planets. Yet, access to these data sets is not sufficient to develop unique models of various physical and chemical processes that govern the planets. Dramatic changes in their atmospheres from discrete localized features to global regions; availability of large telescopes (therefore, higher spatial/spectral data than before), require a new paradigm of rapid exploratory data mining that can be corroborated/validated with standard physical and theoretical models. Statistical models, like PCA and empirical orthogonal analysis, provide unbiased rapid examination of data and identification of key trends in the latent variables that influence the state of the atmosphere. Application of PCA to Jupiter’s Great Red Spot (GRS)-Oval(s) periodic interactions, and seasonal changes in the brightness temperatures on Saturn, showcase its versatility for the identification of the drivers/triggers that influence the observed changes in their atmospheres. I highlight the importance of: (i) placing the observations in the context of various timescales (seasonal, periodic or episodic); (ii) ground-based and spacecraft observations; and (iii) the synergy between professional and amateur astronomers (a new direction in Citizen Science).

I gratefully acknowledge the assistance of various colleagues and student interns in our project and support from NASA/Planetary Astronomy Program.

Satellite observations and computational analysis synergies: aerosol smoke plume heights and impacts of fires on air quality
Maria Val Martin¹ (mval@atmos.colostate.edu), Ralph Kahn², Jennifer Logan³, Colette Heald⁴ and Charles Ichoku²

¹ Atmospheric Science Department, Colorado State University, Fort Collins, CO.

² NASA Goddard Space Flight Center,Greenbelt, MD.

³ School of Engineering and Applied Science, Harvard University, Cambridge, MA.

⁴ Department of Civil and Environmental Engineering & Department of Earth, Atmospheric, and Planetary Sciences, Massachusetts Institute of Technology, Cambridge, MA.

In recent years, the availability of large satellite datasets has provided an extraordinary opportunity to improve our understanding of the mechanisms controlling atmospheric composition. In particular, these datasets have contributed toward improving the representation of fire emissions in climate and air quality models and assessments. In this talk, I will present several examples that applied computational analysis to the multi-year record of satellite aerosol observations, which have enabled characterization of smoke fire processes and their impacts. I will discuss (a) the determination of smoke plume heights from fires over North America; (b) the investigation of the main physical factors that determine smoke plume rise; and (c) the assessment of fire impacts on aerosol loading and air quality over Colorado.

Statistical Downscaling of Two-Dimensional Wind Fields

Mark Nakamura
University of California, Los Angeles

Global Climate Models (GCMs) are dynamical computer programs that model the physical interactions that govern our Earth's climate. The power of these GCMs lie in that they allow us to make future climate predictions. These GCMs produce vast amounts of data by predicting a myriad of climatic variables at all three-dimensions of our oceans and atmosphere. One drawback of modeling a system this complex is the computational expenses. This creates an end result of global predictions at a low-resolution that speak more to the general overall climate and not to climatological impacts a local level.

To produce local high-resolution predictions there are two options. The first being dynamical downscaling. This involves nesting another higher resolution dynamical model and using the initial GCM data as starting inputs. Or you can create statistical models that examine the relationship between local high-resolution (prediction level) variables and corresponding low-resolution GCM data, also known as statistical downscaling. Dynamical downscaling produces accurate estimates but requires heavy computation knowledge, resources and time. Statistical downscaling techniques can be employed much faster and require less knowledge of the computational coding procedures of the dynamical model.

The focus of this talk will be on the statistical downscaling of two-dimensional wind fields. Wind fields propose a unique problem in that they include the prediction of directional data (wind direction). Directional data can not be treated with typical off the shelf statistical techniques because of the data's cyclical nature (e.g. 369 degrees is very close to 1 degree). To account for this in my methods, the prediction of wind magnitude and direction are done separately. Using a circular tree based regression for wind direction and a generalized linear model for wind magnitude. To account for the non-stationarity of a prediction point (i.e. each prediction point has a unique surrounding topography), I create unique models for each prediction point. Within the prediction for one location, days are first clustered into similar wind regimes. This decreases variance within the model and ensures areas of statistical influence are captured.

Understanding Climate Change: Opportunities and Challenges for Data Driven Research

Vipin Kumar
William Norris Professor and Head, Department of Computer Science and Engineering, University of Minnesota

This talk will present an overview of research being done in a large interdisciplinary project on the development of novel data driven approaches that take advantage of the wealth of climate and ecosystem data now available from satellite and ground-based sensors, the observational record for atmospheric, oceanic, and terrestrial processes, and physics-based climate model simulations. These information-rich datasets offer huge potential for monitoring, understanding, and predicting the behavior of the Earth's ecosystem and for advancing the science of climate change. This talk will discuss some of the challenges in analyzing such data sets and our early research results.

Scientific Discovery and Anomaly Detection in Large Aerosol Data Sets

Kiri L. Wagstaff
Jet Propulsion Laboratory, California Institute of Technology, California

Michael J. Garay
Jet Propulsion Laboratory, California Institute of Technology, California

In the era of large scientific data sets, when it is impossible for an individual to examine every observation in detail, there is an urgent need for methods to automatically prioritize data for review. However, any such automated method must make decisions in a trustworthy, comprehensible manner. In this talk, I will describe the Discovery through Eigenbasis Modeling of Uninteresting Data (DEMUD) method, which uses principal component modeling and reconstruction error to prioritize data by its novelty. Uniquely, DEMUD also provides individual reasons for each priority decision. I will share results obtained when using DEMUD to analyze aerosol retrievals from MISR satellite data, enabling us to quickly identify interesting and unusual observations in both space and time. I will also discuss how DEMUD handles situations when data is missing, which commonly occurs in satellite aerosol retrievals for scenes with extensive cloud cover.

Giovanni-4: the next generation of an online tool for satellite data visualization, exploration and intercomparison

Christopher Lynnes
NASA/GSFC

One of the most time consuming phases of scientific research is the identification of useful data for tackling the problem at hand. This is not simply locating datasets via spatial, temporal and semantic information. Typically it involves an intensive (often lengthy) phase of examining the data for key signatures of the phenomena under study. Earth science remote sensing makes this problem challenging due to the large data volumes, complex data formats and data structures employed. The Geospatial Interactive Online Analysis and Visualization Interface (Giovanni) was designed to provide a server-side tool for this exploratory data analysis phase. Deployed at the Goddard Earth Sciences Data and Information Services Center for over a decade, Giovanni offers a variety of services to visualize the content of Earth science data archived at the GES DISC and select other data centers. Services range from basic time-averaged maps to correlation maps of variable pairs and interactive scatterplot+map. Currently, Giovanni is undergoing a rearchitecture to enhance support for data exploration by adding more interactive visualizations and speed to move closer to a true interactive user experience (UX). A longer term goal is to support user-contributed content in Giovanni.

Spatio-Temporal Data Fusion for Remote-Sensing Applications

Hai Nguyen
Jet Propulsion Laboratory, California Institute of Technology, California

Matthias Katzfuss
University of Heidelburg, Heidelburg, Germany

Noel Cressie
University of Wollongong, Australia

Amy Braverman
Jet Propulsion Laboratory, California Institute of Technology, California

Developing global maps of carbon dioxide (CO2) concentration near the surface can help identify locations where major amounts of CO2 are entering and exiting the atmosphere. No single instrument currently provides this information, but inferences can be made by considering a weighted difference between total column CO2 concentration observed by the Greenhouse gases Observing Satellite (GOSAT) and mid-tropospheric CO2 concentration observed by the Atmospheric InfraRed Sounder (AIRS) on the Aqua satellite. In the past, attempts to combine satellite information have been hindered by the instruments' different spatial supports and by the typically massive size of the remote sensing datasets. We describe a spatio-temporal data-fusion methodology, based on the Kalman filter and smoother, that can combine complementary datasets from multiple sources and properly account for spatial and temporal dependencies in order to produce more complete and accurate inferences. The resulting optimal predictors have computational complexity that is linear with respect to the number of observations at each time point.

Co-clustering Spatial Data Using a Generalized Linear Mixed Model With Application to the Integrated Pest Management

Daniel R. Jeske
Department of Statistics, University of California - Riverside

Co-clustering has been broadly applied to many domains such as bioinformatics and text mining. However, model-based spatial co-clustering has not been studied. In this paper, we develop a co-clustering method using a generalized linear mixed model for spatial data. To avoid the high computational demands associated with global optimization, we propose a heuristic optimization algorithm to search for a near optimal co-clustering. For an application pertinent to Integrated Pest Management, we combine the spatial co-clustering technique with a statistical inference method to make assessment of pest densities more accurate. We demonstrate the utility and power of our proposed pest assessment procedure through simulation studies and apply the procedure to studies of the persea mite (Oligonychus perseae), a pest of avocado trees, and the citricola scale (Coccus pseudomagnoliarum), a pest of citrus trees.

[Joint work with Zhanpan Zhang, GE Global Research, Xinping Cui and Mark Hoddle, University of California - Riverside.]

Using state-space models for variance matrices to study climate patterns

Yaming Yu
University of California, Irvine

The global climate system is dominated by large-scale spatial patterns of atmospheric and oceanic variability which are often defined in terms of Empirical Orthogonal Function (EOF) analysis. EOF analysis has two main limitations: The assumption of stationarity over a long period of time and the absence of associated measures of uncertainty. We build a model for the spatial variance-covariance matrix with parametric basis functions and adopt a Bayesian approach to estimate the parameters. Posterior simulation using Markov chain Monte Carlo yields both the estimates for the parameters and the associated measures of uncertainty. A state-space model is applied on smaller time windows with less information to capture the smoothly changing nature of the pattern over time by linking some of the parameters at successive time windows through system equations. We explore these methods and illustrate with both simulations and real data.

Nonlinear Models for Predicting Plankton Ecosystem Dynamics

Barbara A. Bailey
San Diego State University

Time series of physical and biological properties of the ocean are a valuable resource for developing models for ecological forecasting and ecosystem-based management. Both the physics of the oceans and organisms living in it can exhibit nonlinear dynamics. We describe the development of a nonlinear model that predicts the abundance of the important zooplankton species Calanus finmarchicus from hydrographic data from the Gulf of Maine. The results of a neural network model, including model diagnostics, forecasts, and dynamical quantities are presented. The best neural network model based on generalized cross validation includes variables of C. finmarchicus abundance, herring abundance, and the state of the Gulf of Maine waters, with meaningful time lags. Forecasts are constructed for the model fit to 1978-2003 bimonthly data and corresponding forecasts intervals are obtained by the stationary bootstrap.

Session: Climate Data Analysis: From Satellites to Climate Models

Heterogeneous Warming Agents and Widening of the Tropical Belt

Robert Allen
University of California, Riverside

Observational analyses have shown the width of the tropical belt increasing in recent decades as the world has warmed. This expansion is important because it is associated with shifts in large-scale atmospheric circulation and major climate zones. Although recent studies have attributed tropical expansion in the Southern Hemisphere to ozone depletion, the drivers of Northern Hemisphere expansion are not well known and the expansion has not so far been reproduced by climate models. Here we use a climate model with detailed aerosol physics to show that increases in heterogeneous warming agents-including black carbon aerosols and tropospheric ozone-are noticeably better than greenhouse gases at driving expansion, and can account for the observed summertime maximum in tropical expansion. Mechanistically, atmospheric heating from black carbon and tropospheric ozone has occurred at the mid-latitudes, generating a poleward shift of the tropospheric jet, thereby relocating the main division between tropical and temperate air masses. Although we still underestimate tropical expansion, the true aerosol forcing is poorly known and could also be underestimated. Thus, although the insensitivity of models needs further investigation, black carbon and tropospheric ozone, both of which are strongly influenced by human activities, are the most likely causes of observed Northern Hemisphere tropical expansion.

Session: Climate Data Analysis: From Satellites to Climate Models

The Future of Model Evaluation

Charlie Zender
University of California, Irvine

Pedro Vicente

WenShan Wang

Geoscientific model evaluation often means comparing model simulations in netCDF format to satellite observations in HDF format. Analysis techniques that exploit the hierarchical structure of these self-describing data formats can be more intuitive, simple, and efficient than traditional analysis techniques. To unleash the full power of HDF/netCDF4 storage capabilities, one must have tools to manipulate and aggregate disparate datasets into larger structures that facilitate parallel processing. We describe our recent progress extending the netCDF Operators (NCO) to process netCDF and HDF-EOS datasets that use hierarchical groups. We illustrate our approach by showing how much easier it is to characterize, evaluate, and intercompare Earth System Model-simulated (CMIP5 archive) and satellite-retrieved trends with the group and HDF-EOS "aware" NCO compared to previous methods.

Session: Climate Data Analysis: From Satellites to Climate Models

Statistical correction of satellite cloud data for climate change studies

Joel Norris
Scripps Institution of Oceanography, University of California, San Diego

Clouds play a key role in the climate system and are one of the biggest uncertainties in our understanding of climate change. Investigation of cloud changes in recent decades is severely impeded by inhomogeneities in the observational record. Recent work by the presenter, however, demonstrates that statistical techniques are able to remove spurious variability from the satellite cloud record. This enables identification of regional patterns of cloud change that resemble those projected by climate models to occur for global warming.

Session: Climate Data Analysis: From Satellites to Climate Models

Session: Random Solutions to Big Problems

Implementing Randomized Matrix Algorithms in Parallel and Distributed Environments

Michael Mahoney
Stanford

Motivated by problems in large-scale data analysis, randomized algorithms for matrix problems such as regression and low-rank matrix approximation have been the focus of a great deal of attention in recent years. These algorithms exploit novel random sampling and random projection methods; and implementations of these algorithms have already proven superior to traditional state-of-the-art algorithms, as implemented in Lapack and high-quality scientific computing software, for moderately-large problems stored in RAM on a single machine. Here, we describe the extension of these methods to computing high-precision solutions in parallel and distributed environments that are more common in very large-scale data analysis applications.

In particular, we consider both the Least Squares Approximation problem and the Least Absolute Deviation problem, and we develop and implement randomized algorithms that take advantage of modern computer architectures in order to achieve improved communication profiles. Our iterative least-squares solver, LSRN, is competitive with state-of-the-art implementations on moderately-large problems; and, when coupled with the Chebyshev semi-iterative method, scales well for solving large problems on clusters that have high communication costs such as on an Amazon Elastic Compute Cloud cluster. Our iterative least-absolute-deviations solver is based on fast ellipsoidal rounding, random sampling, and interior-point cutting-plane methods; and we demonstrate significant improvements over traditional algorithms on MapReduce. In addition, this algorithm can also be extended to solve more general convex problems on MapReduce.

Session: Random Solutions to Big Problems

A More Powerful Two-Sample Test in High-dimensions using Random Projection

Miles Lopes
University of California, Berkeley

We study the hypothesis testing problem of detecting a shift between the means of two multivariate normal distributions in the high-dimensional setting, allowing for the data dimension $p$ to exceed the sample size $n$. Specifically, we propose a new test statistic for the two-sample test of means that integrates a random projection with the classical Hotelling T^2 statistic. Working under a high-dimensional framework with (p,n) tending to infinity, we first derive an asymptotic power function for our test, and then provide sufficient conditions for it to achieve greater power than other state-of-the-art tests. Lastly, using ROC curves generated from simulated data, we demonstrate superior performance with competing tests in the parameter regimes anticipated by our theoretical results.

Session: Random Solutions to Big Problems

Being Friends with Noise: Probabilistic Machine Learning in Computer Vision and Multimodal Perception

George Papandreou
University of California, Los Angeles

Machine learning allows us to automatically reason about data. It plays an increasingly important role in building computer vision and multimodal perception systems which are able to interpret the ever growing volume of images and videos available in digital form. In these domains we typically deal with complex sensory signals that feature strongly stochastic aspects such as missing data and noisy measurements.

Probabilistic Bayesian machine learning methods are particularly well-suited for describing such ambiguous data, providing a natural conceptual framework for quantifying the uncertainty in interpreting them. I will illustrate with examples from my work in image modeling, computer vision, and audiovisual speech recognition, that the Bayesian approach can be very fruitful in practical applications, allowing us to learn model parameters and adaptively fuse heterogeneous information sources in a principled fashion.

Despite these advantages, applying probabilistic techniques to large-scale data such as those arising in computer vision can pose significant computational challenges and alternative optimization-based deterministic methods are often preferred. I will describe my recent research on Perturb-and-MAP random sampling which brings powerful techniques from optimization into probabilistic modeling, making Bayesian inference computationally tractable for challenging computer vision problems such as image inpainting and deblurring, image segmentation, and scene labeling.

Session: From Large Earth Science Datasets to Compelling Scientific Results

Satellite and Model Data to Support Air Quality Management

Tracey Holloway
University of Wisconsin

Session: From Large Earth Science Datasets to Compelling Scientific Results

Session: Visualization of Big Data

Session: From Large Earth Science Datasets to Compelling Scientific Results

Satellite observations and computational analysis synergies: aerosol smoke plume heights and impacts of fires on air quality

Maria Val Martin
Colorado State University

Session: From Large Earth Science Datasets to Compelling Scientific Results

Detailed global evaluation of aerosol measurements from multiple satellite sensors

Charles Ichoku
NASA Goddard Space Flight Center

Atmospheric aerosols are routinely retrieved from measurements acquired by such spaceborne sensors as MODIS on the Terra and Aqua satellites, MISR on Terra, OMI on Aura, POLDER on the French PARASOL, CALIOP on CALIPSO, SeaWiFS on SeaStar. The aerosol measurements collected by these instruments over the last decade contribute to an unprecedented availability of the most complete set of complimentary aerosol measurements ever acquired. Overall, there are 11 different products from these 7 spaceborne sensors, because aerosols are retrieved from MODIS over land using different algorithms. To derive the full scientific benefit of this diversity of measurements by using them synergistically, they are being carefully and uniformly analyzed in a comparative manner, in order to understand their uncertainties and limitations using coincident ground-based aerosol measurements from the Aerosol Robotic Network (AERONET). In this presentation, we will show results of detailed statistical analysis of these products, which revealed their relative strengths and limitations over different locations around the world, thereby illustrating which measurements are most reliable in different regions and over different land-cover types.

Session: From Large Earth Science Datasets to Compelling Scientific Results

Multi-decadal variations of aerosols from multi-platform data and model from 1980 to 2009

Mian Chin
NASA Goddard Space Flight Center

We present a global model analysis of aerosol trends from 1980 to 2009 in different land and oceanic regions in the world and assessing the anthropogenic and natural emission impact on those trends. The global model GOCART simulated aerosol optical depth are compared with the long-term data from satellite (AVHRR, TOMS, SeaWiFS, MODIS, and MISR) retrievals and ground-based sunphotometer (AERONET) measurements, and surface concentrations with surface measurements from the IMPROVE network in the U.S., the EMEP network in Europe, and the University of Miami managed sites over islands in the oceans. We examine the relationship between emissions, surface concentrations, and column AOD in pollution, dust, and biomass burning dominated source regions and downwind areas and assess the anthropogenic impact on the global and regional aerosol trends.

Session: Data Science Methods

Multivariate Wavelet Density Estimation for Streaming Data – A parallel programming approach

Kyle Caudle
South Dakota School of Mines and Technology

Data streams provide unique challenges that are not normally encountered during standard statistical analysis. Foremost is the fact that data is arriving at such a high rate that storing the data and analyzing later is no longer feasible. Understanding the underlying distribution of the data can help our understanding of how systems operate under stable status quo operations. We propose a method for performing multivariate wavelet density estimation in large dimensions (i.e. 5 or more) by parallel processing each of the wavelet functions and scaling functions and then piecing things back together. Current work utilizes code generating software that automatically generates multidimensional code based on the selected number of dimensions. Once the underlying density is constructed tests are performed to check for changes in the underlying density function and new data arrives.

Session: Data Science Methods

Session: Massive Data Challenges in Numerical Weather Modeling

Evaluation of surface climate fields in the NARCCAP hindcast experiment using JPL Regional Climate Model Evaluation System

J. Kim
Joint Institute for Regional Earth System Science and Engineering, University of California, Los Angeles, CA

D.E. Waliser
Joint Institute for Regional Earth System Science and Engineering, University of California, Los Angeles, CA,
Jet Propulsion Laboratory,California Institute of Technology, Pasadena, CA

C.A. Mattmann
L.O. Mearns
C.E. Goodale
A.F. Hart
D.J. Crichton
H. Lee
P.C. Loikith
S. McGinnis
Jet Propulsion Laboratory, California Institute of Technology, Pasadena, CA

L.O. Mearns
Institute for Mathematical Applications to the Geosciences, National Center for Atmospheric Research, Boulder, CO

The surface air temperature, precipitation, and surface insolation, the three key fields in shaping the surface hydrology and atmosphere-land interaction, simulated by multiple RCMs that have participated in the NARCCAP hindcast experiment are evaluated for the conterminous U.S. region and the period 1980-2003 using the Regional Climate Model Evaluation System (RCMES). Findings in this study illustrate that all models reasonably simulate the spatial pattern and variability of the annual-mean climatology of these three fields in the conterminous U.S. and that the model performance in simulating the spatial variability varies according to RCMs more widely than the spatial pattern. A number of systematic model biases in simulating these variables have also been found. For the annual mean climatology, all five RCMs generate warm bias over the Great Plains and the California Central Valley and cold bias over the coastal regions along the Atlantic Ocean and the Gulf of Mexico. The warm bias over the Great Plains occurs in both summer and winter; however, the model bias in other regions varies considerably according to models and seasons. The most notable errors common to the majority of these RCMs in simulating the annual-mean precipitation include wet bias in the Pacific Northwest region and dry bias in the Gulf of Mexico and southern Great Plains regions, especially for the inland Pacific Northwest region in all seasons and for the Arizona-New Mexico region in summer. In terms of the normalized RMSE, all RCMs perform better for the eastern half of the U.S. than for the western half. Most RCMs overestimate the annual-mean surface insolation over the conterminous U.S. region. All RCMs show either larger positive bias or smaller negative bias in the eastern half of the conterminous U.S. than in the western half. For all RCMs and their ensemble, the spatial pattern of the insolation bias is negatively correlated with that of the precipitation bias, suggesting that the biases in precipitation and surface insolation are related, most likely via the cloud fields. For the three model fields evaluated in this study, the multi-model ensemble appears to be among the best performers for all metrics, regions, and seasons. The systematic variations in model errors according to regions, seasons, variables, and metrics found in this study suggest that the bias correction, a key step in applying climate model data to assess the climate impact on various sectors, may have to be performed differently for regions, for seasons, for variables, and for assessment models.

Session: Massive Data Challenges in Numerical Weather Modeling

Use of variable-resolution gridding in the Ocean-Land-Atmosphere Model (OLAM) for optimal utilization of resources on large and small computers

R.L. Walko
Rosenstiel School of Marine and Atmospheric Science, University of Miami, Miami, FL

D.M. Medvigy
Department of Geosciences and Program in Atmospheric and Oceanic Sciences, Princeton University, Princeton, NJ

M. Otte
Atmospheric Modeling and Analysis Division, US Environmental Protection Agency, Research Triangle Park, NC

R. Avissar
Rosenstiel School of Marine and Atmospheric Science, University of Miami, Miami, FL

Atmospheric, oceanic, hydrological, ecosystem, and other environmental modeling systems are capable of consuming an enormous number of computing cycles and generating an enormous quantity of digital output available for subsequent analysis and post-processing. Management of model simulations and related data communication, storage, and processing rank among the major ‘Big Data’ challenges in the Environmental Sciences, alongside storage and processing of the ever-increasing stream of observational data. It has been argued that along with Big Data comes a growing need and obligation for ‘Big Judgment’, which for numerical modeling includes planning ahead and exercising insight and intuition in order to optimally design model simulations for maximum yield of useful information for a given allocation of computing resources. Variable-resolution computational grids provide one means of increasing the benefit-to-cost ratio in many modeling applications. A common example is regional climate modeling, which concentrates high resolution over geographic regions that are of key interest or importance, while covering the remainder of the planet with lower grid resolution that is much less costly in both computational cycles and data storage. The Ocean-Land-Atmosphere Model (OLAM), a novel environmental simulation system that incorporates seamless variable-resolution grid methods, is used to describe and demonstrate applications of this technique. We present examples from both regional and global modeling applications where spatially-selective higher resolution provides substantial overall benefits compared to uniform resolution. Advantages of the seamless grid over the more traditional nested grid technique are also discussed.

Session: Massive Data Challenges in Numerical Weather Modeling

Development of an Integrated Prediction System for Climate-Environment-Ecosystem Interactions and Corresponding GIS-based Database and Web Display System

Seon K. Park
Department of Environmental Science and Engineering, Ewha Womans University, Seoul, Korea
Center for Climate/Environment Change Prediction Research (CCCPR), Ewha Womans University, Seoul, Korea

Kyehyun Kim
Department of Geoinformatic Engineering, Inha University, Incheon, Korea

Hyo Hyun Sung
Department of Social Studies, Ewha Womans University, Seoul, Korea

Yong-Sang ChoiDepartment of Environmental Science and Engineering, Ewha Womans University, Seoul, Korea
Center for Climate/Environment Change Prediction Research (CCCPR), Ewha Womans University, Seoul, Korea

Climate change affects the global/regional environmental system among its various components, including atmosphere, hydrosphere, biosphere, land surfaces, etc., which have nonlinear interactions each other. These components exert impact on climate change itself via positive/negative feedback processes. There has been less effort to investigate interactions between the environmental system and climate change; especially the feedback processes associated with environmental components, through macro- and micro-scale changes, remain poorly understood.

The Center for Climate/Environment Change Prediction Research (CCCPR) aims at developing an integrative prediction system for climate-environment-ecosystem. We conduct core researches in identifying nonlinear interactions and related feedback processes in the climate-environment system. To achieve our goal, the research efforts of the center are divided into three major themes with strong connections: 1) climate/atmospheric environment prediction; 2) ecology/water environmental prediction; and 3) development of interaction diagnosis/prediction system.

For climate/atmospheric environment prediction, we conduct research on climate change analysis and scenario production as well as atmospheric chemistry/aerosol analysis and prediction. Especially, non-linear feedback processes between climate and atmospheric environment is studied in depth. In studying ecology/water environmental prediction, we focus on 1) vegetation and eco-system analysis and prediction due to climate change; 2) water and soil chemistry characteristics change analysis; and 3) ecosystem/water environment prediction model development. The feedback process of water and soil chemistry characteristic change to ecosystem and water environment which leads to climate change are studied in detail. In the effort of developing interaction diagnosis/prediction system, we are developing 1) coupled atmosphere-land surface process modeling; 2) remote sensing observation and monitoring techniques; and 3) interface and integrated database (DB). In this task, the data interface is developed and the GIS-based integrated database and web display system is operated to consolidate all product data for sharing and distribution in CCCPR and among other research communities. Further details will be discussed.

Registration Fees

General Admission
Full Conference
$300 (early bird) $375 (at door)

General Admission
One day only
$175 (early bird and at door)

Student with ID - $100

Chapman Faculty and Staff - $50

Register Now

» Speaker Abstracts

Kirk Borne, Ph.D. - Abstract 1

Citizen Science and Astroinformatics - Data Science at the Frontiers of Astronomy

John Wallin, Ph.D.

Crowd Sourcing vs. Computational Intelligence - Some Lessons from the Zooniverse

Dan Burger, Ph.D.

Filtergraph: An Innovative Online Portal for Rapid and Intuitive Visualization of Massive Multi-Dimensional Datasets

Matthias Katzfuss, Ph.D.

Low-Rank Spatial Models for Large Remote-Sensing Datasets

Amy Braverman, Ph.D.

Likelihood-based Climate Model Evaluation

Lukas Mandrake, Ph.D.

Informing Climate Retrieval Development Using Data Mining

Kirk Borne, Ph.D.- Abstract 2

Learning from Data, Big and Small

Michael Mahoney, Ph.D. - Abstract 1

Arun Vedachalam, Ph.D.

Pawan K Bhartia, Ph.D. & Joanna Joiner Ph.D.

Experience in extracting scientific information from the data collected by back-scattered ultraviolet (BUV) instruments flown on satellites since 1970

Hongbin Yu, Ph.D.

Assessment of aerosol intercontinental transport with big data from EOS satellites

Peter R. Colarco, Ph.D.

Implications of Satellite Swath Width on Aerosol Optical Thickness Statistics

Jason Cohen, Ph.D.

Improving our Understanding of Land-Use Change and Fire in Southeast Asia by Using Advanced Mathematical Techniques, Models, and Multiple Remotely Sensed Measurements in Tandem

George Djorgovski, Ph.D.

Ashish Mahabal, Ph.D.

Finding Rare Astronomical Objects Using Efficient Bayesian Networks

Umaa Rebbapragada, Ph.D.

Data Triage of Astronomical Transients and Variables: A Machine Learning Approach

Padma A. Yanamandra-Fisher, Ph.D.

Application of PCA to the Atmospheres of Jupiter and Saturn: Temporal and Seasonal Changes

Mark Nakamura, Ph.D.

Statistical Downscaling of Two-Dimensional Wind Fields

Vipin Kumar, Ph.D.

Understanding Climate Change: Opportunities and Challenges for Data Driven Research

Dan Crichton, Ph.D.

Arnold Goodman, Ph.D. - Abstract 1

Susan Paddock, Ph.D.

Arnold Goodman, Ph.D. - Abstract 2

Eric Chi, Ph.D.

Benjamin Shaby, Ph.D.

Heike Hofmann, Ph.D.

Kiri L. Wagstaff , Ph.D. & Michael J. Garay Ph.D.

Scientific Discovery and Anomaly Detection in Large Aerosol Data Sets

Christopher Lynnes, Ph.D.

Giovanni-4: the next generation of an online tool for satellite data visualization, exploration and intercomparison

Hai Nguyen Ph.D., Matthias Katzfuss Ph.D., Noel Cressie Ph.D., & Amy Braverman Ph.D.

Spatio-Temporal Data Fusion for Remote-Sensing Applications

William Cleveland, Ph.D. - Abstract 1

Rob Gould, Ph.D.

Jeff Hammerbacher, Ph.D.

Daniel R. Jeske, Ph.D.

Co-clustering Spatial Data Using a Generalized Linear Mixed Model With Application to the Integrated Pest Management

Yaming Yu, Ph.D

Using state-space models for variance matrices to study climate patterns

Barbara A. Bailey, Ph.D.

Nonlinear Models for Predicting Plankton Ecosystem Dynamics

James Harner, Ph.D.

Michael Limcaco, Ph.D.

Gayn B. Winters, Ph.D.

Robert Allen, Ph.D.

Heterogeneous Warming Agents and Widening of the Tropical Belt

Charlie Zender, Ph.D., Pedro Vicente, WenShan Wang

The Future of Model Evaluation

Joel Norris, Ph.D.

Statistical correction of satellite cloud data for climate change studies

Toshihisa Matsui, Ph.D.

Ian Misner, Ph.D.

Atanas Radenski Ph.D. & Louis Ehwerhemuepha Ph.D

Gennady Verkhivker, Ph.D.

Michael Mahoney, Ph.D. - Abstract 2

Implementing Randomized Matrix Algorithms in Parallel and Distributed Environments

MIles Lopes, Ph.D.

A More Powerful Two-Sample Test in High-dimensions using Random Projection

George Papandreou, Ph.D.

Being Friends with Noise: Probabilistic Machine Learning in Computer Vision and Multimodal Perception

Daven Henze, Ph.D.

Tracey Holloway

Satellite and Model Data to Support Air Quality Management