Professor, Department of Computer Science
Founding Director, Worldwide Computing Laboratorycvarela AT cs DOT rpi DOT edu
Ph: +1 (518) 276-6912,
Fax: +1 (518) 276-4033,
Office: Lally 308.
Our current main research interest is the investigation of fundamental distributed computing and software engineering principles to enable intelligent data-driven aero-space systems.
We are at a crossroads of significant new advances in distributed computing, big data analytics, flight sensor technologies, and machine learning. Such crossroads presents an unparalleled opportunity for smarter flight systems. In particular, we are interested in performing fundamental research leading to an Internet of Planes platform, providing pilots and autonomous vehicles with unprecedented levels of real-time collaborative situational awareness using edge computing on distributed sensor data. We are also interested in fundamental research on software development and verification techniques for data-driven (and thus, stochastic) flight systems. Particularly, we want to investigate the notion of safety envelopes, to formally capture the conditions under which data-driven flight systems (such as the distributed control system of a smart wing) are guaranteed to behave correctly.
While the potential applications and increased level of capability and safety of next-generation flight are enormous, the research challenges are commensurate. We are interested in forming an interdisciplinary Center for Intelligent Flight Systems tackling key open problems in the following thrusts: cyber physical systems, concurrent programming, and distributed computing development. We expect fundamental research results in these directions to be also applicable to other vehicle networks (e.g., of self-driving cars), and to distributed data stream analytics in other domains (e.g., health informatics.) In the following sections, we will illustrate research directions we intend to pursue, along with prior research experience and key results in each of these thrusts.
We intend to further develop the field of aero-space informatics, i.e., the use of computing technology in air and space systems. As part of collaborative research, we will create new programming languages, new distributed computing algorithms, new statistical models, new machine learning frameworks, and their synergistic interaction, in order to further facilitate the development of efficient, secure, reliable, adaptive, and provably correct data-driven aero-space computer systems.
We have worked on dynamic data-driven avionics to improve decision support for pilots in emergency conditions (Imai et al., 2017; Imai et al., 2017; Paul et al., 2018; Paul et al., 2018). We invented error signatures, mathematical function patterns that can be used to detect errors in data streams with high probability by explicitly using available redundancy (Imai and Varela, 2012). Furthermore, we developed a declarative programming language, called PILOTS, to automatically compute error signatures from spatio-temporal data streams (Imai and Varela, 2012). Applying machine learning to the discovery of error signatures, we developed a variation of Bayesian classification that is not only able to adjust the probability distributions in the failure models upon observing new evidence from flights, but also it is capable of detecting statistically significant new modes of operation that were not seen during the training phase (Imai et al., 2017; Chen et al., 2018)
To validate these new models, languages, algorithms, and associated computing technology, we have used data from real-world commercial accidents. For example, Air France flight 447 in June 2009 crashed into the Atlantic Ocean due to erroneous air speed data received from iced pitot tube sensors. Even though there is a trigonometric relationship between air speed, ground speed (sensed by independent satellite-based GPS sensors), and wind speed (e.g., from weather forecast models) that could have been used to detect the erroneous data, and temporarily infer it from the redundantly available information, the autopilot disengaged itself instead, leaving the human pilots in a difficult situation and starting the chain of errors that ultimately caused the accident. Using PILOTS, we were able to detect and correct for the erroneous airspeed data in less than five seconds (Imai et al., 2017). We have also illustrated how PILOTS’s self-healing spatio-temporal data streams could have been applied to the Tuninter 1153 flight that ditched in the Mediterranean Sea in August 2005, after a wrong fuel quantity indicator was installed in the aircraft leading the pilots to believe they had 2,000 Kg more fuel than actual, eventually starving the engines (Imai et al., 2015).
We have also investigated dynamic data-driven trajectory generation upon loss of thrust emergencies to accurately and promptly detect plausible landing sites and generate flyable trajectories to help pilots after engine failures. An example of this is the Hudson River ditching of US Airways 1549 in January 2009. Up to 28 seconds after engines were damaged due to striking birds, we were able to generate valid flight trajectories to La Guardia airport runways assuming full loss of thrust conditions (Paul et al., 2017). The left engine still had 30% power, but it is unclear how much partial thrust was being generated. Using aircraft sensors’ data in real-time, it is possible to create more accurate damaged airplane models with appropriately quantified uncertainty (Darema, 2004; Allaire and Willcox, 2014; Oden et al., 2013) and present to pilots better flight plan choices. Significant research is needed to be able to continue to generate trajectories in sub-second time (currently our software takes under 50 ms to generate a trajectory (Paul et al., 2018), while considering terrain, traffic, wind, and other dynamic conditions.
In the future, we will explore hybrid offline/online computation, multi-fidelity models, incremental algorithms, and decentralized cloud (fog/edge) computing techniques, among others, in further evolution of intelligent flight systems research. We also plan to study decentralized streaming data analytics algorithms on the Internet of Planes, taking advantage of more and more digital information being available in modern cockpits thanks to the FAA’s Next Generation Air Transportation System, which requires all aircraft in controlled airspace to use Automatic Dependent Surveillance-Broadcast (ADS-B) by 2020. We also intend to study how to specify and verify the correctness of the behavior of non-deterministic distributed avionics systems. With the advent of data-driven machine learning techniques, intelligent flight systems of the future will need to adapt and learn, but software verification will become more challenging as a result. We intend to investigate efficiently computable modal logics and constraint programming models to support real-time semi-automated reasoning about spatio-temporal data streams.
We will aim to continue to use aerospace as a motivating application domain rich with complex requirements, while producing fundamental cyber physical systems research results that can apply to other domains.
An Internet of Planes (IoP) distributed computing platform is inherently decentralized and heterogeneous with connectivity rapidly changing. Computer systems enabling applications over the IoP, must thus be dynamically reconfigurable. To tackle the complexity of developing dynamically reconfigurable, scalable, reliable, open distributed computing systems, collaborative research is needed in the following thrusts: reliable concurrent software, adaptive middleware, and decentralized distributed algorithms.
We have worked on foundations of distributed computing software (Varela, 2013) and its application to astroinformatics: data-driven discovery of Milky Way structure and evolution (Cole et al., 2010; Cole et al., 2008).
Concurrent systems are harder to program than sequential ones because of their inherent nondeterminism, creating (i) potential race conditions, (ii) the need for synchronization to safely access shared resources, which itself can lead to deadlock, and in general, (iii) the need to reason about exponentially many plausible execution schedules. Distribution of concurrent systems introduces potential partial failures, heterogeneity of communication, and the potential for operation with disconnected system sub-components. Mobility of users, hardware, and software, adds another dimension to concurrent systems complexity. However, mobile distributed systems are necessary to support the elasticity and adaptability requirements of new computing paradigms, including next generation cloud computing on mobile devices as required by edge/fog computing and the Internet of Things (Buyya et al., 2018).
We have advocated for the use of the actor model of concurrent computation (Hewitt, 1977; Agha, 1986) to tackle the complexity of developing mobile distributed systems. Actors go beyond the state encapsulation afforded by objects, i.e., actors are also a unit of concurrency. In response to a message, an actor may create new actors, change its internal state, or send messages to other known actors. Because of the asynchronous nature of communication in the actor model, actors are also a natural unit of distribution and mobility. With Gul Agha (UIUC), we developed SALSA, a programming language with first-class support for actors, distribution, and mobility (Varela and Agha, 2001). In this actor language, we created new high-level linguistic abstractions to facilitate common coordination patterns, including named tokens, join blocks, and first-class continuations (Varela, 2001). We formalized these abstractions in terms of FeatherWeight SALSA, a fully-expressive kernel subset of the SALSA programming language first introduced in (Varela, 2013).
With John Field (formerly at IBM Research, now at Google), we introduced a new programming model, called transactors, to extend actors with state dependence information to be able to checkpoint and recover from individual temporary node failures. The key aspect of transactors is to ensure that a local checkpoint corresponds to a globally consistent state. When a transactor fails, all transactors whose current transient state depends on the failed transactor roll back to a previously known to be consistent global state. The model builds on the observation that an actor can only depend on another actor if it was created by that actor, or if it updates its internal state in response to a message whose payload is dependent on that actor. We formalized transactors by creating the tau-calculus, specifying its operational semantics as a labeled transition system in the spirit of (Agha et al., 1997), and we proved its safety and liveness properties (Field and Varela, 2005). With David Musser (RPI), we developed generic human-readable machine-checkable methods for formal proofs about actor systems in Athena, a dual deduction and computation language. We investigated proving application-level progress for a simple ticker-clock system and a more complex dining philosophers system, developing in the process reusable generic system-level fairness and deadlock freedom lemmas and associated proofs (Musser and Varela, 2013).
Distributed systems developed using the SALSA programming language can be dynamically reconfigured by moving actors to different run-time environments during execution. With Boleslaw Szymanski (RPI) and former Ph.D. student Kaoutar El Maghraoui (now at IBM Research), we developed a middleware, called the Internet Operating System (IOS) that profiles distributed resources, and opportunistically migrates actors to nodes where resources are available (El Maghraoui et al., 2006). IOS consists of a peer-to-peer network of middleware agents that use a randomized work-stealing protocol (inspired by (Blumofe and Leiserson, 1999)) to migrate application-level actors and balance the load of the underlying computational resources in a decentralized way. We further investigated the notion of malleability of actors, the ability to dynamically change their granularity (Desell et al., 2007). We witnessed significant synthetic workload performance improvements by either splitting actors, to better use hierarchical memory, in particular, improving cache hit rates; or by merging actors, decreasing contention and context switching overhead. We also applied malleability to MPI processes illustrating its applicability beyond the actor model (El Maghraoui et al., 2007). The advantage of actors is that migration is transparent to programmers. Malleability, on the other hand, requires application developers to define split and merge behavior. The IOS middleware decides when reconfiguration in the form of migration or malleability happens. With former Ph.D. student, Travis Desell (now Associate Professor at Rochester Institute of Technology), we improved the scalability and performance of locally concurrent SALSA programs (e.g., for multi-core and GPUs) by defining a new language called SALSA Lite (Desell and Varela, 2014). With former Ph.D. student, Wei-Jen Wang (now Associate Professor at National Central U., Taiwan), we developed the pseudo-root approach as a new technique for garbage collection of distributed mobile actors (Wang and Varela, 2006).
Distributed computing over volunteered resources (e.g, as advocated by BOINC (Anderson, 2004)) requires redundant computation to ensure that erroneous results from malicious users are not considered. We investigated adaptive redundancy in conjunction with asynchronous evolutionary algorithms (e.g., asynchronous genetic search (Desell et al., 2010)) to significantly reduce the need for replication. Essentially, only when a result is positive (i.e., it can improve the current population) does it get into a queue for validation. This relatively simple strategy, though it may miss a few good (uncomputed) results, significantly reduces the need for redundancy, improving the volunteer computing network effective utilization significantly (Desell et al., 2010).
We studied the impact of different cloud computing virtualization strategies on the performance of heterogeneous workloads (Wang et al., 2010). We concluded that for processing-intensive workloads, the ratio of virtual to physical machines has a significant impact on performance, as do the ratio of virtual to physical memory and the ratio of virtual to physical CPUs (Wang and Varela, 2011). We also demonstrated how to scale down and in computations by dynamically migrating applications into fewer virtual machines, thereby enabling consolidation and derived cost/energy savings, and how to scale up and out computations by creating new virtual machines and dynamically reconfiguring applications to use more resources (Imai et al., 2012). We developed the Cloud Operating System (COS), a middleware for policy-driven autonomous cloud computing software reconfiguration. COS has been used for mobile computing, in particular, opportunistically migrating subcomponents of SALSA programs from mobile devices to hybrid clouds (Imai and Varela, 2011). We also defined the notion of workload-tailored elastic compute units to accurately predict the performance of hybrid IaaS clouds on computation-intensive workloads (Imai et al., 2013). With Stacy Patterson (RPI) and former Ph.D. student Shigeru Imai (now post-doctoral researcher at RPI), we have been studying performance models for data streaming applications in the public cloud, in particular, we have defined the notion of maximum sustainable throughput, as a function of the number of VMs allocated to a data streaming analytics task (Imai et al., 2017). We also studied the impact of uncertainty in performance modeling and workload forecasts in virtual machine scheduling, as well as the impact of processing data closer to edge devices (a.k.a fog computing) (Imai et al., 2018; Imai et al., 2018).
In future work, we plan to investigate middleware for adaptive intelligent systems. In particular, we plan to work on elastic software for hybrid (i.e., public and private), energy-efficient, and heterogeneous (including edge and fog) clouds, and on first-class programming language support for data-intensive computing (Imai et al., 2016). We will also explore morphable computing as a new computing paradigm for dynamically changing the distributed algorithm in use by a group of software agents running on airplanes in the IoP context. The intention is to dynamically optimize between exchange of sensor data and declarative query computation on edge (airplane) and cloud (ground data centers) devices based on latency, colocation, and spatio-temporal constraints.