«Abstract In this article, we address transparent Damage Quarantine and Recovery (DQR), a very important problem faced today by a large number of ...»
Damage Quarantine and Recovery in Data Processing
Peng Liu1 Sushil Jajodia2 Meng Yu3
College of Information Sciences and Technology
Pennsylvania State University
University Park, PA 16802
Center for Secure Information Systems
George Mason University
Fairfax, VA 22030-4444
Department of Computer Science
West Long Branch, New Jersey 07764-1898
In this article, we address transparent Damage Quarantine and Recovery (DQR), a very important problem faced today by a large number of mission/life/business-critical applications and information systems that must manage risk, business continuity, and assurance in the presence of severe cyber attacks. Today, these critical applications still have a “good” chance to suffer from a big “hit” from attacks. Due to data sharing, interdependencies, and interoperability, the hit could greatly “amplify” its damage by causing catastrophic cascading effects, which may “force” an application to halt for hours or even days before the application is recovered. In this paper, we ﬁrst do a thorough discussion on the limitations of traditional fault tolerance and failure recovery techniques in solving the DQR problem. Then we present a systematic review on how the DQR problem is being solved.
Finally, we point out some remaining research issues in fully solving the DQR problem.
Key words: Damage Quarantine and Recovery, Transaction Processing, Data Integrity, Security 1 Introduction In this article, we address transparent Damage Quarantine and Recovery (DQR), an important problem faced today by a large number of mission/life/business-critical applications. These applications are the cornerstones of a variety of crucial information systems that must manage risk, business continuity, and data assurance in the presence of severe cyber attacks. Today, many of the nation’s critical infrastructures (e.g., ﬁnancial services, telecommunication infrastructure, transportation control) rely on these information systems to function.
There are at least two main reasons on why mission/life/business-critical applications have an urgent need for transparent damage quarantine and recovery. Firstly, despite that signiﬁcant progress has been made in protecting applications and systems, mission/life/business-critical applications still have a “good” chance to suffer from a big “hit” from attacks. Due to data sharing, interdependencies, and interoperability between business processes and applications, the hit could greatly “amplify” its damage by causing catastrophic cascading effects, which may “force” an application to shut down itself for hours or even days before the application is recovered from the hit. (Note that high speed Internet, e-commerce, and global economy have greatly increased the speed and scale of damage spreading.) The cascading damage and loss of business continuity (i.e., DoS) may yield too much risk. Because not all intrusions can be prevented, DQR is an indispensable part of the corresponding security solution, and a quality DQR scheme may generate signiﬁcant impact on risk management, business continuity, and assurance.
Secondly, due to several fundamental differences between failure recovery and attack recovery, the DQR problem cannot be solved by failure recovery technologies which are very mature in handling random failures. (a) Failure recovery in general assumes the semantics of fail-stop, while attack recovery in general cannot assume the semantics of attack-stop, since to achieve the adversary’s goal, most attacks (except for DoS) do not allow themselves to simply crash the system; they prefer hidden damage and alive zombies, spyware, bots, etc. Assuming fail-stop, quarantine is not really a problem for failure recovery; however, intrusion/damage quarantine is a challenging research topic in attack recovery and it can make a big difference.
(b) Failure recovery assumes that all operations (e.g., transactions) have equal rights to be recovered, while attack recovery can never assume “equal rights” because neither malicious operations nor corrupted operations should be recovered.
Towards understanding and solving the DQR problem, the rest of the article is organized as follows. In Section 2, we present a comprehensive yet tangible description of the DQR problem. In Section 3, we do in-depth discussions on the limitations of traditional fault tolerance and failure recovery techniques in solving the DQR problem. In Section 4, we present a systematic review on how the DQR problem is being solved. In Section 5, we propose a set of remaining research issues in fully solving the DQR problem and conclude the paper.
2 Overview of the DQR Problem We are concerned with the DQR needs of mission/life/business-critical information systems.
Since those information systems have been designed, implemented, deployed, and upgraded over several decades, they run both conventional applications, which typically use proprietary user interfaces and application-level client-server protocols [Bir05], and modern applications, which are typically web-bounded running standard Web Services protocols.
Nevertheless, both conventional and modern mission/life/business-critical applications share some common characteristics: they are typically part of a large-scale, semantically rich, networked, interoperable information system; they are typically stateful and data-intensive; they are typically 24*7 applications requiring superb business continuity (i.e., availability); and they typically require guaranteed recoverability (and data integrity).
such as J2EE and.NET, programmers can make applications transactional in a rather autoOperation request Nested matic, effort-free fashion. And (Transaction begins) making applications transactionalTransactions the beneﬁts of are signiﬁcant: “failure atomicity simpliﬁes theresult maintenance of invariants on data” [SDD85]; a guaranteed level of data consistency can be achieved without worrying about say race conditions;
(Transaction ends) durability makes it much easier for programmers to get the luxury of recoverability.
As a result, the transaction mechanism is embraced by not only database systems [BHG87], but also a large variety of computer systems and applications [Gra93], including operating systems (e.g., VINO provides kernel transaction support [SESS96]), ﬁle systems (e.g., Camelot provides transactional access to user-developed data structures stored in ﬁles [SDD85]; and [LR04] argues that transactional ﬁle systems can be fast), distributed systems (e.g., QuickSilver uses transactions as a uniﬁed failure recovery mechanism [HMSC88]), persistent object stores (e.g., Augus supports transactions on
objects [LCJS87]), CORBA, and Web Services.
To leverage the strength, recovery facilities, and popularity of the transaction mechanism, and more important to make the proposed DQR solutions transparent to existing applications, it is a good idea to develop DQR theories and mechanisms at the transaction level. Since real world mission/life/business-critical applications typically deploy the transaction mechanism, transaction-level DQR solutions will Users; Physicalapplicability. Effects have wide World Actions & Physical World
2.2 The Threat Model and Intrusion Detection Assumption Working at the transaction level does not mean that malicious transactions are the only threat we can handle. Instead, as shown in Figure 1, we allow threats to come from either inside or outside of the transaction-level scope of applications. Nevertheless, to exploit the application’s transaction mechanism to achieve a malicious goal, both inside and outside threats need to either directly corrupt certain data objects or get certain malicious transactions launched. Outside non-transactional attack actions (e.g., Witty worm) may bypass the transaction interface and corrupt some data objects via low-level (e.g., ﬁle or disk) operations. In addition, nontransactional buffer overﬂow attacks may break in certain running program of the application;
then the attacker can manipulate the program to launch certain malicious transactions.
Inside the transaction scope, insider attack [Sch05] is probably the most serious threat.
Since insiders (i.e., disgruntled employees of a bank) are typically not savvy in hacking, issuing malicious transactions (using a different user account) is typically the way they attack.
Based on the study by [CK96], most (application level) attacks are from insiders. Besides insider attack, (a) identity theft may literally “transform” an outsider into an insider. (b) SQL injection attacks, though currently most used to steal sensitive information, has full capability to maliciously update data objects. (c) Five out of the top six web application vulnerabilities identiﬁed by OWASP [OWA04] may enable the attacker to launched a malicious transaction.
They are unvalidated input, broken access control, broken authentication and session management, cross site scripting (which helps the attacker to steal user name and passwords), and injection ﬂaws. (d) Finally, erroneous transactions caused by user/operator mistakes instead of attacks are yet another major threat to data integrity.
The intrusion detection assumption: We assume that a set of external intrusion detection sensors will do their job and tell us which operations (or transactions) were malicious or which data objects were originally corrupted by the attack. These sensors may be a network-level (e.g., [Pax99]), host-level (e.g., [FHSL96]), database-level (e.g., [CGL00]) or transaction-level (e.g., [SFL97, BKTV05]) intrusion detection sensor. These sensors may enforce misuse detection (e.g., [Ilg93]), anomaly detection (e.g., [JV91, LX01]), or speciﬁcation-based (e.g., [KRL97, SGF+ 02]) detection mechanisms. We assume these sensors are usually associated with false positives, false negatives, and detection latency. Finally, sensors that detect data corruption (e.g., [MG96, BGJ00, MVS00]) may also be used.
Remark Although some intrusion detection sensors could raise a good number of false positives or false negatives, the alarms raised by many intrusion/error detection sensors can actually be veriﬁed before any DQR operation is performed. (In this way, the negative impact of false positives and false negatives on the correctness/quality of DQR may be avoided.) For example, (a) most user/operator mistakes can be easily veriﬁed by the operation audit trails. (b) Many data corruption detectors have 100% accuracy. (c) When a strong correlation is found between one alert X and some other alerts, alert X may be veriﬁed as a true intrusion.
2.3 The DQR Problem/Solution Space
In our view, the DQR problem is a 6-dimensional problem:
• (1) The damage propagation dimension explains why cascading effects can be caused and why quarantine is needed. Although some speciﬁc types of damage (e.g., when an individual credit card account is corrupted) could be self-contained, a variety types of damage are actually very infectious due to data sharing, interdependencies, and interoperability between business processes and applications. For example, in a travel assistant Web Service, if a set of air tickets are reserved due to malicious transactions, some other travelers may have to change their travel plans in terms of which airlines to go, which nights to stay in hotel, etc.. Furthermore, the changed travel plans can cause cascading effects to yet another group of travelers; and the propagation may go on and on.
• (2) The recovery dimension covers three semantics for recovery: the coldstart semantics mean that the system is “halted” while damage is being assessed and repaired. (Damage assessment is to identify the set of corrupted data objects. Damage repairing is to restore the value of each corrupted data object to the latest before-infection version.) To address the DoS threat, recovery mechanisms with warmstart or hotstart semantics are needed.
Warmstart semantics allow continuous, but degraded, running of the application while damage is being recovered. Hotstart semantics make recovery transparent to the users.
• (3) The quarantine dimension covers a spectrum of quarantine strategies: (a) coldstart recovery without quarantine, (b) warmstart recovery with conservative, reactive quarantine, (c) warmstart recovery with proactive or predictive quarantine, (b) hotstart recovery with optimistic quarantine, to name a few.
• (4) The application dimension covers the various transaction models deployed by conventional and modern applications. The uniqueness of each model may introduce new challenges for solving the DQR problem.
• (5) The correctness dimension tells whether a DQR scheme is correct in terms of consistency, recoverability, and quarantinability.
• (6) The quality dimension allows people to measure and compare the quality levels achieved by a set of correct yet different DQR schemes.
2.4 What Transaction Level DQR Solutions Cannot Do First, although transaction-level DQR solutions will help minimize the damage caused by cyberspace attacks in the physical world, they cannot repair physical damage, which is a different ﬁeld of study. Second, transaction-level DQR solutions are not designed to patch software which is another critical intrusion recovery problem. Nevertheless, transaction-level DQR solutions and software patching are complementary to each other. Transaction-level DQR solutions can help quarantine and repair the damage done by unpatched software broken-in by the adversary.