FREE ELECTRONIC LIBRARY - Thesis, dissertations, books

Pages:   || 2 | 3 | 4 |

«Asia Slowinska, Istvan Haller, Andrei Bacs, Silviu Baranga, and Herbert Bos Vrije Universiteit Amsterdam Abstract Many software vendors use data ...»

-- [ Page 1 ] --

Data structure archaeology: scrape away the dirt and

glue back the pieces!

(Or: automated techniques to recover split and merged variables)

Asia Slowinska, Istvan Haller, Andrei Bacs, Silviu Baranga, and Herbert Bos

Vrije Universiteit Amsterdam


Many software vendors use data obfuscation to make it hard for reverse engineers to

recover the layout, value and meaning of the variables in a program. The research question in this paper is whether the state-of-the-art data obfuscations techniques are good enough. For this purpose, we evaluate two of the most popular data obfuscation methods: (1) splitting a single variable over multiple memory location, (2) splitting and merging two variables over multiple memory locations. While completely automated and flawless recovery of obfuscated variables is not yet possible, the outcome of our research is that the obfuscations are very vulnerable to reversing by means of automated analysis. We were able to deobfuscate the obfuscated variables in real world programs with false positive rates below 5%, and false negative rates typically below 10%.

1 Introduction Both malware authors and commercial software vendors employ software obfuscation to protect their binaries from the prying eyes of reverse engineers and crackers. The assumption is that sensitive information is safe behind one or more layers of transformation that scramble the data and code in such a way that they become hard to analyze.

In this paper, we focus on legitimate applications written in C that vendors obfuscate for purposes like IP protection and DRM.

By now, code and data obfuscation have evolved into mature fields, with an active research community and commercial products like Irdeto’s Cloakware [19], Morpher [26], and CodeMorph [31]. The commercial interest in obfuscation is high, especially in DRM or security-sensitive environments. Cloakware’s clients include companies like Logitech/Google TV, ComCast, Netflix, Elgato, Harmonix, and Xceedium, while Morpher has clients like Spotify and Discretix (used in Android DRM, Microsoft PlayReady, and many other products).

The obfuscation techniques in today’s obfuscators range from limited control flow hiding to highly advanced methods that include the data structure layouts (and values) as well. Indeed, it is common to distinguish between control obfuscation (e.g., opaque predicates and control flow hiding), and data and layout obfuscation (e.g., splitting a single variable over multiple memory locations) [10, 9].

Over the years, both the blackhat and whitehat communities have shown an active interest in probing the strength of current control obfuscation techniques. Typically, they show that most control obfuscation techniques are limited (or even weak) in the face of determined attackers [21, 33, 24, 13].

We do not know of any project that addresses the recovery of obfuscated memory layouts and data. This is remarkable, because for reverse engineers there is great value in the recovery of the data and its layout. Real programs tend to revolve around their data structures, and ignorance of these structures makes the already complex task of reverse engineering even more painful [30]. In addition, deobfuscated data is a crucial step in reversing sensitive information.

A possible reason for the lack of prior art is that extracting data structures from binary programs is exceedingly difficult even without obfuscation. If data is additionally hidden behind sophisticated obfuscations, the hope of recovering the original data structures is close to zero. For instance, how could you detect that two bytes are really part of the same number in a single structure, if they are not even stored together? The core assumption that many software vendors rely on is that the obfuscation is irreversible in practice. The research question in this paper is whether this assumption is reasonable.

Specifically, we show that the assumption is false for state-of-the-art data obfuscators and that it is feasible to recover the data structures in an automated way. Perfect deobfuscation is not needed and, as we shall see, impossible in the general case. Instead, we probe the binary’s obfuscation and consider it weak when a reverse engineer has a high probability of finding the original data/layout.

Contributions. The research question in this paper is: are state-of-the-art data obfuscation techniques good enough? Specifically, we probe two of the most common and advanced data obfuscation techniques, and show that they are vulnerable to automated analysis. Our approach is based solely on a dynamic analysis of the program’s memory accesses and information flow. As a concrete implementation, we present Carter, a data deobfuscator that reverses the obfuscations by tracking and analyzing the program’s memory access patterns. We also show the usefulness of the tool in an actual reverse engineering scenario.

Assumptions. We assume that the obfuscator may apply different data and control obfuscations at the same time—much like state-of-the-art obfuscators such as Cloakware [19]. For instance, we allow split variables in a program that also runs on a virtualization obfuscator with opaque predicates. Rather than criticize a specific product, our aim is to evaluate the advanced obfuscation techniques in general, regardless of who sells them. Therefore, whenever we discuss data obfuscation techniques in this paper, we refer to publications describing the techniques and not to products. The actual obfuscator used for this paper is representative of the advanced data obfuscations found in (one or more of) the commercial systems.

We also assume that the obfuscated binary is available on a machine controlled by the attackers. They can apply any kind of static or dynamic analysis to the binary and run it many times.

Our approach is almost exclusively dynamic. In particular, it runs the obfuscated binaries in the PIN [18] dynamic instrumentation framework. Dynamic analysis has the advantage that it easily handles popular control obfuscations (like opaque predicates and return address patching, see Section 5). The drawback is that, like all dynamic approaches, we can only analyze what we execute. While code coverage for binaries is a hot research topic [6], it is beyond the scope of this paper.

The paper will also show-case examples of single-threaded obfuscation, since in practice, we never observed instances cross-thread data obfuscation. This does not limit the ability of Carter to handle multi-threaded programs, as PIN is fully capable of performing per-thread instrumentation.

Outline We describe data obfuscation techniques in Section 2, and our approach to deobfuscation in Sections 3-4. Next, we discuss the impact of control obfuscations in Section 5 and evaluate our work in Section 6. We then use our tool in an actual reverse engineering example in Section 7 to demonstrate its usefulness. We discuss both limitations and recommendations in Section 8 and related work in Section 9. Section 10 contains our conclusions.

2 Data obfuscation In the next three sections, we focus on data obfuscation. In Section 5, we also show what happens if data and control obfuscation are combined. Our focus is solely on obfuscation rather than, say, encryption1. Specifically, we evaluate two of the most prevalent

and advanced data obfuscations described in the literature:

• Variable splitting: the program scatters variables (like integers) over multiple locations.

• Splitting and merging: besides splitting variables, the program merges them by using a single location for multiple variables. While we are not aware of any current obfuscator that provides a flexible manner to add such data obfuscation, and we were able to perform only a partial evaluation, we think it represents an interesting extreme case.

We now discuss these techniques in detail and focus on transformations that obscure the built-in data types [11].

Splitting variables Integers and boolean variables are common data types that often carry sensitive information. A popular and complex transformation is known as variable splitting. Splitting a variable breaks it up into several smaller components. Wherever possible, the program will access the constituent components rather than the actual parameter. For reverse engineers it will be very difficult to guess the meaning of the components.

To introduce the concept formally, assume that the obfuscator splits a variable x into variables x1 and x2, with the transformation defined by functions E(x1,x2 ) and D(x).

We say that E encodes x as a function of x1 and x2, while D decodes x (maps x on the corresponding values of x1 and x2 ). Figure 1 shows an example.

Given E and D, we still have to devise operations to perform on the new representation of x. A simple solution would be to compute the value of x, perform the original Variable encryption is not normally seen as obfuscation as it involves secret keys rather than key-less obfuscation algorithms. Thus, it is the subject of cryptanalysis rather than deobfuscation.


Fig. 1. An example variable split transformation.

operation on x, and encode it as x1 and x2 again. However, doing so reveals the variable x to the attacker. Split transformations try to perform the operations solely on the new representation of the variable and avoid computing x even as an intermediate value.

Figure 1 shows an example that maps additions on operations on x1 and x2.

Of course, we cannot always hide x completely. When the program passes x as an argument to a library function or a system call, it needs to compute its value. In general, all interactions with non-obfuscated code require x’s original value. Also, while the potency of the split obfuscation grows with the number of new variables introduced, so does the cost of the transformation. In practice, a variable is split into just 2 or 3 other variables [11].

Splitting and merging variables To add to the confusion, the obfuscator may combine splitting and merging on the same variables: given unrelated variables x and y, the transformation first splits x into {x1, x2 }, and y into {y1, y2 }. Next, it merges x2 with y1 into z, so the obfuscated program uses only variables x1, z, and y2.

2.1 Goal: tractable deobfuscation Carter aims to make data deobfuscation tractable—to give reverse engineers a high probability of finding the original data. Without knowing the original intention of the programmer, it is not always possible to decide whether a variable is obfuscated, or encoded in a certain way for other reasons. For example, to access a two-dimensional array, arr, a programmer may use either one or two subscripts, i.e., arr[x][y] or arr[i] where i = x∗N +y. Thus, when we observe such array accesses in a binary, we cannot tell whether the programmer chose the encoding for convenience, or to obfuscate the variable i by splitting it into variables x and y.

In our analysis, we just aim to discover that x and y are used interchangeably with i. After the analysis, we will then explicitly compare our results with the ground truth and report the variables that were not really obfuscated as false positives (Section 6).

We stress that perfect deobfuscation is not needed. Specifically, we can tolerate false positives (where we say that data was obfuscated, when in reality it was not) and even false negatives (where we miss the obfuscation), as long as these cases do not occur too often. The reason is that the number of variables in a program may be large, but it is only a fraction of the total SLOC count. For instance, the lighttpd webserver used by YouTube counts about 2k variable/field definitions on 40K lines of code.

Even if we incur a false positive rate of 5%, the number of false positives for programs like lighttpd is probably tractable for a motivated attacker. Phrased differently, the base-rate fallacy [3] is less of an issue than for, say, most intrusion detection systems.

Similarly, a false negative rate of 10% means that we miss obfuscated variables, but the remaining 90% are important results. Thus, the real question is whether a reverse engineer can use automated techniques to get a handle on the data and layout.

3 Variable split detection To detect a split variable, we build on two observations. First, when an obfuscator splits a variable z into x and y, it needs to perform a semantically equivalent operation on x and y for all operations on z (whether they be reads, writes, or ALU operations).

Second, although the obfuscator works on x and y independently as much as possible, their values are combined occasionally. For instance, during an interaction with nonobfuscated components, such as the operating system.

Carter therefore analyzes the program’s memory access trace, and looks for variables that are used together and exchange their data locally—in a short logical time interval. The question is how to determine the right level of affinity. For this we developed a new approach that hails from a technique in cache optimization.

Reference affinity grouping [37] restructures arrays and data structures to place elements that are always used together on the same cache line. It measures how ‘close in logical time’ the program accesses groups of data, and proposes a partition based on the outcome. Likewise, Carter looks for candidate data items that together may make up a split variable by tracking items that are used close together in logical time. Whenever Carter finds such items, it classifies them for a grouping.

Although we were inspired by the original work on reference affinity grouping [37], we devised our own method for approximating the solution. Picking an appropriate method is important, because Ding et al. [15] proved that finding the optimal partition is NP-hard. The concept of temporal reuse intervals, which we propose in Section 3.3, provides a practical way to identify memory locations that are accessed together.

Once the grouping algorithm has proposed candidates for split variables, we refine the results by data flow analysis (Section 3.4). Intuitively, data items in a split variable share data on reads.

Running example We illustrate the whole procedure with a simple obfuscated function, which serves as a running example. For the sake of clarity, all examples are in C.

However, we perform the real analysis on binaries.

Pages:   || 2 | 3 | 4 |

Similar works:

«LARSA Section Composer Manual A manual for LARSA 4D Finite Element Analysis and Design Software Last Revised July 2009 Copyright (C) 2001-2012 LARSA, Inc. All rights reserved. Information in this document is subject to change without notice and does not represent a commitment on the part of LARSA, Inc. The software described in this document is furnished under a license or nondisclosure agreement. No part of the documentation may be reproduced or transmitted in any form or by any means,...»

«Whole Detox A 21Day Personalized Program to Break Through Barriers in Every Area of Your Life Deanna Minich Learn More or Pre-order at Whole-Detox.com Copyright This book contains advice and information relating to health and is not meant to diagnose, treat, or prescribe. It should be used to supplement rather than replace the advice of your physician or other trained health-care practitioner. If you know or suspect you have a medical condition, have physical symptoms, or feel unwell, it is...»

«ABC of Knowledge Management Freely extracted from the NHS National Library for Health at http://www.library.nhs.uk/knowledgemanagement/ by Géraud Servin Creator: NHS National Library for Health: Knowledge Management Specialist Library Contributor: Caroline De Brún Publication Date: July 2005 Table of Contents 1 WHAT IS KNOWLEDGE MANAGEMENT? 1.1 What is knowledge management? 1.2 What is knowledge? 1.3 Why do we need knowledge management? 1.4 What does knowledge management involve? 1.5 Some...»

«Contains Nonbinding Recommendations This guidance represents the Food and Drug Administration’s (FDA’s) current thinking on this topic. It does not create or confer any rights for or on any person and does not operate to bind FDA or the public. You can use an alternative approach if the approach satisfies the requirements of the applicable statutes and regulations. If you want to discuss an alternative approach, contact the FDA staff responsible for implementing this guidance. If you cannot...»

«FOOD AND DRUG ADMINISTRATION GUIDE TO INSPECTIONS OF QUALITY SYSTEMS August 1999 Guide to Inspections of Quality Systems This document was developed by the Quality System Inspections Reengineering Team Members Office of Regulatory Affairs Rob Ruff Georgia Layloff Denise Dion Norm Wong Center for Devices and Radiological Health Tim Wells – Team Leader Chris Nelson Cory Tylka Advisors Chet Reynolds Kim Trautman Allen Wynn Designed and Produced by Malaka C. Desroches Foreword This document...»

«Equine Piroplasmosis and the 2010 World Equestrian Games Veterinary Services Animal and Plant Health Inspection Service U.S. Department of Agriculture May 2008 Table of Contents Acknowledgements Introduction What is equine piroplasmosis (EP)? What ticks spread EP? What countries are affected? Has the United States been affected before? What are the risks of the United States becoming affected? What are we doing to prevent the introduction of EP? What are EP-related issues for hosting equine...»

«COBOL – continuing to drive value in the 21st Century The landscape for COBOL asset ownership Reference Code: CYBT0006 Publication Date: November 2008 COBOL – continuing to drive value in the c21st CYBT0006 Published 11/2008 © Datamonitor. This document is a licensed product and is not to be photocopied COBOL – continuing to drive value in the c21st COBOL – continuing to drive value in the c21st CYBT0006 Published 11/2008 © Datamonitor. This document is a licensed product and is not...»

«1 ANNOUNCING EMPIRICALLY BASED PLAY INTERVENTIONS FOR CHILDREN LINDA A. REDDY, TARA M. FILES-HALL, AND CHARLES E. SCHAEFER Play is a universal behavior of children that has been documented since ancient times (Janssen & Janssen, 1996; Lowenfeld, 1939). It is estimated that by 6 years of age, children are likely to have engaged in more than 15,000 hours of play (Schaefer, 1993). The benefits of play for healthy cognitive development (Bornstein & O’Reilly, 1993; Piaget, 1962), language...»

«SUPPLEMENTARY MATERIAL S2: CONCISE GUIDE TO COMPOSITIONAL DATA ANALYSIS FOR PHYSICAL ACTIVITY, SEDENTARY BEHAVIOR AND SLEEP RESEARCH Authors: S Chastin, J. Palarea-Albaladejo Date: 28/01/2015 Version:2.2 DISCLAIMER: This document is a brief guide to the basis of Compositional Data Analysis (CoDa) intended to help researcher in physical activity, sedentary behavior and sleep epidemiology familiarize themselves with basic concepts and develop basic analysis. It is not a formal introduction to the...»

«Clinical Infectious Diseases Advance Access published August 12, 2014 Using Clinician’s Search Query Data to Monitor Influenza Epidemics Mauricio Santillana1, Elaine O. Nsoesie2,4, Sumiko R. Mekaru2, David Scales2,3, and John S. Brownstein2,4,5 School of Engineering and Applied Sciences, Harvard University, Cambridge, MA Children’s Hospital Informatics Program, Boston Children’s Hospital, Boston, MA Department of Internal Medicine, Cambridge Health Alliance, Cambridge, MA Downloaded from...»

«ORIENTATION GUIDE FOR SACC-USA TRAINEE PROGRAM Contents INTRODUCTION SACC-USA Important contact information DESCRIPTION OF THE PROGRAM The Exchange Visitor Program and the J-1 visa The Trainee Program RULES AND REGULATIONS Requirements for Participating in the J-1 Program Rules for the Program as Outlined in the Training Agreement Regulations Regarding Travel INSURANCE AND HEALTH Social Security Number Insurance Non-Immigrants Legal Rights Healthcare LIFE AND CUSTOMS IN THE UNITED STATES...»

«Material and Methods : Patients The study comprised 337 consecutive gastric cancer patients treated at the Department of Surgery, Helsinki University Central Hospital, in 1983-1999. The patient material has been characterized before (1). In brief, 174 men and 163 women were included with a median age of 66 (range 30-87). Median follow-up of patients at study end was 12.7 years (range 4.7Five-year overall survival was 29.4% (95% confidence interval (CI) 24.5-34.3). Approval of the study came...»

<<  HOME   |    CONTACTS
2016 www.dis.xlibx.info - Thesis, dissertations, books

Materials of this site are available for review, all rights belong to their respective owners.
If you do not agree with the fact that your material is placed on this site, please, email us, we will within 1-2 business days delete him.