# «1 Introduction In this paper, we present a uniﬁed analysis and design of algorithms and data structures for two important, and seemingly unrelated, ...»

Computational Bounds on Hierarchical Data Processing with

Applications to Information Security

Roberto Tamassia and Nikos Triandopoulos

Department of Computer Science, Brown University. Email: {rt, nikos}@cs.brown.edu

Abstract. We study the complexity of a class of computations based on a directed acyclic graph

(DAG) that describes the computation of a collection of output values from an input set of n elements. We present an Ω(log n) lower bound on various computational cost measures for our class of computations. Also, we develop a new randomized DAG scheme that provides improved computational performance over previously known schemes. We apply our results to two information security problems, data authentication through cryptographic hashing and multicast key distribution using key-graphs. We show that both problems involve hierarchical data processing and prove logarithmic lower bounds on their computational and communication costs. Also, using our new DAG scheme, we present a new authenticated dictionary data structure with improved authentication overhead. Finally, we present a new skip-list version such that the expected number of comparisons in a search is 1.25 log2 n + O(1).

1 Introduction In this paper, we present a uniﬁed analysis and design of algorithms and data structures for two important, and seemingly unrelated, information security problems, the authentication of membership queries in the presence of data replication at untrusted directories and the distribution of cryptographic keys by the controller of a dynamic multicast group. We provide logarithmic lower bounds on various time and space cost measures for these problems. We also develop new eﬃcient data structures and give an accurate analysis of their performance, taking into account constant factors in the leading asymptotic term.

Our uniﬁed approach is based on the deﬁnition of the class of hierarchical data processing problems, where a directed acyclic graph (DAG) describes the computation of a collection of output values from an input set of n elements. We deﬁne structural metrics for subgraphs of a DAG that express cost measures related to computations in a hierarchical data processing problem. We prove Ω(log n) lower bounds on these cost measures using a reduction to searching by comparisons in an ordered set. We also design a new randomized DAG scheme for hierarchical data processing problems, based on a variation of the skip-list. Our results for the two information security problems are obtained by showing that they can be modeled by a hierarchical data processing problem and by appropriately applying to their domain the general lower bounds and our new data structure. Our contributions are summarized as follows.

Hierarchical Data Processing We introduce the class of hierarchicaldata processing (HDP) problems and study their complexity. This class models computations on a dynamic set of elements that share the following characteristics. Associated with the elements is a structured collection of values, organized according to a DAG. As elements change over time, the values are accordingly updated, where various costs regarding the computations involved depend on certain structural properties of the underlying DAG. Finally queries on elements are issued, where typically the answer of a query is a subset of the associated values. Again, the cost of answering a query depends on the hierarchy induced by the DAG.

Previous Work. We are not aware of any previous systematic study of the class of HDP problems.

Our Results. We deﬁne several cost measures for subgraphs of a DAG that characterize the space and time complexity of queries and update operations in a HDP problem. For a problem of size n, we relate each of these cost measures to the number of comparisons performed in the search of an element in an ordered set of size n. Through this reduction, we prove an Ω(log n) lower bound on the space and time complexity of query and update operations in a hierarchical data processing problem. We also show that trees are optimal DAG structures compared with general DAGs. We also design a new randomized DAG for HDP problems, based on a variation the skip-list. We give a detailed analysis of the cost measures of our scheme taking into account the constant factor on the leading asymptotic term.

Skip-Lists The skip-list, introduced in [25, 26], is an eﬃcient randomized data structure for dictionaries.

Previous Work. A search in a skip-list with n elements takes 1.5 log2 n + O(1) expected comparisons [25, 26], Our Results. As a ﬁrst application of our improved DAG, we design a new type of skip-list where the expected number of comparisons in a search is 1.25 log2 n + O(1), which is closer to optimal up to an additive constant term.

Authenticated Data Structures With the growing use of Web services and data replication applications, queries on a data structure are often answered by a third party diﬀerent from the source of the data. Thus, the problem arises of authenticating the answers given by this third party, which may not be trusted by the user. An authenticated data structure (ADS ) is a distributed model of computation where a directory answers queries on a data structure on behalf of a trusted source and provides to the user a cryptographic proof of the validity of the answer. The source signs a digest (i.e., a cryptographic summary) of the content of the data structure and sends it to the directory. This signed digest is forwarded by the directory to the user together with the proof of the answer to a query. To verify the validity of answer, the user computes the digest of the data from the answer and the proof, and compares this computed digest against the original digest signed by the source.

Cost parameters for an ADS include the space used (for the source and directory), the update time (for the source and directory), the query time (for the directory), the digest size, the proof size and the veriﬁcation time (for the user). In the important class of hash-based authenticated data structures, the digest of the data set is computed by hierarchical hashing, i.e., by hierarchically applying a cryptographic hash function over the data set. Section 4 provides more details about the model and authenticated dictionaries.

Previous Work. Early work on ADSs has focused on hash-based authenticated dictionaries. The hash tree scheme by Merkle [18, 19] implements a static authenticated dictionary by hierarchical hashing over a balanced binary tree. For a set of size n, the scheme uses O(n) space and has O(log n) proof size, query time and veriﬁcation time. Dynamic authenticated dictionaries that achieve O(log n) proof size, query time, update time and veriﬁcation time are presented in [23] (based on hash trees) and in [1, 11] (base on skip-lists). Other authentication schemes for dictionaries, based on variations of hash trees, have been proposed in [2, 8, 16].

General techniques for hash-based query authentication are presented in [17, 13]. Beyond dictionaries, hashbased ADSs have been developed for relational database operations and multidimensional orthogonal range queries [7], pattern matching in tries and multidimensional orthogonal range searching (static case) [17], path and connectivity queries in dynamically evolving graphs and search queries in 2-dimensional static geometric objects (e.g., point location queries and segment intersection queries) [13]. An alternative approach to the design of an authenticated dictionary, based on the RSA accumulator (and not on hierarchical hashing), is presented in [12]. Related work on accumulators appears in [3]. Work related to ADSs includes [6, 9, 22, 1].

Related to ADSs is also recent work on consistency proofs [20, 24] that models data authentication in a more adversarial environment, where the data source is not considered trusted per-se, but the proposed schemes use signiﬁcantly more computational resources than ADSs. The computational overhead incurred by an authenticated data structure over a non-authenticated one consists of: (1) the additional space used to store authentication information (e.g., signatures and hash values) in the data structure and in the proof of the answer, and (2) the additional time spent performing authentication computations (e.g., signatures and cryptographic hashes) in query, update and veriﬁcation operations. Since cryptographic operations such as signatures and hashes are orders of magnitude slower than comparisons and a single hash value is relatively long (e.g., 16B or 20B), the authentication overhead dominates the performance of an ADS.

All the existing hash-based authenticated dictionaries have logarithmic query, update and veriﬁcation cost and logarithmic proof cost. Naor and Nissim [23] posed as an open problem the question of whether one can achieve sublogarithmic authentication overhead for dictionaries. We answer this question negatively for hash-based ADSs.

Our Results. We present the ﬁrst study on the cost of authenticated data structures, focusing on dictionaries.

In Section 4, we model a hash-based dictionary ADS as a hierarchical data processing problem. We consider a very general authentication technique where hashing is performed over the data set in any possible way and where more than one digests of the data structure are digitally signed by the source. Applying our results from Section 2 in this domain, we prove the ﬁrst nontrivial lower bound on the authentication cost for dictionaries. In particular, we show that in any hash-based authenticated dictionary of size n where the source signs k digests of the data set, any of the authentication costs (update/query time, proof size or veriﬁcation time) is Ω(log n ) in the worst case. Thus, the optimal trade-oﬀ between signature cost and k hashing cost is achieved with O(1) signature cost and Ω(log n) hashing cost. In this case, we show that hash-based authenticated dictionaries of size n incur Θ(log n) complexity. We also present a new hash-based dictionary ADS based on our skip-list structure from Section 3 and show that it has better authentication cost parameters than previous hash-based ADS constructions.

Multicast Key Distribution Multicast key distribution (or multicast encryption) is a model for realizing secrecy in multicast communications among a dynamic group of n users. To achieve secrecy, one needs to extend the conventional point-to-point encryption schemes to the multicast transmission setting. Namely, the users share a common secret key, called group-key, and encrypt multicast messages with this key, using a secret-key (symmetric) encryption scheme. When changes in the multicast group occur (through additions/deletions of users), in order to preserve (forward and backward) security, the group-key needs to be securely updated. In general, a group controller (physical or logical entity) is responsible for distributing an initial set of keys to the users. Each user possesses his own secret-key (known only to the controller), the group-key and a subset of other keys. Upon the insertion or removal of a user into/from the group, a subset of the keys of the users are updated. Namely, new keys are encrypted by some of the existing keys so that only legitimate users in the updated group can decrypt them. The main cost associated with this problem is the number of messages that need to be transmitted after an update. Additional costs are related to the computational time spent for key encryptions and decryptions.

Previous Work. Many schemes have been developed for multicast key distribution. We focus on the widely studied key-graph scheme, introduced in [32, 33], where constructions are presented for key-graphs realized by balanced binary trees such that O(log n) messages are transmitted after an update, where n is the current number of users. Further work has been done on key-graphs based on speciﬁc classes of binary trees, such as AVL trees, 2-3 trees and dynamic trees. See, e.g., [14, 10, 27]. In [5], the ﬁrst lower bounds are given for a restricted class of key distribution protocols, where group members have limited memory or the key distribution scheme has a certain structure-preserving property. In [29], an amortized logarithmic lower bound is presented on the number of messages needed after an update. The authors prove the existence of a series of 2n update operations that cause the transmission of Ω(n log n) messages. Recently, a similar amortized logarithmic lower bound has been shown in [21] for a more general class of key distribution protocols, where one can employ a pseudorandom generator to extract (in a one-way fashion) two new keys from one key and one can perform multiple nested key encryptions. Pseudorandom generators for this problem were ﬁrst described in [4], where the number of messages are decreased from 2 log n to log n.

Our Results. We show that the multicast key distribution problem using key-graphs is a hierarchical data processing problem. Applying our results from Sections 2 and 3 to this domain: (i) we perform the ﬁrst study of general key-graphs (other than trees) and show that trees are optimal structures; (ii) we prove an exact worst-case logarithmic lower bound on both the communication cost (number of messages) and the computational cost (cost due to encryption/decryption) of any update operation, the ﬁrst of this type; and (iii) we present a new scheme (tree DAG) that achieves costs closer to the theoretical optimal. Note that we give the ﬁrst lower bounds on the encryption/decryption costs and that our lower bound proof is more generic since it depends on no certain series of update operations. In essence, we present the ﬁrst exact, worst case logarithmic lower bound for the communication cost of the multicast key distribution problem. All of the previously known lower bounds are amortized, i.e., they prove the existence of a sequence of updates that include an expensive (of at least logarithmic cost) one. In contrast, we prove the existence of a single update of at least log n + 1 communication cost for any instance of the problem. Our lower bound holds also for protocols that use pseudorandom generators or multiple encryption, as in the model studied in [21].

These results are described in Section 5.