«Diego Calvanese Faculty of Computer Science Free University of Bolzano/Bozen Piazza Domenicani 3 I-39100 Bolzano-Bozen BZ, Italy Luigi Dragone, ...»
Enterprise Modeling and Data Warehousing
in Telecom Italia
Faculty of Computer Science
Free University of Bolzano/Bozen
Piazza Domenicani 3
I-39100 Bolzano-Bozen BZ, Italy
Luigi Dragone, Daniele Nardi, Riccardo Rosati
Dipartimento di Informatica e Sistemistica,
Universit` di Roma “La Sapienza”,
Via Salaria 113, 00198 Roma, Italy
Stefano M. Trisolini
Telecom Italia and
Data Warehouse and DMDWM Consulting S.A.S. (present aﬃliation)
We present a methodology for Data Warehouse design and its application within the Telecom Italia information system. The methodology is based on a conceptual representation of the Enterprise, which is exploited both in the integration phase of the Warehouse information sources and during the knowledge discovery activity on the information stored in the Warehouse. The application of the methodology in the Telecom Italia framework has been supported by prototype software tools both for conceptual modeling and for data integration and reconciliation.
Key words: Data Warehousing, Data Integration, Conceptual Modeling, Automated Reasoning Email addresses: email@example.com (Diego Calvanese), firstname.lastname@example.org (Luigi Dragone), email@example.com (Daniele Nardi), firstname.lastname@example.org (Riccardo Rosati), email@example.com (Stefano M. Trisolini).
URLs: http://www.inf.unibz.it/~calvanese/ (Diego Calvanese), http://www.dis.uniroma1.it/~dragone/ (Luigi Dragone), Accepted for publication in Information Systems July 2004 1 Introduction Information integration (1) is one of the main problems to be addressed when designing a Data Warehouse (2). Possible inconsistencies and redundancies between data residing at the operational data sources and migrating to the Data Warehouse need to be resolved, so that the Warehouse is able to provide an integrated and reconciled view of data within the organization. The basic components of a data integration system are wrappers and mediators (3; 4).
A wrapper is a software module that accesses a data source, extracts the relevant data, and presents such data in a speciﬁed format, typically as a set of relational tables. A mediator collects, cleans, and combines data produced by wrappers and/or other mediators, according to a speciﬁc information need of the integration system. The speciﬁcation and the realization of mediators is the core problem in the design of an integration system.
In Data Warehouse applications, the data sources are mostly internal to the organization. Moreover, large organizations typically provide informational needs in terms of an integrated conceptual representation of the corporate data that abstracts from the physical and logical structure of data in the sources.
The data stored in the Data Warehouse should reﬂect such an informational need, and hence should be deﬁned in terms of the corporate data.
Telecom Italia is the main Italian provider of national and international telecommunication services, and is among the largest companies worldwide.
In large companies the need to access company data for business intelligence is both an organizational and a technical challenge, requiring a considerable amount of ﬁnancial and human resources. Given the development of information technology in the nineties, in Telecom Italia data warehousing (5) has been a natural evolution of enterprise-wide data management and data integration. A Data Warehouse can be deﬁned as a set of materialized views over the operational information sources of an organization, designed to provide support for data analysis and management’s decisions.
In the last years, Telecom Italia has carried out a large integration initiative of enterprise information systems, called IBDA 1, resulting in the construction of an enterprise-wide database integrated at the conceptual level.
Due to the limitations of the available technologies and the costs of replacing and re-engineering legacy applications, such an activity has lead to a solution based on federated databases and legacy systems wrapping, according to the main guidelines of virtual enterprise system integration.
http://www.dis.uniroma1.it/~nardi/ (Daniele Nardi), http://www.dis.uniroma1.it/~rosati/ (Riccardo Rosati).
1 IBDA stands for “Integrazione di Basi di Dati Aziendali”, i.e., integration of company databases.
Meanwhile, the information systems of Telecom Italia have quickly evolved, in particular new applications have been developed, and existing ones have been upgraded or replaced. Such a rapid evolution has been due to both internal and external factors: the birth and growth of new markets (such as mobile telephone and Internet services) and new competitors, the privatization of the company and the subsequent buyouts. Given also the various and disparate information requirements of business intelligence and decision making activities at diﬀerent levels (e.g., tactical and strategical marketing), the integrated information system started showing inadequate to suit the company’s new informational needs. Consequently, in order to provide an adequate timely deployment, the development of Online Analytical Processing (OLAP) and Decision Support Systems (DSS) applications has been carried out in an unstructured way, resulting in a low modularization and non-eﬀective usage of the Data Warehouse infrastructure.
These issues have pointed out the necessity of adopting an incremental Data Warehousing methodology, in particular, the local-as-view (LAV) approach to Data Warehousing proposed in the context of the European project DWQ 2 (6; 5). In such an approach, each table both in a source and in the Data Warehouse is deﬁned in terms of a view over the global model of the corporate data. This extends the traditional LAV approach to integration, where the information content of each data source is deﬁned in terms of a query over (possibly materialized) global relations constituting the corporate view of data (7; 1; 8; 9; 10). The LAV approach is in contrast to the global-as-view (GAV) approach for data integration (11; 12; 13; 14; 15; 16; 17), typically proposed in Data Warehousing (2; 18). Such an approach requires, for each information need, to specify the corresponding query in terms of the data at the sources. Notably, the LAV approach enables decoupling between information availability and information requirements. Therefore, the introduction of a new information source or the replacement of an existing one does not have any impact on the deﬁnition of the Data Warehouse expressed over the global model of corporate data.
Nonetheless, there are several important questions that are not addressed by the work on integration. More speciﬁcally, integration anticipates semantic problems with data, but does not address eﬃciency issues, which are critical for data warehousing. Thus, to guarantee a proper performance of the Data Warehouse, a major re-organization of the data store may be required, with additional costs. This motivates a layered architecture of the Data Warehouse (5), where a primary Data Warehouse feeds the data to several layers of aggregation (called secondary Data Warehouses or Data Marts) before they become available to the ﬁnal user. Moreover, typically there is the need to eﬃESPRIT Basic Research Action Project EP 22469 “Foundations of Data Warehouse Quality (DWQ)”, http://www.dbnet.ece.ntua.gr/~dwq/.
ciently take into account legacy systems that are not integrated, and external or extemporaneous data sources that can provide relevant information to the Data Warehouse, possibly for a limited time window.
In this paper we report on the experience of Telecom Italia in the development of its Enterprise Data Warehouse. Such a Data Warehouse adopts a layered architecture, including various Primary Data Warehouses concerning phone traﬃc of diﬀerent types and customer information, and several Secondary Data Warehouses, which are at the basis of the Knowledge Discovery and Data Mining activity carried out in Telecom Italia. For the development of its Data Warehouse Telecom Italia has followed the methodology proposed in the DWQ project (6; 5), which is based on the LAV approach.
Indeed, one of the distinguishing features of the approach is a rich modeling language for the conceptual level that extends the Entity-Relationship data model, and thus is fully compatible with the conceptual modeling tools adopted by Telecom Italia. Moreover, the modeling formalism is equipped with automated reasoning tools, which can support the designer during Data Warehouse construction, maintenance and evolution. At the logical level, the methodology allows the designer to declaratively specify several types of Reconciliation Correspondences between data in diﬀerent sources and in the Data Warehouse, that allow her to take care of diﬀerences in the representation at the logical level of the same entities at the conceptual level. Such correspondences are then used to automatically derive the speciﬁcation of the correct mediators for the loading of the materialized views of the Data Warehouse.
This is done by relying on a query rewriting algorithm, whose role is to reformulate the query that deﬁnes the view to materialize in terms of both the source relations and the Reconciliation Correspondences. The characteristic feature of the algorithm is that it takes into account the constraints imposed by the Conceptual Model, and uses the Reconciliation Correspondences for cleaning, integrating, and reconciling data coming from diﬀerent sources.
The paper is organized as follows. In Section 2 we describe the enterprise information system in Telecom Italia. In Section 3 we introduce the enterprise modeling framework at the basis of the Data Warehouse design methodology that is discussed in Section 4. In Section 5 we present the development process of a portion of the Data Warehouse of Telecom Italia, concentrating on the Primary Data Warehouse design activity. In Section 7 we brieﬂy discuss the use of the Secondary Data Warehouse for Data Mining and Decision Support applications. Finally, in Section 8 we draw some conclusions.
The Enterprise Information System in Telecom Italia
In this section we sketch the main methodological and technological issues that arose in the last years in the development of the enterprise integrated database of Telecom Italia and, subsequently, in the design and implementation of a Data Warehouse for Telecom Italia. The two eﬀorts, although driven by diﬀerent needs and requirements can be regarded as a continuous development of an integrated view of the enterprise data. Although we analyze the design and the development of the Data Warehouses in Telecom Italia, we deal with many issues that are common in a large enterprise; so our conclusions are easily generalizable to other scenarios.
2.1 Data Base Integration In 1993, Telecom Italia has launched a strategic project, called IBDA, with
the following main goals:
• the deﬁnition of an Enterprise Data Model and the migration/evolution of existing data;
• the design and implementation of databases covering the main sections of the Enterprise Data Model (customers, suppliers, network, administration, etc.);
• the design and implementation of a client/server architecture and of the communication middleware;
• the design and implementation of data access services.
The driving motivations of the IBDA strategic project are typical of a common scenario in many world-wide large enterprises, where there is a proliferation of legacy databases with a large overhead in the design and maintenance of software for interfacing applications and providing access to the data. IBDA was based on a staged implementation of services that form a layer separating data from application processes. More speciﬁcally, the IBDA service for a database is the exclusive agent that provides access to the data. A database integrated in the IBDA architecture is denoted as BDA and is identiﬁed by a unique number. The access is actually accomplished through contracts that enforce the semantic and referential policies for the database.
In this case, we must cope with the numerous and extremely diﬀerentiated data sources of the Enterprise Information System of Telecom Italia. In particular, one can ﬁnd diﬀerent applications, based on diﬀerent data management technologies (from hierarchical to object-relational DBMSs), that share information, or, in other words, that manage common concepts.
In 1996, the project was formally completed with the integration of 48 operational databases, while in the subsequent years new databases have been continuously added to IBDA. At present, several other databases are included in IBDA as BDAs and there are ongoing projects for adding more. In the following years it has been realized that the process of database inclusion in IBDA is basically incremental.
2.2 Data Warehousing
In Telecom Italia, data warehousing has been a natural evolution of data integration. Starting from a large integrated enterprise database, and given the size of the data to be stored in the Data Warehouse, the architecture of the Telecom Italia Enterprise Data Warehouse includes a group of Primary Data Warehouses, which are devolved to collect, integrate and consolidate the data extracted from the Operational Data Stores. The Primary Data Warehouses feed the data to several systems on which the user applications (e.g., Decision Support System) rely. These systems are also included in the Enterprise Data Warehouse as Secondary Data Warehouses, also known as Data Marts.
The main diﬀerence between the two kinds of Data Warehouses is that the Primary Data Warehouses contain only “atomic” level data, while the Secondary Data Warehouses typically contain information that has been aggregated at diﬀerent levels of detail. The Enterprise Data Warehouse architecture is basically stratiﬁed, therefore the Secondary Data Warehouses are loaded only with data extracted from the Primary Data Warehouses.
Presently, the Telecom Italia Enterprise Data Warehouse includes the following Primary Data Warehouses:
• IPDW–Interconnection Traﬃc Primary Data Warehouse, containing call detail records (CDRs), whose main purpose is to analyze network usage patterns between Telecom Italia Network Nodes and other service providers.