Abstract
-
The performance of data processing in distributed information systems strongly depends on the
efficient scheduling of the applications that access data at the remote sites. This work assumes a
typical model of distributed information system where a central site is connected to a number of
remote and highly autonomous remote sites. An application started by a user at a central site is
decomposed into several data processing tasks to be independently processed at the remote sites.
The objective of this work is to find a method for optimization of task processing schedules at a
central site. We define an abstract model of data and a system of operations that implements the
data processing tasks. Our abstract data model is general enough to represent many specific data
models. We show how an entirely parallel schedule can be transformed into a more optimal hybrid
schedule where certain tasks are processed simultaneously while the other tasks are processed
sequentially. The transformations proposed in this work are guided by the cost-based optimization
model whose objective is to reduce the total data transmission time between the remote sites and a
central site. We show how the properties of data integration expressions can be used to find more
efficient schedules of data processing tasks in distributed information systems.