next up previous
Next: Scheduling Divisible Task in Processor Networks Up: Introduction to Divisible Task Previous: Introduction to Divisible Task

The Concept of a Divisible Task

A new scheduling model applicable in many parallel architectures is presented in this section. We analyze scheduling divisible tasks, i.e. tasks that can be divided into parts of arbitrary size. These parts can be processed in parallel and independently of each other. In other words, divisible tasks are parallel applications which include no data dependencies (precedence constraints) and whose granularity of parallelism is fine.

Before going into details we give some motivation of the model and introduce basic concepts. Consider, as an example, searching for a record in a huge database with thousands (or more) records. This can be done by cooperating processors. The database file can be divided into parts with granularity of one record. Note that the granularity of the division is fine compared the total size of a database. The search can be conducted in each part independently of the other parts. Thus, the two assumptions of the divisible task model are fulfilled. Finally, the results are reported to a master processor. This approach can be applied while searching for a pattern in a text, graphical, audio, etc. file. A similar situation takes place when sorting a database file in a distributed way. Further examples of divisible tasks are related to data parallelism: processing big measurement data files [CR88], image and signal processing [BGMR96], simulations of molecular dynamics [ADKMY93], some problems of linear algebra on big matrices [BT90], solving partial differential equations by the finite element method [Wi91] and many other engineering and scientific problems. Observe that similar assumptions on divisibility of the work are made in loop scheduling [LSL94,ML94] and load balancing [G91,GKR92,LM92,XH93] models of parallel applications.

Now, we outline the process of data dissemination and processing. A parallel computer consists of $m$ processing elements (PEs), each of which comprises a processor, local memory, and is capable of communicating over the interconnection network (either by an independent network processor, or by the use of software run on the processor). The notions of a processor from the scheduling theory and the processing element are equivalent here. Only when a PE has a network processor is it capable of simultaneously computing and communicating. Initially, the whole volume $V$ of data (or work) to be processed resides in one processor called originator. The originator processes locally $\alpha_1$ data units and sends the rest (i.e. $V-\alpha_1$) to its idle neighbors. Each processing element intercepts for local computing some data from the received volume and sends the rest to the idle neighbors. Thus, PE number $i$ (denoted by $P_i$) intercepts and processes locally $\alpha_i$ data units which lasts $\alpha_iA_i$ units of time. $A_i$ represents the processing rate, i.e. the reciprocal of the speed, for $P_i$, $i=1,\dots,m$. The transmission time of $x$ data units over communication link $i$ joining two processors is $S_i+xC_i$. $S_i$ is the startup time that is spent to initiate the communication, and $C_i$ is the transmission rate. Our goal is to find such a distribution of task parts (i.e. $\alpha_i$'s) that the communications and computations are finished in the shortest time. For simplicity of the presentation we assume that the processing time depends linearly on the volume of processed data. It will be demonstrated that this assumption can be relaxed.

The above description still leaves space for details including, e.g. a communication algorithm tailored to the interconnection network. In the following sections we present solutions for different network topologies. Observe that when no results are returned to the originator, all the processors must stop working at the same moment of time. It can be explained intuitively: when $P_i$ finishes earlier, then it is possible to off-load other PEs by moving some part of the load to $P_i$. Thus, the whole schedule would be shortened. This observation has been proved both for particular interconnections [CR88,SR93] and for a general type of interconnection [BD97]. Assuming that the results are not returned simplifies the presentation. However, we show in the following sections that also the case where results are returned to the originator can be included in our model. Unless stated otherwise we assume that the number of divisible tasks is equal to one. The process of distributing workload which is a distribution of information in which each PE receives different data is called scattering.

This part of the book is organized as follows. In Section "Scheduling Divisible Task in Processor Networks" we present results of applying the idea of divisible task in various distributed systems. We concentrate on basic models, and then show the way of expanding them to cover more complex cases. Performance evaluation is one of divisible task model applications. Examples of such an evaluation are presented in Section "Performance Prediction Based on the Divisible Task Model - example". Section "Validation of the Model" describes some results of verifying divisible task model experimentally.


next up previous
Next: Scheduling Divisible Task in Processor Networks Up: Introduction to Divisible Task Previous: Introduction to Divisible Task