In principle, the direct N-body simulation of stellar systems is very simple and straightforward. The only thing one has to do is to numerically integrate the following system of equations of motion:
where and
are the position and mass of the particle with
index i and G is the gravitational constant. The summation is
taken over all stars in the system. Unless the total number of stars
in the system is very small, the calculation and summation of the
gravitational forces dominates the total calculation cost.
We have developed a series of special-purpose computers for N-body simulations, which we call GRAPE (GRAvity PipE). Figure 1 gives the basic idea. The host computer, which is usually a general-purpose workstation running UNIX, sends the positions and masses of particles to the GRAPE hardware. Then GRAPE hardware calculates the interaction between particles. What GRAPE hardware calculates is the right hand side of equation (1). When we use the Hermite scheme (Makino [Mak91]), the first time derivative of the force must also be calculated. Figure 2 shows the block diagram of the pipeline unit to calculate the force and its time derivative.
Figure 2: Pipelined processor for the calculation of force and its time
derivative. Reproduced from Makino et al. 1997 [MTES97]).
In order to combine the individual timestep scheme with GRAPE
hardware, one modification of the basic architecture is
necessary. Figure 3 shows the change. In the
individual timestep scheme, particles has their own times and
timesteps. Thus, to calculate the force on a particles i at its own
time , we need the positions and velocities of all other
particles at time
. However, other particles have their own
times, which are generally different from
. In the individual
timestep scheme, we use the predictor based on the polynomial
extrapolation to obtain the positions and velocities of other
particles. Each particle has, in addition to position and velocity at
its own time, acceleration and its time derivative, so that we can
construct a third-order predictor. This third-order predictor allows us
to use the fourth-order time integration scheme, either two-step
Hermite scheme or four-step variable-stepsize Adams-Moulton scheme for
second-order equations.
This prediction must also be done on the GRAPE hardware, since otherwise the amount of the calculation the host computer has to do becomes comparable to that of GRAPE, and therefore the speedup will be too limited.
In the original individual timestep algorithm, each particle choses its
timestep freely, based on some timestep criterion. Thus, timesteps and
times of different particles are all different, and time integration
proceeds in an ``event-driven'' manner, where one particle with
minimum is integrated. This scheme has one obvious
problem that the degree of the parallelism in force calculation is
limited to N, since we can calculate the force on only one
particle. We can force several particles to share exactly the same
time, by forcing their timesteps to be powers of
two[McM86]. This modified individual timestep scheme is
often called the blockstep scheme. This scheme is useful both on GRAPE
and general-purpose computers.
In the modified architecture shown in figure 3, the particle memory keeps all data necessary for prediction, for all particles in the system. At each timestep, the host computer sends the position and velocity of particles to be updated to GRAPE, and GRAPE calculates the forces on them and sends the results back to the host. If the number of pipelines is smaller than the number of particles in the block, this step is repeated until the forces on all particles in the block are obtained. Then the host performs the orbit integration using these calculated forces, and updates the data of the integrated particles in the particle memory of the GRAPE hardware.
Figure 3: GRAPE for individual timestep
In 1989, we started the development of GRAPE systems. The first hardware, GRAPE-1, [IMES90] was a single-pipeline system with low-accuracy force calculation (r.m.s relative accuracy of 2%), without the predictor hardware. The force calculation pipeline was implemented using integer adder/subtracter IC chips and ROM chips (for lookup tables). It was the proof-of-the-concept machine to demonstrate that theoretical astrophysists (we) could construct a real hardware. It had the peak speed of 240 Mflops, and turned out to be useful for a wide range of astrophysical problems where high accuracy is not very important.
GRAPE-2 [IEMS91] is again a single-pipeline system, but with high accuracy. The pipeline was implemented using commercial floating-point ALU chips such as TI 8847. It had the peak speed of 40 Mflops. For this machine, we did not include the predictor hardware, since with the blockstep scheme it was possible to perform prediction operations on the host computer without significant loss of the efficiency. Of course, this is simply because GRAPE-2 was not so fast.
GRAPE-3[OME93] is our first multiple-pipeline system, where
low-accuracy force calculation pipeline similar to that of GRAPE-1 was
integrated into a custom chip. The GRAPE-3 chip was fabricated in
rule and had about 100K transistors. The die size was
9mm by 9mm. It has single pipeline unit with operating clock frequency
of 10 MHz, resulting in the speed of 380 Mflops/chip. The full GRAPE-3
system consisted of two boards with 24 chips each, for the aggregated
peak speed of 18 Gflops. A smaller version of this machine, GRAPE-3A,
which has 8 GRAPE-3 chips with the clock speed of 20 MHz, has been
commercially available since 1993 and more than 100 GRAPE-3A boards
have been manufactured and are used by a number of researchers
worldwide.
GRAPE-4 (Makino et al. [MTES97]) is the first GRAPE hardware to implement the full predictor hardware with high accuracy. In GRAPE-4, one processor board houses one predictor units and 96 (virtual) force calculation pipelines. The total system in the maximum configuration consisted of 36 boards organized into four clusters, and different boards calculated the forces on the same set of 96 particles. In this way, we met the requirement that the number of forces calculated in parallel is small, even though the number of pipelines is large.
Summation of 9 forces from processor boards in the same cluster is taken care by the communication hardware, and final summation of the forces from four clusters is handled by the host.
A board houses 48 force calculation pipeline chips (GRAPE-4 chips)and one predictor pipeline chip (PROMETHEUS chip). GRAPE-4 chip can calculates one interaction in every three clock cycles, since it processes x, y, and z components in three consecutive clock cycles. It actually calculates the forces on two different particles, using six clock cycles. As a result, a GRAPE-4 chip logically looks like containing two pipelines operating at the clock frequency half of the actual one. We named this feature Virtual Multiple Pipeline (VMP). With VMP, we can reduce the memory bandwidth without sacrificing the performance.
GRAPE-4 was completed in 1995, and has been used by many researchers for the study of many astrophysical problems[HM99,MT98].
Figure 4 shows the peak speed of our GRAPE systems and that of representative high-performance general-purpose computers.
Figure 4: The evolution of GRAPE and general-purpose parallel
computers. The peak speed is plotted against the year of
delivery. Filled circles, open triangles and open squares denote
GRAPEs, vector processors, and parallel processors, respectively.