In this paper, we briefly overviewed the computational issues of direct N-body simulations and the use of special-purpose computers, and we reviewed the plan and development status of GRAPE-6. We believe GRAPE-6 will give us many new important scientific results, much in the same way as GRAPE-4 have done so in the five years since its completion.
Since GRAPE-6 is now close to its completion, it might be worthwhile to speculate on what will come next. The silicon device technology has been advancing as usual, and will do so at least for one more decade or two. If we will start the development of the new system just after the completion of GRAPE-6, we will use the technology safely available by 2002, which is either 0.13 or technology. This will give us around a factor of six increase in the transistor count. For the clock speed, we expect that the improvement by a factor of three will not be too hard, mainly because the current clock of GRAPE-6 chip is not very high. Thus, we can achieve about a factor of 20 improvement in the speed over that of GRAPE-6 chip, reaching around 0.7 Tflops per chip. With three thousand chips, the peak speed would reach two Petaflops.
The target number of particles for such system would be around 1 million, for star cluster simulations. For that number, however, one might ask whether or not the direct force calculation is really a good approach, even though it guarantees very good cost-performance ratio.
There are two widely used algorithms to reduce the calculation cost. One is the neighbor scheme (Ahmad and Cohen [1973]), and the other is the treecode (Barnes and Hut [1986]). For general-purpose computers, both algorithms give pretty large improvement in the overall calculation speed. The theoretical gain of the neighbor scheme is (Makino and Hut [1988]), and the break-even point is as small as N=25. For the treecode, the gain is . The break-even point depends on the required accuracy, but is certainly large enough to guarantee a pretty large gain.
However, we (or anybody else) have not yet conceived a feasible way to combine the neighbor scheme and GRAPE, or treecode with individual timestep and GRAPE. In both cases, the problem is that forces on different particles are calculated differently. As a result, n pipelines which calculate the forces on n different particles need n different data from the particle memory, while in the case of the simple direct summation all pipelines can use the same data. In principle, if each pipeline has its own memory, we can reconcile massively parallel GRAPE and these algorithms. With the next-generation system, it will be necessary to integrate the memory and the pipeline processor anyway, since otherwize we cannot achieve required memory bandwidth even for the case of the simple GRAPE architecture. This integrated memory and pipeline might give us the opportunity to implement more efficient force calculation algorithms.