The idea of designing special-purpose computer for scientific simulation is not new. Until recently, however, it has not become widely accepted (maybe still not). This is primarily because the advantage over general-purpose computers had been limited. First of all, one had to develop hardware, which was time-consuming and costly venture. It was also necessary to develop softwares (system programs and application programs).
To make the matter worse, in a few years hardware would become obsolete and the investment on hardware and software would be lost. Of course, this is also true for general-purpose hardware. The software for general-purpose computers was, however, likely to be reused, unless there was drastic change in the architecture.
However, this situation has changed considerably in the last decade. As I described in the previous section, the development of the software for general-purpose computers has become much more difficult and costly. Moreover, we cannot expect that programs developed for the current generation of machines can be used in the future machines without extensive rewriting. Vector processors required very different way of programming than was used on scalar processors. Parallel computers require yet another way. The current transition from vector processors to microprocessors or MPPs means most of the programs tuned for vector processors are now becoming useless.
On the other hand, the cost advantage of special-purpose computers, which determines the lifetime of the machine, has been greatly improved. This is not because of the efforts at the side of the special-purpose computing, but because of the decline of the hardware efficiency of general-purpose computers. Though the peak speed of computers has been increasing exponentially, the fraction of the hardware used to do arithmetic operations has been falling exponentially. In present microprocessors, only a few percents of the transistors on a chip are used to implement arithmetic units. Other 90+ % are used for cache memory and control logic. Moreover, it is quite difficult to achieve high efficiency with current microprocessors, because of the limitation in the memory bandwidth. Thus, if averaged over time, the typical efficiency of present microprocessors is around 0.1%.
If we can construct a machine with efficiency higher than, say, 10%, therefore, we can achieve the cost advantage of a factor of 100 or larger. Of course, the production cost of the machine would be higher because of the small quantity, and there would be losses due to inefficiencies in the design. However, these two combined is not likely to be as large as a factor of 100. Thus, the special-purpose architecture is becoming a quite attractive solution. Note that 10 years ago the efficiency of general-purpose computers was higher, and therefore it was more difficult to develop special-purpose computers.
There had been two very different approaches to build special-purpose computers for scientific simulations. One is to design a programmable parallel computer with various optimizations. For example, if we do not need much memory, we could replace the DRAM main memory with a few fast SRAM chips, thus reducing the total cost by a large factor. If we do not need fast communication, we can use rather simple network architecture.
In 1980s, CMOS VLSI floating-point chipsets such as Weitek 1064/65 [FW85] offered the theoretical speed of around 1/10 of the speed of vector processor for the cost around 1,000 USD. Thus, if one could use several of them in parallel, one could construct a vastly cost-effective computer. This led to numerous projects, some of them were highly successful. In particular, PAX[Hos92] and Caltech Hypercubes[FWM94] had been so successful that many companies started to sell similar machines as general-purpose parallel computers. As a result of their success, developing a parallel computer as special-purpose system has become unpractical. You can buy a better one from computer companies.
The other approach is to develop ``algorithm oriented'' processors. Our GRAPE (GRAvity piPE) machines is an extreme in this direction. As the name suggests, GRAPE is a device which evaluates the gravitational force between particles.
In direct N-body simulation, more than 99% of CPU time is spent to calculate the gravitational force between particles. Even in the case of more sophisticated algorithms such as the treecode, FMM and PM, large fraction of time is spent for pairwise force calculation. In usual implementation, the direct force calculation consumes typically about half of the total CPU time. However, it is possible to modify these algorithms to reduce the cost of calculations other than pairwise force calculation, by increasing the calculation cost of pairwise force calculation. Thus, actual gain one can achieve for these algorithms is much larger than a factor of two [ABLM98].