As described in the previous section, from the point of the view of the host computer, a GRAPE-6 hardware consists of three components: memory, pipeline processors and control logic. The memory stores the data of particles which exert the force. pipeline processors calculate the force on particles which they received from the host, and send the calculated force back to the host computer. In the following, I describe how the application programmer can actually use GRAPE-6 hardware.
Though GRAPE-6 is designed for individual (block) timestep, many applications use it just with shared timestep. In addition, shared timestep is simpler. So we start with the shared timestep algorithm. Also, for simplicity we discuss only the direct summation method.
A typical sequence of using GRAPE-6 hardware with shared timestep algorithm is the following.
To acquire the GRAPE-6 hardware, one call g6_open(clusterid). The argument, clusterid indicates which of the GRAPE-6 cluster is to be used, when multiple GRAPE-6 hardwares are connected to a single host. In most of sites, only one hardware is connected, and in this case you just supply 0 as argument. It is probably a good idea to make this argument a global variable within program which you can set through command line option or something like that.
If someone else is using the requested hardware, this routine put the process into sleep. It is automatically waken up when the GRAPE-6 hardware becomes available.
After you get GRAPE-6, the first thing you need to do is to set the scaling. GRAPE-6 stores both position and time with fixed point format. So you need to specify the scaling so that the range of time and distance used in the application program is correctly expressed within GRAPE-6. These scaling factors are used by other library routines to convert the data to and from GRAPE-6 internal format.
Time is not used with shared timestep, so we start with the position scaling.
Position scaling is set by g6_set_xunit(xunit). The integer argument xunit gives the number of bits used to express the fractional part. The default is 51, which should be fine if the application program is written using nondimensional units.
Let me give an example. If you do the cosmological simulation which cover several Gpcs and you use parsecs as unit, you need around 33 bits to express the integral part. The bit length for position is 64. So give , or something a bit smaller like 28 or so, as xunit.
Beware that no error check is performed for overflow or underflow. It's application programmer's responsibility to provide a proper scaling.
Now that we specified the scaling. W set particle data to GRAPE-6 memory. This is achieved by calling
int g6_set_j_particle_mxonly(int clusterid, int address, int index, double mass, double x[3])Args address and index probably need further explanation. Address specifies the (logical) location within GRAPE-6 memory to which this particle is stored. If you send n particles to the hardware, address must be in the range of and should not overlap. If you send the data to same address twice, the former data is simply overwritten. If you do not write to some address j, the data in that location is of course garbage. GRAPE-6 always calculates the force from particles in consecutive addresses. So particle data in the specified range must all be valid.
Index is a new feature of GRAPE-6, which was not used in any of previous GRAPE hardwares. This is an integer value assigned to each particle data, which is used for the following two purposes.
With simple direct summation, this is not particularly useful since you almost always set index equal to address. However, if you use treecode, address are not same as the identity of the particles in your code.
The capability to avoid self-interaction is only important in the case of the individual timestep. So we do not discuss it further here.
After you store all necessary particle data to the memory, you can now let the GRAPE-6 hardware actually start calculation. From the point of view of an application, a GRAPE-6 always look like a collection of 48 pipelines. So you send one particle to each pipeline and let all of them calculate the force from particles in the memory.
To achieve this, you call
void g6calc_firsthalf(int clusterid, int nj, int ni, int index[], double xi[][3], double vi[][3], double fold[][3], double j6old[][3], double phiold[], double eps2, double h2[])Here, nj is the number of particles stored in GRAPE-6 memory. Ni is the number of particles you want to send, which should not exceed 48. Other args are the actual data of particles to be sent. Index is here to be used to avoid self-interaction. Xi and vi are position and velocity, eps2 and h2 are the softening and the radius of the neighbor sphere (both squared). For convenience, eps2 is a scalar value, though the hardware can set different values to each pipeline.
Here, one important thing is that you need to send fold, j6old and phiold, which are used to determine the internal scaling for force, jerk and potential for each pipeline.
GRAPE-6, unlike GRAPE-4, accumulates the result in fixed point format, which greatly reduced the complexity of the overall design of hardware. Unfortunately, this simplification does not come free. In order to avoid overflow or underflow, GRAPE-6 hardware should know beforehand roughly how large is the result. It (actually the library functions) determines the scaling accordingly for each pipeline. Fold and other two args are just used to determine the scaling.
Except for the first timestep at which you do not know what the force in the previous timestep is, you can just supply the force in the previous timestep, since it would not change by a large factor in one timestep.
For the first timestep, you need to give some ``educated guess'' for the magnitude of the force. For example, if you are working with the standard units ( etc), to supply order-of-unity values would be fine.
This firsthalf routine send the data to GRAPE-6 and let it start calculation.
The next step is to wait GRAPE-6 to finish the calculation and receive the data. This is achieved by calling lasthalf routine, which takes almost the same arguments as firsthalf. When called, it first checks if the calculation is completed, and if not, it just waits until the calculation finishes. Then it receives the data sent from GRAPE-6, performs the necessary scaling, and returns the results.
If the overflow occurs, this lasthalf routine tries to change the scaling and let GRAPE-6 do the calculation again. This is the reason why positions and velocities are still passed to this routine.
Note that, at least at the time of this writing, GRAPE-6 hardware is not 100% reliable. The library routine detects various kind of hardware error. If lasfhalf returns non-zero value, that means it detected some hardware error which is not corrected. At this point, the application program should reinitialize the hardware and start over. The routine to reinitialized GRAPE-6 is g6_reinitialize(clusterid). It resets the GRAPE-6 hardware, releases it, and reattachs it. Thus, somebody else might use the GRAPE-6 hardware. Therefore, the content of the GRAPE-6 memory may have been changed after your call to g6_reinitialize(clusterid), and you have to send particles to the memory again.
If lasthalf returns zero, you can proceed to calculate the forces on next set of 48 particles, or you can retrieve the neighbor list if you need.
You read the neighbor list in two steps. First, you call g6_read_neighbour_list, which just transfer the content of the hardware neighbour memory to internal storage of the library. Then, you call g6_get_neighbour_list to obtain the list for a specified particle. In case of the overflow of the list, the read function returns . The recovery from overflow will be discussed in more detail in section 8
A typical sequence of using GRAPE-6 hardware with block timestep algorithm is the following.
One can see there is not really much change. The first two steps are the same. To write data to GRAPE-6 memory, now you need to specify the full predictor data. So you use g6_set_j_particle, which accepts all the necessary data for predictor.
The force calculation, neighbour list read, and error recovery procedures are the same as those in the case of shared timestep.
The final small difference from the shared-timestep version is that at the end of the each blockstep, you send the particles which are integrated to GRAPE-6 memory. You need not send all particles at the beginning of the each timestep. You only send the updated ones at the end.
The AMD Athlon processor offers performance significantly better than what offered by Intel Pentium X, in particular with GCC compilers. However, almost all existing motherboards for AMD Athlon processors share one common problem: The bandwidth achieved with PCI PIO access is very low, even though the DEC Alpha box, which uses exactly the same chipset (AMD 750/760), offers very good performance.
In order to get reasonable performance on AMD, we need to activate DMA transfer for all three major data transfers: sending i- and j-particles and getting result. DMA is by default used to get data on x86 box, since Intel processors are also slow in PIO read. However, since PIO write is very fast on Intel processors, DMA is not used by default for sending particles.
To use DMA in sending particles, one need to do the followings:
JPSPACE 10 IJPDMA 1
The number after keyword JPSPACE may be different from 10 after installation. If so, DO NOT CHANGE THAT VALUE.
int g6_initialize_jp_buffer(int clusterid, int size)The second number is the size of the buffer, which should be something reasonably large (10000 or so should be okay). With this function called, g6_set_j_particle just set the data to the internal buffer on the host memory. When the buffer becomes full, it is automatically flushed. However, before you calculate the force using firsthalf/lasthalf pair, you need to explicitly call the function to flush the buffer:
int g6_flush_jp_buffer(int clusterid)so that all j-particle data are actually sent to GRAPE-6.