CSI 761: Final Project

Astrophysical Tree Codes

Implementation and Choice of Simulations

by Sandy Antunes


Tree Codes

Results and Performance

Simulations and Applicability


Abstract

Tree codes are a very powerful tool for handling large-scale N-body simulations. Although time-consuming for small particle numbers, for large N simulations, they are essential, and they work especially well with non-isotropic distributions. There are several different design methodologies for tree codes, depending on the resources (and time) available. In this paper, we present the implementation of such a code, as well as three galaxy interaction simulations generated using it.

Tree Codes

Theory

Balanced binary trees are the simplest tree structures, and worth mention if only to illustrate tree theory. For a binary tree, each "parent" node has two "children", and each child contains half of the particles in the parent. You continue to recursively bisect your list until you have your end leaves, which each contain only one member. Because you subdivide evenly each step, the tree is "balanced" in having the same number of children for each parent, and each child has half the parent's population.

One quick enhancement to this method would be to split each sector (square in 2-dimensional space, cube in 3-dimensional space) into even quarters/eights. This produces quad-trees in 2-D, and oct-trees in 3D. The Barnes-Hut (BH) algorithm uses this quad-tree/oct-tree subdivision of space. There are two methods of constructing this: one can either place particles in the tree one at a time, or used an adaptive quadtree (recursive subdivision.)

For particle-by-particle, you start at the top node. Whenever a node already has a particle, you simply make both the new and present occupant become children of that node, and progress recursively. Since the area of each parent is divided uniformly into quadrants/octants, you end up with a large number of empty nodes once you are done, and simply set these to Null. The second method, adaptive quadtrees, is the approach we used herein.

For adaptive quadtrees, you have a node and a list of particles in that node. First, you find the node center of mass and its extent and limits. Then, you divide the node space into quadrants/octants, around the center of mass. You subdivide your particle list into a separate list for each sector, and then descend into each sector to reiterate this process until each bottom leaf, with one particle per node, is made. The chief difference in this method (besides using different code to construct it), is that the subdivision of a node into quadrants/octants is based on the physical nature of that space (its actual extrema and using its center of mass) rather than a purely geometric division into equal quarters. Therefore, you are only defining regions that have something interesting in them, and each child may be of much smaller extent than a purely geometric slicing.

Practice

Recursive code is almost self-writing (aptly enough); once you have a routine that traverses the tree, most of the other routines will simply be variants on it. With the tree code, the most important part is simply stopping yourself once a viable node matches your Theta criteria. However, the down side of recursive code is that it can be very time-intensive, if you have to traverse all levels of a multi-dimensional tree.

We attempted a more dynamic scheme, were each parent was not forced to have 4 (or 8) children, but instead had one primary Child. This child then served as the head for a linked list of later siblings (using a pointer to ->younger). Traversing the tree was simple: one visited a node, descending to its child if it had one, then went along the list of younger siblings until there were no more. Each sibling contained a link back to its parent, so travel up the tree was rapid. Indeed, this scheme was both elegant and consumed little resources.

However, implementation took much longer than the fixed number of children used in this setup. And, it was not clear that this would save in memory, since it required extra calls to the system memory allocator; its performance times with earlier versions of this tree code did not show a significant speed-up. Therefore, we filed it away as an interesting experiment.

The final code is, essentially, our particle-particle code from Project 1 with a quick call to "build_tree", a quick replacement of the PP force step with a Tree force step that has nearly identical calling parameters, and a similar change for the potential energy calculations. As a bonus, we kept in the PP potential energy calculator as a check on our initial tree's accuracy (running the PP energy calculation against the tree at the first step, to compare results.) This code reuse greatly speeded development. Some other small customizations had to be made; the equations for force, distance, and potential were made more generic so they could handle nodes or particles, and the structures were slightly redefined.

One large change was to change our initial naive point structure, which had "x, y, z, vx, vy, vz" to simply have an array for coordinates and an array for velocities. By making this change, and using loops for dimensioned qualities instead of directly referencing X/Y/Z elements, our code is functional for both 2-dimensional and 3-dimensional simulations by simply changing one defined constant. Our results presented here are for the 2-dimensional case, simply because the graphics are much more clear than for a 3-dimensional model.

Our code is available as a tar file, including perl routines for making the initial points files. The "readme" file contains full details on all files in this distribution.

Debugging

For initial debugging, one should always check that 1 lone particle simply remains unchanged, and that two points interact properly (i.e. compared with an analytic calculation). In all cases, particle interaction using a tree should conserve mass/particle interactions-- a given particle will interact with N-1 particles regardless of whether they are clustered together in nodes or not.

The energy calculation and simulation results using Theta=0 and using a particle-particle code should return identical results. And, for higher N runs, energy conservation should be maintained as with all our N-body codes. While it isn't the definitive assessment of quality, it does provide an instant indication of major faults when energy is not conserved.

Results and Performance

Parameter Choices

Variable time steps were tested. For the most part, this provided more of a check on our choice of stepsize than anything. Using our "canonical" run of 100 particles, the variable time step resulted in (for 1000 iterations) reaching approximately 2.4 time units, which compares favorably with our 2.5 time units for the fixed time step run (step=0.005). Note that, as opposed to earlier PP code, we set the energy change tolerances higher (factor of 10) to allow for the slightly increased error inherent in tree codes. The adjustable time step method, however, has computation overhead, since you potentially evaluate the forces several time for each time step. To assist in cross-comparison with our other methods, and to keep the computations at a fixed, predictable time step, we used the fixed step method for the results presented herein.

We can do an initial parameter "back of the envelope" calculation to determine a proper time step. Given F = m a ~= 1/r^2 and v = v_o + a t, we don't want a velocity increase in one time step to cover more than, say, 1/100th of the area for the average case. So, setting: a t < 0.01 -> F/m * t < 0.01 -> t/(m r^2) < 0.01. The average distance is related by density, and we can assume at uniform density that particles are R/n^2 apart (for the 2-dimensional case). So, for a system between 0 and 1 with unit masses, for 100 particles, they are typically 0.1 apart, and we want t < 0.005 or so.



Figure 1: Speed and Accuracy versus Theta for tree codes

Figure 1 shows the results of choosing different values for Theta. We used our canonical test (500 steps, 100 particles, rebuild tree every 5 steps) with 2 populations (one uniform, the other consisting of two uniform distributions separated by 0.5 units). We plotted the: average number of force calculations per time step (N(calc), in units of 1000 calculations/step), the clock time to run the 500-step simulation (in seconds) and the relative error compared against using a straight point-point (PP) calculation. To complete the comparison, note that PP codes require 4950 force calculations per time step (for 100 particles), and that our N^2 would be 10,000. Clock time is clearly derived from the number of iterations required, and is the parameter we wish to minimize. At the same time, we wish to keep our relative error low.

Tree rebuilding is another issue. Rebuilding the tree each time step seems wasteful; if the time steps are defined so particles do not significantly stray, the tree should be viable for at least a few time steps. Table 1 shows the results for different rebuild intervals. When the tree was not rebuilt, a recalculation of each node's center of mass (CM) and extent is required. Our nodes are defined loosely as circles (or spheres), with a CM position and a radius (the extent). We do not have the individual particle positions saved within each node, so we must make an estimate of the new extent when recalibrating. After the new CM is calculated, the new extent can be assumed as "distance from the new CM to the old CM, plus the extent." This errs on the side of caution, and also means that nodes tend to spread, or grow in size, when we do not rebuild. Spreading nodes implies our theta criteria gets met less often, and that the tree eventually must be rebuilt. We ran into one interesting trade-off: our time to recalculated the CM and extent was on part with our time to build a tree from scratch. This disadvantageous situation suggests we should work on a less intensive rebuilding routine, and for now just rebuild the tree each step.


Table 1: Tree Rebuilding
Rebuild
timesteps
time NCalc time on
treebuild
time on
treetrim
time in
force tree
1 26s 3452 0.5 1.8 19.7
5 25s 3452 0.5 2.2 21.3
10 26s 3453 0.1 2.3 20.5
20 23s 3453 0.1 2.2 20.6
50 24s 3453 0.05 2.5 19.8

Benchmarks and Bottlenecks

Table 2 shows the results for identical runs using our PP code, mesh code, and tree code, for both a single random distribution and a two-aggregates distribution (where 2 random distributions are at nearly interacting distances.) For the non-asterik cases, a uniform initial distribution was used. For the asterik-marked cases, a split distribution of two random clusters separated by 0.5 unit spaces was used. Our choice of run-time parameters for code comparisons used the following compromise between good statistical results and glacial time scales: Delta T = 0.005, 1000 timesteps, epsilon = 0.025, distances from 0 to 1, sum of masses equals 1.

The first obvious conclusion is that, though Theta=0 approximates a PP code method, it is less optimized and thus a poor idea. PP codes typically take advantage of the symmetry of forces and potentials, and thus reduce the number of calculations down from N^2 to approximately half that. However, the tree method requires (at Theta=0) doing each calculation, and thus has a higher number of force calculations than the PP code. However, by the time you have set Theta to the range of 0.20, the number of calculations for the tree has finally proven less than for the PP code. We chose the value of Theta=0.25 for our simulations; this provided good accuracy while also providing a speed enhancement over PP codes.

The other clear factor is that, by large N, the advantages of the tree code begin to show. The increase in time for the Theta=0.25 case basically doubles with the doubling of the system size, while the PP code goes up by a factor of four. Time lost in the tree code due to tree building and overhead is suddenly dwarfed by the reduced number of force calculations the tree provides, once N gets large enough.


Table 2: Run Times

# of particles 20 100 100* 256 512 512*
PP code 0:01 0:22 0:20 2:05 9:41 9:21
Tree (theta=0) 0:05 1:24 -- -- -- --
Tree (theta=0.25) 0:04 0:57 0:20 3:30 7:40 6:03
(Mesh, different scale) -- -- -- 1:40 1:45 1:43
(Tree, mesh scale) -- -- -- 3:14 -- 10:00


Our PP code spends 90% of its time on the force step. As the number of particles increases, then, even the lossy tree code begins to gain substantially as it simply reduces the number of force calculations required. As this is part of the fundamental nature of PP codes, it would be difficult to reduce the computation time.

The mesh routine spends 80% of its time on the FFT routine, and that's fairly independent of particle numbers or particle distribution, only on grid size. Grid size, though, is determined by the granularity of interaction that you are interested in. This places meshes in a separate category: for fine resolution, the single mesh grid square essentially has to shrink down to the same size as an individual tree node, and you're effectively using an inefficient tree.

The Tree is losing time in memory allocation, which is definitely an area that code tuning can improve. It spends 30% of its time simply doing system memory management. The force equation, on the other hand, takes 25% of the time, and another 21% is taken up by hunting through the tree to see whether to apply the force. Related to this, the calculation of theta takes another 23%. Clearly, a clean-up of the code can drastically improve performance; issues such as memory and tree lookups are an area that can easily be tuned. This further accentuates the edge in speed that the tree has over PP codes; trees have more room to optimize.

Future Enhancements

The primary enhancement is to improve the memory handling. Right now, the tree structure is unwieldy on the platform chosen (Sun Sparc 20), and we lose 1/3 of our time with memory routines. The extra allocation of a recursive clone of the particle list in building the tree is one area that can be made more elegant, perhaps by a quickly formed linked list of smaller extent. The tree itself could be made using "calloc" to try and have its memory be contiguous.

One potential area for future development would be to actively monitor the average depth that the force calculation requires, and to only rebuild the tree when too many PP encounters occur, indicating that the tree has "spread" too far and needs to be recomputed. Since the overhead for tree building and the overhead for simply recalculating nodes centers and extents are nearly equal, this may not provide a significant savings.

By far the best improvement, based on current literature, is to use a multipole expansion for our force (and potential) calculation. By using a more accurate expansion, one can use a more liberal value for Theta (descending less deep into the tree) yet still gain in accuracy. The "Fast Multipole Method" is indeed the most significant improvement for tree codes today.

Simulations and Applicability

We modeled two primary types of interactions: Static and Dynamic/Collision. We used small systems of 512 particles, and also generated mpeg movies of the events.

Static Collapse

The static case is when a random distribution of mass coalesces and falls inward, reaches a central collapse, and expands out again. This cycle takes approximately 2.5 units of scale time, and repeats. The most interesting behavior of such collapses is the formation of temporary structure.

As Figure 2 (and movie 1) shows, the collapse of random particles begins to form small groups, which further unit until there is a central mass containing nearly all the particles. A few particles are ejected from this interaction at sufficient velocity to escape the system. The majority, however, expand out and slow, and begin to collapse in again. However, this expansion and re-contraction shows even stronger examples of structure. For the case shown here, for example, two relatively cohesive groups have formed.

Essentially, we have a system where small amounts of structure magnify during the interaction, gain a stronger gravitational influence over the group, and thus increase their structure. The initial inhomogeneity of the original "random" mix thus produces structure. This is not visible with small-N PP calculations (see our earlier paper, for example a 20-particle simulation, for comparison.)

Dynamically interacting galaxies

Our second class of simulation is active dynamics. Specifically, colliding galaxies. Using the same set of particles, we distribute them into two groups separated along the X axis by 0.5 units (each group therefore has half the density of the previous simulation.) The second group is given an initial velocity of 0.002 units/timestep plus a small random fluctuation of 5%. This should prove sufficient for the two systems to interact in about 1/4 of our simulation time, after 250 timesteps. Figure 3 (and movie 2) shows the result. The drift velocity ends up being less significant than the direct gravitational interaction of the two systems, and we essentially see the galaxies merge. A bridge-like structure is the clearest feature to appear during the merge.

The third simulation imparts a hefty initial velocity to our second cluster, in this case a scale unit of 0.5 per full time interval. Figure 4 (and movie 3) shows four time steps of this, covering initial motion+collapse, interaction, and scatter. This high-velocity interaction shows strong forward and back scattering, as well as core interactions. Working with an increased N provides finer detail of the scatter and the core microstructures. The timing is very important, since the compactness of the galaxies at the time of interaction determines to a large degree whether they preferentially scatter, or merge. This simulation uses essentially parallel evolution; the two galaxies start in similar states and are collapsing and expanding on parallel time scales. By starting one at a larger extent then the other, we can explore a variety of morphological evolutions. Again, trees are useful for any work that looks beyond simple bulk motion.

Conclusion

For simulations where the system can be approximated by a small number of particles, the PP codes have an advantage of simplicity and ease of interpretation. However, as our cross-comparison shows, PP codes become too slow for large N and tree codes are necessary. By simulating with more particles, finer structural evolution becomes visible beyond the simple bulk motion. Given that galaxies and galactic collisions show strong evidence of structure of many scales, such simulations therefore have a higher degree of authenticity for what they model.

By the same token, systems with disparate groups of particle that interact are best handled with tree codes. Colliding galaxies come into this category; for the pre-collision and post-collision stages, the majority of interactions can be done in the higher levels of the tree code, and the earlier steps computer much faster. For the actual close-in collision stage, the tree code is sufficiently robust to provide PP-like accuracy at not too much a cost in speed. And, relative to mesh codes, tree codes have a much finer granularity. Thus, in comparison with other N-body schemes, the use of tree codes for large N simulations is highly recommended.

Appendix: NBody Web References

  • 3-D NBody and Coronal work
  • Nbody theory
  • Nbody methods
  • Numerical Analysis FAQ
  • Parallelizing Compilers (and the language V)