Member of the International Virtual Observatory Alliance

Spatial indexing in a relational database using HEALPix

For performing spatial queries such as a cone search, using standard SQL queries on a relational table of the size of the RASS photon catalogue (about 100 million records), one needs to take special care. As an example consider the following naive implementation in SQL of a simple cone search (SCS) on a catalogue:

 SELECT *
   FROM RASS_PHOTONS
  WHERE 2 * ASIN( SQRT(SIN(($DEC_C-DEC)/2) 
       * SIN(($DEC_C-DEC)/2) + COS($DEC_C) * COS(DEC) 
       * SIN(($RA_C - RA)/2) * SIN(($RA_C - RA)/2))) <= $SR_C  

This is an example SQL for a Simple Cone Search on the RASS_PHOTON table containing all the photons observed with the ROSAT satellite in its all-sky survey mode. It is assumed that the position is stored in two columns, RA and DEC. The parameters of the cone search are given by $RA_C, $DEC_C and $SR_C. Note, the expression in this query is obtained from the cone search SQL implementation extracted from AstroGrid's PAL library, in particular from the visit(Circle) method in their Java class org.astrogrid.datacenter.queriers.sql.AdqlQueryTranslator.

In order to avoid using CPU-expensive trigonometric functions we have followed the example of the SLOAN Digital Sky Surevy (SDSS, HTM) and stored positions of sources on the sky also in Cartesian coordinates on the unit sphere. The advantage of this is that now the statement of the cone search is a simple linear expression that can be rapidly computed:

 SELECT *
   FROM RASS_PHOTONS
  WHERE X*$X_C+Y*$Y_C+Z*$Z_C >= $COS_SR  

Example of an SQL query implementing an SCS using the cartesian coordinates on the unit sphere. It is assumed that the columns X, Y and Z store these coordinates. The parameters of the cone search have to be translated (once) into the parameters $X_C, $Y_C and $Z_C. The parameter $COS_SR is the cosine of the search radius.

For tuning the performance of the most common, spatial queries such as the SCS one needs to configure the database such that table records for those photons that are nearby in space are also nearby on disk. The common way for tuning a relational database for fast query performance is to define so-called indices on one or more of the table columns. Standard data structures for implementing such indices, such as the B-tree or hash table (see any book on database design, for example Garcia-Molina et al, are not quite applicable in our case, the reason being that they are inherently one-dimensional: they cannot capture the two-dimensional patterns inherent in a spatial distribution of objects on the sky.

Some databases offer enhanced, multi-dimensional indexing mechanism (for example using R-Trees, see Garcia-Molina et al), but we have at first chosen to implement an explicit type of spatial indexing using the freely available HEALPix pixelization package (Górski, Hivon,and Wandelt 1999). The HEALPix pixelization divides the sky up in pixels of equal area, organized in bands of constant latitude. These were actually the requirements on the algorithm and are especially useful for analysis of diffuse background radiation such as the Cosmic Microwave Background. The HEALPix team has written a software package supporting the algorithm, in Fortran 90, C and IDL. We have used a prototype Java-port of this code provided by William O'Mullane for our implementations, and have added some of our own ports of various routines not yet included in there.

The aspect of the algorithm that is of use for our purposes is the recursive-indexing algorithm. HEALPix comes in two pixel indexations, assigning integer numbers to the pixels according to two different prescriptions. In one the pixels are counted in rings of equal latitude, starting at the North Pole. This indexation is especially useful for all sky analysis routines like spherical FFTs. For our purposes the recursive scheme is of more interest. Here pixels are indexed in such a way that pixels at a certain resolution level that are contained in a given pixel at a lower resolution have index values that are close together.

Objects on the sky are assigned an index value corresponding to the pixel they lie in. In the database we now order the objects by the index number. This now implies that objects that are close together on the sky are close together on the disk, which speeds up retrieval of nearby objects significantly.

Note that HEALPix is not the only pixelization scheme that has this property. For example the Hierarchical Triangular Mesh HTM) scheme, used by the SDSS consortium has these same properties. We have chosen HEALPix for reasons that one of GAVO's associate members (Anthony J. Banday) is closely involved in the HEALPix effort, and for reasons that the algorithm is more simply implemented and ported. Another reason is that it is good within an IVOA context to have experience with different methodologies.

The indexation was implemented as follows. We chose the equatorial J2000 coordinate system for the orientation. We furthermore chose a fixed resolution (or NSIDE in official HEALPix terminology) for dividing up the sky. We chose the maximum value allowed by the standard software, NSIDE=8192. This corresponds to about 800 million pixels, which is probably not optimal for the size of the table. On average there are 0.1 photons per pixel. For indexing purposes a value of 100 is probably more optimal. For each source (photon or other RASS sources) we determine the pixel it belongs to and the corresponding index. The table in the database contains a column for this index and a database index is created for that column. After that the table is clustered on that index.

This now implies that sources that have (almost) the same value for the index, and are therefore close on the sky, will be close on the disk as well. We do still have to do something about the fact that in general one is interested in regions that are of a more general form than HEALPix pixels. The actual cone search implementation is therefore somewhat more complex. For a given cone - defined by a position (RA, DEC) and a search radius (SR) - we determine the HEALPix resolution whose pixel area size most closely corresponds to the area of the cone. For that resolution, we determine all pixels overlapping with the cone. In general the resolution corresponding to the cone size is lower than the maximum resolution, which we used to index the objects in the database. However, the recursive nature of the indexing scheme leads to the nice property that all pixels at a resolution NSIDE1 > NSIDE2 have indices that are consecutive between two values prescribed by the lower resolution. A request for points in a pixel of low resolution leads to a query for points whose high-resolution indices are between these two values. The nature of the ordering on the disk means that all of these points can be found in consecutive blocks/pages on disk, which leads to very fast access times, requiring only a few disk reads.

Since the cone will overlap with a number of distinct pixels, we may still have to pose multiple queries to the database. However, by choosing an NSIDE whose pixels are close in size to the cone, we hope to minimize this number, while at the same time not getting too many points that fall outside of the cone. That we will find some points that fall inside a pixel, but are outside of the cone cannot be avoided as a consequence of the incompatible geometry of the cone and the pixels.

To implement the cone search exactly we need to add some extra filtering WHERE clauses to the SQL, equivalent to the ones in Figure 34. The final query will be something like the following:

 SELECT *
   FROM RASS_PHOTONS
  WHERE HEALPIX_IX BETWEEN $HP_LOW AND $HP_HIGH;
    AND X*$X_C+Y*$Y_C+Z*$Z_C >= $COS_SR

Example of an SQL query implementing an SCS making use of the HEALPix spatial indexation. It is assumed that a column HEALPIX_IX stores the index of the pixel the photon lies in. The parameters $HP_LOW and $HP_HIGH are the lower and upper values of the index at the resolution level of the actual indexation. These values are derived from the parent pixel at the lower resolution (= larger area) that corresponds roughly to one sixth of the cone.

Acknowledgments

We thank the HEALPix team (Górski, Hivon and Wandelt 1999) for providing us with their software package. We thank Tony Banday, Will O'Mullane and Clive Page for useful discussions. We thank Will O'Mullane for providing us with a partial port of the HEALPix software to Java and Martin Reynecke for his port of the package to C++.

References