Cross-matching tools

When working with simulation data, it is frequently necessary to cross-match catalogue entries in different catalogues. For example, particles in one catalogue can be matched to a different snapshot to trace the history of individual particles, or to a subfind catalogue to find particles belonging to a particular subhalo.

Although these tasks are not specific to the Hydrangea/C-EAGLE simulations, the library includes some tools for this purpose in the hydrangea.crossref module, both for convenience and to integrate this functionality into the SplitFile and ReadRegion classes.

General overview

The hydrangea.crossref module provides three high-level objects for cross-matching purposes:

Object Purpose
hydrangea.crossref.ReverseList A class to compute and store a “reverse key” list,
which can be indexed with the key of the object
to be matched (e.g. a particle’s ID) to find its index.
hydrangea.crossref.Gate A class to translate indices between two
catalogues according to a matched key
(e.g. particle IDs).
hydrangea.crossref.find_id_indices() A function to find the indices corresponding to a
specified set of keys (e.g. particle IDs) in a
reference array. This is slightly more direct than
hydrangea.crossref.Gate in many situations.

Both Gate and find_id_indices() internally call ReverseList when appropriate. One important limitation of the latter is that it can only be used when the keys (typically, particle IDs) are limited to reasonably small values, because it internally sets up an array of size max(keys).

This is generally fine for all Hydrangea/C-EAGLE simulations on systems with at least 50 GB RAM, but will not work for e.g. the original EAGLE simulations (due to a different scheme for assigning particle IDs). In such cases, Gate and find_id_indices() switch to a different, generally slower, matching scheme based on sorting and comparing the key lists (the threshold is user-adjustable through the max_direct parameter in both cases).

Examples

Invert a list of particle IDs (part_ids) to find the particle with ID 1001:

rev_list = hydrangea.crossref.ReverseList(part_ids)
index_1001 = rev_list(1001)  # Note parentheses: calling the object, not indexing an array

Match three sets of DM particle indices in snapshot 10 (set_1, set_2, set_3) to their corresponding indices in snapshot 29:

# Set up particle catalogue readers at the two snapshots
snap_10 = hydrangea.SplitFile(snapshot_file_10, 1)
snap_29 = hydrangea.SplitFile(snapshot_file_29, 1)

# Set up the gate
gate = hydrangea.crossref.Gate(snap_10.ParticleIDs, snap_29.ParticleIDs)

# Find the indices corresponding to the three sets in snapshot 29:
ind_1, matched_1 = gate.in_int(set_1)
ind_2, matched_2 = gate.in_int(set_2)
ind_3, matched_3 = gate.in_int(set_3)

Here, ind_1, ind_2, and ind_3 contain the indices in snapshot 29 corresponding to set_1, set_2, and set_3 in snapshot 10, and matched_1, matched_2, matched_3 the sub-indices of particles that could be matched (which should be all for DM, since all particles are present in each snapshot).

Finally, matching only a single set of particles (same set up as last example):

ind_1 = hydrangea.crossref.find_id_indices(snap_10.ParticleIDs[set_1], snap_29.ParticleIDs)

Reference

class hydrangea.crossref.Gate(ids_ext, ids_int=None, rev_IDs=None, max_direct=10000000000, sort_below=100, force_sort=False, min_c=100000, save_revIDs=False, verbose=False)

Class for linking two catalogues through a common set of keys.

It provides functionality to translate indices in one (‘external’, subject) catalogue into indices in a second (‘internal’, reference) catalogue, for key sets of arbitrary length or values.

Parameters:
  • ids_ext (ndarray(int)) – First, ‘external’ set of IDs (keys).
  • ids_int (ndarray(int) or None, optional) – Second, ‘internal’ set of IDs (keys). If it is not provided (None, default), the corresponding reverse-ID list must be (rev_IDs).
  • rev_IDs (ndarray(int), ReverseList, or None, optional) – The reverse-ID list for the second, internal data set. It can either be supplied as a plain (int) array or as a ReverseList object. If it is not provided (None, default), the internal IDs must be supplied as ids_int.
  • max_direct (int, optional) – Maximum value in either of the key sets for which a reverse-index based match can be performed (default: 1e10). For larger values, keys are always matched via sorting. This parameter has no effect if a reverse ID list is provided as rev_IDs.
  • sort_below (int, optional) – Maximum length of either key set for which the sort-based search is always done (default: 100). Ignored if rev_IDs is provided.
  • force_sort (bool, optional) – Force a sort-based search irrespective of input (default: False). This is equivalent to, but slightly faster than, setting max_direct = 0. Ignored if rev_IDs is provided.
  • min_c (int, optional) – Minimum length of (longer) key set for which a (possible) sort-based search is outsourced to the C-library (default: 1e5), if available. This parameter has no effect if a reverse ID list is provided as rev_IDs.
  • save_revIDs (bool, optional) – Internally store the reverse ID list, if it was provided or created internally (default: False). This can be useful if one data set must be matched to multiple others, in which case the reverse ID list can be re-used.
  • verbose (bool, optional) – Provide diagnostic messages (default: False)
rev_IDs

The reversed internal ID (key) list (None if not created or not stored).

Type:ReverseList instance or None
in_ext(index=None)

Return the indices of the internal keys in the external catalogue.

Parameters:index (ndarray(int), optional) – Subset of the internal catalogue for which the indices should be returned. If None (default), return indices for the entire internal catalogue.
Returns:
  • ii_ext (ndarray(int)) – The indices into the external catalogue corresponding to the queried internal elements (-1 for those that are not matched).
  • ii_matched (ndarray(int)) – The indices into the queried internal elements that could be matched to an external key.
in_int(index=None)

Return the indices of the external keys in the internal catalogue.

Parameters:index (ndarray(int), optional) – Subset of the external catalogue for which the indices should be returned. If None (default), return indices for the entire external catalogue.
Returns:
  • ie_int (ndarray(int)) – The indices into the internal catalogue corresponding to the queried external elements (-1 for those that are not matched).
  • ie_matched (ndarray(int)) – The indices into the queried external elements that could be matched to an internal key.
class hydrangea.crossref.ReverseList(ids, delete_ids=False, assume_positive=False, compact=False)

Class for creating and querying a reverse-index list.

This is essentially a thin wrapper around the create_reverse_list() function, but avoids the need to artificially expand the list to deal with possible out-of-bounds queries. It creates and queries an array containing the index for each ID value, i.e. the inverse of the input list (which gives the ID for each index).

Warning

It is a bad idea to instantiate this class with a key set that contains large values in relation to the available memory: as a rough guide, consider the limit as 1/8 * [RAM/byte]. Ignoring this will result in undefined, slow, and likely annoying behaviour.

Parameters:
  • ids (ndarray (int)) – Input keys (IDs) to invert. These are assumed to be unique; death and destruction may happen if this is not the case. The array must be one-dimensional. Any negative values are assumed to be dummy elements and are ignored.
  • delete_ids (bool, optional) – Delete input keys after building reverse list (default: False).
  • assume_positive (bool, optional) – Assume that all the input values are non-negative, which speeds up the inversion. If False (default), the code checks explicitly which input values are non-negative and only transcribes these to the reverse list.
  • compact (bool, optional) – Make the reverse-index list shorter by (transparently) subtracting the minimum ID from inputs (default: False).
reverse_IDs

The reverse-index array, created upon instantiation.

Type:ndarray(int)
num_int

The number of keys in the input array.

Type:int
query(ids, assume_valid=False)

Find the indices of the input (external) IDs.

Parameters:
  • ids (ndarray(int)) – The IDs whose indices in the internal list (used to set up the reverse list) should be determined. Out-of-bound situations are dealt with internally.
  • assume_valid (bool, optional) – Assume that all input IDs are within the range of the internal reverse-index list, so that out-of-bound check can be skipped (default: False).
Returns:

indices (ndarray(int)) – The indices corresponding to each input ID (-1 if not found).

Note

This is also the object’s call method, so it can be used directly without specifying query.

Example

>>> import numpy as np
>>> ids = np.array([4, 0, 5, 1])
>>> reverse_ids = ReverseList(ids)
>>> reverse_ids(np.array([3, 4]))
array([-1, 0])
query_matched(ids, assume_valid=False)

Find indices eof the input (external) IDs, also listing matches.

Parameters:
  • ids (ndarray(int)) – The IDs whose indices in the internal list (used to set up the reverse list) should be determined. Out-of-bound situations are dealt with internally.
  • assume_valid (bool, optional) – Assume that all input IDs are within the range of the internal reverse-index list, so that out-of-bound check can be skipped (default: False).
Returns:

  • indices (ndarray(int)) – The indices corresponding to each input ID (-1 if not found).
  • matches (ndarray(int)) – The indices into the input ids array for keys that could be matched (i.e. have a non-negative value of indices).

Example

>>> import numpy as np
>>> ids = np.array([4, 0, 5, 1])
>>> reverse_ids = ReverseList(ids)
>>> reverse_ids.query_matched(np.array([3, 4]))
array([-1, 0]), array([1])

Sped-up “Katamaran-search”, which uses an external C library.

This assumes that the elements in a and b are unique, and that a and b are sorted.

Parameters:
  • a (ndarray) – Elements to locate. Must be unique and sorted.
  • b (ndarray) – Reference values. Must also be unique and sorted.
Returns:

ind_a (ndarray(int)) – Indices of a in b (-1 for elements that could not be matched).

Note

a and b should be of the same data type. If they contain floats, the matching values must be exactly the same to machine precision.

For small arrays, the overhead of interfacing to the C code may negate the speed benefit; consider using katamaran_search() here.

hydrangea.crossref.create_reverse_list(ids, delete_ids=False, assume_positive=False, max_val=None)

Create a reverse-index list from a (unique) list of IDs.

This gives the index for each ID value, and is thus the inverse of the input list (which gives the ID for an index).

Warning

It is a bad idea to call this function if the input ID list contains very large values in relation to the available memory: as a rough guide, consider the limit as 1/8 * [RAM/byte]. Ignoring this will result in undefined, slow, and likely annoying behaviour.

Parameters:
  • ids (ndarray (int)) – Input keys (IDs) to invert. These are assumed to be unique; death and destruction may happen if this is not the case. The array must be one-dimensional. Any negative values are assumed to be dummy elements and are ignored.
  • delete_ids (bool, optional) – Delete input keys after building reverse list (default: False).
  • assume_positive (bool, optional) – Assume that all the input values are non-negative, which speeds up the inversion. If False (default), the code checks explicitly which input values are non-negative and only transcribes these to the reverse list.
  • max_val (int, optional) – Build a reverse list with at least maxval`+1 entries (i.e. that can be indexed by values up to `maxval), even if this exceeds the maximum input ID value. If None (default), the reverse list is built up to self-indexing capacity (i.e. with max(ids)+1 elements).
Returns:

rev_IDs (ndarray(int)) – The reverse index list. If the input list contains fewer than 2 billion elements, it is of type np.int32, otherwise np.int64.

Note

For most practical purposes, it may be more convenient to directly use the Gate or ReverseList classes, or the find_id_indices() function, to correlate IDs, all of which call this function internally.

Example

>>> # Standard use to invert an array:
>>> import numpy as np
>>> ids = np.array([4, 0, 5, 1])
>>> create_reverse_list(ids)
array([1, 3, -1, -1, 0, 2])
>>> # Including negative array values:
>>> ids = np.array([4, 0, -1, 5, 1])
>>> create_reverse_list(ids)
array([1, 4, -1, -1, 0, 3])
>>> # Including use of max_val:
>>> create_reverse_list(ids, maxVal=10)
array([1, 4, -1, -1, 0, 3, -1, -1, -1, -1, -1])
hydrangea.crossref.find_id_indices(ids_ext, ids_int, max_direct=10000000000, min_c=100000, sort_below=None, force_sort=False, sort_matches=True, verbose=False)

Find and return the locations of IDs in a reference list.

This function can be used to translate indices in one (‘external’, subject) catalogue into indices in a second (‘internal’, reference) catalogue, for key sets of arbitrary length or values. If the maximum value is below an (adjustable) threshold, the lookup is performed via an explicit reverse-lookup list. Otherwise, both input and reference lists are sorted and compared via a “Katamaran-search”.

Parameters:
  • ids_ext (ndarray (int)) – Array of keys (IDs) whose indices in ids_int should be returned. It should be unique, unless the search is guaranteed to be executed via a reverse list.
  • ids_int (ndarray (int)) – Internal (‘reference’) list of keys (IDs), assumed to be unique. The function will search for each input ID in this array.
  • max_direct (int, optional) – Maximum value in either key set for which a reverse-index based match is performed, rather than a sort-based search (default: 1e10).
  • min_c (int, optional) – Minimum length of longer ID list for which a (possible) Katamaran-search is outsourced to the C-library, if compiled (default: 1e5). Ignored if a direct search is performed.
  • sort_below (int or None, optional) – Maximum length of either ID list for which sort-based search is preferred (default: 100). If None, always use reverse-index method if possible.
  • force_sort (bool, optional) – Force a sorted search irrespective of maximum input values. This is equivalent to, but slightly faster than, setting max_direct = 0.
  • sort_matches (bool, optional) – Explicitly sort matching indices from Katamaran-search in ascending order, so that its result is identical to reverse-list based method (default: True).
  • verbose (bool, optional) – Print timing information (default: False)
Returns:

  • ie_int (ndarray(int)) – The index in ids_int for each input ID (-1 if it could not be located at all).
  • ie_matched (ndarray(int)) – The input (external) ID indices that could be matched.

Note

For large arrays, using a reverse-lookup list is typically much faster, but may use substantial amounts of memory. For small input lists, the sort-and-search approach may be faster because it avoids the overheads of setting up the reverse list.

Perform a “Katamaran” search to locate elements of a in b.

This assumes that the elements in a and b are unique, and that a and b are sorted.

Parameters:
  • a (ndarray) – Elements to locate. Must be unique and sorted.
  • b (ndarray) – Reference values. Must also be unique and sorted.
Returns:

ind_a (ndarray(int)) – Indices of a in b (-1 for elements that could not be matched).

Note

a and b should be of the same data type. If they contain floats, the matching values must be exactly the same to machine precision.

This function is implemented in pure python and may be very slow for large arrays. Consider using cKatamaran_search() instead for these.

hydrangea.crossref.query_array(array, indices, default=-1)

Retrieve values from an array that may be too short.

Parameters:
  • array (ndarray) – The array to retrieve values from. It should be of numerical type, and one-dimensional.
  • indices (ndarray(int)) – Indices of array to query.
  • default (value, optional) – The value to assign to out-of-bound indices (default: -1).
Returns:

values (ndarray) – The array values of the specified indices, with out-of-bounds indices (negative or >= len(array)) set to default.

Example

>>> import numpy as np
>>> arr = np.array([1, 4, 0, 8])
>>> ind = np.array([0, 2, 8])
>>> query_array(arr, ind, default=-100)
array([1, 0, -100])