seqgra.evaluator.sis.sis module¶
Finds sufficient input subsets for an input and black-box function.
This module implements the sufficient input subsets (SIS) procedure published in [1]. The goal of this procedure is to interpret black-box functions by identifying minimal sets of input features whose observed values alone suffice for the same decision to be reached, even with all other input values missing.
More precisely, presuming the function’s value at an input x exceeds a pre-specified threshold (f(x) >= threshold), this procedure identifies a collection of sparse subsets of features in x, SIS-collection = [sis_1, sis_2, …] where each sis_i satisfies f(x_sis_i) >= threshold, and x_sis_i is a variant of x where all positions except for those in the SIS are masked.
The authors of the SIS paper [1] recommend that the threshold be selected based on the application, e.g. by precision/recall considerations in the case f is a classifier. Note that as the threshold is increased, the SIS become larger. The mask is likewise pre-specified and also highly application-dependent. In the SIS paper, the authors mask values by using a mean feature value (e.g. a mean word embedding in natural language applications, or a mean pixel value in image classification). Other possible masking values could include <UNK> tokens or zero values. Regardless of choice, one should check that the function’s prediction on the fully-masked input is uninformative.
Note: this procedure allows for interpreting of any arbitrary function, not just those stemming from machine learning applications!
Typical usage example:
In this example, suppose f returns the L_2 norm of its inputs. With a threshold of 1, the two SIS identified are [1] and [2] (where the 1 and 2 are indices into the original input), such that if we select just these values (and mask all others, with the supplied all-zero mask), we have f([0, 10, 0]) >= 1 and f([0, 0, 5]) >= 1.
f_l2 = lambda batch_coords: np.linalg.norm(batch_coords, ord=2, axis=-1) threshold = 1.0 initial_input = np.array([0.1, 10, 5]) fully_masked_input = np.array([0, 0, 0]) collection = sis_collection(f_l2, threshold, initial_input, fully_masked_input)
See docstring of sis_collection for more-detailed usage information. Additional usage examples can be found in tests for sis_collection.
References:
- [1] Carter, B., Mueller, J., Jain, S., & Gifford, D. (2018). What made you do
this? Understanding black-box decisions with sufficient input subsets. arXiv preprint arXiv:1810.03805. https://arxiv.org/abs/1810.03805
- class SISResult(sis, ordering_over_entire_backselect, values_over_entire_backselect, mask)[source]¶
Bases:
seqgra.evaluator.sis.sis.SISResult
Specifies a single SIS identified by the find_sis procedure.
- Fields:
- sis: Array of idxs into the mask which define the sufficient input subset.
These idxs describe the unmasked positions in the input. This array has shape (k x idx.shape), where k is the length of the SIS and idx is an idx into the mask. Note that in case of any ties between elements during backward selection, lower indices appear later in this array (see docstring for find_sis).
- ordering_over_entire_backselect: Array of shape (m x idx.shape), containing
the order of idxs masked during backward selection while identifying this SIS, where 1 <= m <= d (and d is the max number of maskable positions). Later elements in this list were masked later during backward selection. If this is the first SIS extracted for this input, the m = d. Otherwise, m < d (as elements in earlier SIS are not considered again when extracting additional SIS in the sis_collection procedure). In particular, m + the total number of elements in all previous SIS = d.
- values_over_entire_backselect: Array of floats of shape (m,) containing the
values found during backward selection, corresponding to the idxs in ordering_over_entire_backselect. At each position, the value is the value of f after that corresponding position is masked. The length m is defined in the same way as in ordering_over_entire_backselect.
- mask: Boolean array of shape M that corresponds to this SIS. Applying this
mask to the original input produces a version of the input where all values are masked except for those in the SIS. The mask and input may have different shape, as long as the mask is broadcastable over the input (see docstring of sis_collection for details/example).
- approx_equal(other, rtol=1e-05, atol=1e-08)[source]¶
Checks that this SISResult and another SISResult are approximately equal.
SISResult.{sis, mask, ordering_over_entire_backselect} are compared exactly, while SISResult.values_over_entire_backselect are compared with slight tolerance (using np.allclose with provided rtol and atol). This is intended to check equality allowing for small differences due to floating point representations.
- Parameters
other – A SISResult instance.
rtol – Float, the relative tolerance parameter used when comparing
values_over_entire_backselect (see documentation for np.allclose) –
atol – Float, the absolute tolerance parameter used when comparing
values_over_entire_backselect –
- Returns
True if self and other are approximately equal, and False otherwise.
- count(value, /)¶
Return number of occurrences of value.
- index(value, start=0, stop=9223372036854775807, /)¶
Return first index of value.
Raises ValueError if the value is not present.
- mask¶
Alias for field number 3
- ordering_over_entire_backselect¶
Alias for field number 1
- sis¶
Alias for field number 0
- values_over_entire_backselect¶
Alias for field number 2
- find_sis(f, threshold, current_input, current_mask, fully_masked_input)[source]¶
Returns a single SIS from one (possibly partially-masked) input.
This method combines both the BackSelect and FindSIS procedures as defined in the SIS paper [1].
- Parameters
f – A function mapping an array of shape (B x D), containing a batch of B D-dimensional inputs to an array of scalar values with shape (B,).
threshold – A scalar, used as threshold in SIS procedure. Corresponds to tau in the SIS paper [1].
current_input – Array (or type convertible to array) of shape D on which to apply the SIS procedure. D may be multi-dimensional. If any positions are already masked, these must be specified in current_mask.
current_mask – Boolean array (or type convertible to array) of shape M corresponding to already-masked positions in current_input. If no values are masked, this is an empty mask (i.e. all values in the mask == True).
fully_masked_input – Array (or type convertible to array) of shape D (same as current_input), in which all positions hold their masked value. If the mask and input are not the same shape (M != D), the mask must be broadcastable over the input. This enables masking entire rows or columns at a time. For example, for an input of shape (2, 3), using a mask of shape (1, 3) will mask entire columns at the same time during backward selection, and a mask of shape (2, 1) will mask entire rows at a time.
- Returns
- A SISResult corresponding to the identified SIS (see docstring for
SISResult), or None if no SIS is identified, which occurs only when the prediction on the initially provided input is below the threshold, i.e. f(current_input) < threshold, or if all positions are given as masked in current_mask.
- The SIS values are sorted so that the earlier elements in the SIS were
masked later during backward selection (see docstring of SISResult).
- Note that in the case of value ties during backward selection, the first of
the positions is masked first (see docstring for _backselect). This means that if both elements end up in the SIS, the one with the larger index appears first in the SIS (since the SIS is built by adding elements from the backselect_stack in reverse order).
- make_empty_boolean_mask(shape)[source]¶
Creates empty boolean mask (no values are masked) given shape.
- Parameters
shape – A tuple of array dimensions (as in numpy.ndarray.shape).
- Returns
ndarray of given shape and boolean type, all values are True (not masked).
- make_empty_boolean_mask_broadcast_over_axis(shape, axis)[source]¶
Creates empty boolean mask that is broadcastable over specified axes.
Usage example:
Given an input of shape (2, 3):
A broadcastable mask over columns (to mask entire columns at a time during the SIS procedure) has shape (1, 3) and is created using make_empty_boolean_mask_broadcast_over_axis((2, 3), 0).
A broadcastable mask over rows (to mask entire rows at a time during SIS) has shape (2, 1) and is created using make_empty_boolean_mask_broadcast_over_axis((2, 3), 1).
- Parameters
shape – Shape (a tuple of array dimensions, as in numpy.ndarray.shape) of the underlying input to be masked.
axis – An integer, or tuple of integers, specifying the axis (or axes) to broadcast over.
- Returns
- ndarray of boolean type (all values are True) and shape S, where S is the
same as the provided shape, but with value 1 along each of the provided axes (see usage example above).
- produce_masked_inputs(input_to_mask, fully_masked_input, batch_of_masks)[source]¶
Applies masks to an input to produce the corresponding masked inputs.
- Parameters
input_to_mask – Array of shape D to be masked. Note that D may be multi-dimensional.
fully_masked_input – The fully masked version of input_to_mask, also an array of shape D.
batch_of_masks – Array of shape (B x D), a batch of masks to apply to input_to_mask, and B is at least 1.
- Returns
- An array of masked inputs of shape (B x D), where each mask in
batch_of_masks is applied to input_to_mask, and the masked values are taken from fully_masked_input.
- The order of masked inputs in the output corresponds to the order of masks
in batch_of_masks.
- Raises
TypeError if shape of batch_of_masks does not have 1 more dimension than – shape of input_to_mask.
- sis_collection(f, threshold, initial_input, fully_masked_input, initial_mask=None)[source]¶
Identifies the entire collection of SIS for an input.
Implements the SIScollection procedure in the SIS paper [1].
- Parameters
f – A function mapping an array of shape (B x D), containing a batch of B D-dimensional inputs to an array of scalar values with shape (B,).
threshold – A scalar, used as threshold in SIS procedure. Corresponds to tau in the SIS paper [1].
initial_input – Array of shape D (or type convertible to array) on which to apply the SIS procedure. D may be multi-dimensional.
fully_masked_input – Array (or type convertible to array) of shape D (same shape as initial_input), in which all positions hold their masked value.
initial_mask – Optional. Boolean array (or type convertible to array) of shape M to define how input is masked. Default value is None, in which case a mask is created with the same shape as initial_input. If the mask and input are not the same shape (M != D), the mask must be broadcastable over the input. This enables masking entire rows or columns at a time. For example, for an input of shape (2, 3), using a mask of shape (1, 3) will mask entire columns at the same time during backward selection, and a mask of shape (2, 1) will mask entire rows at a time. (See make_empty_boolean_mask_broadcast_over_axis, which can construct broadcastable masks.)
- Returns
- A list of SISResult objects, containing the entire SIS-collection for the
initial_input. If no SIS exists (i.e. f(initial_input) < threshold), returns an empty list.
- Note that we follow the convention in the SIS paper [1], where a SIS only
exists if f(initial_input) >= threshold. If f(initial_input) < threshold, but there exists a subset of features on which f(subset) >= threshold, we do not consider this a valid SIS.
- The order of SISResults in this list corresponds to the order of the SIS as
they are found – the first element is the first SIS found, and so on. Earlier SIS are masked while finding later SIS, so all the SIS in the SIS-collection are disjoint (as in the SIS paper [1]).