Algorithm Interfaces¶

class smqtk.algorithms.SmqtkAlgorithm[source]¶

Parent class for all algorithm interfaces.

property name¶

Returns: The name of this class type.
Return type: str

Here we list and briefly describe the high level algorithm interfaces which SMQTK provides. There is at least one implementation available for each interface. Some implementations will require additional dependencies that cannot be packaged with SMQTK.

Classifier¶

This interface represents algorithms that classify DescriptorElement instances into discrete labels or label confidences.

class smqtk.algorithms.classifier.Classifier[source]¶

Interface for algorithms that classify input descriptors into discrete labels and/or label confidences.

static _assert_array_dim_consistency(array_iter)[source]¶

Assert that arrays are consistent in dimensionality across iterated arrays.

Currently we only support iterating single dimension vectors. Arrays of more than one dimension (i.e. 2D matries, etc.) will trigger a ValueError.

Includes a short-cut where if the input is a non-object 2D ndarray, dimensionality must already be consistent, so the ndarray (which is an Iterable) is just returned. Otherwise, we return a generator that checked dimensionality of the input iterable during iteration.

Parameters

| np.ndarray array_iter (collections.abc.Iterable[numpy.ndarray]) – Iterable numpy arrays.

Raises

AttributeError – Individual arrays are not numpy.ndarray-like.
ValueError – Not all input arrays were of consistent dimensionality.

Returns

Iterable of the same arrays in the same order, but validated to be of common dimensionality.

abstract _classify_arrays(array_iter)[source]¶

Overridable method for classifying an iterable of descriptor elements whose vectors should be classified.

At this level, all input arrays are guaranteed to be of consistent dimensionality.

Each classification mapping should contain confidence values for each label the configured model contains. Implementations may act in a discrete manner whereby only one label is marked with a 1 value (others being 0), or in a continuous manner whereby each label is given a confidence-like value in the [0, 1] range.

Parameters: array_iter (collections.abc.Iterable[numpy.ndarray]) – Iterable of arrays to be classified.
Returns: Iterable of dictionaries, parallel in association to the input descriptor vectors. Each dictionary should map labels to associated confidence values.
Return type: collections.abc.Iterable[dict[collections.abc.Hashable, float]]

classify_arrays(array_iter)[source]¶

Classify an input iterable of numpy arrays into a parallel iterable of label-to-confidence mappings (dictionaries).

Each classification mapping should contain confidence values for each label the configured model contains. Implementations may act in a discrete manner whereby only one label is marked with a 1 value (others being 0), or in a continuous manner whereby each label is given a confidence-like value in the [0, 1] range.

Parameters: | np.ndarray array_iter (collections.abc.Iterable[numpy.ndarray]) – Iterable of descriptor vectors, as numpy arrays, to be classified.
Raises: ValueError – Input arrays were not all of consistent dimensionality.
Returns: Iterable of dictionaries, parallel in association to the input descriptor vectors. Each dictionary should map labels to associated confidence values.
Return type: collections.abc.Iterable[dict[collections.abc.Hashable, float]]

classify_elements(descr_iter, factory=<smqtk.representation.classification_element_factory.ClassificationElementFactory object>, overwrite=False, d_elem_batch=100)[source]¶

Classify an input iterable of descriptor elements into a parallel iterable of classification elements.

Classification element UIDs are inherited from the descriptor element it was generated from.

We invoke classify_arrays for actual generation of classification results. See documentation for this method for further details. # We invoke classify_arrays for factory-generated classification # elements that do not yet have classifications stored, or on all input # descriptor elements if the overwrite flag is True.

Selective Iteration For situations when it is desired to access specific generator returns, like when only one descriptor element is provided in order to get a single element out, it is strongly recommended to expand the returned generator into a sequence type first. For example, expanding out the generator’s returns into a list (list(g.generate_elements([e]))[0]) is recommended over just getting the “next” element of the returned generator (next(g.generate_elements([e]))). Expansion into a sequence allows the generator to fully execute, which includes any functionality after the final yield statement in any of the underlying iterators that may perform required clean-up.

Non-redundant Processing Certain classification element implementations, as dictated by the input factory, may be connected to persistent storage in the background. Because of this, some classification elements may already “have” classification results on construction. This method, by default, only computes new classification results for descriptor elements whose associated classification element does not report as already containing results. If the overwrite flag is True then classifications are computed for all input descriptor elements and results are set to their respective classification elements regardless of existing result storage.

Parameters

descr_iter (collections.abc.Iterable[DescriptorElement]) – Iterable of DescriptorElement instances to be classified.
factory (smqtk.representation.ClassificationElementFactory) – Classification element factory. The default factory yields MemoryClassificationElement instances.
overwrite (bool) – Recompute classification of the input descriptor and set the results to the ClassificationElement produced by the factory.
d_elem_batch (int) – The number of descriptor elements to collect before requesting the whole batch’s vectors at once via DescriptorElement.get_many_vectors method.

Raises

ValueError – Either: (A) one or more input descriptor elements did not have a stored vector, or (B) input descriptor element arrays were not all of consistent dimensionality.
IndexError – Implementation of _classify_arrays either under or over produced classifications relative to the number of input descriptor vectors.

Returns

Iterator of result ClassificationElement instances. UUIDs of generated ClassificationElement instances will reflect the UUID of the DescriptorElement it was computed from.

Return type

collections.abc.Iterator[smqtk.representation.ClassificationElement]

classify_one_element(descr_elem, factory=<smqtk.representation.classification_element_factory.ClassificationElementFactory object>, overwrite=False)[source]¶

Convenience method around classify_elements for the single-input case.

See documentation for the Classifier.classify_elements() method for more information.

Parameters

descr_elem (DescriptorElement) – Iterable of DescriptorElement instances to be classified.
factory (smqtk.representation.ClassificationElementFactory) – Classification element factory. The default factory yields MemoryClassificationElement instances.
overwrite (bool) – Recompute classification of the input descriptor and set the results to the ClassificationElement produced by the factory.

Raises

ValueError – The input descriptor element did not have a stored vector.
IndexError – Implementation of _classify_arrays either under or over produced classifications relative to the number of input descriptor vectors.

Returns

ClassificationElement instances. UUIDs of the generated ClassificationElement instance will reflect the UUID of the DescriptorElement it was computed from.

Return type

smqtk.representation.ClassificationElement

abstract get_labels()[source]¶

Get the sequence of class labels that this classifier can classify descriptors into. This includes the negative or background label if the classifier embodies such a concept.

Returns: Sequence of possible classifier labels.
Return type: collections.abc.Sequence[collections.abc.Hashable]
Raises: RuntimeError – No model loaded.

DescriptorGenerator¶

This interface represents algorithms that generate whole-content descriptor vectors for one or more given input DataElement instances. The input DataElement instances must be of a content type that the DescriptorGenerator supports, referenced against the valid_content_types() method (required by the ContentTypeValidator mixin class).

The DescriptorGenerator.generate_elements() method also requires a DescriptorElementFactory instance to tell the algorithm how to generate the DescriptorElement instances it should return. The returned DescriptorElement instances will have a type equal to the name of the DescriptorGenerator class that generated it, and a UUID that is the same as the input DataElement instance.

If a DescriptorElement implementation that supports persistent storage is generated, and there is already a descriptor associated with the given type name and UUID values, the descriptor is returned without re-computation.

If the overwrite parameter is True, the DescriptorGenerator instance will re-compute a descriptor for the input DataElement, setting it to the generated DescriptorElement. This will overwrite descriptor data in persistent storage if the DescriptorElement type used supports it.

class smqtk.algorithms.descriptor_generator.DescriptorGenerator[source]¶

Base abstract Feature Descriptor interface.

generate_arrays(data_iter)[source]¶

Generate descriptor vector elements for all input data elements.

Descriptor arrays yielded out will be parallel in association with the data elements input.

Selective Iteration For situations when it is desired to access specific generator returns, like when only one data element is provided in order to get a single array out, it is strongly recommended to expand the returned generator into a sequence type first. For example, expanding out the generator’s returns into a list (list(g.generate_arrays([e]))[0]) is recommended over just getting the “next” element of the returned generator (next(g.generate_arrays([e]))). Expansion into a sequence allows the generator to fully execute, which includes any functionality after the final yield statement in any of the underlying iterators.

Parameters

data_iter (collections.abc.Iterable[smqtk.representation.DataElement]) – Iterable of DataElement instances to be described.

Raises

RuntimeError – Descriptor extraction failure of some kind.
ValueError – Given data element content was not of a valid type with respect to this descriptor generator implementation.

Returns

Iterator of result numpy.ndarray instances.

Return type

collections.abc.Iterator[numpy.ndarray]

generate_elements(data_iter, descr_factory=<smqtk.representation.descriptor_element_factory.DescriptorElementFactory object>, overwrite=False)[source]¶

Generate DescriptorElement instances for the input data elements, generating new descriptors for those elements that need them, or optionally all input data elements.

Descriptor elements yielded out will be parallel in association with the data elements input. Descriptor element UUIDs are inherited from the data element it was generated from.

Selective Iteration For situations when it is desired to access specific generator returns, like when only one data element is provided in order to get a single element out, it is strongly recommended to expand the returned generator into a sequence type first. For example, expanding out the generator’s returns into a list (list(g.generate_elements([e]))[0]) is recommended over just getting the “next” element of the returned generator (next(g.generate_elements([e]))). Expansion into a sequence allows the generator to fully execute, which includes any functionality after the final yield statement in any of the underlying iterators that may perform required clean-up.

Non-redundant Processing Certain descriptor element implementations, as dictated by the input factory, may be connected to persistent storage in the background. Because of this, some descriptor elements may already “have” a vector on construction. This method, by default, only computes new descriptor vectors for data elements whose associated descriptor element does not report as already containing a vector. If the overwrite flag is True then descriptors are computed for all input data elements and are set to their respective descriptor elements regardless of existing vector storage.

Parameters

data_iter (collections.abc.Iterable[smqtk.representation.DataElement]) – Iterable of DataElement instances to be described.
descr_factory (smqtk.representation.DescriptorElementFactory) – DescriptorElementFactory instance to drive the generation of element instances with some parametrization.
overwrite (bool) – By default, if a factory-produced DescriptorElement reports as containing a vector, we do not compute a descriptor again for the associated data element. If this is True, however, we will generate descriptors for all input data elements, overwriting the vectors previously stored in the factory-produces descriptor elements.

Raises

RuntimeError – Descriptor extraction failure of some kind.
ValueError – Given data element content was not of a valid type with respect to this descriptor generator implementation.
IndexError – Underlying vector-producing generator either under or over produced vectors.

Returns

Iterator of result DescriptorElement instances. UUIDs of generated DescriptorElement instances will reflect the UUID of the DataElement it was generated from.

Return type

collections.abc.Iterator[smqtk.representation.DescriptorElement]

generate_one_array(data_elem)[source]¶

Convenience wrapper around generate_arrays for the single-input case.

See the documentation for the DescriptorGenerator.generate_arrays() method for more information.

Parameters

data_elem (smqtk.representation.DataElement) – DataElement instance to be described.

Raises

RuntimeError – Descriptor extraction failure of some kind.
ValueError – Given data element content was not of a valid type with respect to this descriptor generator implementation.

Returns

Descriptor vector the given data as a numpy.ndarray instance.

Return type

numpy.ndarray

generate_one_element(data_elem, descr_factory=<smqtk.representation.descriptor_element_factory.DescriptorElementFactory object>, overwrite=False)[source]¶

Convenience wrapper around generate_elements for the single-input case.

See documentation for the DescriptorGenerator.generate_elements() method for more information

Parameters

data_elem (smqtk.representation.DataElement) – DataElement instance to be described.
descr_factory (smqtk.representation.DescriptorElementFactory) – DescriptorElementFactory instance to drive the generation of element instances with some parametrization.
overwrite (bool) – By default, if a factory-produced DescriptorElement reports as containing a vector, we do not compute a descriptor again for the associated data element. If this is True, however, we will generate descriptors for all input data elements, overwriting the vectors previously stored in the factory-produces descriptor elements.

Raises

IndexError – Underlying vector-producing generator either under or over produced vectors.
RuntimeError – Descriptor extraction failure of some kind.
ValueError – Given data element content was not of a valid type with respect to this descriptor generator implementation.

Returns

Result DescriptorElement instance. UUID of the generated DescriptorElement instance will reflect the UUID of the DataElement it was generated from.

Return type

smqtk.representation.DescriptorElement

ImageReader¶

class smqtk.algorithms.image_io.ImageReader[source]¶

Interface for algorithms that load a raster image matrix from a data element.

is_valid_element(data_element)[source]¶

Check if the given DataElement instance reports a content type that matches one of the MIME types reported by valid_content_types.

This override checks if the DataElement has the matrix property as the MatrixDataElement would provide, and that its value of an expected type.

Parameters: data_element (smqtk.representation.DataElement) – Data element instance to check.
Returns: True if the given element has a valid content type as reported by valid_content_types, and False if not.
Return type: bool

load_as_matrix(data_element, pixel_crop=None)[source]¶

Load an image matrix from the given data element.

Matrix Property Shortcut. If the given DataElement instance defines a matrix property this method simply returns that. This is intended to interface with instances of smqtk.representation.data_element.matrix.MatrixDataElement.

Loading From Bytes. When not loading from a short-cut matrix, matrix return format is ImageReader implementation dependant. Implementations of this interface should specify and describe their return type.

Aside from the exceptions documented below, other exceptions may be raised when an image fails to load that are implementation dependent.

Parameters

data_element (smqtk.representation.DataElement) – DataElement to load image data from.
pixel_crop (None|smqtk.representation.AxisAlignedBoundingBox) – Optional bounding box specifying a pixel sub-region to load from the given data. If this is provided it must represent a valid sub-region within the loaded image, otherwise a RuntimeError is raised. Handling of non-integer aligned boxes are implementation dependant.

Raises

RuntimeError – A crop region was specified but did not specify a valid sub-region of the image.
AssertionError – The data_element provided defined a matrix attribute/property, but its access did not result in an expected value.
ValueError –
This error is raised when:
- The given data_element was not of a valid content type.
- A pixel_crop bounding box was provided but was zero volume.
- pixel_crop bounding box vertices are not fully represented by integers.

Returns

Numpy ndarray of the image data. Specific return format is implementation dependant.

Return type

numpy.ndarray

class smqtk.algorithms.image_io.pil_io.PilImageReader(explicit_mode=None)[source]¶

Image reader that uses PIL to load the image.

This implementation may additionally raise an IOError when failing to to load an image.

get_config()[source]¶

Return a JSON-compliant dictionary that could be passed to this class’s from_config method to produce an instance with identical configuration.

In the most cases, this involves naming the keys of the dictionary based on the initialization argument names as if it were to be passed to the constructor via dictionary expansion. In some cases, where it doesn’t make sense to store some object constructor parameters are expected to be supplied at as configuration values (i.e. must be supplied at runtime), this method’s returned dictionary may leave those parameters out. In such cases, the object’s from_config class-method would also take additional positional arguments to fill in for the parameters that this returned configuration lacks.

Returns: JSON type compliant configuration dictionary.
Return type: dict

classmethod is_usable()[source]¶

Check whether this class is available for use.

Since certain plugin implementations may require additional dependencies that may not yet be available on the system, or other runtime conditions, this method may be overridden to check for those and return a boolean saying if the implementation is available for usable. When this method returns True, the class is declaring that it should be constructable and usable in the current environment.

By default, this method will return True unless a sub-class overrides this class-method with their specific logic.

NOTES:

This should be a class method
When an implementation is deemed not usable, this should emit a
(user) warning, or some other kind of logging, detailing why the implementation is not available for use.

Returns: Boolean determination of whether this implementation is usable in the current environment.
Return type: bool

valid_content_types()[source]¶

Returns: A set valid MIME types that are “valid” within the implementing class’ context.
Return type: set[str]

HashIndex¶

This interface describes specialized NearestNeighborsIndex implementations designed to index hash codes (bit vectors) via the hamming distance function. Implementations of this interface are primarily used with the LSHNearestNeighborIndex implementation.

Unlike the NearestNeighborsIndex interface from which this interface descends, HashIndex instances are build with an iterable of numpy.ndarray and nn returns a numpy.ndarray.

class smqtk.algorithms.nn_index.hash_index.HashIndex[source]¶

Specialized NearestNeighborsIndex for indexing unique hash codes bit-vectors) in memory (numpy arrays) using the hamming distance metric.

Implementations of this interface cannot be used in place of something requiring a NearestNeighborsIndex implementation due to the speciality of this interface.

Only unique bit vectors should be indexed. The nn method should not return the same bit vector more than once for any query.

build_index(hashes)[source]¶

Build the index with the given hash codes (bit-vectors).

Subsequent calls to this method should rebuild the current index. This method shall not add to the existing index nor raise an exception to as to protect the current index.

Raises: ValueError – No data available in the given iterable.
Parameters: hashes (collections.abc.Iterable[numpy.ndarray[bool]]) – Iterable of descriptor elements to build index over.

abstract count()[source]¶

Returns: Number of elements in this index.
Return type: int

nn(h, n=1)[source]¶

Return the nearest N neighbor hash codes as bit-vectors to the given hash code bit-vector.

Distances are in the range [0,1] and are the percent different each neighbor hash is from the query, based on the number of bits contained in the query (normalized hamming distance).

Raises

ValueError – Current index is empty.

Parameters

h (numpy.ndarray[bool]) – Hash code to compute the neighbors of. Should be the same bit length as indexed hash codes.
n (int) – Number of nearest neighbors to find.

Returns

Tuple of nearest N hash codes and a tuple of the distance values to those neighbors.

Return type

(tuple[numpy.ndarray[bool]], tuple[float])

remove_from_index(hashes)[source]¶

Partially remove hashes from this index.

Parameters

hashes (collections.abc.Iterable[numpy.ndarray[bool]]) – Iterable of numpy boolean hash vectors to remove from this index.

Raises

ValueError – No data available in the given iterable.
KeyError – One or more UIDs provided do not match any stored descriptors.

update_index(hashes)[source]¶

Additively update the current index with the one or more hash vectors given.

If no index exists yet, a new one should be created using the given hash vectors.

Raises: ValueError – No data available in the given iterable.
Parameters: hashes (collections.abc.Iterable[numpy.ndarray[bool]]) – Iterable of numpy boolean hash vectors to add to this index.

LshFunctor¶

Implementations of this interface define the generation of a locality-sensitive hash code for a given DescriptorElement. These are used in LSHNearestNeighborIndex instances.

class smqtk.algorithms.nn_index.lsh.functors.LshFunctor[source]¶

Locality-sensitive hashing functor interface.

The aim of such a function is to be able to generate hash codes (bit-vectors) such that similar items map to the same or similar hashes with a high probability. In other words, it aims to maximize hash collision for similar items.

Building Models

Some hash functions want to build a model based on some training set of descriptors. Due to the non-standard nature of algorithm training and model building, please refer to the specific implementation for further information on whether model training is needed and how it is accomplished.

abstract get_hash(descriptor)[source]¶

Get the locality-sensitive hash code for the input descriptor.

Parameters: descriptor (numpy.ndarray[float]) – Descriptor vector we should generate the hash of.
Returns: Generated bit-vector as a numpy array of booleans.
Return type: numpy.ndarray[bool]

NearestNeighborsIndex¶

This interface defines a method to build an index from a set of DescriptorElement instances (NearestNeighborsIndex.build_index) and a nearest-neighbors query function for getting a number of near neighbors to e query DescriptorElement (NearestNeighborsIndex.nn).

Building an index requires that some non-zero number of DescriptorElement instances be passed into the build_index method. Subsequent calls to this method should rebuild the index model, not add to it. If an implementation supports persistant storage of the index, it should overwrite the configured index.

The nn method uses a single DescriptorElement to query the current index for a specified number of nearest neighbors. Thus, the NearestNeighborsIndex instance must have a non-empty index loaded for this method to function. If the provided query DescriptorElement does not have a set vector, this method will also fail with an exception.

This interface additionally requires that implementations define a count method, which returns the number of distinct DescriptorElement instances are in the index.

class smqtk.algorithms.nn_index.NearestNeighborsIndex[source]¶

Common interface for descriptor-based nearest-neighbor computation over a built index of descriptors.

Implementations, if they allow persistent storage of their index, should take the necessary parameters at construction time. Persistent storage content should be (over)written build_index is called.

Implementations should be thread safe and appropriately protect internal model components from concurrent access and modification.

build_index(descriptors)[source]¶

Build the index with the given descriptor data elements.

Subsequent calls to this method should rebuild the current index. This method shall not add to the existing index nor raise an exception to as to protect the current index.

Raises: ValueError – No data available in the given iterable.
Parameters: descriptors (collections.abc.Iterable[smqtk.representation.DescriptorElement]) – Iterable of descriptor elements to build index over.

abstract count()[source]¶

Returns: Number of elements in this index.
Return type: int

nn(d, n=1)[source]¶

Return the nearest N neighbors to the given descriptor element.

Raises

ValueError – Input query descriptor d has no vector set.
ValueError – Current index is empty.

Parameters

d (smqtk.representation.DescriptorElement) – Descriptor element to compute the neighbors of.
n (int) – Number of nearest neighbors to find.

Returns

Tuple of nearest N DescriptorElement instances, and a tuple of the distance values to those neighbors.

Return type

(tuple[smqtk.representation.DescriptorElement], tuple[float])

remove_from_index(uids)[source]¶

Partially remove descriptors from this index associated with the given UIDs.

Parameters

uids (collections.abc.Iterable[collections.abc.Hashable]) – Iterable of UIDs of descriptors to remove from this index.

Raises

ValueError – No data available in the given iterable.
KeyError – One or more UIDs provided do not match any stored descriptors. The index should not be modified.

update_index(descriptors)[source]¶

Additively update the current index with the one or more descriptor elements given.

If no index exists yet, a new one should be created using the given descriptors.

Raises: ValueError – No data available in the given iterable.
Parameters: descriptors (collections.abc.Iterable[smqtk.representation .DescriptorElement]) – Iterable of descriptor elements to add to this index.

ObjectDetector¶

This interface defines a method to generate object detections (DetectionElement) over a given DataElement.

class smqtk.algorithms.object_detection.ObjectDetector[source]¶

Abstract interface to an object detection algorithm.

An object detection algorithm is one that can take in data and output zero or more detection elements, where each detection represents a spatial region in the data.

This high level interface only requires detection element returns (spatial bounding-boxes with associated classification elements).

detect_objects(data_element, de_factory=<smqtk.representation.detection_element_factory.DetectionElementFactory object>, ce_factory=<smqtk.representation.classification_element_factory.ClassificationElementFactory object>)[source]¶

Detect objects in the given data.

UUIDs of detections are based on the hash produced from the combination of:

Detection bounding-box bounding coordinates

Classification label set predicted for a bounding box.

Parameters

data_element (smqtk.representation.DataElement) – Source data from which to detect objects within.
de_factory (smqtk.representation.DetectionElementFactory) – Factory for generating DetectionElement instances. The default factory yields MemoryClassificationElement instances.
ce_factory (smqtk.representation.ClassificationElementFactory) – Factory for generating ClassificationElement instances for detections. The default factory yields MemoryClassificationElement instances.

Raises

ValueError – Given data element content was not of a valid content type that this class reports as valid for object detection.

Returns

Iterator over result DetectionElement instances as generated by the given DetectionElementFactory, containing classification elements as generated by the given ClassificationElementFactory.

Return type

collections.abc.Iterable[smqtk.representation.DetectionElement]

RankRelevancy¶

This interface defines one method: rank. The rank method takes examples of relevant and not-relevant example descriptor vectors as numpy.ndarray sequences and uses them to compute relevancy scores (on a [0, 1] scale) on a provided pool of other descriptor vectors.

class smqtk.algorithms.rank_relevancy.RankRelevancy[source]¶

Algorithm that can rank a given pool of descriptors based on positively and negatively adjudicated descriptors.

abstract rank(pos: Sequence[numpy.ndarray], neg: Sequence[numpy.ndarray], pool: Sequence[numpy.ndarray]) → Sequence[float][source]¶

Assign a relevancy score to each input descriptor in pool based on the positively and negatively adjudicated descriptors in pos and neg respectively.

Parameters

pos – Sequence of positively adjudicated descriptor vectors.
neg – Sequence of negatively adjudicated descriptor vectors.
pool – A sequence of descriptor vectors that we want to rank by topical relevancy relative to the given positive and negative examples.

Returns

An ordered sequence of float values denoting the relevancy of pool elements

RankRelevancyWithFeedback¶

This interface defines one method: rank_with_feedback. Like RankRelevancy.rank(), rank_with_feedback takes examples of relevant and not-relevant example descriptor vectors as numpy.ndarray sequences and uses them to compute relevancy scores (on a [0, 1] scale) on a provided pool of other descriptor vectors. However, it also expects a sequence of corresponding UIDs for the pool vectors and additionally returns a sequence of UIDs, possibly not all from the pool, on which feedback would be most useful.

class smqtk.algorithms.rank_relevancy.RankRelevancyWithFeedback[source]¶

Similar to the RankRelevancy algorithm but with the added feature of also returning a sequence of elements from which feedback would be “most useful”.

What “most useful” means may be flexible but generally refers to the goal of reducing the amount of adjudications required in order to separate true-positive examples from true-negative examples in provided pools via the assigned relevancy scores. E.g. other elements may be adjudicated in some quantity to achieve some level of relevant sample separation, but if the feedback requests are instead adjudicated, less elements may need to be adjudicated to achieve and equivalent level of separation.

Feedback requests ought to be returned in a form that is meaningful for the user to be able to properly convey the proper information to the adjudicating agent to actually perform adjudications. Additionally, we want to be able to request feedback from elements that may not be present in the given pool of descriptors.

Towards that end, this algorithm should be given a sequence of UIDs for the given pool of descriptors. This allows the implementation to potentially coordinate with an outside source of descriptor references such that the returned feedback requests may be interpreted uniformly.

abstract _rank_with_feedback(pos: Sequence[numpy.ndarray], neg: Sequence[numpy.ndarray], pool: Sequence[numpy.ndarray], pool_uids: Sequence[collections.abc.Hashable]) → Tuple[Sequence[float], Sequence[collections.abc.Hashable]][source]¶: Implement rank_with_feedback(). pool and pool_uids have already been checked to be of equal length.

See also

rank_with_feedback()’s doc-string for the meanings of the parameters and their return values

rank_with_feedback(pos: Sequence[numpy.ndarray], neg: Sequence[numpy.ndarray], pool: Sequence[numpy.ndarray], pool_uids: Sequence[collections.abc.Hashable]) → Tuple[Sequence[float], Sequence[collections.abc.Hashable]][source]¶

Assign a relevancy score to each input descriptor in pool based on the positively and negatively adjudicated descriptors in pos and neg respectively, additionally returning a sequence of UIDs of those descriptors for which adjudication feedback would be “most useful”.

Parameters

pos – Sequence of positively adjudicated descriptor vectors.
neg – Sequence of negatively adjudicated descriptor vectors.
pool – A sequence of descriptor vectors that we want to rank by topical relevancy relative to the given positive and negative examples.
pool_uids – A sequence of hashable UID values, parallel in association with descriptors in pool.

Returns

Ordered sequence of float values denoting relevancy of pool elements, as well as a sequence of Hashable values referencing in-pool or out-of-pool descriptors we recommend for adjudication feedback. In the latter sequence, descriptors are ordered by usefulness, most to least.

Raises

ValueError – pool and pool_uids are of different length

RelevancyIndex¶

This interface defines two methods: build_index and rank. The build_index method is, like a NearestNeighborsIndex, used to build an index of DescriptorElement instances. The rank method takes examples of relevant and not-relevant DescriptorElement examples with which the algorithm uses to rank (think sort) the indexed DescriptorElement instances by relevancy (on a [0, 1] scale).

class smqtk.algorithms.relevancy_index.RelevancyIndex[source]¶

Abstract class for IQR index implementations.

Similar to a traditional nearest-neighbors algorithm, An IQR index provides a specialized nearest-neighbors interface that can take multiple examples of positively and negatively relevant exemplars in order to produce a [0, 1] ranking of the indexed elements by determined relevancy.

abstract build_index(descriptors)[source]¶

Build the index based on the given iterable of descriptor elements.

Subsequent calls to this method should rebuild the index, not add to it.

Raises: ValueError – No data available in the given iterable.
Parameters: descriptors (collections.abc.Iterable[smqtk.representation.DescriptorElement]) – Iterable of descriptor elements to build index over.

abstract count()[source]¶

Returns: Number of elements in this index.
Return type: int

abstract rank(pos, neg)[source]¶

Rank the currently indexed elements given pos positive and neg negative exemplar descriptor elements.

Parameters

pos (collections.abc.Iterable[smqtk.representation.DescriptorElement]) – Iterable of positive exemplar DescriptorElement instances. This may be optional for some implementations.
neg (collections.abc.Iterable[smqtk.representation.DescriptorElement]) – Iterable of negative exemplar DescriptorElement instances. This may be optional for some implementations.

Raises

NoIndexError – If index ranking is requested without an index to rank.

Returns

Map of indexed descriptor elements to a rank value between [0, 1] (inclusive) range, where a 1.0 means most relevant and 0.0 meaning least relevant.

Return type

dict[smqtk.representation.DescriptorElement, float]