Algorithm Interfaces¶
Here we list and briefly describe the high level algorithm interfaces which SMQTK provides. There is at least one implementation available for each interface. Some implementations will require additional dependencies that cannot be packaged with SMQTK.
Classifier¶
This interface represents algorithms that classify DescriptorElement
instances into discrete labels or label confidences.
-
class
smqtk.algorithms.classifier.
Classifier
[source]¶ Interface for algorithms that classify input descriptors into discrete labels and/or label confidences.
-
static
_assert_array_dim_consistency
(array_iter)[source]¶ Assert that arrays are consistent in dimensionality across iterated arrays.
Currently we only support iterating single dimension vectors. Arrays of more than one dimension (i.e. 2D matries, etc.) will trigger a ValueError.
Parameters: array_iter (collections.Iterable[numpy.ndarray]) – Iterable numpy arrays. Raises: ValueError – Not all input arrays were of consistent dimensionality. Returns: Iterable of the same arrays in the same order, but validated to be of common dimensionality.
-
_classify_arrays
(array_iter)[source]¶ Overridable method for classifying an iterable of descriptor elements whose vectors should be classified.
At this level, all input arrays are guaranteed to be of consistent dimensionality.
Each classification mapping should contain confidence values for each label the configured model contains. Implementations may act in a discrete manner whereby only one label is marked with a
1
value (others being0
), or in a continuous manner whereby each label is given a confidence-like value in the [0, 1] range.Parameters: array_iter (collections.Iterable[numpy.ndarray]) – Iterable of arrays to be classified. Returns: Iterable of dictionaries, parallel in association to the input descriptor vectors. Each dictionary should map labels to associated confidence values. Return type: collections.Iterable[dict[collections.Hashable, float]]
-
classify_arrays
(array_iter)[source]¶ Classify an input iterable of numpy arrays into a parallel iterable of label-to-confidence mappings (dictionaries).
Each classification mapping should contain confidence values for each label the configured model contains. Implementations may act in a discrete manner whereby only one label is marked with a
1
value (others being0
), or in a continuous manner whereby each label is given a confidence-like value in the [0, 1] range.Parameters: array_iter (collections.Iterable[numpy.ndarray]) – Iterable of DescriptorElement instances to be classified. Raises: ValueError – Input arrays were not all of consistent dimensionality. Returns: Iterable of dictionaries, parallel in association to the input descriptor vectors. Each dictionary should map labels to associated confidence values. Return type: collections.Iterable[dict[collections.Hashable, float]]
-
classify_elements
(descr_iter, factory=<smqtk.representation.classification_element_factory.ClassificationElementFactory object>, overwrite=False, d_elem_batch=100)[source]¶ Classify an input iterable of descriptor elements into a parallel iterable of classification elements.
Classification element UIDs are inherited from the descriptor element it was generated from.
We invoke
classify_arrays
for actual generation of classification results. See documentation for this method for further details. # We invokeclassify_arrays
for factory-generated classification # elements that do not yet have classifications stored, or on all input # descriptor elements if theoverwrite
flag is True.Selective Iteration For situations when it is desired to access specific generator returns, like when only one descriptor element is provided in order to get a single element out, it is strongly recommended to expand the returned generator into a sequence type first. For example, expanding out the generator’s returns into a list (
list(g.generate_elements([e]))[0]
) is recommended over just getting the “next” element of the returned generator (next(g.generate_elements([e]))
). Expansion into a sequence allows the generator to fully execute, which includes any functionality after the finalyield
statement in any of the underlying iterators that may perform required clean-up.Non-redundant Processing Certain classification element implementations, as dictated by the input factory, may be connected to persistent storage in the background. Because of this, some classification elements may already “have” classification results on construction. This method, by default, only computes new classification results for descriptor elements whose associated classification element does not report as already containing results. If the
overwrite
flag is True then classifications are computed for all input descriptor elements and results are set to their respective classification elements regardless of existing result storage.Parameters: - descr_iter (collections.Iterable[DescriptorElement]) – Iterable of DescriptorElement instances to be classified.
- factory (smqtk.representation.ClassificationElementFactory) – Classification element factory. The default factory yields MemoryClassificationElement instances.
- overwrite (bool) – Recompute classification of the input descriptor and set the results to the ClassificationElement produced by the factory.
- d_elem_batch (int) – The number of descriptor elements to collect before requesting
the whole batch’s vectors at once via
DescriptorElement.get_many_vectors
method.
Raises: - ValueError – Either: (A) one or more input descriptor elements did not have a stored vector, or (B) input descriptor element arrays were not all of consistent dimensionality.
- IndexError – Implementation of
_classify_arrays
either under or over produced classifications relative to the number of input descriptor vectors.
Returns: Iterator of result ClassificationElement instances. UUIDs of generated ClassificationElement instances will reflect the UUID of the DescriptorElement it was computed from.
Return type: collections.Iterator[smqtk.representation.ClassificationElement]
-
classify_one_element
(descr_elem, factory=<smqtk.representation.classification_element_factory.ClassificationElementFactory object>, overwrite=False)[source]¶ Convenience method around
classify_elements
for the single-input case.See documentation for the
Classifier.classify_elements()
method for more information.Parameters: - descr_elem (DescriptorElement) – Iterable of DescriptorElement instances to be classified.
- factory (smqtk.representation.ClassificationElementFactory) – Classification element factory. The default factory yields MemoryClassificationElement instances.
- overwrite (bool) – Recompute classification of the input descriptor and set the results to the ClassificationElement produced by the factory.
Raises: - ValueError – The input descriptor element did not have a stored vector.
- IndexError – Implementation of
_classify_arrays
either under or over produced classifications relative to the number of input descriptor vectors.
Returns: ClassificationElement instances. UUIDs of the generated ClassificationElement instance will reflect the UUID of the DescriptorElement it was computed from.
Return type: smqtk.representation.ClassificationElement
-
get_labels
()[source]¶ Get the sequence of class labels that this classifier can classify descriptors into. This includes the negative or background label if the classifier embodies such a concept.
Returns: Sequence of possible classifier labels. Return type: collections.Sequence[collections.Hashable] Raises: RuntimeError – No model loaded.
-
static
DescriptorGenerator¶
This interface represents algorithms that generate whole-content descriptor
vectors for one or more given input DataElement
instances.
The input DataElement
instances must be of a
content type that the DescriptorGenerator
supports, referenced
against the valid_content_types()
method (required by the ContentTypeValidator
mixin
class).
The DescriptorGenerator.generate_elements()
method also requires a
DescriptorElementFactory
instance to tell the algorithm how to
generate the DescriptorElement
instances it should return.
The returned DescriptorElement
instances will have a type equal to
the name of the DescriptorGenerator
class that generated it, and a
UUID that is the same as the input DataElement
instance.
If a DescriptorElement
implementation that supports persistent
storage is generated, and there is already a descriptor associated with the
given type name and UUID values, the descriptor is returned without
re-computation.
If the overwrite
parameter is True
, the DescriptorGenerator
instance will re-compute a descriptor for the input DataElement
,
setting it to the generated DescriptorElement
.
This will overwrite descriptor data in persistent storage if the
DescriptorElement
type used supports it.
-
class
smqtk.algorithms.descriptor_generator.
DescriptorGenerator
[source]¶ Base abstract Feature Descriptor interface.
-
generate_arrays
(data_iter)[source]¶ Generate descriptor vector elements for all input data elements.
Descriptor arrays yielded out will be parallel in association with the data elements input.
Selective Iteration For situations when it is desired to access specific generator returns, like when only one data element is provided in order to get a single array out, it is strongly recommended to expand the returned generator into a sequence type first. For example, expanding out the generator’s returns into a list (
list(g.generate_arrays([e]))[0]
) is recommended over just getting the “next” element of the returned generator (next(g.generate_arrays([e]))
). Expansion into a sequence allows the generator to fully execute, which includes any functionality after the finalyield
statement in any of the underlying iterators.Parameters: data_iter (collections.Iterable[smqtk.representation.DataElement]) – Iterable of DataElement instances to be described.
Raises: - RuntimeError – Descriptor extraction failure of some kind.
- ValueError – Given data element content was not of a valid type with respect to this descriptor generator implementation.
Returns: Iterator of result numpy.ndarray instances.
Return type: collections.Iterator[numpy.ndarray]
-
generate_elements
(data_iter, descr_factory=<smqtk.representation.descriptor_element_factory.DescriptorElementFactory object>, overwrite=False)[source]¶ Generate DescriptorElement instances for the input data elements, generating new descriptors for those elements that need them, or optionally all input data elements.
Descriptor elements yielded out will be parallel in association with the data elements input. Descriptor element UUIDs are inherited from the data element it was generated from.
Selective Iteration For situations when it is desired to access specific generator returns, like when only one data element is provided in order to get a single element out, it is strongly recommended to expand the returned generator into a sequence type first. For example, expanding out the generator’s returns into a list (
list(g.generate_elements([e]))[0]
) is recommended over just getting the “next” element of the returned generator (next(g.generate_elements([e]))
). Expansion into a sequence allows the generator to fully execute, which includes any functionality after the finalyield
statement in any of the underlying iterators that may perform required clean-up.Non-redundant Processing Certain descriptor element implementations, as dictated by the input factory, may be connected to persistent storage in the background. Because of this, some descriptor elements may already “have” a vector on construction. This method, by default, only computes new descriptor vectors for data elements whose associated descriptor element does not report as already containing a vector. If the
overwrite
flag is True then descriptors are computed for all input data elements and are set to their respective descriptor elements regardless of existing vector storage.Parameters: - data_iter (collections.Iterable[smqtk.representation.DataElement]) – Iterable of DataElement instances to be described.
- descr_factory (smqtk.representation.DescriptorElementFactory) – DescriptorElementFactory instance to drive the generation of element instances with some parametrization.
- overwrite (bool) – By default, if a factory-produced DescriptorElement reports as
containing a vector, we do not compute a descriptor again for the
associated data element. If this is
True
, however, we will generate descriptors for all input data elements, overwriting the vectors previously stored in the factory-produces descriptor elements.
Raises: - RuntimeError – Descriptor extraction failure of some kind.
- ValueError – Given data element content was not of a valid type with respect to this descriptor generator implementation.
- IndexError – Underlying vector-producing generator either under or over produced vectors.
Returns: Iterator of result DescriptorElement instances. UUIDs of generated DescriptorElement instances will reflect the UUID of the DataElement it was generated from.
Return type: collections.Iterator[smqtk.representation.DescriptorElement]
-
generate_one_array
(data_elem)[source]¶ Convenience wrapper around
generate_arrays
for the single-input case.See the documentation for the
DescriptorGenerator.generate_arrays()
method for more information.Parameters: data_elem (smqtk.representation.DataElement) – DataElement instance to be described.
Raises: - RuntimeError – Descriptor extraction failure of some kind.
- ValueError – Given data element content was not of a valid type with respect to this descriptor generator implementation.
Returns: Descriptor vector the given data as a
numpy.ndarray
instance.Return type: numpy.ndarray
-
generate_one_element
(data_elem, descr_factory=<smqtk.representation.descriptor_element_factory.DescriptorElementFactory object>, overwrite=False)[source]¶ Convenience wrapper around
generate_elements
for the single-input case.See documentation for the
DescriptorGenerator.generate_elements()
method for more informationParameters: - data_elem (smqtk.representation.DataElement) – DataElement instance to be described.
- descr_factory (smqtk.representation.DescriptorElementFactory) – DescriptorElementFactory instance to drive the generation of element instances with some parametrization.
- overwrite (bool) – By default, if a factory-produced DescriptorElement reports as
containing a vector, we do not compute a descriptor again for the
associated data element. If this is
True
, however, we will generate descriptors for all input data elements, overwriting the vectors previously stored in the factory-produces descriptor elements.
Raises: - IndexError – Underlying vector-producing generator either under or over produced vectors.
- RuntimeError – Descriptor extraction failure of some kind.
- ValueError – Given data element content was not of a valid type with respect to this descriptor generator implementation.
Returns: Result DescriptorElement instance. UUID of the generated DescriptorElement instance will reflect the UUID of the DataElement it was generated from.
Return type:
-
ImageReader¶
-
class
smqtk.algorithms.image_io.
ImageReader
[source]¶ Interface for algorithms that load a raster image matrix from a data element.
-
is_valid_element
(data_element)[source]¶ Check if the given DataElement instance reports a content type that matches one of the MIME types reported by
valid_content_types
.This override checks if the
DataElement
has thematrix
property as theMatrixDataElement
would provide, and that its value of an expected type.Parameters: data_element (smqtk.representation.DataElement) – Data element instance to check. Returns: True if the given element has a valid content type as reported by valid_content_types
, and False if not.Return type: bool
-
load_as_matrix
(data_element, pixel_crop=None)[source]¶ Load an image matrix from the given data element.
Matrix Property Shortcut. If the given DataElement instance defines a
matrix
property this method simply returns that. This is intended to interface with instances ofsmqtk.representation.data_element.matrix.MatrixDataElement
.Loading From Bytes. When not loading from a short-cut matrix, matrix return format is
ImageReader
implementation dependant. Implementations of this interface should specify and describe their return type.Aside from the exceptions documented below, other exceptions may be raised when an image fails to load that are implementation dependent.
Parameters: - data_element (smqtk.representation.DataElement) – DataElement to load image data from.
- pixel_crop (None|smqtk.representation.AxisAlignedBoundingBox) – Optional bounding box specifying a pixel sub-region to load from the given data. If this is provided it must represent a valid sub-region within the loaded image, otherwise a RuntimeError is raised. Handling of non-integer aligned boxes are implementation dependant.
Raises: - RuntimeError – A crop region was specified but did not specify a valid sub-region of the image.
- AssertionError – The
data_element
provided defined amatrix
attribute/property, but its access did not result in an expected value. - ValueError –
- This error is raised when:
- The given
data_element
was not of a valid content type. - A
pixel_crop
bounding box was provided but was zero volume. pixel_crop
bounding box vertices are not fully represented by integers.
- The given
Returns: Numpy ndarray of the image data. Specific return format is implementation dependant.
Return type: numpy.ndarray
-
-
class
smqtk.algorithms.image_io.pil_io.
PilImageReader
(explicit_mode=None)[source]¶ Image reader that uses PIL to load the image.
This implementation may additionally raise an
IOError
when failing to to load an image.-
get_config
()[source]¶ Return a JSON-compliant dictionary that could be passed to this class’s
from_config
method to produce an instance with identical configuration.In the most cases, this involves naming the keys of the dictionary based on the initialization argument names as if it were to be passed to the constructor via dictionary expansion. In some cases, where it doesn’t make sense to store some object constructor parameters are expected to be supplied at as configuration values (i.e. must be supplied at runtime), this method’s returned dictionary may leave those parameters out. In such cases, the object’s
from_config
class-method would also take additional positional arguments to fill in for the parameters that this returned configuration lacks.Returns: JSON type compliant configuration dictionary. Return type: dict
-
classmethod
is_usable
()[source]¶ Check whether this class is available for use.
Since certain plugin implementations may require additional dependencies that may not yet be available on the system, this method should check for those dependencies and return a boolean saying if the implementation is usable.
- NOTES:
- This should be a class method
- When an implementation is deemed not usable, this should emit a
- warning detailing why the implementation is not available for use.
Returns: Boolean determination of whether this implementation is usable. Return type: bool
-
HashIndex¶
This interface describes specialized NearestNeighborsIndex
implementations designed to index hash codes (bit vectors) via the hamming distance function.
Implementations of this interface are primarily used with the LSHNearestNeighborIndex
implementation.
Unlike the NearestNeighborsIndex
interface from which this interface descends, HashIndex
instances are build with an iterable of numpy.ndarray
and nn
returns a numpy.ndarray
.
-
class
smqtk.algorithms.nn_index.hash_index.
HashIndex
[source]¶ Specialized
NearestNeighborsIndex
for indexing unique hash codes bit-vectors) in memory (numpy arrays) using the hamming distance metric.Implementations of this interface cannot be used in place of something requiring a
NearestNeighborsIndex
implementation due to the speciality of this interface.Only unique bit vectors should be indexed. The
nn
method should not return the same bit vector more than once for any query.-
build_index
(hashes)[source]¶ Build the index with the given hash codes (bit-vectors).
Subsequent calls to this method should rebuild the current index. This method shall not add to the existing index nor raise an exception to as to protect the current index.
Raises: ValueError – No data available in the given iterable. Parameters: hashes (collections.Iterable[numpy.ndarray[bool]]) – Iterable of descriptor elements to build index over.
-
nn
(h, n=1)[source]¶ Return the nearest N neighbor hash codes as bit-vectors to the given hash code bit-vector.
Distances are in the range [0,1] and are the percent different each neighbor hash is from the query, based on the number of bits contained in the query (normalized hamming distance).
Raises: ValueError – Current index is empty.
Parameters: Returns: Tuple of nearest N hash codes and a tuple of the distance values to those neighbors.
Return type:
-
remove_from_index
(hashes)[source]¶ Partially remove hashes from this index.
Parameters: hashes (collections.Iterable[numpy.ndarray[bool]]) – Iterable of numpy boolean hash vectors to remove from this index.
Raises: - ValueError – No data available in the given iterable.
- KeyError – One or more UIDs provided do not match any stored descriptors.
-
update_index
(hashes)[source]¶ Additively update the current index with the one or more hash vectors given.
If no index exists yet, a new one should be created using the given hash vectors.
Raises: ValueError – No data available in the given iterable. Parameters: hashes (collections.Iterable[numpy.ndarray[bool]]) – Iterable of numpy boolean hash vectors to add to this index.
-
LshFunctor¶
Implementations of this interface define the generation of a locality-sensitive hash code for a given DescriptorElement
.
These are used in LSHNearestNeighborIndex
instances.
-
class
smqtk.algorithms.nn_index.lsh.functors.
LshFunctor
[source]¶ Locality-sensitive hashing functor interface.
The aim of such a function is to be able to generate hash codes (bit-vectors) such that similar items map to the same or similar hashes with a high probability. In other words, it aims to maximize hash collision for similar items.
Building Models
Some hash functions want to build a model based on some training set of descriptors. Due to the non-standard nature of algorithm training and model building, please refer to the specific implementation for further information on whether model training is needed and how it is accomplished.
NearestNeighborsIndex¶
This interface defines a method to build an index from a set of DescriptorElement
instances (NearestNeighborsIndex.build_index
) and a nearest-neighbors query function for getting a number of near neighbors to e query DescriptorElement
(NearestNeighborsIndex.nn
).
Building an index requires that some non-zero number of DescriptorElement
instances be passed into the build_index
method.
Subsequent calls to this method should rebuild the index model, not add to it.
If an implementation supports persistant storage of the index, it should overwrite the configured index.
The nn
method uses a single DescriptorElement
to query the current index for a specified number of nearest neighbors.
Thus, the NearestNeighborsIndex
instance must have a non-empty index loaded for this method to function.
If the provided query DescriptorElement
does not have a set vector, this method will also fail with an exception.
This interface additionally requires that implementations define a count
method, which returns the number of distinct DescriptorElement
instances are in the index.
-
class
smqtk.algorithms.nn_index.
NearestNeighborsIndex
[source]¶ Common interface for descriptor-based nearest-neighbor computation over a built index of descriptors.
Implementations, if they allow persistent storage of their index, should take the necessary parameters at construction time. Persistent storage content should be (over)written
build_index
is called.Implementations should be thread safe and appropriately protect internal model components from concurrent access and modification.
-
build_index
(descriptors)[source]¶ Build the index with the given descriptor data elements.
Subsequent calls to this method should rebuild the current index. This method shall not add to the existing index nor raise an exception to as to protect the current index.
Raises: ValueError – No data available in the given iterable. Parameters: descriptors (collections.Iterable[smqtk.representation.DescriptorElement]) – Iterable of descriptor elements to build index over.
-
nn
(d, n=1)[source]¶ Return the nearest N neighbors to the given descriptor element.
Raises: - ValueError – Input query descriptor
d
has no vector set. - ValueError – Current index is empty.
Parameters: - d (smqtk.representation.DescriptorElement) – Descriptor element to compute the neighbors of.
- n (int) – Number of nearest neighbors to find.
Returns: Tuple of nearest N DescriptorElement instances, and a tuple of the distance values to those neighbors.
Return type: (tuple[smqtk.representation.DescriptorElement], tuple[float])
- ValueError – Input query descriptor
-
remove_from_index
(uids)[source]¶ Partially remove descriptors from this index associated with the given UIDs.
Parameters: uids (collections.Iterable[collections.Hashable]) – Iterable of UIDs of descriptors to remove from this index.
Raises: - ValueError – No data available in the given iterable.
- KeyError – One or more UIDs provided do not match any stored descriptors. The index should not be modified.
-
update_index
(descriptors)[source]¶ Additively update the current index with the one or more descriptor elements given.
If no index exists yet, a new one should be created using the given descriptors.
Raises: ValueError – No data available in the given iterable. Parameters: descriptors (collections.Iterable[smqtk.representation .DescriptorElement]) – Iterable of descriptor elements to add to this index.
-
ObjectDetector¶
This interface defines a method to generate object detections
(DetectionElement
) over a given
DataElement
.
-
class
smqtk.algorithms.object_detection.
ObjectDetector
[source]¶ Abstract interface to an object detection algorithm.
An object detection algorithm is one that can take in data and output zero or more detection elements, where each detection represents a spatial region in the data.
This high level interface only requires detection element returns (spatial bounding-boxes with associated classification elements).
-
detect_objects
(data_element, de_factory=<smqtk.representation.detection_element_factory.DetectionElementFactory object>, ce_factory=<smqtk.representation.classification_element_factory.ClassificationElementFactory object>)[source]¶ Detect objects in the given data.
UUIDs of detections are based on the hash produced from the combination of:
- Detection bounding-box bounding coordinates
- Classification label set predicted for a bounding box.
Parameters: - data_element (smqtk.representation.DataElement) – Source data from which to detect objects within.
- de_factory (smqtk.representation.DetectionElementFactory) – Factory for generating DetectionElement instances. The default factory yields MemoryClassificationElement instances.
- ce_factory (smqtk.representation.ClassificationElementFactory) – Factory for generating ClassificationElement instances for detections. The default factory yields MemoryClassificationElement instances.
Raises: ValueError – Given data element content was not of a valid content type that this class reports as valid for object detection.
Returns: Iterator over result DetectionElement instances as generated by the given DetectionElementFactory, containing classification elements as generated by the given ClassificationElementFactory.
Return type: collections.Iterable[smqtk.representation.DetectionElement]
-
RelevancyIndex¶
This interface defines two methods: build_index
and rank
.
The build_index
method is, like a NearestNeighborsIndex
, used to build an index of DescriptorElement
instances.
The rank
method takes examples of relevant and not-relevant DescriptorElement
examples with which the algorithm uses to rank (think sort) the indexed DescriptorElement
instances by relevancy (on a [0, 1]
scale).
-
class
smqtk.algorithms.relevancy_index.
RelevancyIndex
[source]¶ Abstract class for IQR index implementations.
Similar to a traditional nearest-neighbors algorithm, An IQR index provides a specialized nearest-neighbors interface that can take multiple examples of positively and negatively relevant exemplars in order to produce a [0, 1] ranking of the indexed elements by determined relevancy.
-
build_index
(descriptors)[source]¶ Build the index based on the given iterable of descriptor elements.
Subsequent calls to this method should rebuild the index, not add to it.
Raises: ValueError – No data available in the given iterable. Parameters: descriptors (collections.Iterable[smqtk.representation.DescriptorElement]) – Iterable of descriptor elements to build index over.
-
rank
(pos, neg)[source]¶ Rank the currently indexed elements given
pos
positive andneg
negative exemplar descriptor elements.Parameters: - pos (collections.Iterable[smqtk.representation.DescriptorElement]) – Iterable of positive exemplar DescriptorElement instances. This may be optional for some implementations.
- neg (collections.Iterable[smqtk.representation.DescriptorElement]) – Iterable of negative exemplar DescriptorElement instances. This may be optional for some implementations.
Raises: NoIndexError – If index ranking is requested without an index to rank.
Returns: Map of indexed descriptor elements to a rank value between [0, 1] (inclusive) range, where a 1.0 means most relevant and 0.0 meaning least relevant.
Return type:
-