Algorithm Interfaces

class smqtk.algorithms.SmqtkAlgorithm[source]

Parent class for all algorithm interfaces.

name
Returns:The name of this class type.
Return type:str

Here we list and briefly describe the high level algorithm interfaces which SMQTK provides. There is at least one implementation available for each interface. Some implementations will require additional dependencies that cannot be packaged with SMQTK.

Classifier

This interface represents algorithms that classify DescriptorElement instances into discrete labels or label confidences.

class smqtk.algorithms.classifier.Classifier[source]

Interface for algorithms that classify input descriptors into discrete labels and/or label confidences.

static _assert_array_dim_consistency(array_iter)[source]

Assert that arrays are consistent in dimensionality across iterated arrays.

Currently we only support iterating single dimension vectors. Arrays of more than one dimension (i.e. 2D matries, etc.) will trigger a ValueError.

Parameters:array_iter (collections.Iterable[numpy.ndarray]) – Iterable numpy arrays.
Raises:ValueError – Not all input arrays were of consistent dimensionality.
Returns:Iterable of the same arrays in the same order, but validated to be of common dimensionality.
_classify_arrays(array_iter)[source]

Overridable method for classifying an iterable of descriptor elements whose vectors should be classified.

At this level, all input arrays are guaranteed to be of consistent dimensionality.

Each classification mapping should contain confidence values for each label the configured model contains. Implementations may act in a discrete manner whereby only one label is marked with a 1 value (others being 0), or in a continuous manner whereby each label is given a confidence-like value in the [0, 1] range.

Parameters:array_iter (collections.Iterable[numpy.ndarray]) – Iterable of arrays to be classified.
Returns:Iterable of dictionaries, parallel in association to the input descriptor vectors. Each dictionary should map labels to associated confidence values.
Return type:collections.Iterable[dict[collections.Hashable, float]]
classify_arrays(array_iter)[source]

Classify an input iterable of numpy arrays into a parallel iterable of label-to-confidence mappings (dictionaries).

Each classification mapping should contain confidence values for each label the configured model contains. Implementations may act in a discrete manner whereby only one label is marked with a 1 value (others being 0), or in a continuous manner whereby each label is given a confidence-like value in the [0, 1] range.

Parameters:array_iter (collections.Iterable[numpy.ndarray]) – Iterable of DescriptorElement instances to be classified.
Raises:ValueError – Input arrays were not all of consistent dimensionality.
Returns:Iterable of dictionaries, parallel in association to the input descriptor vectors. Each dictionary should map labels to associated confidence values.
Return type:collections.Iterable[dict[collections.Hashable, float]]
classify_elements(descr_iter, factory=<smqtk.representation.classification_element_factory.ClassificationElementFactory object>, overwrite=False, d_elem_batch=100)[source]

Classify an input iterable of descriptor elements into a parallel iterable of classification elements.

Classification element UIDs are inherited from the descriptor element it was generated from.

We invoke classify_arrays for actual generation of classification results. See documentation for this method for further details. # We invoke classify_arrays for factory-generated classification # elements that do not yet have classifications stored, or on all input # descriptor elements if the overwrite flag is True.

Selective Iteration For situations when it is desired to access specific generator returns, like when only one descriptor element is provided in order to get a single element out, it is strongly recommended to expand the returned generator into a sequence type first. For example, expanding out the generator’s returns into a list (list(g.generate_elements([e]))[0]) is recommended over just getting the “next” element of the returned generator (next(g.generate_elements([e]))). Expansion into a sequence allows the generator to fully execute, which includes any functionality after the final yield statement in any of the underlying iterators that may perform required clean-up.

Non-redundant Processing Certain classification element implementations, as dictated by the input factory, may be connected to persistent storage in the background. Because of this, some classification elements may already “have” classification results on construction. This method, by default, only computes new classification results for descriptor elements whose associated classification element does not report as already containing results. If the overwrite flag is True then classifications are computed for all input descriptor elements and results are set to their respective classification elements regardless of existing result storage.

Parameters:
  • descr_iter (collections.Iterable[DescriptorElement]) – Iterable of DescriptorElement instances to be classified.
  • factory (smqtk.representation.ClassificationElementFactory) – Classification element factory. The default factory yields MemoryClassificationElement instances.
  • overwrite (bool) – Recompute classification of the input descriptor and set the results to the ClassificationElement produced by the factory.
  • d_elem_batch (int) – The number of descriptor elements to collect before requesting the whole batch’s vectors at once via DescriptorElement.get_many_vectors method.
Raises:
  • ValueError – Either: (A) one or more input descriptor elements did not have a stored vector, or (B) input descriptor element arrays were not all of consistent dimensionality.
  • IndexError – Implementation of _classify_arrays either under or over produced classifications relative to the number of input descriptor vectors.
Returns:

Iterator of result ClassificationElement instances. UUIDs of generated ClassificationElement instances will reflect the UUID of the DescriptorElement it was computed from.

Return type:

collections.Iterator[smqtk.representation.ClassificationElement]

classify_one_element(descr_elem, factory=<smqtk.representation.classification_element_factory.ClassificationElementFactory object>, overwrite=False)[source]

Convenience method around classify_elements for the single-input case.

See documentation for the Classifier.classify_elements() method for more information.

Parameters:
  • descr_elem (DescriptorElement) – Iterable of DescriptorElement instances to be classified.
  • factory (smqtk.representation.ClassificationElementFactory) – Classification element factory. The default factory yields MemoryClassificationElement instances.
  • overwrite (bool) – Recompute classification of the input descriptor and set the results to the ClassificationElement produced by the factory.
Raises:
  • ValueError – The input descriptor element did not have a stored vector.
  • IndexError – Implementation of _classify_arrays either under or over produced classifications relative to the number of input descriptor vectors.
Returns:

ClassificationElement instances. UUIDs of the generated ClassificationElement instance will reflect the UUID of the DescriptorElement it was computed from.

Return type:

smqtk.representation.ClassificationElement

get_labels()[source]

Get the sequence of class labels that this classifier can classify descriptors into. This includes the negative or background label if the classifier embodies such a concept.

Returns:Sequence of possible classifier labels.
Return type:collections.Sequence[collections.Hashable]
Raises:RuntimeError – No model loaded.

DescriptorGenerator

This interface represents algorithms that generate whole-content descriptor vectors for one or more given input DataElement instances. The input DataElement instances must be of a content type that the DescriptorGenerator supports, referenced against the valid_content_types() method (required by the ContentTypeValidator mixin class).

The DescriptorGenerator.generate_elements() method also requires a DescriptorElementFactory instance to tell the algorithm how to generate the DescriptorElement instances it should return. The returned DescriptorElement instances will have a type equal to the name of the DescriptorGenerator class that generated it, and a UUID that is the same as the input DataElement instance.

If a DescriptorElement implementation that supports persistent storage is generated, and there is already a descriptor associated with the given type name and UUID values, the descriptor is returned without re-computation.

If the overwrite parameter is True, the DescriptorGenerator instance will re-compute a descriptor for the input DataElement, setting it to the generated DescriptorElement. This will overwrite descriptor data in persistent storage if the DescriptorElement type used supports it.

class smqtk.algorithms.descriptor_generator.DescriptorGenerator[source]

Base abstract Feature Descriptor interface.

generate_arrays(data_iter)[source]

Generate descriptor vector elements for all input data elements.

Descriptor arrays yielded out will be parallel in association with the data elements input.

Selective Iteration For situations when it is desired to access specific generator returns, like when only one data element is provided in order to get a single array out, it is strongly recommended to expand the returned generator into a sequence type first. For example, expanding out the generator’s returns into a list (list(g.generate_arrays([e]))[0]) is recommended over just getting the “next” element of the returned generator (next(g.generate_arrays([e]))). Expansion into a sequence allows the generator to fully execute, which includes any functionality after the final yield statement in any of the underlying iterators.

Parameters:

data_iter (collections.Iterable[smqtk.representation.DataElement]) – Iterable of DataElement instances to be described.

Raises:
  • RuntimeError – Descriptor extraction failure of some kind.
  • ValueError – Given data element content was not of a valid type with respect to this descriptor generator implementation.
Returns:

Iterator of result numpy.ndarray instances.

Return type:

collections.Iterator[numpy.ndarray]

generate_elements(data_iter, descr_factory=<smqtk.representation.descriptor_element_factory.DescriptorElementFactory object>, overwrite=False)[source]

Generate DescriptorElement instances for the input data elements, generating new descriptors for those elements that need them, or optionally all input data elements.

Descriptor elements yielded out will be parallel in association with the data elements input. Descriptor element UUIDs are inherited from the data element it was generated from.

Selective Iteration For situations when it is desired to access specific generator returns, like when only one data element is provided in order to get a single element out, it is strongly recommended to expand the returned generator into a sequence type first. For example, expanding out the generator’s returns into a list (list(g.generate_elements([e]))[0]) is recommended over just getting the “next” element of the returned generator (next(g.generate_elements([e]))). Expansion into a sequence allows the generator to fully execute, which includes any functionality after the final yield statement in any of the underlying iterators that may perform required clean-up.

Non-redundant Processing Certain descriptor element implementations, as dictated by the input factory, may be connected to persistent storage in the background. Because of this, some descriptor elements may already “have” a vector on construction. This method, by default, only computes new descriptor vectors for data elements whose associated descriptor element does not report as already containing a vector. If the overwrite flag is True then descriptors are computed for all input data elements and are set to their respective descriptor elements regardless of existing vector storage.

Parameters:
  • data_iter (collections.Iterable[smqtk.representation.DataElement]) – Iterable of DataElement instances to be described.
  • descr_factory (smqtk.representation.DescriptorElementFactory) – DescriptorElementFactory instance to drive the generation of element instances with some parametrization.
  • overwrite (bool) – By default, if a factory-produced DescriptorElement reports as containing a vector, we do not compute a descriptor again for the associated data element. If this is True, however, we will generate descriptors for all input data elements, overwriting the vectors previously stored in the factory-produces descriptor elements.
Raises:
  • RuntimeError – Descriptor extraction failure of some kind.
  • ValueError – Given data element content was not of a valid type with respect to this descriptor generator implementation.
  • IndexError – Underlying vector-producing generator either under or over produced vectors.
Returns:

Iterator of result DescriptorElement instances. UUIDs of generated DescriptorElement instances will reflect the UUID of the DataElement it was generated from.

Return type:

collections.Iterator[smqtk.representation.DescriptorElement]

generate_one_array(data_elem)[source]

Convenience wrapper around generate_arrays for the single-input case.

See the documentation for the DescriptorGenerator.generate_arrays() method for more information.

Parameters:

data_elem (smqtk.representation.DataElement) – DataElement instance to be described.

Raises:
  • RuntimeError – Descriptor extraction failure of some kind.
  • ValueError – Given data element content was not of a valid type with respect to this descriptor generator implementation.
Returns:

Descriptor vector the given data as a numpy.ndarray instance.

Return type:

numpy.ndarray

generate_one_element(data_elem, descr_factory=<smqtk.representation.descriptor_element_factory.DescriptorElementFactory object>, overwrite=False)[source]

Convenience wrapper around generate_elements for the single-input case.

See documentation for the DescriptorGenerator.generate_elements() method for more information

Parameters:
  • data_elem (smqtk.representation.DataElement) – DataElement instance to be described.
  • descr_factory (smqtk.representation.DescriptorElementFactory) – DescriptorElementFactory instance to drive the generation of element instances with some parametrization.
  • overwrite (bool) – By default, if a factory-produced DescriptorElement reports as containing a vector, we do not compute a descriptor again for the associated data element. If this is True, however, we will generate descriptors for all input data elements, overwriting the vectors previously stored in the factory-produces descriptor elements.
Raises:
  • IndexError – Underlying vector-producing generator either under or over produced vectors.
  • RuntimeError – Descriptor extraction failure of some kind.
  • ValueError – Given data element content was not of a valid type with respect to this descriptor generator implementation.
Returns:

Result DescriptorElement instance. UUID of the generated DescriptorElement instance will reflect the UUID of the DataElement it was generated from.

Return type:

smqtk.representation.DescriptorElement

ImageReader

class smqtk.algorithms.image_io.ImageReader[source]

Interface for algorithms that load a raster image matrix from a data element.

is_valid_element(data_element)[source]

Check if the given DataElement instance reports a content type that matches one of the MIME types reported by valid_content_types.

This override checks if the DataElement has the matrix property as the MatrixDataElement would provide, and that its value of an expected type.

Parameters:data_element (smqtk.representation.DataElement) – Data element instance to check.
Returns:True if the given element has a valid content type as reported by valid_content_types, and False if not.
Return type:bool
load_as_matrix(data_element, pixel_crop=None)[source]

Load an image matrix from the given data element.

Matrix Property Shortcut. If the given DataElement instance defines a matrix property this method simply returns that. This is intended to interface with instances of smqtk.representation.data_element.matrix.MatrixDataElement.

Loading From Bytes. When not loading from a short-cut matrix, matrix return format is ImageReader implementation dependant. Implementations of this interface should specify and describe their return type.

Aside from the exceptions documented below, other exceptions may be raised when an image fails to load that are implementation dependent.

Parameters:
  • data_element (smqtk.representation.DataElement) – DataElement to load image data from.
  • pixel_crop (None|smqtk.representation.AxisAlignedBoundingBox) – Optional bounding box specifying a pixel sub-region to load from the given data. If this is provided it must represent a valid sub-region within the loaded image, otherwise a RuntimeError is raised. Handling of non-integer aligned boxes are implementation dependant.
Raises:
  • RuntimeError – A crop region was specified but did not specify a valid sub-region of the image.
  • AssertionError – The data_element provided defined a matrix attribute/property, but its access did not result in an expected value.
  • ValueError
    This error is raised when:
    • The given data_element was not of a valid content type.
    • A pixel_crop bounding box was provided but was zero volume.
    • pixel_crop bounding box vertices are not fully represented by integers.
Returns:

Numpy ndarray of the image data. Specific return format is implementation dependant.

Return type:

numpy.ndarray

class smqtk.algorithms.image_io.pil_io.PilImageReader(explicit_mode=None)[source]

Image reader that uses PIL to load the image.

This implementation may additionally raise an IOError when failing to to load an image.

get_config()[source]

Return a JSON-compliant dictionary that could be passed to this class’s from_config method to produce an instance with identical configuration.

In the most cases, this involves naming the keys of the dictionary based on the initialization argument names as if it were to be passed to the constructor via dictionary expansion. In some cases, where it doesn’t make sense to store some object constructor parameters are expected to be supplied at as configuration values (i.e. must be supplied at runtime), this method’s returned dictionary may leave those parameters out. In such cases, the object’s from_config class-method would also take additional positional arguments to fill in for the parameters that this returned configuration lacks.

Returns:JSON type compliant configuration dictionary.
Return type:dict
classmethod is_usable()[source]

Check whether this class is available for use.

Since certain plugin implementations may require additional dependencies that may not yet be available on the system, this method should check for those dependencies and return a boolean saying if the implementation is usable.

NOTES:
  • This should be a class method
  • When an implementation is deemed not usable, this should emit a
    warning detailing why the implementation is not available for use.
Returns:Boolean determination of whether this implementation is usable.
Return type:bool
valid_content_types()[source]
Returns:A set valid MIME types that are “valid” within the implementing class’ context.
Return type:set[str]

HashIndex

This interface describes specialized NearestNeighborsIndex implementations designed to index hash codes (bit vectors) via the hamming distance function. Implementations of this interface are primarily used with the LSHNearestNeighborIndex implementation.

Unlike the NearestNeighborsIndex interface from which this interface descends, HashIndex instances are build with an iterable of numpy.ndarray and nn returns a numpy.ndarray.

class smqtk.algorithms.nn_index.hash_index.HashIndex[source]

Specialized NearestNeighborsIndex for indexing unique hash codes bit-vectors) in memory (numpy arrays) using the hamming distance metric.

Implementations of this interface cannot be used in place of something requiring a NearestNeighborsIndex implementation due to the speciality of this interface.

Only unique bit vectors should be indexed. The nn method should not return the same bit vector more than once for any query.

build_index(hashes)[source]

Build the index with the given hash codes (bit-vectors).

Subsequent calls to this method should rebuild the current index. This method shall not add to the existing index nor raise an exception to as to protect the current index.

Raises:ValueError – No data available in the given iterable.
Parameters:hashes (collections.Iterable[numpy.ndarray[bool]]) – Iterable of descriptor elements to build index over.
count()[source]
Returns:Number of elements in this index.
Return type:int
nn(h, n=1)[source]

Return the nearest N neighbor hash codes as bit-vectors to the given hash code bit-vector.

Distances are in the range [0,1] and are the percent different each neighbor hash is from the query, based on the number of bits contained in the query (normalized hamming distance).

Raises:

ValueError – Current index is empty.

Parameters:
  • h (numpy.ndarray[bool]) – Hash code to compute the neighbors of. Should be the same bit length as indexed hash codes.
  • n (int) – Number of nearest neighbors to find.
Returns:

Tuple of nearest N hash codes and a tuple of the distance values to those neighbors.

Return type:

(tuple[numpy.ndarray[bool]], tuple[float])

remove_from_index(hashes)[source]

Partially remove hashes from this index.

Parameters:

hashes (collections.Iterable[numpy.ndarray[bool]]) – Iterable of numpy boolean hash vectors to remove from this index.

Raises:
  • ValueError – No data available in the given iterable.
  • KeyError – One or more UIDs provided do not match any stored descriptors.
update_index(hashes)[source]

Additively update the current index with the one or more hash vectors given.

If no index exists yet, a new one should be created using the given hash vectors.

Raises:ValueError – No data available in the given iterable.
Parameters:hashes (collections.Iterable[numpy.ndarray[bool]]) – Iterable of numpy boolean hash vectors to add to this index.

LshFunctor

Implementations of this interface define the generation of a locality-sensitive hash code for a given DescriptorElement. These are used in LSHNearestNeighborIndex instances.

class smqtk.algorithms.nn_index.lsh.functors.LshFunctor[source]

Locality-sensitive hashing functor interface.

The aim of such a function is to be able to generate hash codes (bit-vectors) such that similar items map to the same or similar hashes with a high probability. In other words, it aims to maximize hash collision for similar items.

Building Models

Some hash functions want to build a model based on some training set of descriptors. Due to the non-standard nature of algorithm training and model building, please refer to the specific implementation for further information on whether model training is needed and how it is accomplished.

get_hash(descriptor)[source]

Get the locality-sensitive hash code for the input descriptor.

Parameters:descriptor (numpy.ndarray[float]) – Descriptor vector we should generate the hash of.
Returns:Generated bit-vector as a numpy array of booleans.
Return type:numpy.ndarray[bool]

NearestNeighborsIndex

This interface defines a method to build an index from a set of DescriptorElement instances (NearestNeighborsIndex.build_index) and a nearest-neighbors query function for getting a number of near neighbors to e query DescriptorElement (NearestNeighborsIndex.nn).

Building an index requires that some non-zero number of DescriptorElement instances be passed into the build_index method. Subsequent calls to this method should rebuild the index model, not add to it. If an implementation supports persistant storage of the index, it should overwrite the configured index.

The nn method uses a single DescriptorElement to query the current index for a specified number of nearest neighbors. Thus, the NearestNeighborsIndex instance must have a non-empty index loaded for this method to function. If the provided query DescriptorElement does not have a set vector, this method will also fail with an exception.

This interface additionally requires that implementations define a count method, which returns the number of distinct DescriptorElement instances are in the index.

class smqtk.algorithms.nn_index.NearestNeighborsIndex[source]

Common interface for descriptor-based nearest-neighbor computation over a built index of descriptors.

Implementations, if they allow persistent storage of their index, should take the necessary parameters at construction time. Persistent storage content should be (over)written build_index is called.

Implementations should be thread safe and appropriately protect internal model components from concurrent access and modification.

build_index(descriptors)[source]

Build the index with the given descriptor data elements.

Subsequent calls to this method should rebuild the current index. This method shall not add to the existing index nor raise an exception to as to protect the current index.

Raises:ValueError – No data available in the given iterable.
Parameters:descriptors (collections.Iterable[smqtk.representation.DescriptorElement]) – Iterable of descriptor elements to build index over.
count()[source]
Returns:Number of elements in this index.
Return type:int
nn(d, n=1)[source]

Return the nearest N neighbors to the given descriptor element.

Raises:
Parameters:
Returns:

Tuple of nearest N DescriptorElement instances, and a tuple of the distance values to those neighbors.

Return type:

(tuple[smqtk.representation.DescriptorElement], tuple[float])

remove_from_index(uids)[source]

Partially remove descriptors from this index associated with the given UIDs.

Parameters:

uids (collections.Iterable[collections.Hashable]) – Iterable of UIDs of descriptors to remove from this index.

Raises:
  • ValueError – No data available in the given iterable.
  • KeyError – One or more UIDs provided do not match any stored descriptors. The index should not be modified.
update_index(descriptors)[source]

Additively update the current index with the one or more descriptor elements given.

If no index exists yet, a new one should be created using the given descriptors.

Raises:ValueError – No data available in the given iterable.
Parameters:descriptors (collections.Iterable[smqtk.representation .DescriptorElement]) – Iterable of descriptor elements to add to this index.

ObjectDetector

This interface defines a method to generate object detections (DetectionElement) over a given DataElement.

class smqtk.algorithms.object_detection.ObjectDetector[source]

Abstract interface to an object detection algorithm.

An object detection algorithm is one that can take in data and output zero or more detection elements, where each detection represents a spatial region in the data.

This high level interface only requires detection element returns (spatial bounding-boxes with associated classification elements).

detect_objects(data_element, de_factory=<smqtk.representation.detection_element_factory.DetectionElementFactory object>, ce_factory=<smqtk.representation.classification_element_factory.ClassificationElementFactory object>)[source]

Detect objects in the given data.

UUIDs of detections are based on the hash produced from the combination of:

  • Detection bounding-box bounding coordinates
  • Classification label set predicted for a bounding box.
Parameters:
Raises:

ValueError – Given data element content was not of a valid content type that this class reports as valid for object detection.

Returns:

Iterator over result DetectionElement instances as generated by the given DetectionElementFactory, containing classification elements as generated by the given ClassificationElementFactory.

Return type:

collections.Iterable[smqtk.representation.DetectionElement]

RelevancyIndex

This interface defines two methods: build_index and rank. The build_index method is, like a NearestNeighborsIndex, used to build an index of DescriptorElement instances. The rank method takes examples of relevant and not-relevant DescriptorElement examples with which the algorithm uses to rank (think sort) the indexed DescriptorElement instances by relevancy (on a [0, 1] scale).

class smqtk.algorithms.relevancy_index.RelevancyIndex[source]

Abstract class for IQR index implementations.

Similar to a traditional nearest-neighbors algorithm, An IQR index provides a specialized nearest-neighbors interface that can take multiple examples of positively and negatively relevant exemplars in order to produce a [0, 1] ranking of the indexed elements by determined relevancy.

build_index(descriptors)[source]

Build the index based on the given iterable of descriptor elements.

Subsequent calls to this method should rebuild the index, not add to it.

Raises:ValueError – No data available in the given iterable.
Parameters:descriptors (collections.Iterable[smqtk.representation.DescriptorElement]) – Iterable of descriptor elements to build index over.
count()[source]
Returns:Number of elements in this index.
Return type:int
rank(pos, neg)[source]

Rank the currently indexed elements given pos positive and neg negative exemplar descriptor elements.

Parameters:
Raises:

NoIndexError – If index ranking is requested without an index to rank.

Returns:

Map of indexed descriptor elements to a rank value between [0, 1] (inclusive) range, where a 1.0 means most relevant and 0.0 meaning least relevant.

Return type:

dict[smqtk.representation.DescriptorElement, float]