Algorithm Interfaces¶
Here we list and briefly describe the high level algorithm interfaces which SMQTK provides. There is at least one implementation available for each interface. Some implementations will require additional dependencies that cannot be packaged with SMQTK.
Classifier¶
This interface represents algorithms that classify DescriptorElement
instances into discrete labels or label confidences.
-
class
smqtk.algorithms.classifier.
Classifier
[source]¶ Interface for algorithms that classify input descriptors into discrete labels and/or label confidences.
-
classify
(d, factory=<smqtk.representation.classification_element_factory.ClassificationElementFactory object>, overwrite=False)[source]¶ Classify the input descriptor against one or more discrete labels, outputting a ClassificationElement containing the classification result.
We return confidence values for each label the configured model contains. Implementations may act in a discrete manner whereby only one label is marked with a
1
value (others being0
), or in a continuous manner whereby each label is given a confidence-like value in the [0, 1] range.The returned
ClassificationElement
will have the same UUID as the inputDescriptorElement
.Parameters: - d (smqtk.representation.DescriptorElement) – Input descriptor to classify
- factory (smqtk.representation.ClassificationElementFactory) – Classification element factory. The default factory yields MemoryClassificationElement instances.
- overwrite (bool) – Recompute classification of the input descriptor and set the results to the ClassificationElement produced by the factory.
Raises: - ValueError – The given descriptor element did not have a vector to operate on.
- RuntimeError – Could not perform classification for some reason (see message in raised exception).
Returns: Classification result element
Return type: smqtk.representation.ClassificationElement
-
classify_async
(d_iter, factory=<smqtk.representation.classification_element_factory.ClassificationElementFactory object>, overwrite=False, procs=None, use_multiprocessing=False, ri=None)[source]¶ Asynchronously classify the DescriptorElements in the given iterable.
Parameters: - d_iter (collections.Iterable[smqtk.representation.DescriptorElement]) – Iterable of DescriptorElements
- factory (smqtk.representation.ClassificationElementFactory) – Classifier element factory to use for element generation. The default factory yields MemoryClassificationElement instances.
- overwrite (bool) – Recompute classification of the input descriptor and set the results to the ClassificationElement produced by the factory.
- procs (None | int) – Explicit number of cores/thread/processes to use.
- use_multiprocessing (bool) – Use multiprocessing instead of threading.
- ri (float | None) – Progress reporting interval in seconds. Set to a value > 0 to enable. Disabled by default.
Returns: Mapping of input DescriptorElement instances to the computed ClassificationElement. ClassificationElement UUID’s are congruent with the UUID of the DescriptorElement
Return type: dict[smqtk.representation.DescriptorElement, smqtk.representation.ClassificationElement]
-
get_labels
()[source]¶ Get the sequence of class labels that this classifier can classify descriptors into. This includes the negative label.
Returns: Sequence of possible classifier labels. Return type: collections.Sequence[collections.Hashable] Raises: RuntimeError – No model loaded.
-
-
smqtk.algorithms.classifier.
get_classifier_impls
(reload_modules=False, sub_interface=None)[source]¶ Discover and return discovered
Classifier
classes. Keys in the returned map are the names of the discovered classes, and the paired values are the actual class type objects.- We search for implementation classes in:
modules next to this file this function is defined in (ones that begin with an alphanumeric character),
python modules listed in the environment variable
CLASSIFIER_PATH
- This variable should contain a sequence of python module
specifications, separated by the platform specific PATH separator
character (
;
for Windows,:
for unix)
- This variable should contain a sequence of python module
specifications, separated by the platform specific PATH separator
character (
Within a module we first look for a helper variable by the name
CLASSIFIER_CLASS
, which can either be a single class object or an iterable of class objects, to be specifically exported. If the variable is set to None, we skip that module and do not import anything. If the variable is not present, we look at attributes defined in that module for classes that descend from the given base class type. If none of the above are found, or if an exception occurs, the module is skipped.Parameters: - reload_modules (bool) – Explicitly reload discovered modules from source.
- sub_interface – Only return implementations that also descend from
the given sub-interface. The given interface must also descend from
Classifier
.
Returns: Map of discovered class object of type
Classifier
whose keys are the string names of the classes.Return type:
DescriptorGenerator¶
This interface represents algorithms that generate whole-content descriptor vectors for a single given input DataElement
instance.
The input DataElement
must be of a content type that the DescriptorGenerator
supports, referenced against the DescriptorGenerator.valid_content_types
method.
The compute_descriptor
method also requires a DescriptorElementFactory
instance to tell the algorithm how to generate the DescriptorElement
it should return.
The returned DescriptorElement
instance will have a type equal to the name of the DescriptorGenerator
class that generated it, and a UUID that is the same as the input DataElement
instance.
If a DescriptorElement
implementation that supports persistant storage is generated, and there is already a descriptor associated with the given type name and UUID values, the descriptor is returned without re-computation.
If the overwrite
parameter is True
, the DescriptorGenerator
instance will re-compute a descriptor for the input DataElement
, setting it to the generated DescriptorElement
. This will overwrite descriptor data in persistant storage if the DescriptorElement
type used supports it.
This interface supports a high-level, implementation agnostic asynchronous descriptor computation method.
This is given an iterable of DataElement
instances, a single DescriptorElementFactory
that is used to produce all descriptor
-
class
smqtk.algorithms.descriptor_generator.
DescriptorGenerator
[source]¶ Base abstract Feature Descriptor interface
-
compute_descriptor
(data, descr_factory=<smqtk.representation.descriptor_element_factory.DescriptorElementFactory object>, overwrite=False)[source]¶ Given some data, return a descriptor element containing a descriptor vector.
Raises: - RuntimeError – Descriptor extraction failure of some kind.
- ValueError – Given data element content was not of a valid type with respect to this descriptor.
Parameters: - data (smqtk.representation.DataElement) – Some kind of input data for the feature descriptor.
- descr_factory (smqtk.representation.DescriptorElementFactory) – Factory instance to produce the wrapping
descriptor element instance. The default factory produces
DescriptorMemoryElement
instances by default. - staverwrite (ot) – Whether or not to force re-computation of a descriptor vector for the given data even when there exists a precomputed vector in the generated DescriptorElement as generated from the provided factory. This will overwrite the persistently stored vector if the provided factory produces a DescriptorElement implementation with such storage.
Returns: Result descriptor element. UUID of this output descriptor is the same as the UUID of the input data element.
Return type: smqtk.representation.DescriptorElement
-
compute_descriptor_async
(data_iter, descr_factory=<smqtk.representation.descriptor_element_factory.DescriptorElementFactory object>, overwrite=False, procs=None, **kwds)[source]¶ Asynchronously compute feature data for multiple data items.
- Base implementation additional keyword arguments:
- use_mp [= False]
- If multi-processing should be used vs. multi-threading.
Parameters: - data_iter (collections.Iterable[smqtk.representation.DataElement]) – Iterable of data elements to compute features for. These must have UIDs assigned for feature association in return value.
- descr_factory (smqtk.representation.DescriptorElementFactory) – Factory instance to produce the wrapping
descriptor element instance. The default factory produces
DescriptorMemoryElement
instances by default. - overwrite (bool) – Whether or not to force re-computation of a descriptor vectors for the given data even when there exists precomputed vectors in the generated DescriptorElements as generated from the provided factory. This will overwrite the persistently stored vectors if the provided factory produces a DescriptorElement implementation such storage.
- procs (int | None) – Optional specification of how many processors to use when pooling sub-tasks. If None, we attempt to use all available cores.
Raises: ValueError – An input DataElement was of a content type that we cannot handle.
Returns: Mapping of input DataElement UUIDs to the computed descriptor element for that data. DescriptorElement UUID’s are congruent with the UUID of the data element it is the descriptor of.
Return type: dict[collections.Hashable, smqtk.representation.DescriptorElement]
-
-
smqtk.algorithms.descriptor_generator.
get_descriptor_generator_impls
(reload_modules=False)[source]¶ Discover and return discovered
DescriptorGenerator
classes. Keys in the returned map are the names of the discovered classes, and the paired values are the actual class type objects.- We search for implementation classes in:
modules next to this file this function is defined in (ones that begin with an alphanumeric character),
python modules listed in the environment variable
DESCRIPTOR_GENERATOR_PATH
- This variable should contain a sequence of python module
specifications, separated by the platform specific PATH separator
character (
;
for Windows,:
for unix)
- This variable should contain a sequence of python module
specifications, separated by the platform specific PATH separator
character (
Within a module we first look for a helper variable by the name
DESCRIPTOR_GENERATOR_CLASS
, which can either be a single class object or an iterable of class objects, to be specifically exported. If the variable is set to None, we skip that module and do not import anything. If the variable is not present, we look at attributes defined in that module for classes that descend from the given base class type. If none of the above are found, or if an exception occurs, the module is skipped.Parameters: reload_modules (bool) – Explicitly reload discovered modules from source. Returns: Map of discovered class object of type DescriptorGenerator
whose keys are the string names of the classes.Return type: dict[str, type]
HashIndex¶
This interface describes specialized NearestNeighborsIndex
implementations designed to index hash codes (bit vectors) via the hamming distance function.
Implementations of this interface are primarily used with the LSHNearestNeighborIndex
implementation.
Unlike the NearestNeighborsIndex
interface from which this interface descends, HashIndex
instances are build with an iterable of numpy.ndarray
and nn
returns a numpy.ndarray
.
-
class
smqtk.algorithms.nn_index.hash_index.
HashIndex
[source]¶ Specialized
NearestNeighborsIndex
for indexing unique hash codes bit-vectors) in memory (numpy arrays) using the hamming distance metric.Implementations of this interface cannot be used in place of something requiring a
NearestNeighborsIndex
implementation due to the speciality of this interface.Only unique bit vectors should be indexed. The
nn
method should not return the same bit vector more than once for any query.-
build_index
(hashes)[source]¶ Build the index with the given hash codes (bit-vectors).
Subsequent calls to this method should rebuild the current index. This method shall not add to the existing index nor raise an exception to as to protect the current index.
Raises: ValueError – No data available in the given iterable. Parameters: hashes (collections.Iterable[numpy.ndarray[bool]]) – Iterable of descriptor elements to build index over.
-
nn
(h, n=1)[source]¶ Return the nearest N neighbor hash codes as bit-vectors to the given hash code bit-vector.
Distances are in the range [0,1] and are the percent different each neighbor hash is from the query, based on the number of bits contained in the query (normalized hamming distance).
Raises: ValueError – Current index is empty.
Parameters: Returns: Tuple of nearest N hash codes and a tuple of the distance values to those neighbors.
Return type:
-
remove_from_index
(hashes)[source]¶ Partially remove hashes from this index.
Parameters: hashes (collections.Iterable[numpy.ndarray[bool]]) – Iterable of numpy boolean hash vectors to remove from this index.
Raises: - ValueError – No data available in the given iterable.
- KeyError – One or more UIDs provided do not match any stored descriptors.
-
update_index
(hashes)[source]¶ Additively update the current index with the one or more hash vectors given.
If no index exists yet, a new one should be created using the given hash vectors.
Raises: ValueError – No data available in the given iterable. Parameters: hashes (collections.Iterable[numpy.ndarray[bool]]) – Iterable of numpy boolean hash vectors to add to this index.
-
-
smqtk.algorithms.nn_index.hash_index.
get_hash_index_impls
(reload_modules=False)[source]¶ Discover and return discovered
HashIndex
classes. Keys in the returned map are the names of the discovered classes, and the paired values are the actual class type objects.- We search for implementation classes in:
- modules next to this file this function is defined in (ones that begin with an alphanumeric character),
- python modules listed in the environment variable
HASH_INDEX_PATH
- This variable should contain a sequence of python module
specifications, separated by the platform specific PATH
separator character (
;
for Windows,:
for unix)
- This variable should contain a sequence of python module
specifications, separated by the platform specific PATH
separator character (
Within a module we first look for a helper variable by the name
HASH_INDEX_CLASS
, which can either be a single class object or an iterable of class objects, to be specifically exported. If the variable is set to None, we skip that module and do not import anything. If the variable is not present, we look at attributes defined in that module for classes that descend from the given base class type. If none of the above are found, or if an exception occurs, the module is skipped.Parameters: reload_modules (bool) – Explicitly reload discovered modules from source. Returns: Map of discovered class object of type HashIndex
whose keys are the string names of the classes.Return type: dict[str, type]
LshFunctor¶
Implementations of this interface define the generation of a locality-sensitive hash code for a given DescriptorElement
.
These are used in LSHNearestNeighborIndex
instances.
-
class
smqtk.algorithms.nn_index.lsh.functors.
LshFunctor
[source]¶ Locality-sensitive hashing functor interface.
The aim of such a function is to be able to generate hash codes (bit-vectors) such that similar items map to the same or similar hashes with a high probability. In other words, it aims to maximize hash collision for similar items.
Building Models
Some hash functions want to build a model based on some training set of descriptors. Due to the non-standard nature of algorithm training and model building, please refer to the specific implementation for further information on whether model training is needed and how it is accomplished.
-
smqtk.algorithms.nn_index.lsh.functors.
get_lsh_functor_impls
(reload_modules=False)[source]¶ Discover and return discovered
LshFunctor
classes. Keys in the returned map are the names of the discovered classes, and the paired values are the actual class type objects.- We search for implementation classes in:
modules next to this file this function is defined in (ones that begin with an alphanumeric character),
python modules listed in the environment variable
LSH_FUNCTOR_PATH
- This variable should contain a sequence of python module
specifications, separated by the platform specific PATH separator
character (
;
for Windows,:
for unix)
- This variable should contain a sequence of python module
specifications, separated by the platform specific PATH separator
character (
Within a module we first look for a helper variable by the name
LSH_FUNCTOR_CLASS
, which can either be a single class object or an iterable of class objects, to be specifically exported. If the variable is set to None, we skip that module and do not import anything. If the variable is not present, we look at attributes defined in that module for classes that descend from the given base class type. If none of the above are found, or if an exception occurs, the module is skipped.Parameters: reload_modules (bool) – Explicitly reload discovered modules from source. Returns: Map of discovered class object of type LshFunctor
whose keys are the string names of the classes.Return type: dict[str, type]
NearestNeighborsIndex¶
This interface defines a method to build an index from a set of DescriptorElement
instances (NearestNeighborsIndex.build_index
) and a nearest-neighbors query function for getting a number of near neighbors to e query DescriptorElement
(NearestNeighborsIndex.nn
).
Building an index requires that some non-zero number of DescriptorElement
instances be passed into the build_index
method.
Subsequent calls to this method should rebuild the index model, not add to it.
If an implementation supports persistant storage of the index, it should overwrite the configured index.
The nn
method uses a single DescriptorElement
to query the current index for a specified number of nearest neighbors.
Thus, the NearestNeighborsIndex
instance must have a non-empty index loaded for this method to function.
If the provided query DescriptorElement
does not have a set vector, this method will also fail with an exception.
This interface additionally requires that implementations define a count
method, which returns the number of distinct DescriptorElement
instances are in the index.
-
class
smqtk.algorithms.nn_index.
NearestNeighborsIndex
[source]¶ Common interface for descriptor-based nearest-neighbor computation over a built index of descriptors.
Implementations, if they allow persistent storage of their index, should take the necessary parameters at construction time. Persistent storage content should be (over)written
build_index
is called.Implementations should be thread safe and appropriately protect internal model components from concurrent access and modification.
-
build_index
(descriptors)[source]¶ Build the index with the given descriptor data elements.
Subsequent calls to this method should rebuild the current index. This method shall not add to the existing index nor raise an exception to as to protect the current index.
Raises: ValueError – No data available in the given iterable. Parameters: descriptors (collections.Iterable[smqtk.representation.DescriptorElement]) – Iterable of descriptor elements to build index over.
-
nn
(d, n=1)[source]¶ Return the nearest N neighbors to the given descriptor element.
Raises: - ValueError – Input query descriptor
d
has no vector set. - ValueError – Current index is empty.
Parameters: - d (smqtk.representation.DescriptorElement) – Descriptor element to compute the neighbors of.
- n (int) – Number of nearest neighbors to find.
Returns: Tuple of nearest N DescriptorElement instances, and a tuple of the distance values to those neighbors.
Return type: (tuple[smqtk.representation.DescriptorElement], tuple[float])
- ValueError – Input query descriptor
-
remove_from_index
(uids)[source]¶ Partially remove descriptors from this index associated with the given UIDs.
Parameters: uids (collections.Iterable[collections.Hashable]) – Iterable of UIDs of descriptors to remove from this index.
Raises: - ValueError – No data available in the given iterable.
- KeyError – One or more UIDs provided do not match any stored descriptors. The index should not be modified.
-
update_index
(descriptors)[source]¶ Additively update the current index with the one or more descriptor elements given.
If no index exists yet, a new one should be created using the given descriptors.
Raises: ValueError – No data available in the given iterable. Parameters: descriptors (collections.Iterable[smqtk.representation .DescriptorElement]) – Iterable of descriptor elements to add to this index.
-
-
smqtk.algorithms.nn_index.
get_nn_index_impls
(reload_modules=False)[source]¶ Discover and return discovered
NearestNeighborsIndex
classes. Keys in the returned map are the names of the discovered classes, and the paired values are the actual class type objects.- We search for implementation classes in:
- modules next to this file this function is defined in (ones that begin with an alphanumeric character),
- python modules listed in the environment variable
NN_INDEX_PATH
- This variable should contain a sequence of python module
specifications, separated by the platform specific PATH
separator character (
;
for Windows,:
for unix)
- This variable should contain a sequence of python module
specifications, separated by the platform specific PATH
separator character (
- python modules listed in the environment variable
Within a module we first look for a helper variable by the name
NN_INDEX_CLASS
, which can either be a single class object or an iterable of class objects, to be specifically exported. If the variable is set to None, we skip that module and do not import anything. If the variable is not present, we look at attributes defined in that module for classes that descend from the given base class type. If none of the above are found, or if an exception occurs, the module is skipped.Parameters: reload_modules (bool) – Explicitly reload discovered modules from source. Returns: Map of discovered class object of type NearestNeighborsIndex
whose keys are the string names of the classes.Return type: dict[str, type]
RelevancyIndex¶
This interface defines two methods: build_index
and rank
.
The build_index
method is, like a NearestNeighborsIndex
, used to build an index of DescriptorElement
instances.
The rank
method takes examples of relevant and not-relevant DescriptorElement
examples with which the algorithm uses to rank (think sort) the indexed DescriptorElement
instances by relevancy (on a [0, 1]
scale).
-
class
smqtk.algorithms.relevancy_index.
RelevancyIndex
[source]¶ Abstract class for IQR index implementations.
Similar to a traditional nearest-neighbors algorithm, An IQR index provides a specialized nearest-neighbors interface that can take multiple examples of positively and negatively relevant exemplars in order to produce a [0, 1] ranking of the indexed elements by determined relevancy.
-
build_index
(descriptors)[source]¶ Build the index based on the given iterable of descriptor elements.
Subsequent calls to this method should rebuild the index, not add to it.
Raises: ValueError – No data available in the given iterable. Parameters: descriptors (collections.Iterable[smqtk.representation.DescriptorElement]) – Iterable of descriptor elements to build index over.
-
rank
(pos, neg)[source]¶ Rank the currently indexed elements given
pos
positive andneg
negative exemplar descriptor elements.Parameters: - pos (collections.Iterable[smqtk.representation.DescriptorElement]) – Iterable of positive exemplar DescriptorElement instances. This may be optional for some implementations.
- neg (collections.Iterable[smqtk.representation.DescriptorElement]) – Iterable of negative exemplar DescriptorElement instances. This may be optional for some implementations.
Returns: Map of indexed descriptor elements to a rank value between [0, 1] (inclusive) range, where a 1.0 means most relevant and 0.0 meaning least relevant.
Return type:
-
-
smqtk.algorithms.relevancy_index.
get_relevancy_index_impls
(reload_modules=False)[source]¶ Discover and return discovered
RelevancyIndex
classes. Keys in the returned map are the names of the discovered classes, and the paired values are the actual class type objects.- We search for implementation classes in:
modules next to this file this function is defined in (ones that begin with an alphanumeric character),
python modules listed in the environment variable
RELEVANCY_INDEX_PATH
- This variable should contain a sequence of python module
specifications, separated by the platform specific PATH separator
character (
;
for Windows,:
for unix)
- This variable should contain a sequence of python module
specifications, separated by the platform specific PATH separator
character (
Within a module we first look for a helper variable by the name
RELEVANCY_INDEX_CLASS
, which can either be a single class object or an iterable of class objects, to be specifically exported. If the variable is set to None, we skip that module and do not import anything. If the variable is not present, we look at attributes defined in that module for classes that descend from the given base class type. If none of the above are found, or if an exception occurs, the module is skipped.Parameters: reload_modules (bool) – Explicitly reload discovered modules from source. Returns: Map of discovered class object of type RelevancyIndex
whose keys are the string names of the classes.Return type: dict[str, type]