Algorithm Interfaces¶
-
class
smqtk.algorithms.
SmqtkAlgorithm
[source]¶ Parent class for all algorithm interfaces.
-
property
name
¶ - Returns
The name of this class type.
- Return type
str
-
property
Here we list and briefly describe the high level algorithm interfaces which SMQTK provides. There is at least one implementation available for each interface. Some implementations will require additional dependencies that cannot be packaged with SMQTK.
Classifier¶
This interface represents algorithms that classify DescriptorElement
instances into discrete labels or label confidences.
-
class
smqtk.algorithms.classifier.
Classifier
[source]¶ Interface for algorithms that classify input descriptors into discrete labels and/or label confidences.
-
static
_assert_array_dim_consistency
(array_iter)[source]¶ Assert that arrays are consistent in dimensionality across iterated arrays.
Currently we only support iterating single dimension vectors. Arrays of more than one dimension (i.e. 2D matries, etc.) will trigger a ValueError.
Includes a short-cut where if the input is a non-object 2D ndarray, dimensionality must already be consistent, so the ndarray (which is an Iterable) is just returned. Otherwise, we return a generator that checked dimensionality of the input iterable during iteration.
- Parameters
| np.ndarray array_iter (collections.abc.Iterable[numpy.ndarray]) – Iterable numpy arrays.
- Raises
AttributeError – Individual arrays are not numpy.ndarray-like.
ValueError – Not all input arrays were of consistent dimensionality.
- Returns
Iterable of the same arrays in the same order, but validated to be of common dimensionality.
-
abstract
_classify_arrays
(array_iter)[source]¶ Overridable method for classifying an iterable of descriptor elements whose vectors should be classified.
At this level, all input arrays are guaranteed to be of consistent dimensionality.
Each classification mapping should contain confidence values for each label the configured model contains. Implementations may act in a discrete manner whereby only one label is marked with a
1
value (others being0
), or in a continuous manner whereby each label is given a confidence-like value in the [0, 1] range.- Parameters
array_iter (collections.abc.Iterable[numpy.ndarray]) – Iterable of arrays to be classified.
- Returns
Iterable of dictionaries, parallel in association to the input descriptor vectors. Each dictionary should map labels to associated confidence values.
- Return type
collections.abc.Iterable[dict[collections.abc.Hashable, float]]
-
classify_arrays
(array_iter)[source]¶ Classify an input iterable of numpy arrays into a parallel iterable of label-to-confidence mappings (dictionaries).
Each classification mapping should contain confidence values for each label the configured model contains. Implementations may act in a discrete manner whereby only one label is marked with a
1
value (others being0
), or in a continuous manner whereby each label is given a confidence-like value in the [0, 1] range.- Parameters
| np.ndarray array_iter (collections.abc.Iterable[numpy.ndarray]) – Iterable of descriptor vectors, as numpy arrays, to be classified.
- Raises
ValueError – Input arrays were not all of consistent dimensionality.
- Returns
Iterable of dictionaries, parallel in association to the input descriptor vectors. Each dictionary should map labels to associated confidence values.
- Return type
collections.abc.Iterable[dict[collections.abc.Hashable, float]]
-
classify_elements
(descr_iter, factory=<smqtk.representation.classification_element_factory.ClassificationElementFactory object>, overwrite=False, d_elem_batch=100)[source]¶ Classify an input iterable of descriptor elements into a parallel iterable of classification elements.
Classification element UIDs are inherited from the descriptor element it was generated from.
We invoke
classify_arrays
for actual generation of classification results. See documentation for this method for further details. # We invokeclassify_arrays
for factory-generated classification # elements that do not yet have classifications stored, or on all input # descriptor elements if theoverwrite
flag is True.Selective Iteration For situations when it is desired to access specific generator returns, like when only one descriptor element is provided in order to get a single element out, it is strongly recommended to expand the returned generator into a sequence type first. For example, expanding out the generator’s returns into a list (
list(g.generate_elements([e]))[0]
) is recommended over just getting the “next” element of the returned generator (next(g.generate_elements([e]))
). Expansion into a sequence allows the generator to fully execute, which includes any functionality after the finalyield
statement in any of the underlying iterators that may perform required clean-up.Non-redundant Processing Certain classification element implementations, as dictated by the input factory, may be connected to persistent storage in the background. Because of this, some classification elements may already “have” classification results on construction. This method, by default, only computes new classification results for descriptor elements whose associated classification element does not report as already containing results. If the
overwrite
flag is True then classifications are computed for all input descriptor elements and results are set to their respective classification elements regardless of existing result storage.- Parameters
descr_iter (collections.abc.Iterable[DescriptorElement]) – Iterable of DescriptorElement instances to be classified.
factory (smqtk.representation.ClassificationElementFactory) – Classification element factory. The default factory yields MemoryClassificationElement instances.
overwrite (bool) – Recompute classification of the input descriptor and set the results to the ClassificationElement produced by the factory.
d_elem_batch (int) – The number of descriptor elements to collect before requesting the whole batch’s vectors at once via
DescriptorElement.get_many_vectors
method.
- Raises
ValueError – Either: (A) one or more input descriptor elements did not have a stored vector, or (B) input descriptor element arrays were not all of consistent dimensionality.
IndexError – Implementation of
_classify_arrays
either under or over produced classifications relative to the number of input descriptor vectors.
- Returns
Iterator of result ClassificationElement instances. UUIDs of generated ClassificationElement instances will reflect the UUID of the DescriptorElement it was computed from.
- Return type
collections.abc.Iterator[smqtk.representation.ClassificationElement]
-
classify_one_element
(descr_elem, factory=<smqtk.representation.classification_element_factory.ClassificationElementFactory object>, overwrite=False)[source]¶ Convenience method around
classify_elements
for the single-input case.See documentation for the
Classifier.classify_elements()
method for more information.- Parameters
descr_elem (DescriptorElement) – Iterable of DescriptorElement instances to be classified.
factory (smqtk.representation.ClassificationElementFactory) – Classification element factory. The default factory yields MemoryClassificationElement instances.
overwrite (bool) – Recompute classification of the input descriptor and set the results to the ClassificationElement produced by the factory.
- Raises
ValueError – The input descriptor element did not have a stored vector.
IndexError – Implementation of
_classify_arrays
either under or over produced classifications relative to the number of input descriptor vectors.
- Returns
ClassificationElement instances. UUIDs of the generated ClassificationElement instance will reflect the UUID of the DescriptorElement it was computed from.
- Return type
smqtk.representation.ClassificationElement
-
abstract
get_labels
()[source]¶ Get the sequence of class labels that this classifier can classify descriptors into. This includes the negative or background label if the classifier embodies such a concept.
- Returns
Sequence of possible classifier labels.
- Return type
collections.abc.Sequence[collections.abc.Hashable]
- Raises
RuntimeError – No model loaded.
-
static
DescriptorGenerator¶
This interface represents algorithms that generate whole-content descriptor
vectors for one or more given input DataElement
instances.
The input DataElement
instances must be of a
content type that the DescriptorGenerator
supports, referenced
against the valid_content_types()
method (required by the ContentTypeValidator
mixin
class).
The DescriptorGenerator.generate_elements()
method also requires a
DescriptorElementFactory
instance to tell the algorithm how to
generate the DescriptorElement
instances it should return.
The returned DescriptorElement
instances will have a type equal to
the name of the DescriptorGenerator
class that generated it, and a
UUID that is the same as the input DataElement
instance.
If a DescriptorElement
implementation that supports persistent
storage is generated, and there is already a descriptor associated with the
given type name and UUID values, the descriptor is returned without
re-computation.
If the overwrite
parameter is True
, the DescriptorGenerator
instance will re-compute a descriptor for the input DataElement
,
setting it to the generated DescriptorElement
.
This will overwrite descriptor data in persistent storage if the
DescriptorElement
type used supports it.
-
class
smqtk.algorithms.descriptor_generator.
DescriptorGenerator
[source]¶ Base abstract Feature Descriptor interface.
-
generate_arrays
(data_iter)[source]¶ Generate descriptor vector elements for all input data elements.
Descriptor arrays yielded out will be parallel in association with the data elements input.
Selective Iteration For situations when it is desired to access specific generator returns, like when only one data element is provided in order to get a single array out, it is strongly recommended to expand the returned generator into a sequence type first. For example, expanding out the generator’s returns into a list (
list(g.generate_arrays([e]))[0]
) is recommended over just getting the “next” element of the returned generator (next(g.generate_arrays([e]))
). Expansion into a sequence allows the generator to fully execute, which includes any functionality after the finalyield
statement in any of the underlying iterators.- Parameters
data_iter (collections.abc.Iterable[smqtk.representation.DataElement]) – Iterable of DataElement instances to be described.
- Raises
RuntimeError – Descriptor extraction failure of some kind.
ValueError – Given data element content was not of a valid type with respect to this descriptor generator implementation.
- Returns
Iterator of result numpy.ndarray instances.
- Return type
collections.abc.Iterator[numpy.ndarray]
-
generate_elements
(data_iter, descr_factory=<smqtk.representation.descriptor_element_factory.DescriptorElementFactory object>, overwrite=False)[source]¶ Generate DescriptorElement instances for the input data elements, generating new descriptors for those elements that need them, or optionally all input data elements.
Descriptor elements yielded out will be parallel in association with the data elements input. Descriptor element UUIDs are inherited from the data element it was generated from.
Selective Iteration For situations when it is desired to access specific generator returns, like when only one data element is provided in order to get a single element out, it is strongly recommended to expand the returned generator into a sequence type first. For example, expanding out the generator’s returns into a list (
list(g.generate_elements([e]))[0]
) is recommended over just getting the “next” element of the returned generator (next(g.generate_elements([e]))
). Expansion into a sequence allows the generator to fully execute, which includes any functionality after the finalyield
statement in any of the underlying iterators that may perform required clean-up.Non-redundant Processing Certain descriptor element implementations, as dictated by the input factory, may be connected to persistent storage in the background. Because of this, some descriptor elements may already “have” a vector on construction. This method, by default, only computes new descriptor vectors for data elements whose associated descriptor element does not report as already containing a vector. If the
overwrite
flag is True then descriptors are computed for all input data elements and are set to their respective descriptor elements regardless of existing vector storage.- Parameters
data_iter (collections.abc.Iterable[smqtk.representation.DataElement]) – Iterable of DataElement instances to be described.
descr_factory (smqtk.representation.DescriptorElementFactory) – DescriptorElementFactory instance to drive the generation of element instances with some parametrization.
overwrite (bool) – By default, if a factory-produced DescriptorElement reports as containing a vector, we do not compute a descriptor again for the associated data element. If this is
True
, however, we will generate descriptors for all input data elements, overwriting the vectors previously stored in the factory-produces descriptor elements.
- Raises
RuntimeError – Descriptor extraction failure of some kind.
ValueError – Given data element content was not of a valid type with respect to this descriptor generator implementation.
IndexError – Underlying vector-producing generator either under or over produced vectors.
- Returns
Iterator of result DescriptorElement instances. UUIDs of generated DescriptorElement instances will reflect the UUID of the DataElement it was generated from.
- Return type
collections.abc.Iterator[smqtk.representation.DescriptorElement]
-
generate_one_array
(data_elem)[source]¶ Convenience wrapper around
generate_arrays
for the single-input case.See the documentation for the
DescriptorGenerator.generate_arrays()
method for more information.- Parameters
data_elem (smqtk.representation.DataElement) – DataElement instance to be described.
- Raises
RuntimeError – Descriptor extraction failure of some kind.
ValueError – Given data element content was not of a valid type with respect to this descriptor generator implementation.
- Returns
Descriptor vector the given data as a
numpy.ndarray
instance.- Return type
numpy.ndarray
-
generate_one_element
(data_elem, descr_factory=<smqtk.representation.descriptor_element_factory.DescriptorElementFactory object>, overwrite=False)[source]¶ Convenience wrapper around
generate_elements
for the single-input case.See documentation for the
DescriptorGenerator.generate_elements()
method for more information- Parameters
data_elem (smqtk.representation.DataElement) – DataElement instance to be described.
descr_factory (smqtk.representation.DescriptorElementFactory) – DescriptorElementFactory instance to drive the generation of element instances with some parametrization.
overwrite (bool) – By default, if a factory-produced DescriptorElement reports as containing a vector, we do not compute a descriptor again for the associated data element. If this is
True
, however, we will generate descriptors for all input data elements, overwriting the vectors previously stored in the factory-produces descriptor elements.
- Raises
IndexError – Underlying vector-producing generator either under or over produced vectors.
RuntimeError – Descriptor extraction failure of some kind.
ValueError – Given data element content was not of a valid type with respect to this descriptor generator implementation.
- Returns
Result DescriptorElement instance. UUID of the generated DescriptorElement instance will reflect the UUID of the DataElement it was generated from.
- Return type
-
ImageReader¶
-
class
smqtk.algorithms.image_io.
ImageReader
[source]¶ Interface for algorithms that load a raster image matrix from a data element.
-
is_valid_element
(data_element)[source]¶ Check if the given DataElement instance reports a content type that matches one of the MIME types reported by
valid_content_types
.This override checks if the
DataElement
has thematrix
property as theMatrixDataElement
would provide, and that its value of an expected type.- Parameters
data_element (smqtk.representation.DataElement) – Data element instance to check.
- Returns
True if the given element has a valid content type as reported by
valid_content_types
, and False if not.- Return type
bool
-
load_as_matrix
(data_element, pixel_crop=None)[source]¶ Load an image matrix from the given data element.
Matrix Property Shortcut. If the given DataElement instance defines a
matrix
property this method simply returns that. This is intended to interface with instances ofsmqtk.representation.data_element.matrix.MatrixDataElement
.Loading From Bytes. When not loading from a short-cut matrix, matrix return format is
ImageReader
implementation dependant. Implementations of this interface should specify and describe their return type.Aside from the exceptions documented below, other exceptions may be raised when an image fails to load that are implementation dependent.
- Parameters
data_element (smqtk.representation.DataElement) – DataElement to load image data from.
pixel_crop (None|smqtk.representation.AxisAlignedBoundingBox) – Optional bounding box specifying a pixel sub-region to load from the given data. If this is provided it must represent a valid sub-region within the loaded image, otherwise a RuntimeError is raised. Handling of non-integer aligned boxes are implementation dependant.
- Raises
RuntimeError – A crop region was specified but did not specify a valid sub-region of the image.
AssertionError – The
data_element
provided defined amatrix
attribute/property, but its access did not result in an expected value.ValueError –
- This error is raised when:
The given
data_element
was not of a valid content type.A
pixel_crop
bounding box was provided but was zero volume.pixel_crop
bounding box vertices are not fully represented by integers.
- Returns
Numpy ndarray of the image data. Specific return format is implementation dependant.
- Return type
numpy.ndarray
-
-
class
smqtk.algorithms.image_io.pil_io.
PilImageReader
(explicit_mode=None)[source]¶ Image reader that uses PIL to load the image.
This implementation may additionally raise an
IOError
when failing to to load an image.-
get_config
()[source]¶ Return a JSON-compliant dictionary that could be passed to this class’s
from_config
method to produce an instance with identical configuration.In the most cases, this involves naming the keys of the dictionary based on the initialization argument names as if it were to be passed to the constructor via dictionary expansion. In some cases, where it doesn’t make sense to store some object constructor parameters are expected to be supplied at as configuration values (i.e. must be supplied at runtime), this method’s returned dictionary may leave those parameters out. In such cases, the object’s
from_config
class-method would also take additional positional arguments to fill in for the parameters that this returned configuration lacks.- Returns
JSON type compliant configuration dictionary.
- Return type
dict
-
classmethod
is_usable
()[source]¶ Check whether this class is available for use.
Since certain plugin implementations may require additional dependencies that may not yet be available on the system, or other runtime conditions, this method may be overridden to check for those and return a boolean saying if the implementation is available for usable. When this method returns True, the class is declaring that it should be constructable and usable in the current environment.
By default, this method will return True unless a sub-class overrides this class-method with their specific logic.
- NOTES:
This should be a class method
- When an implementation is deemed not usable, this should emit a
(user) warning, or some other kind of logging, detailing why the implementation is not available for use.
- Returns
Boolean determination of whether this implementation is usable in the current environment.
- Return type
bool
-
HashIndex¶
This interface describes specialized NearestNeighborsIndex
implementations designed to index hash codes (bit vectors) via the hamming distance function.
Implementations of this interface are primarily used with the LSHNearestNeighborIndex
implementation.
Unlike the NearestNeighborsIndex
interface from which this interface descends, HashIndex
instances are build with an iterable of numpy.ndarray
and nn
returns a numpy.ndarray
.
-
class
smqtk.algorithms.nn_index.hash_index.
HashIndex
[source]¶ Specialized
NearestNeighborsIndex
for indexing unique hash codes bit-vectors) in memory (numpy arrays) using the hamming distance metric.Implementations of this interface cannot be used in place of something requiring a
NearestNeighborsIndex
implementation due to the speciality of this interface.Only unique bit vectors should be indexed. The
nn
method should not return the same bit vector more than once for any query.-
build_index
(hashes)[source]¶ Build the index with the given hash codes (bit-vectors).
Subsequent calls to this method should rebuild the current index. This method shall not add to the existing index nor raise an exception to as to protect the current index.
- Raises
ValueError – No data available in the given iterable.
- Parameters
hashes (collections.abc.Iterable[numpy.ndarray[bool]]) – Iterable of descriptor elements to build index over.
-
nn
(h, n=1)[source]¶ Return the nearest N neighbor hash codes as bit-vectors to the given hash code bit-vector.
Distances are in the range [0,1] and are the percent different each neighbor hash is from the query, based on the number of bits contained in the query (normalized hamming distance).
- Raises
ValueError – Current index is empty.
- Parameters
h (numpy.ndarray[bool]) – Hash code to compute the neighbors of. Should be the same bit length as indexed hash codes.
n (int) – Number of nearest neighbors to find.
- Returns
Tuple of nearest N hash codes and a tuple of the distance values to those neighbors.
- Return type
(tuple[numpy.ndarray[bool]], tuple[float])
-
remove_from_index
(hashes)[source]¶ Partially remove hashes from this index.
- Parameters
hashes (collections.abc.Iterable[numpy.ndarray[bool]]) – Iterable of numpy boolean hash vectors to remove from this index.
- Raises
ValueError – No data available in the given iterable.
KeyError – One or more UIDs provided do not match any stored descriptors.
-
update_index
(hashes)[source]¶ Additively update the current index with the one or more hash vectors given.
If no index exists yet, a new one should be created using the given hash vectors.
- Raises
ValueError – No data available in the given iterable.
- Parameters
hashes (collections.abc.Iterable[numpy.ndarray[bool]]) – Iterable of numpy boolean hash vectors to add to this index.
-
LshFunctor¶
Implementations of this interface define the generation of a locality-sensitive hash code for a given DescriptorElement
.
These are used in LSHNearestNeighborIndex
instances.
-
class
smqtk.algorithms.nn_index.lsh.functors.
LshFunctor
[source]¶ Locality-sensitive hashing functor interface.
The aim of such a function is to be able to generate hash codes (bit-vectors) such that similar items map to the same or similar hashes with a high probability. In other words, it aims to maximize hash collision for similar items.
Building Models
Some hash functions want to build a model based on some training set of descriptors. Due to the non-standard nature of algorithm training and model building, please refer to the specific implementation for further information on whether model training is needed and how it is accomplished.
NearestNeighborsIndex¶
This interface defines a method to build an index from a set of DescriptorElement
instances (NearestNeighborsIndex.build_index
) and a nearest-neighbors query function for getting a number of near neighbors to e query DescriptorElement
(NearestNeighborsIndex.nn
).
Building an index requires that some non-zero number of DescriptorElement
instances be passed into the build_index
method.
Subsequent calls to this method should rebuild the index model, not add to it.
If an implementation supports persistant storage of the index, it should overwrite the configured index.
The nn
method uses a single DescriptorElement
to query the current index for a specified number of nearest neighbors.
Thus, the NearestNeighborsIndex
instance must have a non-empty index loaded for this method to function.
If the provided query DescriptorElement
does not have a set vector, this method will also fail with an exception.
This interface additionally requires that implementations define a count
method, which returns the number of distinct DescriptorElement
instances are in the index.
-
class
smqtk.algorithms.nn_index.
NearestNeighborsIndex
[source]¶ Common interface for descriptor-based nearest-neighbor computation over a built index of descriptors.
Implementations, if they allow persistent storage of their index, should take the necessary parameters at construction time. Persistent storage content should be (over)written
build_index
is called.Implementations should be thread safe and appropriately protect internal model components from concurrent access and modification.
-
build_index
(descriptors)[source]¶ Build the index with the given descriptor data elements.
Subsequent calls to this method should rebuild the current index. This method shall not add to the existing index nor raise an exception to as to protect the current index.
- Raises
ValueError – No data available in the given iterable.
- Parameters
descriptors (collections.abc.Iterable[smqtk.representation.DescriptorElement]) – Iterable of descriptor elements to build index over.
-
nn
(d, n=1)[source]¶ Return the nearest N neighbors to the given descriptor element.
- Raises
ValueError – Input query descriptor
d
has no vector set.ValueError – Current index is empty.
- Parameters
d (smqtk.representation.DescriptorElement) – Descriptor element to compute the neighbors of.
n (int) – Number of nearest neighbors to find.
- Returns
Tuple of nearest N DescriptorElement instances, and a tuple of the distance values to those neighbors.
- Return type
(tuple[smqtk.representation.DescriptorElement], tuple[float])
-
remove_from_index
(uids)[source]¶ Partially remove descriptors from this index associated with the given UIDs.
- Parameters
uids (collections.abc.Iterable[collections.abc.Hashable]) – Iterable of UIDs of descriptors to remove from this index.
- Raises
ValueError – No data available in the given iterable.
KeyError – One or more UIDs provided do not match any stored descriptors. The index should not be modified.
-
update_index
(descriptors)[source]¶ Additively update the current index with the one or more descriptor elements given.
If no index exists yet, a new one should be created using the given descriptors.
- Raises
ValueError – No data available in the given iterable.
- Parameters
descriptors (collections.abc.Iterable[smqtk.representation .DescriptorElement]) – Iterable of descriptor elements to add to this index.
-
ObjectDetector¶
This interface defines a method to generate object detections
(DetectionElement
) over a given
DataElement
.
-
class
smqtk.algorithms.object_detection.
ObjectDetector
[source]¶ Abstract interface to an object detection algorithm.
An object detection algorithm is one that can take in data and output zero or more detection elements, where each detection represents a spatial region in the data.
This high level interface only requires detection element returns (spatial bounding-boxes with associated classification elements).
-
detect_objects
(data_element, de_factory=<smqtk.representation.detection_element_factory.DetectionElementFactory object>, ce_factory=<smqtk.representation.classification_element_factory.ClassificationElementFactory object>)[source]¶ Detect objects in the given data.
UUIDs of detections are based on the hash produced from the combination of:
Detection bounding-box bounding coordinates
Classification label set predicted for a bounding box.
- Parameters
data_element (smqtk.representation.DataElement) – Source data from which to detect objects within.
de_factory (smqtk.representation.DetectionElementFactory) – Factory for generating DetectionElement instances. The default factory yields MemoryClassificationElement instances.
ce_factory (smqtk.representation.ClassificationElementFactory) – Factory for generating ClassificationElement instances for detections. The default factory yields MemoryClassificationElement instances.
- Raises
ValueError – Given data element content was not of a valid content type that this class reports as valid for object detection.
- Returns
Iterator over result DetectionElement instances as generated by the given DetectionElementFactory, containing classification elements as generated by the given ClassificationElementFactory.
- Return type
collections.abc.Iterable[smqtk.representation.DetectionElement]
-
RankRelevancy¶
This interface defines one method: rank
. The rank
method
takes examples of relevant and not-relevant example descriptor vectors
as numpy.ndarray
sequences and uses them to compute relevancy
scores (on a [0, 1]
scale) on a provided pool of other descriptor
vectors.
-
class
smqtk.algorithms.rank_relevancy.
RankRelevancy
[source]¶ Algorithm that can rank a given pool of descriptors based on positively and negatively adjudicated descriptors.
-
abstract
rank
(pos: Sequence[numpy.ndarray], neg: Sequence[numpy.ndarray], pool: Sequence[numpy.ndarray]) → Sequence[float][source]¶ Assign a relevancy score to each input descriptor in pool based on the positively and negatively adjudicated descriptors in pos and neg respectively.
- Parameters
pos – Sequence of positively adjudicated descriptor vectors.
neg – Sequence of negatively adjudicated descriptor vectors.
pool – A sequence of descriptor vectors that we want to rank by topical relevancy relative to the given positive and negative examples.
- Returns
An ordered sequence of float values denoting the relevancy of pool elements
-
abstract
RankRelevancyWithFeedback¶
This interface defines one method: rank_with_feedback
. Like
RankRelevancy.rank()
, rank_with_feedback
takes examples of
relevant and not-relevant example descriptor vectors as
numpy.ndarray
sequences and uses them to compute relevancy
scores (on a [0, 1]
scale) on a provided pool of other descriptor
vectors. However, it also expects a sequence of corresponding UIDs
for the pool vectors and additionally returns a sequence of UIDs,
possibly not all from the pool, on which feedback would be most
useful.
-
class
smqtk.algorithms.rank_relevancy.
RankRelevancyWithFeedback
[source]¶ Similar to the
RankRelevancy
algorithm but with the added feature of also returning a sequence of elements from which feedback would be “most useful”.What “most useful” means may be flexible but generally refers to the goal of reducing the amount of adjudications required in order to separate true-positive examples from true-negative examples in provided pools via the assigned relevancy scores. E.g. other elements may be adjudicated in some quantity to achieve some level of relevant sample separation, but if the feedback requests are instead adjudicated, less elements may need to be adjudicated to achieve and equivalent level of separation.
Feedback requests ought to be returned in a form that is meaningful for the user to be able to properly convey the proper information to the adjudicating agent to actually perform adjudications. Additionally, we want to be able to request feedback from elements that may not be present in the given pool of descriptors.
Towards that end, this algorithm should be given a sequence of UIDs for the given pool of descriptors. This allows the implementation to potentially coordinate with an outside source of descriptor references such that the returned feedback requests may be interpreted uniformly.
-
abstract
_rank_with_feedback
(pos: Sequence[numpy.ndarray], neg: Sequence[numpy.ndarray], pool: Sequence[numpy.ndarray], pool_uids: Sequence[collections.abc.Hashable]) → Tuple[Sequence[float], Sequence[collections.abc.Hashable]][source]¶ Implement
rank_with_feedback()
. pool and pool_uids have already been checked to be of equal length.See also
rank_with_feedback()
’s doc-string for the meanings of the parameters and their return values
-
rank_with_feedback
(pos: Sequence[numpy.ndarray], neg: Sequence[numpy.ndarray], pool: Sequence[numpy.ndarray], pool_uids: Sequence[collections.abc.Hashable]) → Tuple[Sequence[float], Sequence[collections.abc.Hashable]][source]¶ Assign a relevancy score to each input descriptor in pool based on the positively and negatively adjudicated descriptors in pos and neg respectively, additionally returning a sequence of UIDs of those descriptors for which adjudication feedback would be “most useful”.
- Parameters
pos – Sequence of positively adjudicated descriptor vectors.
neg – Sequence of negatively adjudicated descriptor vectors.
pool – A sequence of descriptor vectors that we want to rank by topical relevancy relative to the given positive and negative examples.
pool_uids – A sequence of hashable UID values, parallel in association with descriptors in pool.
- Returns
Ordered sequence of float values denoting relevancy of pool elements, as well as a sequence of
Hashable
values referencing in-pool or out-of-pool descriptors we recommend for adjudication feedback. In the latter sequence, descriptors are ordered by usefulness, most to least.- Raises
ValueError – pool and pool_uids are of different length
See also
RankRelevancyWithFeedback
class doc-string for discussion on “most useful” meaning.
-
abstract
RelevancyIndex¶
This interface defines two methods: build_index
and rank
.
The build_index
method is, like a NearestNeighborsIndex
, used to build an index of DescriptorElement
instances.
The rank
method takes examples of relevant and not-relevant DescriptorElement
examples with which the algorithm uses to rank (think sort) the indexed DescriptorElement
instances by relevancy (on a [0, 1]
scale).
-
class
smqtk.algorithms.relevancy_index.
RelevancyIndex
[source]¶ Abstract class for IQR index implementations.
Similar to a traditional nearest-neighbors algorithm, An IQR index provides a specialized nearest-neighbors interface that can take multiple examples of positively and negatively relevant exemplars in order to produce a [0, 1] ranking of the indexed elements by determined relevancy.
-
abstract
build_index
(descriptors)[source]¶ Build the index based on the given iterable of descriptor elements.
Subsequent calls to this method should rebuild the index, not add to it.
- Raises
ValueError – No data available in the given iterable.
- Parameters
descriptors (collections.abc.Iterable[smqtk.representation.DescriptorElement]) – Iterable of descriptor elements to build index over.
-
abstract
rank
(pos, neg)[source]¶ Rank the currently indexed elements given
pos
positive andneg
negative exemplar descriptor elements.- Parameters
pos (collections.abc.Iterable[smqtk.representation.DescriptorElement]) – Iterable of positive exemplar DescriptorElement instances. This may be optional for some implementations.
neg (collections.abc.Iterable[smqtk.representation.DescriptorElement]) – Iterable of negative exemplar DescriptorElement instances. This may be optional for some implementations.
- Raises
NoIndexError – If index ranking is requested without an index to rank.
- Returns
Map of indexed descriptor elements to a rank value between [0, 1] (inclusive) range, where a 1.0 means most relevant and 0.0 meaning least relevant.
- Return type
dict[smqtk.representation.DescriptorElement, float]
-
abstract