Data Abstraction¶
An important part of any algorithm is the data its working over and the data that it produces.
An important part of working with large scales of data is where the data is stored and how its accessed.
The smqtk.representation
module contains interfaces and plugins for various core data structures, allowing plugin implementations to decide where and how the underlying raw data should be stored and accessed.
This potentially allows algorithms to handle more data that would otherwise be feasible on a single machine.
-
class
smqtk.representation.
SmqtkRepresentation
[source]¶ Interface for data representation interfaces and implementations.
Data should be serializable, so this interface adds abstract methods for serializing and de-serializing SMQTK data representation instances.
Data Representation Structures¶
The following are the core data representation interfaces.
- Note:
It is required that implementations have a common serialization format so that they may be stored or transported by other structures in a general way without caring what the specific implementation is. For this we require that all implementations be serializable via the
pickle
(and thuscPickle
) module functions.
DataElement¶
-
class
smqtk.representation.
DataElement
[source]¶ Abstract interface for a byte data container.
The primary “value” of a
DataElement
is the byte content wrapped. Since this can technically change due to external forces, we cannot guarantee that an element is immutable. ThusDataElement
instances are not considered generally hashable. Specific implementations may define a__hash__
method if that implementation reflects a data source that guarantees immutability.UUIDs should be cast-able to a string and maintain unique-ness after conversion.
-
clean_temp
()[source]¶ Clean any temporary files created by this element. This does nothing if no temporary files have been generated for this element yet.
-
abstract
content_type
()[source]¶ - Returns
Standard type/subtype string for this data element, or None if the content type is unknown.
- Return type
str or None
-
classmethod
from_uri
(uri)[source]¶ Construct a new instance based on the given URI.
This function may not be implemented for all DataElement types.
- Parameters
uri (str) – URI string to resolve into an element instance
- Raises
NoUriResolutionError – This element type does not implement URI resolution.
smqtk.exceptions.InvalidUriError – This element type could not resolve the provided URI string.
- Returns
New element instance of our type.
- Return type
-
abstract
is_empty
()[source]¶ Check if this element contains no bytes.
The intend of this method is to quickly check if there is any data behind this element, ideally without having to read all/any of the underlying data.
- Returns
If this element contains 0 bytes.
- Return type
bool
-
md5
()[source]¶ Get the MD5 checksum of this element’s binary content.
- Returns
MD5 hex checksum of the data content.
- Return type
str
-
abstract
set_bytes
(b)[source]¶ Set bytes to this data element.
Not all implementations may support setting bytes (check
writable
method return).This base abstract method should be called by sub-class implementations first. We check for mutability based on
writable()
method return.- Parameters
b (bytes) – bytes to set.
- Raises
ReadOnlyError – This data element can only be read from / does not support writing.
-
sha1
()[source]¶ Get the SHA1 checksum of this element’s binary content.
- Returns
SHA1 hex checksum of the data content.
- Return type
str
-
sha512
()[source]¶ Get the SHA512 checksum of this element’s binary content.
- Returns
SHA512 hex checksum of the data content.
- Return type
str
-
to_buffered_reader
()[source]¶ Wrap this element’s bytes in a
io.BufferedReader
instance for use as file-like object for reading.As we use the
get_bytes
function, this element’s bytes must safely fit in memory for this method to be usable.- Returns
New BufferedReader instance
- Return type
io.BufferedReader
-
uuid
()[source]¶ UUID for this data element.
This many take different forms from integers to strings to a uuid.UUID instance. This must return a hashable data type.
By default, this ends up being the hex stringification of the SHA1 hash of this data’s bytes. Specific implementations may provide other UUIDs, however.
- Returns
UUID value for this data element. This return value should be hashable.
- Return type
collections.abc.Hashable
-
write_temp
(temp_dir=None)[source]¶ Write this data’s bytes to a temporary file on disk, returning the path to the written file, whose extension is guessed based on this data’s content type.
It is not guaranteed that the returned file path does not point to the original data, i.e. writing to the returned filepath may modify the original data.
- NOTE:
The file path returned should not be explicitly removed by the user. Instead, the
clean_temp()
method should be called on this object.
- Parameters
temp_dir (None or str) – Optional directory to write temporary file in, otherwise we use the platform default temporary files directory. If this is an empty string, we count it the same as having provided None.
- Returns
Path to the temporary file
- Return type
str
-
DataSet¶
-
class
smqtk.representation.
DataSet
[source]¶ Abstract interface for data sets, that contain an arbitrary number of
DataElement
instances of arbitrary implementation type, keyed onDataElement
UUID values.This should only be used with DataElements whose byte content is expected not to change. If they do, then UUID keys may no longer represent the elements associated with them.
-
abstract
add_data
(*elems)[source]¶ Add the given data element(s) instance to this data set.
NOTE: Implementing methods should check that input elements are in fact DataElement instances.
- Parameters
elems (smqtk.representation.DataElement) – Data element(s) to add
-
abstract
get_data
(uuid)[source]¶ Get the data element the given uuid references, or raise an exception if the uuid does not reference any element in this set.
- Raises
KeyError – If the given uuid does not refer to an element in this data set.
- Parameters
uuid (collections.abc.Hashable) – The uuid of the element to retrieve.
- Returns
The data element instance for the given uuid.
- Return type
-
abstract
has_uuid
(uuid)[source]¶ Test if the given uuid refers to an element in this data set.
- Parameters
uuid (collections.abc.Hashable) – Unique ID to test for inclusion. This should match the type that the set implementation expects or cares about.
- Returns
True if the given uuid matches an element in this set, or False if it does not.
- Return type
bool
-
abstract
DescriptorElement¶
-
class
smqtk.representation.
DescriptorElement
(type_str, uuid)[source]¶ Abstract descriptor vector container.
This structure supports implementations that cache descriptor vectors on a per-UUID basis.
UUIDs must maintain unique-ness when transformed into a string.
Descriptor element equality based on shared descriptor type and vector equality. Two descriptor vectors that are generated by different types of descriptor generator should not be considered the same (though, this may be up for discussion).
Stored vectors should be effectively immutable.
-
classmethod
from_config
(config_dict, type_str, uuid, merge_default=True)[source]¶ Instantiate a new instance of this class given the desired type, uuid, and JSON-compliant configuration dictionary.
- Parameters
type_str (str) – Type of descriptor. This is usually the name of the content descriptor that generated this vector.
uuid (collections.abc.Hashable) – Unique ID reference of the descriptor.
config_dict (dict) – JSON compliant dictionary encapsulating a configuration.
merge_default (bool) – Merge the given configuration on top of the default provided by
get_default_config
.
- Returns
Constructed instance from the provided config.
- Return type
-
classmethod
get_default_config
()[source]¶ Generate and return a default configuration dictionary for this class. This will be primarily used for generating what the configuration dictionary would look like for this class without instantiating it.
By default, we observe what this class’s constructor takes as arguments, aside from the first two assumed positional arguments, turning those argument names into configuration dictionary keys. If any of those arguments have defaults, we will add those values into the configuration dictionary appropriately. The dictionary returned should only contain JSON compliant value types.
It is not be guaranteed that the configuration dictionary returned from this method is valid for construction of an instance of this class.
- Returns
Default configuration dictionary for the class.
- Return type
dict
-
classmethod
get_many_vectors
(descriptors)[source]¶ Get an iterator over vectors associated with given descriptors.
- Note
Most subclasses should override internal method _get_many_vectors rather than this external wrapper function. If a subclass does override this classmethod, it is responsible for appropriately handling any valid DescriptorElement, regardless of subclass.
- Parameters
descriptors (collections.abc.Iterable[ smqtk.representation.descriptor_element.DescriptorElement]) – Iterable of descriptors to query for.
- Returns
Iterable of vectors associated with the given descriptors or None if the descriptor has no associated vector. Results are returned in the order that descriptors were given.
- Return type
list[numpy.ndarray | None]
-
abstract
has_vector
()[source]¶ - Returns
Whether or not this container current has a descriptor vector stored.
- Return type
bool
-
abstract
set_vector
(new_vec)[source]¶ Set the contained vector.
If this container already stores a descriptor vector, this will overwrite it.
- Parameters
new_vec (numpy.ndarray) – New vector to contain.
- Returns
Self.
- Return type
DescriptorMemoryElement
-
classmethod
DescriptorSet¶
-
class
smqtk.representation.
DescriptorSet
[source]¶ Index of descriptors, keyed and query-able by descriptor UUID.
Note that these indexes do not use the descriptor type strings. Thus, if a set of descriptors has multiple elements with the same UUID, but different type strings, they will bash each other in these indexes. In such a case, when dealing with descriptors for different generators, it is advisable to use multiple indices.
-
abstract
add_descriptor
(descriptor)[source]¶ Add a descriptor to this index.
Adding the same descriptor multiple times should not add multiple copies of the descriptor in the index (based on UUID). Added descriptors overwrite indexed descriptors based on UUID.
- Parameters
descriptor (smqtk.representation.DescriptorElement) – Descriptor to index.
-
abstract
add_many_descriptors
(descriptors)[source]¶ Add multiple descriptors at one time.
Adding the same descriptor multiple times should not add multiple copies of the descriptor in the index (based on UUID). Added descriptors overwrite indexed descriptors based on UUID.
- Parameters
descriptors (collections.abc.Iterable[smqtk.representation.DescriptorElement]) – Iterable of descriptor instances to add to this index.
-
abstract
count
()[source]¶ - Returns
Number of descriptor elements stored in this index.
- Return type
int
-
abstract
get_descriptor
(uuid)[source]¶ Get the descriptor in this index that is associated with the given UUID.
- Parameters
uuid (collections.abc.Hashable) – UUID of the DescriptorElement to get.
- Raises
KeyError – The given UUID doesn’t associate to a DescriptorElement in this index.
- Returns
DescriptorElement associated with the queried UUID.
- Return type
-
abstract
get_many_descriptors
(uuids)[source]¶ Get an iterator over descriptors associated to given descriptor UUIDs.
- Parameters
uuids (collections.abc.Iterable[collections.abc.Hashable]) – Iterable of descriptor UUIDs to query for.
- Raises
KeyError – A given UUID doesn’t associate with a DescriptorElement in this index.
- Returns
Iterator of descriptors associated to given uuid values.
- Return type
collections.abc.Iterable[smqtk.representation.DescriptorElement]
-
get_many_vectors
(uuids)[source]¶ Get underlying vectors of descriptors associated with given uuids.
- Parameters
uuids (collections.abc.Iterable[collections.abc.Hashable]) – Iterable of descriptor UUIDs to query for.
- Raises
KeyError: When there is not a descriptor in this set for one or more input UIDs.
- Returns
List of vectors for descriptors associated with given uuid values.
- Return type
list[numpy.ndarray | None]
-
abstract
has_descriptor
(uuid)[source]¶ Check if a DescriptorElement with the given UUID exists in this index.
- Parameters
uuid (collections.abc.Hashable) – UUID to query for
- Returns
True if a DescriptorElement with the given UUID exists in this index, or False if not.
- Return type
bool
-
abstract
iterdescriptors
()[source]¶ Return an iterator over indexed descriptor element instances. :rtype: collections.abc.Iterator[smqtk.representation.DescriptorElement]
-
abstract
iteritems
()[source]¶ Return an iterator over indexed descriptor key and instance pairs. :rtype: collections.abc.Iterator[(collections.abc.Hashable,
smqtk.representation.DescriptorElement)]
-
abstract
iterkeys
()[source]¶ Return an iterator over indexed descriptor keys, which are their UUIDs. :rtype: collections.abc.Iterator[collections.abc.Hashable]
-
abstract
remove_descriptor
(uuid)[source]¶ Remove a descriptor from this index by the given UUID.
- Parameters
uuid (collections.abc.Hashable) – UUID of the DescriptorElement to remove.
- Raises
KeyError – The given UUID doesn’t associate to a DescriptorElement in this index.
-
abstract
remove_many_descriptors
(uuids)[source]¶ Remove descriptors associated to given descriptor UUIDs from this index.
- Parameters
uuids (collections.abc.Iterable[collections.abc.Hashable]) – Iterable of descriptor UUIDs to remove.
- Raises
KeyError – A given UUID doesn’t associate with a DescriptorElement in this index.
-
abstract
DetectionElement¶
-
class
smqtk.representation.
DetectionElement
(uuid)[source]¶ Representation of a spatial detection.
-
classmethod
from_config
(config_dict, uuid, merge_default=True)[source]¶ Override of
smqtk.utils.configuration.Configurable.from_config()
with the added runtime argumentuuid
. See parent method documentation for details.- Parameters
config_dict (dict) – JSON compliant dictionary encapsulating a configuration.
uuid (collections.abc.Hashable) – UUID to assign to the produced DetectionElement.
merge_default (bool) – Merge the given configuration on top of the default provided by
get_default_config
.
- Returns
Constructed instance from the provided config.
- Return type
-
abstract
get_bbox
()[source]¶ - Returns
The spatial bounding box of this detection.
- Return type
smqtk.representation.AxisAlignedBoundingBox
- Raises
NoDetectionError – No detection AxisAlignedBoundingBox set yet.
-
abstract
get_classification
()[source]¶ - Returns
The classification element of this detection.
- Return type
smqtk.representation.ClassificationElement
- Raises
NoDetectionError – No detection ClassificationElement set yet or the element is empty.
-
classmethod
get_default_config
()[source]¶ Generate and return a default configuration dictionary for this class. This will be primarily used for generating what the configuration dictionary would look like for this class without instantiating it.
By default, we observe what this class’s constructor takes as arguments, turning those argument names into configuration dictionary keys. If any of those arguments have defaults, we will add those values into the configuration dictionary appropriately. The dictionary returned should only contain JSON compliant value types.
It is not be guaranteed that the configuration dictionary returned from this method is valid for construction of an instance of this class.
- Returns
Default configuration dictionary for the class.
- Return type
dict
>>> # noinspection PyUnresolvedReferences >>> class SimpleConfig(Configurable): ... def __init__(self, a=1, b='foo'): ... self.a = a ... self.b = b ... def get_config(self): ... return {'a': self.a, 'b': self.b} >>> self = SimpleConfig() >>> config = self.get_default_config() >>> assert config == {'a': 1, 'b': 'foo'}
-
abstract
get_detection
()[source]¶ - Returns
The paired spatial bounding box and classification element of this detection.
- Return type
(smqtk.representation.AxisAlignedBoundingBox, smqtk.representation.ClassificationElement)
- Raises
NoDetectionError – No detection AxisAlignedBoundingBox and ClassificationElement set yet.
-
abstract
has_detection
()[source]¶ - Returns
Whether or not this container currently contains a valid detection bounding box and classification element (must be non-zero).
- Return type
bool
-
abstract
set_detection
(bbox, classification_element)[source]¶ Set a bounding box and classification element to this detection element.
- Parameters
bbox (smqtk.representation.AxisAlignedBoundingBox) – Spatial bounding box instance.
classification_element (smqtk.representation.ClassificationElement) – The classification of this detection.
- Raises
ValueError – No, or invalid, AxisAlignedBoundingBox or ClassificationElement was provided.
- Returns
Self
- Return type
-
classmethod
Data Support Structures¶
Other data structures are provided in the [smqtk.representation
](/python/smqtk/representation) module to assist with the use of the above described structures:
ClassificationElementFactory¶
-
class
smqtk.representation.
ClassificationElementFactory
(type, type_config)[source]¶ Factory class for producing ClassificationElement instances of a specified type and configuration.
-
classmethod
from_config
(config_dict, merge_default=True)[source]¶ Instantiate a new instance of this class given the configuration JSON-compliant dictionary encapsulating initialization arguments.
This method should not be called via super unless and instance of the class is desired.
- Parameters
config_dict (dict) – JSON compliant dictionary encapsulating a configuration.
merge_default (bool) – Merge the given configuration on top of the default provided by
get_default_config
.
- Returns
Constructed instance from the provided config.
- Return type
-
get_config
()[source]¶ Return a JSON-compliant dictionary that could be passed to this class’s
from_config
method to produce an instance with identical configuration.In the most cases, this involves naming the keys of the dictionary based on the initialization argument names as if it were to be passed to the constructor via dictionary expansion. In some cases, where it doesn’t make sense to store some object constructor parameters are expected to be supplied at as configuration values (i.e. must be supplied at runtime), this method’s returned dictionary may leave those parameters out. In such cases, the object’s
from_config
class-method would also take additional positional arguments to fill in for the parameters that this returned configuration lacks.- Returns
JSON type compliant configuration dictionary.
- Return type
dict
-
classmethod
get_default_config
()[source]¶ Generate and return a default configuration dictionary for this class. This will be primarily used for generating what the configuration dictionary would look like for this class without instantiating it.
It is not be guaranteed that the configuration dictionary returned from this method is valid for construction of an instance of this class.
- Returns
Default configuration dictionary for the class.
- Return type
dict
-
new_classification
(type, uuid)[source]¶ Create a new ClassificationElement instance of the configured implementation.
- Parameters
type (str) – Type of classifier. This is usually the name of the classifier that generated this result.
uuid (collections.abc.Hashable) – UUID to associate with the classification.
- Returns
New ClassificationElement instance.
- Return type
smqtk.representation.ClassificationElement
-
type
¶ - Type
type | smqtk.representation.ClassificationElement
-
classmethod
DescriptorElementFactory¶
-
class
smqtk.representation.
DescriptorElementFactory
(d_type, type_config)[source]¶ Factory class for producing DescriptorElement instances of a specified type and configuration.
-
classmethod
from_config
(config_dict, merge_default=True)[source]¶ Instantiate a new instance of this class given the configuration JSON-compliant dictionary encapsulating initialization arguments.
This method should not be called via super unless and instance of the class is desired.
- Parameters
config_dict (dict) – JSON compliant dictionary encapsulating a configuration.
merge_default (bool) – Merge the given configuration on top of the default provided by
get_default_config
.
- Returns
Constructed instance from the provided config.
- Return type
-
get_config
()[source]¶ Return a JSON-compliant dictionary that could be passed to this class’s
from_config
method to produce an instance with identical configuration.In the most cases, this involves naming the keys of the dictionary based on the initialization argument names as if it were to be passed to the constructor via dictionary expansion. In some cases, where it doesn’t make sense to store some object constructor parameters are expected to be supplied at as configuration values (i.e. must be supplied at runtime), this method’s returned dictionary may leave those parameters out. In such cases, the object’s
from_config
class-method would also take additional positional arguments to fill in for the parameters that this returned configuration lacks.- Returns
JSON type compliant configuration dictionary.
- Return type
dict
-
classmethod
get_default_config
()[source]¶ Generate and return a default configuration dictionary for this class. This will be primarily used for generating what the configuration dictionary would look like for this class without instantiating it.
It is not be guaranteed that the configuration dictionary returned from this method is valid for construction of an instance of this class.
- Returns
Default configuration dictionary for the class.
- Return type
dict
-
new_descriptor
(type_str, uuid)[source]¶ Create a new DescriptorElement instance of the configured implementation
- Parameters
type_str (str) – Type of descriptor. This is usually the name of the content descriptor that generated this vector.
uuid (collections.abc.Hashable) – UUID to associate with the descriptor
- Returns
New DescriptorElement instance
- Return type
-
classmethod
DetectionElementFactory¶
-
class
smqtk.representation.
DetectionElementFactory
(elem_type, elem_config)[source]¶ Factory class for producing DetectionElement instances of a specified type and configuration.
-
classmethod
from_config
(config_dict, merge_default=True)[source]¶ Instantiate a new instance of this class given the configuration JSON-compliant dictionary encapsulating initialization arguments.
This base method is adequate without modification when a class’s constructor argument types are JSON-compliant. If one or more are not, however, this method then needs to be overridden in order to convert from a JSON-compliant stand-in into the more complex object the constructor requires. It is recommended that when complex types are used they also inherit from the
Configurable
in order to hopefully make easier the conversion to and from JSON-compliant stand-ins.When this method does need to be overridden, this usually looks like the following pattern:
class MyClass (Configurable): @classmethod def from_config(cls, config_dict, merge_default=True): # Optionally guarantee default values are present in the # configuration dictionary. This statement pairs with the # ``merge_default=False`` parameter in the super call. # This also in effect shallow copies the given non-dictionary # entries of ``config_dict`` due to the merger with the # default config. if merge_default: config_dict = merge_dict(cls.get_default_config(), config_dict) # # Perform any overriding here. # # Create and return an instance using the super method. return super(MyClass, cls).from_config(config_dict, merge_default=False)
This method should not be called via super unless an instance of the class is desired.
- Parameters
config_dict (dict) – JSON compliant dictionary encapsulating a configuration.
merge_default (bool) – Merge the given configuration on top of the default provided by
get_default_config
.
- Returns
Constructed instance from the provided config.
-
get_config
()[source]¶ Return a JSON-compliant dictionary that could be passed to this class’s
from_config
method to produce an instance with identical configuration.In the most cases, this involves naming the keys of the dictionary based on the initialization argument names as if it were to be passed to the constructor via dictionary expansion. In some cases, where it doesn’t make sense to store some object constructor parameters are expected to be supplied at as configuration values (i.e. must be supplied at runtime), this method’s returned dictionary may leave those parameters out. In such cases, the object’s
from_config
class-method would also take additional positional arguments to fill in for the parameters that this returned configuration lacks.- Returns
JSON type compliant configuration dictionary.
- Return type
dict
-
classmethod
get_default_config
()[source]¶ Generate and return a default configuration dictionary for this class. This will be primarily used for generating what the configuration dictionary would look like for this class without instantiating it.
By default, we observe what this class’s constructor takes as arguments, turning those argument names into configuration dictionary keys. If any of those arguments have defaults, we will add those values into the configuration dictionary appropriately. The dictionary returned should only contain JSON compliant value types.
It is not be guaranteed that the configuration dictionary returned from this method is valid for construction of an instance of this class.
- Returns
Default configuration dictionary for the class.
- Return type
dict
>>> # noinspection PyUnresolvedReferences >>> class SimpleConfig(Configurable): ... def __init__(self, a=1, b='foo'): ... self.a = a ... self.b = b ... def get_config(self): ... return {'a': self.a, 'b': self.b} >>> self = SimpleConfig() >>> config = self.get_default_config() >>> assert config == {'a': 1, 'b': 'foo'}
-
classmethod