Data Abstraction

An important part of any algorithm is the data its working over and the data that it produces. An important part of working with large scales of data is where the data is stored and how its accessed. The smqtk.representation module contains interfaces and plugins for various core data structures, allowing plugin implementations to decide where and how the underlying raw data should be stored and accessed. This potentially allows algorithms to handle more data that would otherwise be feasible on a single machine.

class smqtk.representation.SmqtkRepresentation[source]

Interface for data representation interfaces and implementations.

Data should be serializable, so this interface adds abstract methods for serializing and de-serializing SMQTK data representation instances.

Data Representation Structures

The following are the core data representation interfaces.

Note:

It is required that implementations have a common serialization format so that they may be stored or transported by other structures in a general way without caring what the specific implementation is. For this we require that all implementations be serializable via the pickle (and thus cPickle) module functions.

DataElement

class smqtk.representation.DataElement[source]

Abstract interface for a byte data container.

The primary “value” of a DataElement is the byte content wrapped. Since this can technically change due to external forces, we cannot guarantee that an element is immutable. Thus DataElement instances are not considered generally hashable. Specific implementations may define a __hash__ method if that implementation reflects a data source that guarantees immutability.

UUIDs should be cast-able to a string and maintain unique-ness after conversion.

clean_temp()[source]

Clean any temporary files created by this element. This does nothing if no temporary files have been generated for this element yet.

abstract content_type()[source]
Returns

Standard type/subtype string for this data element, or None if the content type is unknown.

Return type

str or None

classmethod from_uri(uri)[source]

Construct a new instance based on the given URI.

This function may not be implemented for all DataElement types.

Parameters

uri (str) – URI string to resolve into an element instance

Raises
  • NoUriResolutionError – This element type does not implement URI resolution.

  • smqtk.exceptions.InvalidUriError – This element type could not resolve the provided URI string.

Returns

New element instance of our type.

Return type

DataElement

abstract get_bytes()[source]
Returns

Get the bytes for this data element.

Return type

bytes

abstract is_empty()[source]

Check if this element contains no bytes.

The intend of this method is to quickly check if there is any data behind this element, ideally without having to read all/any of the underlying data.

Returns

If this element contains 0 bytes.

Return type

bool

is_read_only()[source]
Returns

If this element can only be read from.

Return type

bool

md5()[source]

Get the MD5 checksum of this element’s binary content.

Returns

MD5 hex checksum of the data content.

Return type

str

abstract set_bytes(b)[source]

Set bytes to this data element.

Not all implementations may support setting bytes (check writable method return).

This base abstract method should be called by sub-class implementations first. We check for mutability based on writable() method return.

Parameters

b (bytes) – bytes to set.

Raises

ReadOnlyError – This data element can only be read from / does not support writing.

sha1()[source]

Get the SHA1 checksum of this element’s binary content.

Returns

SHA1 hex checksum of the data content.

Return type

str

sha512()[source]

Get the SHA512 checksum of this element’s binary content.

Returns

SHA512 hex checksum of the data content.

Return type

str

to_buffered_reader()[source]

Wrap this element’s bytes in a io.BufferedReader instance for use as file-like object for reading.

As we use the get_bytes function, this element’s bytes must safely fit in memory for this method to be usable.

Returns

New BufferedReader instance

Return type

io.BufferedReader

uuid()[source]

UUID for this data element.

This many take different forms from integers to strings to a uuid.UUID instance. This must return a hashable data type.

By default, this ends up being the hex stringification of the SHA1 hash of this data’s bytes. Specific implementations may provide other UUIDs, however.

Returns

UUID value for this data element. This return value should be hashable.

Return type

collections.abc.Hashable

abstract writable()[source]
Returns

if this instance supports setting bytes.

Return type

bool

write_temp(temp_dir=None)[source]

Write this data’s bytes to a temporary file on disk, returning the path to the written file, whose extension is guessed based on this data’s content type.

It is not guaranteed that the returned file path does not point to the original data, i.e. writing to the returned filepath may modify the original data.

NOTE:

The file path returned should not be explicitly removed by the user. Instead, the clean_temp() method should be called on this object.

Parameters

temp_dir (None or str) – Optional directory to write temporary file in, otherwise we use the platform default temporary files directory. If this is an empty string, we count it the same as having provided None.

Returns

Path to the temporary file

Return type

str

DataSet

class smqtk.representation.DataSet[source]

Abstract interface for data sets, that contain an arbitrary number of DataElement instances of arbitrary implementation type, keyed on DataElement UUID values.

This should only be used with DataElements whose byte content is expected not to change. If they do, then UUID keys may no longer represent the elements associated with them.

abstract add_data(*elems)[source]

Add the given data element(s) instance to this data set.

NOTE: Implementing methods should check that input elements are in fact DataElement instances.

Parameters

elems (smqtk.representation.DataElement) – Data element(s) to add

abstract count()[source]
Returns

The number of data elements in this set.

Return type

int

abstract get_data(uuid)[source]

Get the data element the given uuid references, or raise an exception if the uuid does not reference any element in this set.

Raises

KeyError – If the given uuid does not refer to an element in this data set.

Parameters

uuid (collections.abc.Hashable) – The uuid of the element to retrieve.

Returns

The data element instance for the given uuid.

Return type

smqtk.representation.DataElement

abstract has_uuid(uuid)[source]

Test if the given uuid refers to an element in this data set.

Parameters

uuid (collections.abc.Hashable) – Unique ID to test for inclusion. This should match the type that the set implementation expects or cares about.

Returns

True if the given uuid matches an element in this set, or False if it does not.

Return type

bool

abstract uuids()[source]
Returns

A new set of uuids represented in this data set.

Return type

set

DescriptorElement

class smqtk.representation.DescriptorElement(type_str, uuid)[source]

Abstract descriptor vector container.

This structure supports implementations that cache descriptor vectors on a per-UUID basis.

UUIDs must maintain unique-ness when transformed into a string.

Descriptor element equality based on shared descriptor type and vector equality. Two descriptor vectors that are generated by different types of descriptor generator should not be considered the same (though, this may be up for discussion).

Stored vectors should be effectively immutable.

classmethod from_config(config_dict, type_str, uuid, merge_default=True)[source]

Instantiate a new instance of this class given the desired type, uuid, and JSON-compliant configuration dictionary.

Parameters
  • type_str (str) – Type of descriptor. This is usually the name of the content descriptor that generated this vector.

  • uuid (collections.abc.Hashable) – Unique ID reference of the descriptor.

  • config_dict (dict) – JSON compliant dictionary encapsulating a configuration.

  • merge_default (bool) – Merge the given configuration on top of the default provided by get_default_config.

Returns

Constructed instance from the provided config.

Return type

DescriptorElement

classmethod get_default_config()[source]

Generate and return a default configuration dictionary for this class. This will be primarily used for generating what the configuration dictionary would look like for this class without instantiating it.

By default, we observe what this class’s constructor takes as arguments, aside from the first two assumed positional arguments, turning those argument names into configuration dictionary keys. If any of those arguments have defaults, we will add those values into the configuration dictionary appropriately. The dictionary returned should only contain JSON compliant value types.

It is not be guaranteed that the configuration dictionary returned from this method is valid for construction of an instance of this class.

Returns

Default configuration dictionary for the class.

Return type

dict

classmethod get_many_vectors(descriptors)[source]

Get an iterator over vectors associated with given descriptors.

Note

Most subclasses should override internal method _get_many_vectors rather than this external wrapper function. If a subclass does override this classmethod, it is responsible for appropriately handling any valid DescriptorElement, regardless of subclass.

Parameters

descriptors (collections.abc.Iterable[ smqtk.representation.descriptor_element.DescriptorElement]) – Iterable of descriptors to query for.

Returns

Iterable of vectors associated with the given descriptors or None if the descriptor has no associated vector. Results are returned in the order that descriptors were given.

Return type

list[numpy.ndarray | None]

abstract has_vector()[source]
Returns

Whether or not this container current has a descriptor vector stored.

Return type

bool

abstract set_vector(new_vec)[source]

Set the contained vector.

If this container already stores a descriptor vector, this will overwrite it.

Parameters

new_vec (numpy.ndarray) – New vector to contain.

Returns

Self.

Return type

DescriptorMemoryElement

type()[source]
Returns

Type label type of the DescriptorGenerator that generated this vector.

Return type

str

uuid()[source]
Returns

Unique ID for this vector.

Return type

collections.abc.Hashable

abstract vector()[source]
Returns

Get the stored descriptor vector as a numpy array. This returns None of there is no vector stored in this container.

Return type

numpy.ndarray or None

DescriptorSet

class smqtk.representation.DescriptorSet[source]

Index of descriptors, keyed and query-able by descriptor UUID.

Note that these indexes do not use the descriptor type strings. Thus, if a set of descriptors has multiple elements with the same UUID, but different type strings, they will bash each other in these indexes. In such a case, when dealing with descriptors for different generators, it is advisable to use multiple indices.

abstract add_descriptor(descriptor)[source]

Add a descriptor to this index.

Adding the same descriptor multiple times should not add multiple copies of the descriptor in the index (based on UUID). Added descriptors overwrite indexed descriptors based on UUID.

Parameters

descriptor (smqtk.representation.DescriptorElement) – Descriptor to index.

abstract add_many_descriptors(descriptors)[source]

Add multiple descriptors at one time.

Adding the same descriptor multiple times should not add multiple copies of the descriptor in the index (based on UUID). Added descriptors overwrite indexed descriptors based on UUID.

Parameters

descriptors (collections.abc.Iterable[smqtk.representation.DescriptorElement]) – Iterable of descriptor instances to add to this index.

abstract clear()[source]

Clear this descriptor index’s entries.

abstract count()[source]
Returns

Number of descriptor elements stored in this index.

Return type

int

abstract get_descriptor(uuid)[source]

Get the descriptor in this index that is associated with the given UUID.

Parameters

uuid (collections.abc.Hashable) – UUID of the DescriptorElement to get.

Raises

KeyError – The given UUID doesn’t associate to a DescriptorElement in this index.

Returns

DescriptorElement associated with the queried UUID.

Return type

smqtk.representation.DescriptorElement

abstract get_many_descriptors(uuids)[source]

Get an iterator over descriptors associated to given descriptor UUIDs.

Parameters

uuids (collections.abc.Iterable[collections.abc.Hashable]) – Iterable of descriptor UUIDs to query for.

Raises

KeyError – A given UUID doesn’t associate with a DescriptorElement in this index.

Returns

Iterator of descriptors associated to given uuid values.

Return type

collections.abc.Iterable[smqtk.representation.DescriptorElement]

get_many_vectors(uuids)[source]

Get underlying vectors of descriptors associated with given uuids.

Parameters

uuids (collections.abc.Iterable[collections.abc.Hashable]) – Iterable of descriptor UUIDs to query for.

Raises

KeyError: When there is not a descriptor in this set for one or more input UIDs.

Returns

List of vectors for descriptors associated with given uuid values.

Return type

list[numpy.ndarray | None]

abstract has_descriptor(uuid)[source]

Check if a DescriptorElement with the given UUID exists in this index.

Parameters

uuid (collections.abc.Hashable) – UUID to query for

Returns

True if a DescriptorElement with the given UUID exists in this index, or False if not.

Return type

bool

items()[source]

alias for iteritems

abstract iterdescriptors()[source]

Return an iterator over indexed descriptor element instances. :rtype: collections.abc.Iterator[smqtk.representation.DescriptorElement]

abstract iteritems()[source]

Return an iterator over indexed descriptor key and instance pairs. :rtype: collections.abc.Iterator[(collections.abc.Hashable,

smqtk.representation.DescriptorElement)]

abstract iterkeys()[source]

Return an iterator over indexed descriptor keys, which are their UUIDs. :rtype: collections.abc.Iterator[collections.abc.Hashable]

keys()[source]

alias for iterkeys

abstract remove_descriptor(uuid)[source]

Remove a descriptor from this index by the given UUID.

Parameters

uuid (collections.abc.Hashable) – UUID of the DescriptorElement to remove.

Raises

KeyError – The given UUID doesn’t associate to a DescriptorElement in this index.

abstract remove_many_descriptors(uuids)[source]

Remove descriptors associated to given descriptor UUIDs from this index.

Parameters

uuids (collections.abc.Iterable[collections.abc.Hashable]) – Iterable of descriptor UUIDs to remove.

Raises

KeyError – A given UUID doesn’t associate with a DescriptorElement in this index.

DetectionElement

class smqtk.representation.DetectionElement(uuid)[source]

Representation of a spatial detection.

classmethod from_config(config_dict, uuid, merge_default=True)[source]

Override of smqtk.utils.configuration.Configurable.from_config() with the added runtime argument uuid. See parent method documentation for details.

Parameters
  • config_dict (dict) – JSON compliant dictionary encapsulating a configuration.

  • uuid (collections.abc.Hashable) – UUID to assign to the produced DetectionElement.

  • merge_default (bool) – Merge the given configuration on top of the default provided by get_default_config.

Returns

Constructed instance from the provided config.

Return type

DetectionElement

abstract get_bbox()[source]
Returns

The spatial bounding box of this detection.

Return type

smqtk.representation.AxisAlignedBoundingBox

Raises

NoDetectionError – No detection AxisAlignedBoundingBox set yet.

abstract get_classification()[source]
Returns

The classification element of this detection.

Return type

smqtk.representation.ClassificationElement

Raises

NoDetectionError – No detection ClassificationElement set yet or the element is empty.

classmethod get_default_config()[source]

Generate and return a default configuration dictionary for this class. This will be primarily used for generating what the configuration dictionary would look like for this class without instantiating it.

By default, we observe what this class’s constructor takes as arguments, turning those argument names into configuration dictionary keys. If any of those arguments have defaults, we will add those values into the configuration dictionary appropriately. The dictionary returned should only contain JSON compliant value types.

It is not be guaranteed that the configuration dictionary returned from this method is valid for construction of an instance of this class.

Returns

Default configuration dictionary for the class.

Return type

dict

>>> # noinspection PyUnresolvedReferences
>>> class SimpleConfig(Configurable):
...     def __init__(self, a=1, b='foo'):
...         self.a = a
...         self.b = b
...     def get_config(self):
...         return {'a': self.a, 'b': self.b}
>>> self = SimpleConfig()
>>> config = self.get_default_config()
>>> assert config == {'a': 1, 'b': 'foo'}
abstract get_detection()[source]
Returns

The paired spatial bounding box and classification element of this detection.

Return type

(smqtk.representation.AxisAlignedBoundingBox, smqtk.representation.ClassificationElement)

Raises

NoDetectionError – No detection AxisAlignedBoundingBox and ClassificationElement set yet.

abstract has_detection()[source]
Returns

Whether or not this container currently contains a valid detection bounding box and classification element (must be non-zero).

Return type

bool

abstract set_detection(bbox, classification_element)[source]

Set a bounding box and classification element to this detection element.

Parameters
  • bbox (smqtk.representation.AxisAlignedBoundingBox) – Spatial bounding box instance.

  • classification_element (smqtk.representation.ClassificationElement) – The classification of this detection.

Raises

ValueError – No, or invalid, AxisAlignedBoundingBox or ClassificationElement was provided.

Returns

Self

Return type

DetectionElement

Data Support Structures

Other data structures are provided in the [smqtk.representation](/python/smqtk/representation) module to assist with the use of the above described structures:

ClassificationElementFactory

class smqtk.representation.ClassificationElementFactory(type, type_config)[source]

Factory class for producing ClassificationElement instances of a specified type and configuration.

classmethod from_config(config_dict, merge_default=True)[source]

Instantiate a new instance of this class given the configuration JSON-compliant dictionary encapsulating initialization arguments.

This method should not be called via super unless and instance of the class is desired.

Parameters
  • config_dict (dict) – JSON compliant dictionary encapsulating a configuration.

  • merge_default (bool) – Merge the given configuration on top of the default provided by get_default_config.

Returns

Constructed instance from the provided config.

Return type

ClassificationElementFactory

get_config()[source]

Return a JSON-compliant dictionary that could be passed to this class’s from_config method to produce an instance with identical configuration.

In the most cases, this involves naming the keys of the dictionary based on the initialization argument names as if it were to be passed to the constructor via dictionary expansion. In some cases, where it doesn’t make sense to store some object constructor parameters are expected to be supplied at as configuration values (i.e. must be supplied at runtime), this method’s returned dictionary may leave those parameters out. In such cases, the object’s from_config class-method would also take additional positional arguments to fill in for the parameters that this returned configuration lacks.

Returns

JSON type compliant configuration dictionary.

Return type

dict

classmethod get_default_config()[source]

Generate and return a default configuration dictionary for this class. This will be primarily used for generating what the configuration dictionary would look like for this class without instantiating it.

It is not be guaranteed that the configuration dictionary returned from this method is valid for construction of an instance of this class.

Returns

Default configuration dictionary for the class.

Return type

dict

new_classification(type, uuid)[source]

Create a new ClassificationElement instance of the configured implementation.

Parameters
  • type (str) – Type of classifier. This is usually the name of the classifier that generated this result.

  • uuid (collections.abc.Hashable) – UUID to associate with the classification.

Returns

New ClassificationElement instance.

Return type

smqtk.representation.ClassificationElement

type
Type

type | smqtk.representation.ClassificationElement

DescriptorElementFactory

class smqtk.representation.DescriptorElementFactory(d_type, type_config)[source]

Factory class for producing DescriptorElement instances of a specified type and configuration.

classmethod from_config(config_dict, merge_default=True)[source]

Instantiate a new instance of this class given the configuration JSON-compliant dictionary encapsulating initialization arguments.

This method should not be called via super unless and instance of the class is desired.

Parameters
  • config_dict (dict) – JSON compliant dictionary encapsulating a configuration.

  • merge_default (bool) – Merge the given configuration on top of the default provided by get_default_config.

Returns

Constructed instance from the provided config.

Return type

DescriptorElementFactory

get_config()[source]

Return a JSON-compliant dictionary that could be passed to this class’s from_config method to produce an instance with identical configuration.

In the most cases, this involves naming the keys of the dictionary based on the initialization argument names as if it were to be passed to the constructor via dictionary expansion. In some cases, where it doesn’t make sense to store some object constructor parameters are expected to be supplied at as configuration values (i.e. must be supplied at runtime), this method’s returned dictionary may leave those parameters out. In such cases, the object’s from_config class-method would also take additional positional arguments to fill in for the parameters that this returned configuration lacks.

Returns

JSON type compliant configuration dictionary.

Return type

dict

classmethod get_default_config()[source]

Generate and return a default configuration dictionary for this class. This will be primarily used for generating what the configuration dictionary would look like for this class without instantiating it.

It is not be guaranteed that the configuration dictionary returned from this method is valid for construction of an instance of this class.

Returns

Default configuration dictionary for the class.

Return type

dict

new_descriptor(type_str, uuid)[source]

Create a new DescriptorElement instance of the configured implementation

Parameters
  • type_str (str) – Type of descriptor. This is usually the name of the content descriptor that generated this vector.

  • uuid (collections.abc.Hashable) – UUID to associate with the descriptor

Returns

New DescriptorElement instance

Return type

smqtk.representation.DescriptorElement

DetectionElementFactory

class smqtk.representation.DetectionElementFactory(elem_type, elem_config)[source]

Factory class for producing DetectionElement instances of a specified type and configuration.

classmethod from_config(config_dict, merge_default=True)[source]

Instantiate a new instance of this class given the configuration JSON-compliant dictionary encapsulating initialization arguments.

This base method is adequate without modification when a class’s constructor argument types are JSON-compliant. If one or more are not, however, this method then needs to be overridden in order to convert from a JSON-compliant stand-in into the more complex object the constructor requires. It is recommended that when complex types are used they also inherit from the Configurable in order to hopefully make easier the conversion to and from JSON-compliant stand-ins.

When this method does need to be overridden, this usually looks like the following pattern:

class MyClass (Configurable):

    @classmethod
    def from_config(cls, config_dict, merge_default=True):
        # Optionally guarantee default values are present in the
        # configuration dictionary.  This statement pairs with the
        # ``merge_default=False`` parameter in the super call.
        # This also in effect shallow copies the given non-dictionary
        # entries of ``config_dict`` due to the merger with the
        # default config.
        if merge_default:
            config_dict = merge_dict(cls.get_default_config(),
                                     config_dict)

        #
        # Perform any overriding here.
        #

        # Create and return an instance using the super method.
        return super(MyClass, cls).from_config(config_dict,
                                               merge_default=False)

This method should not be called via super unless an instance of the class is desired.

Parameters
  • config_dict (dict) – JSON compliant dictionary encapsulating a configuration.

  • merge_default (bool) – Merge the given configuration on top of the default provided by get_default_config.

Returns

Constructed instance from the provided config.

get_config()[source]

Return a JSON-compliant dictionary that could be passed to this class’s from_config method to produce an instance with identical configuration.

In the most cases, this involves naming the keys of the dictionary based on the initialization argument names as if it were to be passed to the constructor via dictionary expansion. In some cases, where it doesn’t make sense to store some object constructor parameters are expected to be supplied at as configuration values (i.e. must be supplied at runtime), this method’s returned dictionary may leave those parameters out. In such cases, the object’s from_config class-method would also take additional positional arguments to fill in for the parameters that this returned configuration lacks.

Returns

JSON type compliant configuration dictionary.

Return type

dict

classmethod get_default_config()[source]

Generate and return a default configuration dictionary for this class. This will be primarily used for generating what the configuration dictionary would look like for this class without instantiating it.

By default, we observe what this class’s constructor takes as arguments, turning those argument names into configuration dictionary keys. If any of those arguments have defaults, we will add those values into the configuration dictionary appropriately. The dictionary returned should only contain JSON compliant value types.

It is not be guaranteed that the configuration dictionary returned from this method is valid for construction of an instance of this class.

Returns

Default configuration dictionary for the class.

Return type

dict

>>> # noinspection PyUnresolvedReferences
>>> class SimpleConfig(Configurable):
...     def __init__(self, a=1, b='foo'):
...         self.a = a
...         self.b = b
...     def get_config(self):
...         return {'a': self.a, 'b': self.b}
>>> self = SimpleConfig()
>>> config = self.get_default_config()
>>> assert config == {'a': 1, 'b': 'foo'}
new_detection(uuid)[source]

Create a new DetectionElement instance o the configured implementation.

Parameters

uuid (collections.abc.Hashable) – UUID to assign the element.

Returns

New DetectionElement instance.

Return type

DetectionElement