Data Abstraction

An important part of any algorithm is the data its working over and the data that it produces. An important part of working with large scales of data is where the data is stored and how its accessed. The smqtk.representation module contains interfaces and plugins for various core data structures, allowing plugin implementations to decide where and how the underlying raw data should be stored and accessed. This potentially allows algorithms to handle more data that would otherwise be feasible on a single machine.

class smqtk.representation.SmqtkRepresentation[source]

Interface for data representation interfaces and implementations.

Data should be serializable, so this interface adds abstract methods for serializing and de-serializing SMQTK data representation instances.

Data Representation Structures

The following are the core data representation interfaces.

Note:
It is required that implementations have a common serialization format so that they may be stored or transported by other structures in a general way without caring what the specific implementation is. For this we require that all implementations be serializable via the pickle (and thus cPickle) module functions.

DataElement

class smqtk.representation.data_element.DataElement[source]

Abstract interface for a byte data container.

The primary “value” of a DataElement is the byte content wrapped. Since this can technically change due to external forces, we cannot guarantee that an element is immutable. Thus DataElement instances are not considered generally hashable. Specific implementations may define a __hash__ method if that implementation reflects a data source that guarantees immutability.

UUIDs should be cast-able to a string and maintain unique-ness after conversion.

clean_temp()[source]

Clean any temporary files created by this element. This does nothing if no temporary files have been generated for this element yet.

content_type()[source]
Returns:Standard type/subtype string for this data element, or None if the content type is unknown.
Return type:str or None
classmethod from_uri(uri)[source]

Construct a new instance based on the given URI.

This function may not be implemented for all DataElement types.

Parameters:

uri (str) – URI string to resolve into an element instance

Raises:
  • NoUriResolutionError – This element type does not implement URI resolution.
  • smqtk.exceptions.InvalidUriError – This element type could not resolve the provided URI string.
Returns:

New element instance of our type.

Return type:

DataElement

get_bytes()[source]
Returns:Get the bytes for this data element.
Return type:bytes
is_empty()[source]

Check if this element contains no bytes.

The intend of this method is to quickly check if there is any data behind this element, ideally without having to read all/any of the underlying data.

Returns:If this element contains 0 bytes.
Return type:bool
is_read_only()[source]
Returns:If this element can only be read from.
Return type:bool
md5()[source]

Get the MD5 checksum of this element’s binary content.

Returns:MD5 hex checksum of the data content.
Return type:str
set_bytes(b)[source]

Set bytes to this data element.

Not all implementations may support setting bytes (check writable method return).

This base abstract method should be called by sub-class implementations first. We check for mutability based on writable() method return and invalidate checksum caches.

Parameters:b (str) – bytes to set.
Raises:ReadOnlyError – This data element can only be read from / does not support writing.
sha1()[source]

Get the SHA1 checksum of this element’s binary content.

Returns:SHA1 hex checksum of the data content.
Return type:str
sha512()[source]

Get the SHA512 checksum of this element’s binary content.

Returns:SHA512 hex checksum of the data content.
Return type:str
to_buffered_reader()[source]

Wrap this element’s bytes in a io.BufferedReader instance for use as file-like object for reading.

As we use the get_bytes function, this element’s bytes must safely fit in memory for this method to be usable.

Returns:New BufferedReader instance
Return type:io.BufferedReader
uuid()[source]

UUID for this data element.

This many take different forms from integers to strings to a uuid.UUID instance. This must return a hashable data type.

By default, this ends up being the hex stringification of the SHA1 hash of this data’s bytes. Specific implementations may provide other UUIDs, however.

Returns:UUID value for this data element. This return value should be hashable.
Return type:collections.Hashable
writable()[source]
Returns:if this instance supports setting bytes.
Return type:bool
write_temp(temp_dir=None)[source]

Write this data’s bytes to a temporary file on disk, returning the path to the written file, whose extension is guessed based on this data’s content type.

It is not guaranteed that the returned file path does not point to the original data, i.e. writing to the returned filepath may modify the original data.

NOTE:
The file path returned should not be explicitly removed by the user. Instead, the clean_temp() method should be called on this object.
Parameters:temp_dir (None or str) – Optional directory to write temporary file in, otherwise we use the platform default temporary files directory. If this is an empty string, we count it the same as having provided None.
Returns:Path to the temporary file
Return type:str
smqtk.representation.data_element.from_uri(uri, impl_generator=<function get_data_element_impls>)[source]

Create a data element instance from available plugin implementations.

The first implementation that can resolve the URI is what is returned. If no implementations can resolve the URL, an InvalidUriError is raised.

Parameters:
  • uri (str) – URI to try to resolve into a DataElement instance.
  • impl_generator (() -> dict[str, type]) – Function that returns a dictionary mapping implementation type names to the class type. By default this refers to the standard get_data_element_impls function, however this can be changed to refer to a custom set of classes if desired.
Raises:

smqtk.exceptions.InvalidUriError – No data element implementations could resolve the given URI.

Returns:

New data element instance providing access to the data pointed to by the input URI.

Return type:

DataElement

smqtk.representation.data_element.get_data_element_impls(reload_modules=False)[source]

Discover and return discovered DataElement classes. Keys in the returned map are the names of the discovered classes, and the paired values are the actual class type objects.

We search for implementation classes in:
  • modules next to this file this function is defined in (ones that begin with an alphanumeric character),

  • python modules listed in the environment variable DATA_ELEMENT_PATH

    • This variable should contain a sequence of python module specifications, separated by the platform specific PATH separator character (; for Windows, : for unix)

Within a module we first look for a helper variable by the name DATA_ELEMENT_CLASS, which can either be a single class object or an iterable of class objects, to be specifically exported. If the variable is set to None, we skip that module and do not import anything. If the variable is not present, we look at attributes defined in that module for classes that descend from the given base class type. If none of the above are found, or if an exception occurs, the module is skipped.

Parameters:reload_modules (bool) – Explicitly reload discovered modules from source.
Returns:Map of discovered class object of type DataElement whose keys are the string names of the classes.
Return type:dict[str, type]

DataSet

class smqtk.representation.data_set.DataSet[source]

Abstract interface for data sets, that contain an arbitrary number of DataElement instances of arbitrary implementation type, keyed on DataElement UUID values.

This should only be used with DataElements whose byte content is expected not to change. If they do, then UUID keys may no longer represent the elements associated with them.

add_data(*elems)[source]

Add the given data element(s) instance to this data set.

NOTE: Implementing methods should check that input elements are in fact DataElement instances.

Parameters:elems (smqtk.representation.DataElement) – Data element(s) to add
count()[source]
Returns:The number of data elements in this set.
Return type:int
get_data(uuid)[source]

Get the data element the given uuid references, or raise an exception if the uuid does not reference any element in this set.

Raises:KeyError – If the given uuid does not refer to an element in this data set.
Parameters:uuid (collections.Hashable) – The uuid of the element to retrieve.
Returns:The data element instance for the given uuid.
Return type:smqtk.representation.DataElement
has_uuid(uuid)[source]

Test if the given uuid refers to an element in this data set.

Parameters:uuid (collections.Hashable) – Unique ID to test for inclusion. This should match the type that the set implementation expects or cares about.
Returns:True if the given uuid matches an element in this set, or False if it does not.
Return type:bool
uuids()[source]
Returns:A new set of uuids represented in this data set.
Return type:set
smqtk.representation.data_set.get_data_set_impls(reload_modules=False)[source]

Discover and return discovered DataSet classes. Keys in the returned map are the names of the discovered classes, and the paired values are the actual class type objects.

We search for implementation classes in:
  • modules next to this file this function is defined in (ones that begin with an alphanumeric character),
  • python modules listed in the environment variable DATA_SET_PATH
    • This variable should contain a sequence of python module specifications, separated by the platform specific PATH separator character (; for Windows, : for unix)

Within a module we first look for a helper variable by the name DATA_SET_CLASS, which can either be a single class object or an iterable of class objects, to be specifically exported. If the variable is set to None, we skip that module and do not import anything. If the variable is not present, we look at attributes defined in that module for classes that descend from the given base class type. If none of the above are found, or if an exception occurs, the module is skipped.

Parameters:reload_modules (bool) – Explicitly reload discovered modules from source.
Returns:Map of discovered class object of type DataSet whose keys are the string names of the classes.
Return type:dict[str, type]

DescriptorElement

class smqtk.representation.descriptor_element.DescriptorElement(type_str, uuid)[source]

Abstract descriptor vector container.

This structure supports implementations that cache descriptor vectors on a per-UUID basis.

UUIDs must maintain unique-ness when transformed into a string.

Descriptor element equality based on shared descriptor type and vector equality. Two descriptor vectors that are generated by different types of descriptor generator should not be considered the same (though, this may be up for discussion).

Stored vectors should be effectively immutable.

classmethod from_config(config_dict, type_str, uuid, merge_default=True)[source]

Instantiate a new instance of this class given the desired type, uuid, and JSON-compliant configuration dictionary.

Parameters:
  • type_str (str) – Type of descriptor. This is usually the name of the content descriptor that generated this vector.
  • uuid (collections.Hashable) – Unique ID reference of the descriptor.
  • config_dict (dict) – JSON compliant dictionary encapsulating a configuration.
  • merge_default (bool) – Merge the given configuration on top of the default provided by get_default_config.
Returns:

Constructed instance from the provided config.

Return type:

DescriptorElement

classmethod get_default_config()[source]

Generate and return a default configuration dictionary for this class. This will be primarily used for generating what the configuration dictionary would look like for this class without instantiating it.

By default, we observe what this class’s constructor takes as arguments, aside from the first two assumed positional arguments, turning those argument names into configuration dictionary keys. If any of those arguments have defaults, we will add those values into the configuration dictionary appropriately. The dictionary returned should only contain JSON compliant value types.

It is not be guaranteed that the configuration dictionary returned from this method is valid for construction of an instance of this class.

Returns:Default configuration dictionary for the class.
Return type:dict
has_vector()[source]
Returns:Whether or not this container current has a descriptor vector stored.
Return type:bool
set_vector(new_vec)[source]

Set the contained vector.

If this container already stores a descriptor vector, this will overwrite it.

Parameters:new_vec (numpy.ndarray) – New vector to contain.
Returns:Self.
Return type:DescriptorMemoryElement
type()[source]
Returns:Type label type of the DescriptorGenerator that generated this vector.
Return type:str
uuid()[source]
Returns:Unique ID for this vector.
Return type:collections.Hashable
vector()[source]
Returns:Get the stored descriptor vector as a numpy array. This returns None of there is no vector stored in this container.
Return type:numpy.ndarray or None
smqtk.representation.descriptor_element.get_descriptor_element_impls(reload_modules=False)[source]

Discover and return discovered DescriptorElement classes. Keys in the returned map are the names of the discovered classes, and the paired values are the actual class type objects.

We search for implementation classes in:
  • modules next to this file this function is defined in (ones that begin with an alphanumeric character),

  • python modules listed in the environment variable DESCRIPTOR_ELEMENT_PATH

    • This variable should contain a sequence of python module specifications, separated by the platform specific PATH separator character (; for Windows, : for unix)

Within a module we first look for a helper variable by the name DESCRIPTOR_ELEMENT_CLASS, which can either be a single class object or an iterable of class objects, to be specifically exported. If the variable is set to None, we skip that module and do not import anything. If the variable is not present, we look at attributes defined in that module for classes that descend from the given base class type. If none of the above are found, or if an exception occurs, the module is skipped.

Parameters:reload_modules (bool) – Explicitly reload discovered modules from source.
Returns:Map of discovered class object of type DescriptorElement whose keys are the string names of the classes.
Return type:dict[str, type]

DescriptorIndex

class smqtk.representation.descriptor_index.DescriptorIndex[source]

Index of descriptors, keyed and query-able by descriptor UUID.

Note that these indexes do not use the descriptor type strings. Thus, if a set of descriptors has multiple elements with the same UUID, but different type strings, they will bash each other in these indexes. In such a case, when dealing with descriptors for different generators, it is advisable to use multiple indices.

add_descriptor(descriptor)[source]

Add a descriptor to this index.

Adding the same descriptor multiple times should not add multiple copies of the descriptor in the index (based on UUID). Added descriptors overwrite indexed descriptors based on UUID.

Parameters:descriptor (smqtk.representation.DescriptorElement) – Descriptor to index.
add_many_descriptors(descriptors)[source]

Add multiple descriptors at one time.

Adding the same descriptor multiple times should not add multiple copies of the descriptor in the index (based on UUID). Added descriptors overwrite indexed descriptors based on UUID.

Parameters:descriptors (collections.Iterable[smqtk.representation.DescriptorElement]) – Iterable of descriptor instances to add to this index.
clear()[source]

Clear this descriptor index’s entries.

count()[source]
Returns:Number of descriptor elements stored in this index.
Return type:int
get_descriptor(uuid)[source]

Get the descriptor in this index that is associated with the given UUID.

Parameters:uuid (collections.Hashable) – UUID of the DescriptorElement to get.
Raises:KeyError – The given UUID doesn’t associate to a DescriptorElement in this index.
Returns:DescriptorElement associated with the queried UUID.
Return type:smqtk.representation.DescriptorElement
get_many_descriptors(uuids)[source]

Get an iterator over descriptors associated to given descriptor UUIDs.

Parameters:uuids (collections.Iterable[collections.Hashable]) – Iterable of descriptor UUIDs to query for.
Raises:KeyError – A given UUID doesn’t associate with a DescriptorElement in this index.
Returns:Iterator of descriptors associated to given uuid values.
Return type:collections.Iterable[smqtk.representation.DescriptorElement]
has_descriptor(uuid)[source]

Check if a DescriptorElement with the given UUID exists in this index.

Parameters:uuid (collections.Hashable) – UUID to query for
Returns:True if a DescriptorElement with the given UUID exists in this index, or False if not.
Return type:bool
items()[source]

alias for iteritems

iterdescriptors()[source]

Return an iterator over indexed descriptor element instances. :rtype: collections.Iterator[smqtk.representation.DescriptorElement]

iteritems()[source]

Return an iterator over indexed descriptor key and instance pairs. :rtype: collections.Iterator[(collections.Hashable,

smqtk.representation.DescriptorElement)]
iterkeys()[source]

Return an iterator over indexed descriptor keys, which are their UUIDs. :rtype: collections.Iterator[collections.Hashable]

keys()[source]

alias for iterkeys

remove_descriptor(uuid)[source]

Remove a descriptor from this index by the given UUID.

Parameters:uuid (collections.Hashable) – UUID of the DescriptorElement to remove.
Raises:KeyError – The given UUID doesn’t associate to a DescriptorElement in this index.
remove_many_descriptors(uuids)[source]

Remove descriptors associated to given descriptor UUIDs from this index.

Parameters:uuids (collections.Iterable[collections.Hashable]) – Iterable of descriptor UUIDs to remove.
Raises:KeyError – A given UUID doesn’t associate with a DescriptorElement in this index.
smqtk.representation.descriptor_index.get_descriptor_index_impls(reload_modules=False)[source]

Discover and return discovered DescriptorIndex classes. Keys in the returned map are the names of the discovered classes, and the paired values are the actual class type objects.

We search for implementation classes in:
  • modules next to this file this function is defined in (ones that begin with an alphanumeric character),

  • python modules listed in the environment variable DESCRIPTOR_INDEX_PATH

    • This variable should contain a sequence of python module specifications, separated by the platform specific PATH separator character (; for Windows, : for unix)

Within a module we first look for a helper variable by the name DESCRIPTOR_INDEX_CLASS, which can either be a single class object or an iterable of class objects, to be specifically exported. If the variable is set to None, we skip that module and do not import anything. If the variable is not present, we look at attributes defined in that module for classes that descend from the given base class type. If none of the above are found, or if an exception occurs, the module is skipped.

Parameters:reload_modules (bool) – Explicitly reload discovered modules from source.
Returns:Map of discovered class object of type DescriptorIndex whose keys are the string names of the classes.
Return type:dict[str, type]

Data Support Structures

Other data structures are provided in the [smqtk.representation](/python/smqtk/representation) module to assist with the use of the above described structures:

DescriptorElementFactory

class smqtk.representation.descriptor_element_factory.DescriptorElementFactory(d_type, type_config)[source]

Factory class for producing DescriptorElement instances of a specified type and configuration.

classmethod from_config(config_dict, merge_default=True)[source]

Instantiate a new instance of this class given the configuration JSON-compliant dictionary encapsulating initialization arguments.

This method should not be called via super unless and instance of the class is desired.

Parameters:
  • config_dict (dict) – JSON compliant dictionary encapsulating a configuration.
  • merge_default (bool) – Merge the given configuration on top of the default provided by get_default_config.
Returns:

Constructed instance from the provided config.

Return type:

DescriptorElementFactory

get_config()[source]

Return a JSON-compliant dictionary that could be passed to this class’s from_config method to produce an instance with identical configuration.

In the common case, this involves naming the keys of the dictionary based on the initialization argument names as if it were to be passed to the constructor via dictionary expansion.

Returns:JSON type compliant configuration dictionary.
Return type:dict
classmethod get_default_config()[source]

Generate and return a default configuration dictionary for this class. This will be primarily used for generating what the configuration dictionary would look like for this class without instantiating it.

It is not be guaranteed that the configuration dictionary returned from this method is valid for construction of an instance of this class.

Returns:Default configuration dictionary for the class.
Return type:dict
new_descriptor(type_str, uuid)[source]

Create a new DescriptorElement instance of the configured implementation

Parameters:
  • type_str (str) – Type of descriptor. This is usually the name of the content descriptor that generated this vector.
  • uuid (collections.Hashable) – UUID to associate with the descriptor
Returns:

New DescriptorElement instance

Return type:

smqtk.representation.DescriptorElement