Data Abstraction¶
An important part of any algorithm is the data its working over and the data that it produces.
An important part of working with large scales of data is where the data is stored and how its accessed.
The smqtk.representation
module contains interfaces and plugins for various core data structures, allowing plugin implementations to decide where and how the underlying raw data should be stored and accessed.
This potentially allows algorithms to handle more data that would otherwise be feasible on a single machine.
-
class
smqtk.representation.
SmqtkRepresentation
[source]¶ Interface for data representation interfaces and implementations.
Data should be serializable, so this interface adds abstract methods for serializing and de-serializing SMQTK data representation instances.
Data Representation Structures¶
The following are the core data representation interfaces.
- Note:
- It is required that implementations have a common serialization format so that they may be stored or transported by other structures in a general way without caring what the specific implementation is.
For this we require that all implementations be serializable via the
pickle
(and thuscPickle
) module functions.
DataElement¶
-
class
smqtk.representation.data_element.
DataElement
[source]¶ Abstract interface for a byte data container.
The primary “value” of a
DataElement
is the byte content wrapped. Since this can technically change due to external forces, we cannot guarantee that an element is immutable. ThusDataElement
instances are not considered generally hashable. Specific implementations may define a__hash__
method if that implementation reflects a data source that guarantees immutability.UUIDs should be cast-able to a string and maintain unique-ness after conversion.
-
clean_temp
()[source]¶ Clean any temporary files created by this element. This does nothing if no temporary files have been generated for this element yet.
-
content_type
()[source]¶ Returns: Standard type/subtype string for this data element, or None if the content type is unknown. Return type: str or None
-
classmethod
from_uri
(uri)[source]¶ Construct a new instance based on the given URI.
This function may not be implemented for all DataElement types.
Parameters: uri (str) – URI string to resolve into an element instance
Raises: - NoUriResolutionError – This element type does not implement URI resolution.
- smqtk.exceptions.InvalidUriError – This element type could not resolve the provided URI string.
Returns: New element instance of our type.
Return type:
-
is_empty
()[source]¶ Check if this element contains no bytes.
The intend of this method is to quickly check if there is any data behind this element, ideally without having to read all/any of the underlying data.
Returns: If this element contains 0 bytes. Return type: bool
-
md5
()[source]¶ Get the MD5 checksum of this element’s binary content.
Returns: MD5 hex checksum of the data content. Return type: str
-
set_bytes
(b)[source]¶ Set bytes to this data element.
Not all implementations may support setting bytes (check
writable
method return).This base abstract method should be called by sub-class implementations first. We check for mutability based on
writable()
method return and invalidate checksum caches.Parameters: b (str) – bytes to set. Raises: ReadOnlyError – This data element can only be read from / does not support writing.
-
sha1
()[source]¶ Get the SHA1 checksum of this element’s binary content.
Returns: SHA1 hex checksum of the data content. Return type: str
-
sha512
()[source]¶ Get the SHA512 checksum of this element’s binary content.
Returns: SHA512 hex checksum of the data content. Return type: str
-
to_buffered_reader
()[source]¶ Wrap this element’s bytes in a
io.BufferedReader
instance for use as file-like object for reading.As we use the
get_bytes
function, this element’s bytes must safely fit in memory for this method to be usable.Returns: New BufferedReader instance Return type: io.BufferedReader
-
uuid
()[source]¶ UUID for this data element.
This many take different forms from integers to strings to a uuid.UUID instance. This must return a hashable data type.
By default, this ends up being the hex stringification of the SHA1 hash of this data’s bytes. Specific implementations may provide other UUIDs, however.
Returns: UUID value for this data element. This return value should be hashable. Return type: collections.Hashable
-
write_temp
(temp_dir=None)[source]¶ Write this data’s bytes to a temporary file on disk, returning the path to the written file, whose extension is guessed based on this data’s content type.
It is not guaranteed that the returned file path does not point to the original data, i.e. writing to the returned filepath may modify the original data.
- NOTE:
- The file path returned should not be explicitly removed by the user.
Instead, the
clean_temp()
method should be called on this object.
Parameters: temp_dir (None or str) – Optional directory to write temporary file in, otherwise we use the platform default temporary files directory. If this is an empty string, we count it the same as having provided None. Returns: Path to the temporary file Return type: str
-
-
smqtk.representation.data_element.
from_uri
(uri, impl_generator=<function get_data_element_impls>)[source]¶ Create a data element instance from available plugin implementations.
The first implementation that can resolve the URI is what is returned. If no implementations can resolve the URL, an
InvalidUriError
is raised.Parameters: - uri (str) – URI to try to resolve into a DataElement instance.
- impl_generator (() -> dict[str, type]) – Function that returns a dictionary mapping
implementation type names to the class type. By default this refers to
the standard
get_data_element_impls
function, however this can be changed to refer to a custom set of classes if desired.
Raises: smqtk.exceptions.InvalidUriError – No data element implementations could resolve the given URI.
Returns: New data element instance providing access to the data pointed to by the input URI.
Return type:
-
smqtk.representation.data_element.
get_data_element_impls
(reload_modules=False)[source]¶ Discover and return discovered
DataElement
classes. Keys in the returned map are the names of the discovered classes, and the paired values are the actual class type objects.- We search for implementation classes in:
modules next to this file this function is defined in (ones that begin with an alphanumeric character),
python modules listed in the environment variable
DATA_ELEMENT_PATH
- This variable should contain a sequence of python module
specifications, separated by the platform specific PATH separator
character (
;
for Windows,:
for unix)
- This variable should contain a sequence of python module
specifications, separated by the platform specific PATH separator
character (
Within a module we first look for a helper variable by the name
DATA_ELEMENT_CLASS
, which can either be a single class object or an iterable of class objects, to be specifically exported. If the variable is set to None, we skip that module and do not import anything. If the variable is not present, we look at attributes defined in that module for classes that descend from the given base class type. If none of the above are found, or if an exception occurs, the module is skipped.Parameters: reload_modules (bool) – Explicitly reload discovered modules from source. Returns: Map of discovered class object of type DataElement
whose keys are the string names of the classes.Return type: dict[str, type]
DataSet¶
-
class
smqtk.representation.data_set.
DataSet
[source]¶ Abstract interface for data sets, that contain an arbitrary number of
DataElement
instances of arbitrary implementation type, keyed onDataElement
UUID values.This should only be used with DataElements whose byte content is expected not to change. If they do, then UUID keys may no longer represent the elements associated with them.
-
add_data
(*elems)[source]¶ Add the given data element(s) instance to this data set.
NOTE: Implementing methods should check that input elements are in fact DataElement instances.
Parameters: elems (smqtk.representation.DataElement) – Data element(s) to add
-
get_data
(uuid)[source]¶ Get the data element the given uuid references, or raise an exception if the uuid does not reference any element in this set.
Raises: KeyError – If the given uuid does not refer to an element in this data set. Parameters: uuid (collections.Hashable) – The uuid of the element to retrieve. Returns: The data element instance for the given uuid. Return type: smqtk.representation.DataElement
-
has_uuid
(uuid)[source]¶ Test if the given uuid refers to an element in this data set.
Parameters: uuid (collections.Hashable) – Unique ID to test for inclusion. This should match the type that the set implementation expects or cares about. Returns: True if the given uuid matches an element in this set, or False if it does not. Return type: bool
-
-
smqtk.representation.data_set.
get_data_set_impls
(reload_modules=False)[source]¶ Discover and return discovered
DataSet
classes. Keys in the returned map are the names of the discovered classes, and the paired values are the actual class type objects.- We search for implementation classes in:
- modules next to this file this function is defined in (ones that begin with an alphanumeric character),
- python modules listed in the environment variable
DATA_SET_PATH
- This variable should contain a sequence of python module
specifications, separated by the platform specific PATH separator
character (
;
for Windows,:
for unix)
- This variable should contain a sequence of python module
specifications, separated by the platform specific PATH separator
character (
- python modules listed in the environment variable
Within a module we first look for a helper variable by the name
DATA_SET_CLASS
, which can either be a single class object or an iterable of class objects, to be specifically exported. If the variable is set to None, we skip that module and do not import anything. If the variable is not present, we look at attributes defined in that module for classes that descend from the given base class type. If none of the above are found, or if an exception occurs, the module is skipped.Parameters: reload_modules (bool) – Explicitly reload discovered modules from source. Returns: Map of discovered class object of type DataSet
whose keys are the string names of the classes.Return type: dict[str, type]
DescriptorElement¶
-
class
smqtk.representation.descriptor_element.
DescriptorElement
(type_str, uuid)[source]¶ Abstract descriptor vector container.
This structure supports implementations that cache descriptor vectors on a per-UUID basis.
UUIDs must maintain unique-ness when transformed into a string.
Descriptor element equality based on shared descriptor type and vector equality. Two descriptor vectors that are generated by different types of descriptor generator should not be considered the same (though, this may be up for discussion).
Stored vectors should be effectively immutable.
-
classmethod
from_config
(config_dict, type_str, uuid, merge_default=True)[source]¶ Instantiate a new instance of this class given the desired type, uuid, and JSON-compliant configuration dictionary.
Parameters: - type_str (str) – Type of descriptor. This is usually the name of the content descriptor that generated this vector.
- uuid (collections.Hashable) – Unique ID reference of the descriptor.
- config_dict (dict) – JSON compliant dictionary encapsulating a configuration.
- merge_default (bool) – Merge the given configuration on top of the
default provided by
get_default_config
.
Returns: Constructed instance from the provided config.
Return type:
-
classmethod
get_default_config
()[source]¶ Generate and return a default configuration dictionary for this class. This will be primarily used for generating what the configuration dictionary would look like for this class without instantiating it.
By default, we observe what this class’s constructor takes as arguments, aside from the first two assumed positional arguments, turning those argument names into configuration dictionary keys. If any of those arguments have defaults, we will add those values into the configuration dictionary appropriately. The dictionary returned should only contain JSON compliant value types.
It is not be guaranteed that the configuration dictionary returned from this method is valid for construction of an instance of this class.
Returns: Default configuration dictionary for the class. Return type: dict
-
has_vector
()[source]¶ Returns: Whether or not this container current has a descriptor vector stored. Return type: bool
-
set_vector
(new_vec)[source]¶ Set the contained vector.
If this container already stores a descriptor vector, this will overwrite it.
Parameters: new_vec (numpy.ndarray) – New vector to contain. Returns: Self. Return type: DescriptorMemoryElement
-
classmethod
-
smqtk.representation.descriptor_element.
get_descriptor_element_impls
(reload_modules=False)[source]¶ Discover and return discovered
DescriptorElement
classes. Keys in the returned map are the names of the discovered classes, and the paired values are the actual class type objects.- We search for implementation classes in:
modules next to this file this function is defined in (ones that begin with an alphanumeric character),
python modules listed in the environment variable
DESCRIPTOR_ELEMENT_PATH
- This variable should contain a sequence of python module
specifications, separated by the platform specific PATH separator
character (
;
for Windows,:
for unix)
- This variable should contain a sequence of python module
specifications, separated by the platform specific PATH separator
character (
Within a module we first look for a helper variable by the name
DESCRIPTOR_ELEMENT_CLASS
, which can either be a single class object or an iterable of class objects, to be specifically exported. If the variable is set to None, we skip that module and do not import anything. If the variable is not present, we look at attributes defined in that module for classes that descend from the given base class type. If none of the above are found, or if an exception occurs, the module is skipped.Parameters: reload_modules (bool) – Explicitly reload discovered modules from source. Returns: Map of discovered class object of type DescriptorElement
whose keys are the string names of the classes.Return type: dict[str, type]
DescriptorIndex¶
-
class
smqtk.representation.descriptor_index.
DescriptorIndex
[source]¶ Index of descriptors, keyed and query-able by descriptor UUID.
Note that these indexes do not use the descriptor type strings. Thus, if a set of descriptors has multiple elements with the same UUID, but different type strings, they will bash each other in these indexes. In such a case, when dealing with descriptors for different generators, it is advisable to use multiple indices.
-
add_descriptor
(descriptor)[source]¶ Add a descriptor to this index.
Adding the same descriptor multiple times should not add multiple copies of the descriptor in the index (based on UUID). Added descriptors overwrite indexed descriptors based on UUID.
Parameters: descriptor (smqtk.representation.DescriptorElement) – Descriptor to index.
-
add_many_descriptors
(descriptors)[source]¶ Add multiple descriptors at one time.
Adding the same descriptor multiple times should not add multiple copies of the descriptor in the index (based on UUID). Added descriptors overwrite indexed descriptors based on UUID.
Parameters: descriptors (collections.Iterable[smqtk.representation.DescriptorElement]) – Iterable of descriptor instances to add to this index.
-
get_descriptor
(uuid)[source]¶ Get the descriptor in this index that is associated with the given UUID.
Parameters: uuid (collections.Hashable) – UUID of the DescriptorElement to get. Raises: KeyError – The given UUID doesn’t associate to a DescriptorElement in this index. Returns: DescriptorElement associated with the queried UUID. Return type: smqtk.representation.DescriptorElement
-
get_many_descriptors
(uuids)[source]¶ Get an iterator over descriptors associated to given descriptor UUIDs.
Parameters: uuids (collections.Iterable[collections.Hashable]) – Iterable of descriptor UUIDs to query for. Raises: KeyError – A given UUID doesn’t associate with a DescriptorElement in this index. Returns: Iterator of descriptors associated to given uuid values. Return type: collections.Iterable[smqtk.representation.DescriptorElement]
-
has_descriptor
(uuid)[source]¶ Check if a DescriptorElement with the given UUID exists in this index.
Parameters: uuid (collections.Hashable) – UUID to query for Returns: True if a DescriptorElement with the given UUID exists in this index, or False if not. Return type: bool
-
iterdescriptors
()[source]¶ Return an iterator over indexed descriptor element instances. :rtype: collections.Iterator[smqtk.representation.DescriptorElement]
-
iteritems
()[source]¶ Return an iterator over indexed descriptor key and instance pairs. :rtype: collections.Iterator[(collections.Hashable,
smqtk.representation.DescriptorElement)]
-
iterkeys
()[source]¶ Return an iterator over indexed descriptor keys, which are their UUIDs. :rtype: collections.Iterator[collections.Hashable]
-
remove_descriptor
(uuid)[source]¶ Remove a descriptor from this index by the given UUID.
Parameters: uuid (collections.Hashable) – UUID of the DescriptorElement to remove. Raises: KeyError – The given UUID doesn’t associate to a DescriptorElement in this index.
-
remove_many_descriptors
(uuids)[source]¶ Remove descriptors associated to given descriptor UUIDs from this index.
Parameters: uuids (collections.Iterable[collections.Hashable]) – Iterable of descriptor UUIDs to remove. Raises: KeyError – A given UUID doesn’t associate with a DescriptorElement in this index.
-
-
smqtk.representation.descriptor_index.
get_descriptor_index_impls
(reload_modules=False)[source]¶ Discover and return discovered
DescriptorIndex
classes. Keys in the returned map are the names of the discovered classes, and the paired values are the actual class type objects.- We search for implementation classes in:
modules next to this file this function is defined in (ones that begin with an alphanumeric character),
python modules listed in the environment variable
DESCRIPTOR_INDEX_PATH
- This variable should contain a sequence of python module
specifications, separated by the platform specific PATH separator
character (
;
for Windows,:
for unix)
- This variable should contain a sequence of python module
specifications, separated by the platform specific PATH separator
character (
Within a module we first look for a helper variable by the name
DESCRIPTOR_INDEX_CLASS
, which can either be a single class object or an iterable of class objects, to be specifically exported. If the variable is set to None, we skip that module and do not import anything. If the variable is not present, we look at attributes defined in that module for classes that descend from the given base class type. If none of the above are found, or if an exception occurs, the module is skipped.Parameters: reload_modules (bool) – Explicitly reload discovered modules from source. Returns: Map of discovered class object of type DescriptorIndex
whose keys are the string names of the classes.Return type: dict[str, type]
Data Support Structures¶
Other data structures are provided in the [smqtk.representation
](/python/smqtk/representation) module to assist with the use of the above described structures:
DescriptorElementFactory¶
-
class
smqtk.representation.descriptor_element_factory.
DescriptorElementFactory
(d_type, type_config)[source]¶ Factory class for producing DescriptorElement instances of a specified type and configuration.
-
classmethod
from_config
(config_dict, merge_default=True)[source]¶ Instantiate a new instance of this class given the configuration JSON-compliant dictionary encapsulating initialization arguments.
This method should not be called via super unless and instance of the class is desired.
Parameters: Returns: Constructed instance from the provided config.
Return type:
-
get_config
()[source]¶ Return a JSON-compliant dictionary that could be passed to this class’s
from_config
method to produce an instance with identical configuration.In the common case, this involves naming the keys of the dictionary based on the initialization argument names as if it were to be passed to the constructor via dictionary expansion.
Returns: JSON type compliant configuration dictionary. Return type: dict
-
classmethod
get_default_config
()[source]¶ Generate and return a default configuration dictionary for this class. This will be primarily used for generating what the configuration dictionary would look like for this class without instantiating it.
It is not be guaranteed that the configuration dictionary returned from this method is valid for construction of an instance of this class.
Returns: Default configuration dictionary for the class. Return type: dict
-
new_descriptor
(type_str, uuid)[source]¶ Create a new DescriptorElement instance of the configured implementation
Parameters: - type_str (str) – Type of descriptor. This is usually the name of the content descriptor that generated this vector.
- uuid (collections.Hashable) – UUID to associate with the descriptor
Returns: New DescriptorElement instance
Return type: smqtk.representation.DescriptorElement
-
classmethod