Interactive Query Refinement or “IQR” is a process whereby a user
provides an examplar image or images and a system attempts to locate
additional images from an archive that a simimilar to the examplar(s).
The user then adjudicates the results by identifying those results
that match their search and those results. The system then uses
those adjudications to attempt to provide better, more closely
matching results refined by the user’s input.
The IQR application is an excellent example application for SMQTK as it makes use of a broad spectrum of SMQTK’s capabilities.
In order to characterize each image in the archive so that it can be indexed, the DescriptorGenerator
algorithm is used.
The NearestNeighborsIndex
is used to understand the relationship between the images in the archive and the
RelevancyIndex
is used to rank results based on the user’s positive and negative ajudications.
SMQTK comes with a web based application that implements an IQR system using SMQTK’s
services as shown in the SMQTK IQR Workflow figure.
Running the IQR Application
In order to run the IQR demonstration application, you will need an archive of imagery. SMQTK has facilities for creating indexes that support
10’s or even 100’s or 1000’s of thousands of images we’ll be using simpler implementations for this example. As a result, we’ll use a modest archive
of images. The Leeds Butterfly Dataset will serve quite nicely. Download and unzip the archive (which contains over 800 images of different species of butterflys).
SQMTK comes witha script that computes the descriptors on all of the images in your archive, and bulids up the models needed by the Nearest Neighbors and Relevancy indices:
Train and generate models for the SMQTK IQR Application.
usage: iqr_app_model_generation [-h] -c CONFIG [-t TAB] [-v] [GLOB [GLOB ...]]
Positional Arguments
GLOB |
Shell glob to files to add to the configured data set. |
Named Arguments
-c, --config |
IQR application configuration file. |
-t, --tab |
The configuration tab to generate the model for.
Default: 0
|
-v, --verbose |
Show debug logging.
Default: False
|
The CONFIG argument specifies a JSON file that provides a configuration block for each of the SMQTK algorithms (DescriptorGenerator, NearestNeighborsIndex etcc) required to generate the
models and indices that will be required by the application. For convenience, the sampe CONFIG file will be provided to the web application when it is run.
The SMQTK source repository contains a sample configuration file. It can be found in source/python/smqtk/web/search_app/config.IqrSearchApp.json
. The configuration is designed to run
run from an empty directory and will create the sub directories and files that it requires when run.
Since this configuration file drives both the generation of the models for the application and the application itself, a closer examination of it is in order.
As a JSON file, the configuration consists of a collection of JSON objects that are used to configure various aspects of the application. Lines 2, 73 and 77 introduce blocks that configure the way
the application itself works: setting the username and password, the location of the MongoDB server that the application uses for storing session information and finally the IP address and port that
the application listens on.
The array of “tabs” that starts at line 7 is the core of the configuration file. We’ll talk in a moment about why this is an array of tabs but for now we’ll examine the the single element in the array.
The blocks introduced at lines 26, 39, and 77 configure the three main algorithms used by the application: the descriptor used, the nearest neighbors index, and the relevancy index.
Each of these of these blocks is passed to the SMQTK plugin system to create the appropriate instance of the algorithm in question.
For example the nn_index
block that starts at line 39 defines the the parameters for two different implementations, an LSHNearestNeighborIndex
, configured to use Iterative Quantization (paper), to generate an index and FlannNearestNeighborsIndex
which uses the Flann library to do so.
The type
element on line 75 selects FlannNearestNeighborsIndex
to be active for this configuration.
Once you have the configuration file set up the way that you like it, you can generate all of the models and indexes required by the application by running the following command:
iqr_app_model_generation -c config.IqrSearchApp.json /path/to/leeds/data/images/*.jpg
This will generate descriptors for all of the images in the data set and use them to compute the models and indices it requires.
Once it completes, you can run the IqrSearchApp
itself. You’ll need an instance of MongoDB running on the port and IP address specified by the mongo
element on line 73. You can start a Mongo
instance (presuming you have it installed) with:
mongod --dbpath /path/to/mongo/work/dir
Once Mongo has been started you can start the IqrSearchApp
with the following command:
runApplication.py -a IqrSearchApp -c config.IqrSearchApp.json
When the application starts you should click on the login
element and then enter the credentials you specified in the flask_app
element of the config file.
Once you’ve logged in you will be able to select the CSIFT
tab of the UI. This tab was named by line 9 in the configuration file and is configure by the first block in the tabs
array. The tabs
array allows you to configure different combinations of the required algorithms within the same application instance – useful for example, if you want to compare the
performance of different descriptors.
To begin the IQR process drag an exemplar image to the grey load area (marked 1
in the next figure). In this case we’ve uploaded a picture of a Monarch butterfly (Item 2). Once
you’ve done so, click the Refine
element (Item 3) and the system will return a set of images that it believes are similar to the exemplar image based on the descriptor computed.
The next figure shows the set of images returned by the system (on the left) and a random selection of images from the archive (by clicking the Toggle Random Results
element). As you can
see, even with just one exemplar the system is beginning to learn to return Monarch butterflys (or butterflys that look like Monarchs)
At this point you can begin to refine the query. You do this by marking correct returns at their checkbox and incorrect returns at the “X”. Once you’ve marked a number of
returns, you can select the “Refine” element which will use your adjudications to retrain and rerank the results with the goal that you will increasingly see correct results in
your result set.
You can continue this process for as long as you like until you are satisfied with the results that the query is returning. Once you are happy with the reulsts, you can select the Save IQR State
element. This will save a file that contains all of the information requires to use the results of the IQR query as an image classifier. The process for doing this is described in the next session.
Using and IQR Trained Classifier
Before you can use your IQR session as a classifier, you must first train the classifier. You can do this with the iqrTrainClassifier
command:
Train a supervised classifier based on an IQR session state dump.
Descriptors used in IQR, and thus referenced via their UUIDs in the IQR session
state dump, must exist external to the IQR web-app (uses a non-memory backend).
This is needed so that this script might access them for classifier training.
Click the “Save IQR State” button to download the IqrState file encapsulating
the descriptors of positively and negatively marked items. These descriptors
will be used to train the configured SupervisedClassifier.
usage: iqrTrainClassifier [-h] [-v] [-c PATH] [-g PATH] [-i IQR_STATE]
Named Arguments
-v, --verbose |
Output additional debug logging.
Default: False
|
-i, --iqr-state |
| Path to the ZIP file saved from an IQR session. |
Configuration
-c, --config |
Path to the JSON configuration file. |
-g, --generate-config |
| Optionally generate a default configuration file at the specified path. If a configuration file was provided, we update the default configuration with the contents of the given configuration. |
As with other commands from SMQTK the config file is a set of configuration blocks stored in a JSON file. An example ships in the SMQTK repository:
In this case the only block required, specifies the classifier that will be used, in this case the LibSvmClassifier
. We’ll assume that you downloaded your IQR session as
1d62a3bb-0b74-479f-be1b-acf03cabf944.IqrState
. In that case the following command will train your classifier leveraging the descriptors associated with the IQR session that you saved.:
iqrTrainClassifier.py -c config.iqrTrainClassifier.json -i 1d62a3bb-0b74-479f-be1b-acf03cabf944.IqrState
Once you have trained the classifier, you can use the classifyFiles
command to actually classify a set of files.
Based on an input, trained classifier configuration, classify a number of media
files, whose descriptor is computed by the configured descriptor generator.
Input files that classify as the given label are then output to standard out.
Thus, this script acts like a filter.
usage: classifyFiles [-h] [-v] [-c PATH] [-g PATH] [--overwrite] [-l LABEL]
[GLOB [GLOB ...]]
Positional Arguments
GLOB |
Series of shell globs specifying the files to classify. |
Named Arguments
-v, --verbose |
Output additional debug logging.
Default: False
|
Configuration
-c, --config |
Path to the JSON configuration file. |
-g, --generate-config |
| Optionally generate a default configuration file at the specified path. If a configuration file was provided, we update the default configuration with the contents of the given configuration. |
Classification
--overwrite |
When generating a configuration file, overwrite an existing file.
Default: False
|
-l, --label |
The class to filter by. This is based on the classifier configuration/model used. If this is not provided, we will list the available labels in the provided classifier configuration. |
Again, we need to provide a config block based configuration file for the command. As with iqrTrainClassifier
, there is a sample configuration file in the repository:
Note that the classifier
block on line 10 is the same as the classifier
block in the iqrTrainClassfier
configuration file. Further, the descriptor_generator
block on line 42
matches the descriptor generator used for the IQR application itself (thus matching the type of descriptor used to train the classifier).
Once you’ve set up the configuration file to yoru liking, you can classify a set of labels with the following command:
classifyFiles.py -c config.classifyFiles.json -l positive /path/to/leedsbutterfly/images/*.jpg
If you leave the -l
argument, the command will tell you the labels available with the classifier (in this case positive and negative).
SMQTK’s classifyFiles
command can use this saved
IQR state to classify a set of files (not necessarily the files in your IQR Applicaiton ingest). The command has the following form:
Social Media Query ToolKit¶
GitHub
Python toolkit for pluggable algorithms and data structures for multimedia-based machine learning.
Installation¶
There are two ways to get ahold of SMQTK. The simplest is to install via the pip command. Alternatively, the source tree can be acquired and build/install SMQTK via CMake or
setuptools
.From pip¶
In order to get the latest version of SMQTK from PYPI:
pip install --upgrade smqtk
This method will install all of the same functionality as when installing from source, but not as many plugins will be functional right out of the box. This is due to some plugin dependencies not being installable through pip. We will see more on this in the section below.
Extras¶
A few extras are defined for the
smqtk
package:docs
caffe
flann
postgres
solr
From Source¶
Acquiring and building from source is different than installing from pip because:
The inclusion of FLANN and libSVM in the source is generally helpful due to their lack of [up-to-date] availability in the PYPI and system package repositories. When available via a system package manager, it is often not easy to use when dealing with a virtual environment (e.g. virtualenv or Anaconda).
The sections below will cover the quick-start steps in more detail:
Quick Start¶
System dependencies¶
In order retrieve and build from source, your system will need at a minimum:
In order to run the provided IQR-search web-application, introduced later when describing the provided web services and applications, the following system dependencies are additionally required:
Getting the Source¶
The SMQTK source code is currently hosted on GitHub here.
To clone the repository locally:
git clone https://github.com/Kitware/SMQTK.git /path/to/local/source
Installing Python dependencies¶
After deciding and activating what environment to install python packages into (system or a virtual), the python dependencies should be installed based on the
requirements.*.txt
files found in the root of the source tree. These files detail different dependencies, and their exact versions tested, for different components of SMQTK.The the core required python packages are detailed in:
requirements.txt
.In addition, if you wish to be able to build the Sphinx based documentation for the project:
requirements.docs.txt
. These are separated because not everyone wishes or needs to build the documentation.Other optional dependencies and what plugins they correspond to are found in:
requirements.optional.txt
Note that if conda [4] is being used, not all packages listed in our requirements files may be found in conda’s repository.
Installation of python dependencies via pip will look like the following:
pip install -r requirements.txt [-r requirements.docs.txt]
Where the
requirements.docs.txt
argument is only needed if you intend to build the SMQTK documentation.Building NumPy and SciPy¶
If NumPy and SciPy is being built from source when installing from pip, either due to a wheel not existing for your platform or something else, it may be useful or required to install BLAS or LAPACK libraries for certain functionality and efficiency.
Additionally, when installing these packages using pip, if the
LDFLAGS
orCFLAGS
/CXXFLAGS
/CPPFLAGS
are set, their build may fail as they are assuming specific setups [5].Additional Plugin Dependencies¶
Some plugins in SMQTK may require additional dependencies in order to run, usually python but sometimes not. In general, each plugin should document and describe their specific dependencies.
For example, the ColorDescriptor implementation required a 3rd party tool to downloaded and setup. Its requirements and restrictions are documented in
python/smqtk/algorithms/descriptor_generator/colordescriptor/INSTALL.md
.CMake Build¶
See the example below for a simple example of how to build SMQTK
Navigate to where the build products should be located. It is recommended that this not be the source tree. Build products include some C/C++ libraries, python modules and generated scripts.
If the desired build directory, and run the following, filling in
<...>
slots with appropriate values:cmake <source_dir_path>
Optionally, the ccmake command line utility, or the GUI version, may be run in order to modify options for building additional modules. Currently, the selection is very minimal, but may be expanded over time.
Building the Documentation¶
All of the documentation for SMQTK is maintained as a collection of reStructuredText_ documents in the
docs
folder of the project. This documentation can be processed by the Sphinx documentation tool into a variety of documentation formats, the most common of which is HTML.Within the
docs
directory is a UnixMakefile
(for Windows systems, amake.bat
file with similar capabilities exists). ThisMakefile
takes care of the work required to run Sphinx to convert the raw documentation to an attractive output format. For example:Will generate HTML format documentation rooted a
docs/_build/html/index.html
.The command:
Will show the other documentation formats that may be available (although be aware that some of them require additional dependencies such as TeX or LaTeX.)
Live Preview¶
While writing documentation in a mark up format such as
reStructuredText
it is very helpful to be able to preview the formatted version of the text. While it is possible to simply run themake html
command periodically, a more seamless version of this is available. Within thedocs
directory is a small Python script calledsphinx_server.py
. If you execute that file with the following command:It will run small process that watches the
docs
folder for changes in the raw documentation*.rst
files and re-runs make html when changes are detected. It will serve the resulting HTML files at http://localhost:5500. Thus having that URL open in a browser will provide you with a relatively up to date preview of the rendered documentation.Footnotes
SMQTK Architecture Overview¶
SMQTK is mainly comprised of 4 high level components, with additional sub-modules for tests, utilities and other control structures.
Data Abstraction¶
An important part of any algorithm is the data its working over and the data that it produces. An important part of working with large scales of data is where the data is stored and how its accessed. The
smqtk.representation
module contains interfaces and plugins for various core data structures, allowing plugin implementations to decide where and how the underlying raw data should be stored and accessed. This potentially allows algorithms to handle more data that would otherwise be feasible on a single machine.smqtk.representation.
SmqtkRepresentation
[source]¶Interface for data representation interfaces and implementations.
Data should be serializable, so this interface adds abstract methods for serializing and de-serializing SMQTK data representation instances.
Data Representation Structures¶
The following are the core data representation interfaces.
pickle
(and thuscPickle
) module functions.DataElement¶
smqtk.representation.data_element.
DataElement
[source]¶Abstract interface for a byte data container.
The primary “value” of a
DataElement
is the byte content wrapped. Since this can technically change due to external forces, we cannot guarantee that an element is immutable. ThusDataElement
instances are not considered generally hashable. Specific implementations may define a__hash__
method if that implementation reflects a data source that guarantees immutability.UUIDs should be cast-able to a string and maintain unique-ness after conversion.
clean_temp
()[source]¶Clean any temporary files created by this element. This does nothing if no temporary files have been generated for this element yet.
content_type
()[source]¶from_uri
(uri)[source]¶Construct a new instance based on the given URI.
This function may not be implemented for all DataElement types.
uri (str) – URI string to resolve into an element instance
New element instance of our type.
DataElement
get_bytes
()[source]¶is_empty
()[source]¶Check if this element contains no bytes.
The intend of this method is to quickly check if there is any data behind this element, ideally without having to read all/any of the underlying data.
is_read_only
()[source]¶md5
()[source]¶Get the MD5 checksum of this element’s binary content.
set_bytes
(b)[source]¶Set bytes to this data element.
Not all implementations may support setting bytes (check
writable
method return).This base abstract method should be called by sub-class implementations first. We check for mutability based on
writable()
method return and invalidate checksum caches.sha1
()[source]¶Get the SHA1 checksum of this element’s binary content.
sha512
()[source]¶Get the SHA512 checksum of this element’s binary content.
to_buffered_reader
()[source]¶Wrap this element’s bytes in a
io.BufferedReader
instance for use as file-like object for reading.As we use the
get_bytes
function, this element’s bytes must safely fit in memory for this method to be usable.uuid
()[source]¶UUID for this data element.
This many take different forms from integers to strings to a uuid.UUID instance. This must return a hashable data type.
By default, this ends up being the hex stringification of the SHA1 hash of this data’s bytes. Specific implementations may provide other UUIDs, however.
writable
()[source]¶write_temp
(temp_dir=None)[source]¶Write this data’s bytes to a temporary file on disk, returning the path to the written file, whose extension is guessed based on this data’s content type.
It is not guaranteed that the returned file path does not point to the original data, i.e. writing to the returned filepath may modify the original data.
clean_temp()
method should be called on this object.smqtk.representation.data_element.
from_uri
(uri, impl_generator=<function get_data_element_impls>)[source]¶Create a data element instance from available plugin implementations.
The first implementation that can resolve the URI is what is returned. If no implementations can resolve the URL, an
InvalidUriError
is raised.get_data_element_impls
function, however this can be changed to refer to a custom set of classes if desired.smqtk.exceptions.InvalidUriError – No data element implementations could resolve the given URI.
New data element instance providing access to the data pointed to by the input URI.
DataElement
smqtk.representation.data_element.
get_data_element_impls
(reload_modules=False)[source]¶Discover and return discovered
DataElement
classes. Keys in the returned map are the names of the discovered classes, and the paired values are the actual class type objects.modules next to this file this function is defined in (ones that begin with an alphanumeric character),
python modules listed in the environment variable
DATA_ELEMENT_PATH
Within a module we first look for a helper variable by the name
DATA_ELEMENT_CLASS
, which can either be a single class object or an iterable of class objects, to be specifically exported. If the variable is set to None, we skip that module and do not import anything. If the variable is not present, we look at attributes defined in that module for classes that descend from the given base class type. If none of the above are found, or if an exception occurs, the module is skipped.DataElement
whose keys are the string names of the classes.DataSet¶
smqtk.representation.data_set.
DataSet
[source]¶Abstract interface for data sets, that contain an arbitrary number of
DataElement
instances of arbitrary implementation type, keyed onDataElement
UUID values.This should only be used with DataElements whose byte content is expected not to change. If they do, then UUID keys may no longer represent the elements associated with them.
add_data
(*elems)[source]¶Add the given data element(s) instance to this data set.
NOTE: Implementing methods should check that input elements are in fact DataElement instances.
count
()[source]¶get_data
(uuid)[source]¶Get the data element the given uuid references, or raise an exception if the uuid does not reference any element in this set.
has_uuid
(uuid)[source]¶Test if the given uuid refers to an element in this data set.
uuids
()[source]¶smqtk.representation.data_set.
get_data_set_impls
(reload_modules=False)[source]¶Discover and return discovered
DataSet
classes. Keys in the returned map are the names of the discovered classes, and the paired values are the actual class type objects.DATA_SET_PATH
;
for Windows,:
for unix)Within a module we first look for a helper variable by the name
DATA_SET_CLASS
, which can either be a single class object or an iterable of class objects, to be specifically exported. If the variable is set to None, we skip that module and do not import anything. If the variable is not present, we look at attributes defined in that module for classes that descend from the given base class type. If none of the above are found, or if an exception occurs, the module is skipped.DataSet
whose keys are the string names of the classes.DescriptorElement¶
smqtk.representation.descriptor_element.
DescriptorElement
(type_str, uuid)[source]¶Abstract descriptor vector container.
This structure supports implementations that cache descriptor vectors on a per-UUID basis.
UUIDs must maintain unique-ness when transformed into a string.
Descriptor element equality based on shared descriptor type and vector equality. Two descriptor vectors that are generated by different types of descriptor generator should not be considered the same (though, this may be up for discussion).
Stored vectors should be effectively immutable.
from_config
(config_dict, type_str, uuid, merge_default=True)[source]¶Instantiate a new instance of this class given the desired type, uuid, and JSON-compliant configuration dictionary.
get_default_config
.Constructed instance from the provided config.
DescriptorElement
get_default_config
()[source]¶Generate and return a default configuration dictionary for this class. This will be primarily used for generating what the configuration dictionary would look like for this class without instantiating it.
By default, we observe what this class’s constructor takes as arguments, aside from the first two assumed positional arguments, turning those argument names into configuration dictionary keys. If any of those arguments have defaults, we will add those values into the configuration dictionary appropriately. The dictionary returned should only contain JSON compliant value types.
It is not be guaranteed that the configuration dictionary returned from this method is valid for construction of an instance of this class.
has_vector
()[source]¶set_vector
(new_vec)[source]¶Set the contained vector.
If this container already stores a descriptor vector, this will overwrite it.
type
()[source]¶uuid
()[source]¶vector
()[source]¶smqtk.representation.descriptor_element.
get_descriptor_element_impls
(reload_modules=False)[source]¶Discover and return discovered
DescriptorElement
classes. Keys in the returned map are the names of the discovered classes, and the paired values are the actual class type objects.modules next to this file this function is defined in (ones that begin with an alphanumeric character),
python modules listed in the environment variable
DESCRIPTOR_ELEMENT_PATH
Within a module we first look for a helper variable by the name
DESCRIPTOR_ELEMENT_CLASS
, which can either be a single class object or an iterable of class objects, to be specifically exported. If the variable is set to None, we skip that module and do not import anything. If the variable is not present, we look at attributes defined in that module for classes that descend from the given base class type. If none of the above are found, or if an exception occurs, the module is skipped.DescriptorElement
whose keys are the string names of the classes.DescriptorIndex¶
smqtk.representation.descriptor_index.
DescriptorIndex
[source]¶Index of descriptors, keyed and query-able by descriptor UUID.
Note that these indexes do not use the descriptor type strings. Thus, if a set of descriptors has multiple elements with the same UUID, but different type strings, they will bash each other in these indexes. In such a case, when dealing with descriptors for different generators, it is advisable to use multiple indices.
add_descriptor
(descriptor)[source]¶Add a descriptor to this index.
Adding the same descriptor multiple times should not add multiple copies of the descriptor in the index (based on UUID). Added descriptors overwrite indexed descriptors based on UUID.
add_many_descriptors
(descriptors)[source]¶Add multiple descriptors at one time.
Adding the same descriptor multiple times should not add multiple copies of the descriptor in the index (based on UUID). Added descriptors overwrite indexed descriptors based on UUID.
clear
()[source]¶Clear this descriptor index’s entries.
count
()[source]¶get_descriptor
(uuid)[source]¶Get the descriptor in this index that is associated with the given UUID.
get_many_descriptors
(uuids)[source]¶Get an iterator over descriptors associated to given descriptor UUIDs.
has_descriptor
(uuid)[source]¶Check if a DescriptorElement with the given UUID exists in this index.
items
()[source]¶alias for iteritems
iterdescriptors
()[source]¶Return an iterator over indexed descriptor element instances. :rtype: collections.Iterator[smqtk.representation.DescriptorElement]
iteritems
()[source]¶Return an iterator over indexed descriptor key and instance pairs. :rtype: collections.Iterator[(collections.Hashable,
iterkeys
()[source]¶Return an iterator over indexed descriptor keys, which are their UUIDs. :rtype: collections.Iterator[collections.Hashable]
keys
()[source]¶alias for iterkeys
remove_descriptor
(uuid)[source]¶Remove a descriptor from this index by the given UUID.
remove_many_descriptors
(uuids)[source]¶Remove descriptors associated to given descriptor UUIDs from this index.
smqtk.representation.descriptor_index.
get_descriptor_index_impls
(reload_modules=False)[source]¶Discover and return discovered
DescriptorIndex
classes. Keys in the returned map are the names of the discovered classes, and the paired values are the actual class type objects.modules next to this file this function is defined in (ones that begin with an alphanumeric character),
python modules listed in the environment variable
DESCRIPTOR_INDEX_PATH
Within a module we first look for a helper variable by the name
DESCRIPTOR_INDEX_CLASS
, which can either be a single class object or an iterable of class objects, to be specifically exported. If the variable is set to None, we skip that module and do not import anything. If the variable is not present, we look at attributes defined in that module for classes that descend from the given base class type. If none of the above are found, or if an exception occurs, the module is skipped.DescriptorIndex
whose keys are the string names of the classes.Data Support Structures¶
Other data structures are provided in the [
smqtk.representation
](/python/smqtk/representation) module to assist with the use of the above described structures:DescriptorElementFactory¶
smqtk.representation.descriptor_element_factory.
DescriptorElementFactory
(d_type, type_config)[source]¶Factory class for producing DescriptorElement instances of a specified type and configuration.
from_config
(config_dict, merge_default=True)[source]¶Instantiate a new instance of this class given the configuration JSON-compliant dictionary encapsulating initialization arguments.
This method should not be called via super unless and instance of the class is desired.
get_default_config
.Constructed instance from the provided config.
DescriptorElementFactory
get_config
()[source]¶Return a JSON-compliant dictionary that could be passed to this class’s
from_config
method to produce an instance with identical configuration.In the common case, this involves naming the keys of the dictionary based on the initialization argument names as if it were to be passed to the constructor via dictionary expansion.
get_default_config
()[source]¶Generate and return a default configuration dictionary for this class. This will be primarily used for generating what the configuration dictionary would look like for this class without instantiating it.
It is not be guaranteed that the configuration dictionary returned from this method is valid for construction of an instance of this class.
new_descriptor
(type_str, uuid)[source]¶Create a new DescriptorElement instance of the configured implementation
New DescriptorElement instance
smqtk.representation.DescriptorElement
Algorithms¶
Algorithm Interfaces¶
smqtk.algorithms.
SmqtkAlgorithm
[source]¶Parent class for all algorithm interfaces.
name
¶Here we list and briefly describe the high level algorithm interfaces which SMQTK provides. There is at least one implementation available for each interface. Some implementations will require additional dependencies that cannot be packaged with SMQTK.
Classifier¶
This interface represents algorithms that classify
DescriptorElement
instances into discrete labels or label confidences.smqtk.algorithms.classifier.
Classifier
[source]¶Interface for algorithms that classify input descriptors into discrete labels and/or label confidences.
classify
(d, factory=<smqtk.representation.classification_element_factory.ClassificationElementFactory object>, overwrite=False)[source]¶Classify the input descriptor against one or more discrete labels, outputting a ClassificationElement containing the classification result.
We return confidence values for each label the configured model contains. Implementations may act in a discrete manner whereby only one label is marked with a
1
value (others being0
), or in a continuous manner whereby each label is given a confidence-like value in the [0, 1] range.The returned
ClassificationElement
will have the same UUID as the inputDescriptorElement
.Classification result element
smqtk.representation.ClassificationElement
classify_async
(d_iter, factory=<smqtk.representation.classification_element_factory.ClassificationElementFactory object>, overwrite=False, procs=None, use_multiprocessing=False, ri=None)[source]¶Asynchronously classify the DescriptorElements in the given iterable.
Mapping of input DescriptorElement instances to the computed ClassificationElement. ClassificationElement UUID’s are congruent with the UUID of the DescriptorElement
dict[smqtk.representation.DescriptorElement, smqtk.representation.ClassificationElement]
get_labels
()[source]¶Get the sequence of class labels that this classifier can classify descriptors into. This includes the negative label.
smqtk.algorithms.classifier.
get_classifier_impls
(reload_modules=False, sub_interface=None)[source]¶Discover and return discovered
Classifier
classes. Keys in the returned map are the names of the discovered classes, and the paired values are the actual class type objects.modules next to this file this function is defined in (ones that begin with an alphanumeric character),
python modules listed in the environment variable
CLASSIFIER_PATH
Within a module we first look for a helper variable by the name
CLASSIFIER_CLASS
, which can either be a single class object or an iterable of class objects, to be specifically exported. If the variable is set to None, we skip that module and do not import anything. If the variable is not present, we look at attributes defined in that module for classes that descend from the given base class type. If none of the above are found, or if an exception occurs, the module is skipped.Classifier
.Map of discovered class object of type
Classifier
whose keys are the string names of the classes.dict[str, type]
DescriptorGenerator¶
This interface represents algorithms that generate whole-content descriptor vectors for a single given input
DataElement
instance. The inputDataElement
must be of a content type that theDescriptorGenerator
supports, referenced against theDescriptorGenerator.valid_content_types
method.The
compute_descriptor
method also requires aDescriptorElementFactory
instance to tell the algorithm how to generate theDescriptorElement
it should return. The returnedDescriptorElement
instance will have a type equal to the name of theDescriptorGenerator
class that generated it, and a UUID that is the same as the inputDataElement
instance.If a
DescriptorElement
implementation that supports persistant storage is generated, and there is already a descriptor associated with the given type name and UUID values, the descriptor is returned without re-computation.If the
overwrite
parameter isTrue
, theDescriptorGenerator
instance will re-compute a descriptor for the inputDataElement
, setting it to the generatedDescriptorElement
. This will overwrite descriptor data in persistant storage if theDescriptorElement
type used supports it.This interface supports a high-level, implementation agnostic asynchronous descriptor computation method. This is given an iterable of
DataElement
instances, a singleDescriptorElementFactory
that is used to produce all descriptorsmqtk.algorithms.descriptor_generator.
DescriptorGenerator
[source]¶Base abstract Feature Descriptor interface
compute_descriptor
(data, descr_factory=<smqtk.representation.descriptor_element_factory.DescriptorElementFactory object>, overwrite=False)[source]¶Given some data, return a descriptor element containing a descriptor vector.
DescriptorMemoryElement
instances by default.Result descriptor element. UUID of this output descriptor is the same as the UUID of the input data element.
smqtk.representation.DescriptorElement
compute_descriptor_async
(data_iter, descr_factory=<smqtk.representation.descriptor_element_factory.DescriptorElementFactory object>, overwrite=False, procs=None, **kwds)[source]¶Asynchronously compute feature data for multiple data items.
DescriptorMemoryElement
instances by default.ValueError – An input DataElement was of a content type that we cannot handle.
Mapping of input DataElement UUIDs to the computed descriptor element for that data. DescriptorElement UUID’s are congruent with the UUID of the data element it is the descriptor of.
dict[collections.Hashable, smqtk.representation.DescriptorElement]
valid_content_types
()[source]¶smqtk.algorithms.descriptor_generator.
get_descriptor_generator_impls
(reload_modules=False)[source]¶Discover and return discovered
DescriptorGenerator
classes. Keys in the returned map are the names of the discovered classes, and the paired values are the actual class type objects.modules next to this file this function is defined in (ones that begin with an alphanumeric character),
python modules listed in the environment variable
DESCRIPTOR_GENERATOR_PATH
Within a module we first look for a helper variable by the name
DESCRIPTOR_GENERATOR_CLASS
, which can either be a single class object or an iterable of class objects, to be specifically exported. If the variable is set to None, we skip that module and do not import anything. If the variable is not present, we look at attributes defined in that module for classes that descend from the given base class type. If none of the above are found, or if an exception occurs, the module is skipped.DescriptorGenerator
whose keys are the string names of the classes.HashIndex¶
This interface describes specialized
NearestNeighborsIndex
implementations designed to index hash codes (bit vectors) via the hamming distance function. Implementations of this interface are primarily used with theLSHNearestNeighborIndex
implementation.Unlike the
NearestNeighborsIndex
interface from which this interface descends,HashIndex
instances are build with an iterable ofnumpy.ndarray
andnn
returns anumpy.ndarray
.smqtk.algorithms.nn_index.hash_index.
HashIndex
[source]¶Specialized
NearestNeighborsIndex
for indexing unique hash codes bit-vectors) in memory (numpy arrays) using the hamming distance metric.Implementations of this interface cannot be used in place of something requiring a
NearestNeighborsIndex
implementation due to the speciality of this interface.Only unique bit vectors should be indexed. The
nn
method should not return the same bit vector more than once for any query.build_index
(hashes)[source]¶Build the index with the given hash codes (bit-vectors).
Subsequent calls to this method should rebuild the current index. This method shall not add to the existing index nor raise an exception to as to protect the current index.
count
()[source]¶nn
(h, n=1)[source]¶Return the nearest N neighbor hash codes as bit-vectors to the given hash code bit-vector.
Distances are in the range [0,1] and are the percent different each neighbor hash is from the query, based on the number of bits contained in the query (normalized hamming distance).
ValueError – Current index is empty.
Tuple of nearest N hash codes and a tuple of the distance values to those neighbors.
(tuple[numpy.ndarray[bool]], tuple[float])
remove_from_index
(hashes)[source]¶Partially remove hashes from this index.
hashes (collections.Iterable[numpy.ndarray[bool]]) – Iterable of numpy boolean hash vectors to remove from this index.
update_index
(hashes)[source]¶Additively update the current index with the one or more hash vectors given.
If no index exists yet, a new one should be created using the given hash vectors.
smqtk.algorithms.nn_index.hash_index.
get_hash_index_impls
(reload_modules=False)[source]¶Discover and return discovered
HashIndex
classes. Keys in the returned map are the names of the discovered classes, and the paired values are the actual class type objects.Within a module we first look for a helper variable by the name
HASH_INDEX_CLASS
, which can either be a single class object or an iterable of class objects, to be specifically exported. If the variable is set to None, we skip that module and do not import anything. If the variable is not present, we look at attributes defined in that module for classes that descend from the given base class type. If none of the above are found, or if an exception occurs, the module is skipped.HashIndex
whose keys are the string names of the classes.LshFunctor¶
Implementations of this interface define the generation of a locality-sensitive hash code for a given
DescriptorElement
. These are used inLSHNearestNeighborIndex
instances.smqtk.algorithms.nn_index.lsh.functors.
LshFunctor
[source]¶Locality-sensitive hashing functor interface.
The aim of such a function is to be able to generate hash codes (bit-vectors) such that similar items map to the same or similar hashes with a high probability. In other words, it aims to maximize hash collision for similar items.
Building Models
Some hash functions want to build a model based on some training set of descriptors. Due to the non-standard nature of algorithm training and model building, please refer to the specific implementation for further information on whether model training is needed and how it is accomplished.
get_hash
(descriptor)[source]¶Get the locality-sensitive hash code for the input descriptor.
smqtk.algorithms.nn_index.lsh.functors.
get_lsh_functor_impls
(reload_modules=False)[source]¶Discover and return discovered
LshFunctor
classes. Keys in the returned map are the names of the discovered classes, and the paired values are the actual class type objects.modules next to this file this function is defined in (ones that begin with an alphanumeric character),
python modules listed in the environment variable
LSH_FUNCTOR_PATH
Within a module we first look for a helper variable by the name
LSH_FUNCTOR_CLASS
, which can either be a single class object or an iterable of class objects, to be specifically exported. If the variable is set to None, we skip that module and do not import anything. If the variable is not present, we look at attributes defined in that module for classes that descend from the given base class type. If none of the above are found, or if an exception occurs, the module is skipped.LshFunctor
whose keys are the string names of the classes.NearestNeighborsIndex¶
This interface defines a method to build an index from a set of
DescriptorElement
instances (NearestNeighborsIndex.build_index
) and a nearest-neighbors query function for getting a number of near neighbors to e queryDescriptorElement
(NearestNeighborsIndex.nn
).Building an index requires that some non-zero number of
DescriptorElement
instances be passed into thebuild_index
method. Subsequent calls to this method should rebuild the index model, not add to it. If an implementation supports persistant storage of the index, it should overwrite the configured index.The
nn
method uses a singleDescriptorElement
to query the current index for a specified number of nearest neighbors. Thus, theNearestNeighborsIndex
instance must have a non-empty index loaded for this method to function. If the provided queryDescriptorElement
does not have a set vector, this method will also fail with an exception.This interface additionally requires that implementations define a
count
method, which returns the number of distinctDescriptorElement
instances are in the index.smqtk.algorithms.nn_index.
NearestNeighborsIndex
[source]¶Common interface for descriptor-based nearest-neighbor computation over a built index of descriptors.
Implementations, if they allow persistent storage of their index, should take the necessary parameters at construction time. Persistent storage content should be (over)written
build_index
is called.Implementations should be thread safe and appropriately protect internal model components from concurrent access and modification.
build_index
(descriptors)[source]¶Build the index with the given descriptor data elements.
Subsequent calls to this method should rebuild the current index. This method shall not add to the existing index nor raise an exception to as to protect the current index.
count
()[source]¶nn
(d, n=1)[source]¶Return the nearest N neighbors to the given descriptor element.
d
has no vector set.Tuple of nearest N DescriptorElement instances, and a tuple of the distance values to those neighbors.
(tuple[smqtk.representation.DescriptorElement], tuple[float])
remove_from_index
(uids)[source]¶Partially remove descriptors from this index associated with the given UIDs.
uids (collections.Iterable[collections.Hashable]) – Iterable of UIDs of descriptors to remove from this index.
update_index
(descriptors)[source]¶Additively update the current index with the one or more descriptor elements given.
If no index exists yet, a new one should be created using the given descriptors.
smqtk.algorithms.nn_index.
get_nn_index_impls
(reload_modules=False)[source]¶Discover and return discovered
NearestNeighborsIndex
classes. Keys in the returned map are the names of the discovered classes, and the paired values are the actual class type objects.NN_INDEX_PATH
;
for Windows,:
for unix)Within a module we first look for a helper variable by the name
NN_INDEX_CLASS
, which can either be a single class object or an iterable of class objects, to be specifically exported. If the variable is set to None, we skip that module and do not import anything. If the variable is not present, we look at attributes defined in that module for classes that descend from the given base class type. If none of the above are found, or if an exception occurs, the module is skipped.NearestNeighborsIndex
whose keys are the string names of the classes.RelevancyIndex¶
This interface defines two methods:
build_index
andrank
. Thebuild_index
method is, like aNearestNeighborsIndex
, used to build an index ofDescriptorElement
instances. Therank
method takes examples of relevant and not-relevantDescriptorElement
examples with which the algorithm uses to rank (think sort) the indexedDescriptorElement
instances by relevancy (on a[0, 1]
scale).smqtk.algorithms.relevancy_index.
RelevancyIndex
[source]¶Abstract class for IQR index implementations.
Similar to a traditional nearest-neighbors algorithm, An IQR index provides a specialized nearest-neighbors interface that can take multiple examples of positively and negatively relevant exemplars in order to produce a [0, 1] ranking of the indexed elements by determined relevancy.
build_index
(descriptors)[source]¶Build the index based on the given iterable of descriptor elements.
Subsequent calls to this method should rebuild the index, not add to it.
count
()[source]¶rank
(pos, neg)[source]¶Rank the currently indexed elements given
pos
positive andneg
negative exemplar descriptor elements.Map of indexed descriptor elements to a rank value between [0, 1] (inclusive) range, where a 1.0 means most relevant and 0.0 meaning least relevant.
dict[smqtk.representation.DescriptorElement, float]
smqtk.algorithms.relevancy_index.
get_relevancy_index_impls
(reload_modules=False)[source]¶Discover and return discovered
RelevancyIndex
classes. Keys in the returned map are the names of the discovered classes, and the paired values are the actual class type objects.modules next to this file this function is defined in (ones that begin with an alphanumeric character),
python modules listed in the environment variable
RELEVANCY_INDEX_PATH
Within a module we first look for a helper variable by the name
RELEVANCY_INDEX_CLASS
, which can either be a single class object or an iterable of class objects, to be specifically exported. If the variable is set to None, we skip that module and do not import anything. If the variable is not present, we look at attributes defined in that module for classes that descend from the given base class type. If none of the above are found, or if an exception occurs, the module is skipped.RelevancyIndex
whose keys are the string names of the classes.Algorithm Models and Generation¶
Some algorithms require a model, of a pre-existing computed state, to function correctly. Not all algorithm interfaces require that there is some model generation method as it is as times not appropriate or applicable to the definition of the algorithm the interface is for. However some implementations of algorithms desire a model for some or all of its functionality. Algorithm implementations that require extra modeling are responsible for providing a method or utility for generating algorithm specific models. Some algorithm implementations may also be pre-packaged with one or more specific models to optionally choose from, due to some performance, tuning or feasibility constraint. Explanations about whether an extra model is required and how it is constructed should be detailed by the documentation for that specific implementation.
For example, part of the definition of a
NearestNeighborsIndex
algorithm is that there is an index to search over, which is arguably a model for that algorithm. Thus, thebuild_index()
method, which should build the index model, is part of that algorithm’s interface. Other algorithms, like theDescriptorGenerator
class of algorithms, do not have a high-level model building method, and model generation or choice is left to specific implementations to explain or provide.DescriptorGenerator Models¶
The
DescriptorGenerator
interface does not define a model building method, but some implementations require internal models. Below are explanations on how to build or get modes forDescriptorGenerator
implementations that require a model.ColorDescriptor¶
ColorDescriptor implementations need to build a visual bag-of-words codebook model for reducing the dimensionality of the many low-level descriptors detected in an input data element. Model parameters as well as storage location parameters are specified at instance construction time, or via a configuration dictionary given to the
from_config
class method.The storage location parameters include a data model directory path and an intermediate data working directory path:
model_directory
andwork_directory
respectively. Themodel_directory
should be the path to a directory for storage of generated model elements. Thework_directory
should be the path to a directory to store cached intermediate data. If model elements already exist in the providedmodel_directory
, they are loaded at construction time. Otherwise, the provided directory is used to store model components when thegenerate_model
method is called. Please reference the constructor’s doc-string for the description of other constructor parameters.The method
generate_model(data_set)
is provided on instances, which should be given an iterable ofDataElement
instances representing media that should be used for training the visual bag-of-words codebook. Media content types that are supported byDescriptorGenerator
instances is listed via thevalid_content_types()
method.Below is an example code snippet of how to train a ColorDescriptor model for some instance of a ColorDescriptor implementation class and configuration:
CaffeDefaultImageNet¶
This implementation does not come with a method of training its own models, but requires model files provided by Caffe: the network model file and the image mean binary protobuf file.
The Caffe source tree provides two scripts to download the specific files (relative to the caffe source tree):
These script effectively just download files from a specific source.
If the Caffe source tree is not available, the model files can be downloaded from the following URLs:
NearestNeighborsIndex Models (k nearest-neighbors)¶
NearestNeighborsIndex
interfaced classes include abuild_index
method on instances that should build the index model for an implementation. Implementations, if they allow for persistant storage, should take relevant parameters at construction time. Currently, we do not package an implementation that require additional model creation.The general pattern for
NearestNeighborsIndex
instance index model generation:RelevancyIndex Models¶
RelevancyIndex
interfaced classes include abuild_index
method in instances that should build the index model for a particular implementation. Implementations, if they allow for persistant storage, should take relevant parameters at construction time. Currently, we do not package an implementation that requires additional model creation.The general pattern for
RelevancyIndex
instance index model generation:Web Service and Demonstration Applications¶
Included in SMQTK are a few web-based service and demonstration applications, providing a view into the functionality provided by SMQTK algorithms and utilities.
runApplication¶
This script can be used to run any conforming (derived from SmqtkWebApp) SMQTK web based application. Web services should be runnable via the
bin/runApplication.py
script.Runs conforming SMQTK Web Applications.
Named Arguments¶
Output additional debug logging.
Default: False
Configuration¶
Application Selection¶
List currently available applications for running. More description is included if SMQTK verbosity is increased (-v | –debug-smqtk)
Default: False
Server options¶
Turn on server reloading.
Default: False
Turn on server multi-threading.
Default: False
Use global basic authentication as configured.
Default: False
Other options¶
Turn on server debugging messages ONLY
Default: False
Turn on SMQTK debugging messages ONLY
Default: False
SmqtkWebApp¶
This is the base class for all web applications and services in SMQTK.
smqtk.web.
SmqtkWebApp
(json_config)[source]¶Base class for SMQTK web applications
from_config
(config_dict, merge_default=True)[source]¶Override to just pass the configuration dictionary to constructor
get_config
()[source]¶Return a JSON-compliant dictionary that could be passed to this class’s
from_config
method to produce an instance with identical configuration.In the common case, this involves naming the keys of the dictionary based on the initialization argument names as if it were to be passed to the constructor via dictionary expansion.
get_default_config
()[source]¶Generate and return a default configuration dictionary for this class. This will be primarily used for generating what the configuration dictionary would look like for this class without instantiating it.
This should be overridden in each implemented application class to add appropriate configuration.
impl_directory
()[source]¶run
(host=None, port=None, debug=False, **options)[source]¶Override of the run method, drawing running host and port from configuration by default. ‘host’ and ‘port’ values specified as argument or keyword will override the app configuration.
Sample Web Applications¶
Descriptor Similarity Service¶
smqtk.web.descriptor_service.
DescriptorServiceServer
(json_config)[source]¶Simple server that takes in a specification of the following form:
See the docstring for the
compute_descriptor()
method for complete rules on how to form a calling URL.Computes the requested descriptor for the given file and returns that via a JSON structure.
Standard return JSON:
Additional Configuration
Note
We will look for an environment variable DescriptorService_CONFIG for a string file path to an additional JSON configuration file to consider.
generate_descriptor
(de, cd_label)[source]¶Generate a descriptor for the content pointed to by the given URI using the specified descriptor generator.
Generated descriptor element instance with vector information.
smqtk.representation.DescriptorElement
generator_label_configs
= None¶get_config
()[source]¶Return a JSON-compliant dictionary that could be passed to this class’s
from_config
method to produce an instance with identical configuration.In the common case, this involves naming the keys of the dictionary based on the initialization argument names as if it were to be passed to the constructor via dictionary expansion.
get_default_config
()[source]¶Generate and return a default configuration dictionary for this class. This will be primarily used for generating what the configuration dictionary would look like for this class without instantiating it.
get_descriptor_inst
(label)[source]¶Get the cached content descriptor instance for a configuration label :type label: str :rtype: smqtk.descriptor_generator.DescriptorGenerator
is_usable
()[source]¶Check whether this class is available for use.
Since certain plugin implementations may require additional dependencies that may not yet be available on the system, this method should check for those dependencies and return a boolean saying if the implementation is usable.
resolve_data_element
(uri)[source]¶Given the URI to some data, resolve it down to a DataElement instance.
IQR Demo Application¶
Interactive Query Refinement or “IQR” is a process whereby a user provides an examplar image or images and a system attempts to locate additional images from an archive that a simimilar to the examplar(s). The user then adjudicates the results by identifying those results that match their search and those results. The system then uses those adjudications to attempt to provide better, more closely matching results refined by the user’s input.
SMQTK IQR Workflow
The IQR application is an excellent example application for SMQTK as it makes use of a broad spectrum of SMQTK’s capabilities. In order to characterize each image in the archive so that it can be indexed, the
DescriptorGenerator
algorithm is used. TheNearestNeighborsIndex
is used to understand the relationship between the images in the archive and theRelevancyIndex
is used to rank results based on the user’s positive and negative ajudications.SMQTK comes with a web based application that implements an IQR system using SMQTK’s services as shown in the SMQTK IQR Workflow figure.
Running the IQR Application¶
In order to run the IQR demonstration application, you will need an archive of imagery. SMQTK has facilities for creating indexes that support 10’s or even 100’s or 1000’s of thousands of images we’ll be using simpler implementations for this example. As a result, we’ll use a modest archive of images. The Leeds Butterfly Dataset will serve quite nicely. Download and unzip the archive (which contains over 800 images of different species of butterflys).
SQMTK comes witha script that computes the descriptors on all of the images in your archive, and bulids up the models needed by the Nearest Neighbors and Relevancy indices:
Train and generate models for the SMQTK IQR Application.
The configuration tab to generate the model for.
Default: 0
Show debug logging.
Default: False
The CONFIG argument specifies a JSON file that provides a configuration block for each of the SMQTK algorithms (DescriptorGenerator, NearestNeighborsIndex etcc) required to generate the models and indices that will be required by the application. For convenience, the sampe CONFIG file will be provided to the web application when it is run.
The SMQTK source repository contains a sample configuration file. It can be found in
source/python/smqtk/web/search_app/config.IqrSearchApp.json
. The configuration is designed to run run from an empty directory and will create the sub directories and files that it requires when run.Since this configuration file drives both the generation of the models for the application and the application itself, a closer examination of it is in order.
As a JSON file, the configuration consists of a collection of JSON objects that are used to configure various aspects of the application. Lines 2, 73 and 77 introduce blocks that configure the way the application itself works: setting the username and password, the location of the MongoDB server that the application uses for storing session information and finally the IP address and port that the application listens on.
The array of “tabs” that starts at line 7 is the core of the configuration file. We’ll talk in a moment about why this is an array of tabs but for now we’ll examine the the single element in the array. The blocks introduced at lines 26, 39, and 77 configure the three main algorithms used by the application: the descriptor used, the nearest neighbors index, and the relevancy index. Each of these of these blocks is passed to the SMQTK plugin system to create the appropriate instance of the algorithm in question. For example the
nn_index
block that starts at line 39 defines the the parameters for two different implementations, anLSHNearestNeighborIndex
, configured to use Iterative Quantization (paper), to generate an index andFlannNearestNeighborsIndex
which uses the Flann library to do so. Thetype
element on line 75 selectsFlannNearestNeighborsIndex
to be active for this configuration.Once you have the configuration file set up the way that you like it, you can generate all of the models and indexes required by the application by running the following command:
This will generate descriptors for all of the images in the data set and use them to compute the models and indices it requires.
Once it completes, you can run the
IqrSearchApp
itself. You’ll need an instance of MongoDB running on the port and IP address specified by themongo
element on line 73. You can start a Mongo instance (presuming you have it installed) with:Once Mongo has been started you can start the
IqrSearchApp
with the following command:When the application starts you should click on the
login
element and then enter the credentials you specified in theflask_app
element of the config file.Click on the login element to enter your credentials
Once you’ve logged in you will be able to select the
CSIFT
tab of the UI. This tab was named by line 9 in the configuration file and is configure by the first block in thetabs
array. Thetabs
array allows you to configure different combinations of the required algorithms within the same application instance – useful for example, if you want to compare the performance of different descriptors.Select the CSIFT tab to begin working with the application
To begin the IQR process drag an exemplar image to the grey load area (marked
1
in the next figure). In this case we’ve uploaded a picture of a Monarch butterfly (Item 2). Once you’ve done so, click theRefine
element (Item 3) and the system will return a set of images that it believes are similar to the exemplar image based on the descriptor computed.IQR Initilization
The next figure shows the set of images returned by the system (on the left) and a random selection of images from the archive (by clicking the
Toggle Random Results
element). As you can see, even with just one exemplar the system is beginning to learn to return Monarch butterflys (or butterflys that look like Monarchs)Initial Query Results and Random Results
At this point you can begin to refine the query. You do this by marking correct returns at their checkbox and incorrect returns at the “X”. Once you’ve marked a number of returns, you can select the “Refine” element which will use your adjudications to retrain and rerank the results with the goal that you will increasingly see correct results in your result set.
Query Refinement
You can continue this process for as long as you like until you are satisfied with the results that the query is returning. Once you are happy with the reulsts, you can select the
Save IQR State
element. This will save a file that contains all of the information requires to use the results of the IQR query as an image classifier. The process for doing this is described in the next session.Using and IQR Trained Classifier¶
Before you can use your IQR session as a classifier, you must first train the classifier. You can do this with the
iqrTrainClassifier
command:Train a supervised classifier based on an IQR session state dump.
Descriptors used in IQR, and thus referenced via their UUIDs in the IQR session state dump, must exist external to the IQR web-app (uses a non-memory backend). This is needed so that this script might access them for classifier training.
Click the “Save IQR State” button to download the IqrState file encapsulating the descriptors of positively and negatively marked items. These descriptors will be used to train the configured SupervisedClassifier.
Output additional debug logging.
Default: False
As with other commands from SMQTK the config file is a set of configuration blocks stored in a JSON file. An example ships in the SMQTK repository:
In this case the only block required, specifies the classifier that will be used, in this case the
LibSvmClassifier
. We’ll assume that you downloaded your IQR session as1d62a3bb-0b74-479f-be1b-acf03cabf944.IqrState
. In that case the following command will train your classifier leveraging the descriptors associated with the IQR session that you saved.:Once you have trained the classifier, you can use the
classifyFiles
command to actually classify a set of files.Based on an input, trained classifier configuration, classify a number of media files, whose descriptor is computed by the configured descriptor generator. Input files that classify as the given label are then output to standard out. Thus, this script acts like a filter.
Output additional debug logging.
Default: False
When generating a configuration file, overwrite an existing file.
Default: False
Again, we need to provide a config block based configuration file for the command. As with
iqrTrainClassifier
, there is a sample configuration file in the repository:Note that the
classifier
block on line 10 is the same as theclassifier
block in theiqrTrainClassfier
configuration file. Further, thedescriptor_generator
block on line 42 matches the descriptor generator used for the IQR application itself (thus matching the type of descriptor used to train the classifier).Once you’ve set up the configuration file to yoru liking, you can classify a set of labels with the following command:
If you leave the
-l
argument, the command will tell you the labels available with the classifier (in this case positive and negative).SMQTK’s
classifyFiles
command can use this saved IQR state to classify a set of files (not necessarily the files in your IQR Applicaiton ingest). The command has the following form:Utilities and Applications¶
Also part of SMQTK are support utility modules, utility scripts (effectively the “binaries”) and service-oriented and demonstration web applications.
Utility Modules¶
Various unclassified functionality intended to support the primary goals of SMQTK. See doc-string comments on sub-module classes and functions in [
smqtk.utils
](/python/smqtk/utils) module.Utility Scripts¶
Located in the [
smqtk.bin
](/python/smqtk/bin) module are various scripts intended to provide quick access or generic entry points to common SMQTK functionality. These scripts generally require configuration via a JSON text file and executable entry points are installed via thesetup.py
. By rule of thumb, scripts that require a configuration also provide an option for outputting a default or example configuration file.Currently available utility scripts in alphabetical order:
classifier_kfold_validation¶
classifier_model_validation¶
Utility for validating a given classifier implementation’s model against some labeled testing data, outputting PR and ROC curve plots with area-under-curve score values.
This utility can optionally be used train a supervised classifier model if the given classifier model configuration does not exist and a second CSV file listing labeled training data is provided. Training will be attempted if
train
is set to true. If training is performed, we exit after training completes. ASupervisedClassifier
sub-classing implementation must be configuredWe expect the test and train CSV files in the column format:
The UUID is of the descriptor to which the label applies. The label may be any arbitrary string value, but all labels must be consistent in application.
Some metrics presented assume the highest confidence class as the single predicted class for an element:
The output UUID confusion matrix is a JSON dictionary where the top-level keys are the true labels, and the inner dictionary is the mapping of predicted labels to the UUIDs of the classifications/descriptors that yielded the prediction. Again, this is based on the maximum probability label for a classification result (T=0.5).
Named Arguments¶
Output additional debug logging.
Default: False
Configuration¶
classifyFiles¶
Based on an input, trained classifier configuration, classify a number of media files, whose descriptor is computed by the configured descriptor generator. Input files that classify as the given label are then output to standard out. Thus, this script acts like a filter.
Positional Arguments¶
Named Arguments¶
Output additional debug logging.
Default: False
Configuration¶
Classification¶
When generating a configuration file, overwrite an existing file.
Default: False
compute_classifications¶
Script for asynchronously computing classifications for DescriptorElements in a DescriptorIndex specified via a list of UUIDs. Results are output to a CSV file in the format:
CSV columns labels are output to the given CSV header file path. Label columns will be in the order as reported by the classifier implementations
get_labels
method.Due to using an input file-list of UUIDs, we require that the UUIDs of indexed descriptors be strings, or equality comparable to the UUIDs’ string representation.
Named Arguments¶
Output additional debug logging.
Default: False
Configuration¶
Input Output Files¶
compute_hash_codes¶
Compute LSH hash codes based on the provided functor on all or specific descriptors from the configured index given a file-list of UUIDs.
When using an input file-list of UUIDs, we require that the UUIDs of indexed descriptors be strings, or equality comparable to the UUIDs’ string representation.
We update a key-value store with the results of descriptor hash computation. We assume the keys of the store are the integer hash values and the values of the store are
frozenset
instances of descriptor UUIDs (hashable-type objects). We also assume that no other source is concurrently modifying this key-value store due to the need to modify the values of keys.Named Arguments¶
Output additional debug logging.
Default: False
Configuration¶
I/O¶
compute_many_descriptors¶
Descriptor computation helper utility. Checks data content type with respect to the configured descriptor generator to skip content that does not match the accepted types. Optionally, we can additionally filter out image content whose image bytes we cannot load via
PIL.Image.open
.Named Arguments¶
Output additional debug logging.
Default: False
Number of files to batch together into a single compute async call. This defines the granularity of the checkpoint file in regards to computation completed. If given 0, we do not batch and will perform a single
compute_async
call on the configured generator. Default batch size is 0.Default: 0
If se should check image pixel loading before queueing an input image for processing. If we cannot load the image pixels via
PIL.Image.open
, the input image is not queued for processingDefault: False
Configuration¶
Required Arguments¶
computeDescriptor¶
Compute a descriptor vector for a given data file, outputting the generated feature vector to standard out, or to an output file if one was specified (in numpy format).
Positional Arguments¶
Named Arguments¶
Output additional debug logging.
Default: False
Force descriptor computation even if an existing descriptor vector was discovered based on the given content descriptor type and data combination.
Default: False
Configuration¶
createFileIngest¶
Add a set of local system files to a data set via explicit paths or shell-style glob strings.
Positional Arguments¶
Named Arguments¶
Output additional debug logging.
Default: False
Configuration¶
descriptors_to_svmtrainfile¶
Utility script to transform a set of descriptors, specified by UUID, with matching class labels, to a test file usable by libSVM utilities for train/test experiments.
The input CSV file is assumed to be of the format:
This is the same as the format requested for other scripts like
classifier_model_validation.py
.This is very useful for searching for -c and -g parameter values for a training sample of data using the
tools/grid.py
script, found in the libSVM source tree. For example:Named Arguments¶
Output additional debug logging.
Default: False
Configuration¶
IO Options¶
generate_image_transform¶
Utility for transforming an input image in various standardized ways, saving out those transformed images with standard namings. Transformations used are configurable via a configuration file (JSON).
Configuration details: {
}
Named Arguments¶
Output additional debug logging.
Default: False
Configuration¶
Input/Output¶
iqr_app_model_generation¶
Train and generate models for the SMQTK IQR Application.
Positional Arguments¶
Named Arguments¶
The configuration tab to generate the model for.
Default: 0
Show debug logging.
Default: False
iqrTrainClassifier¶
Train a supervised classifier based on an IQR session state dump.
Descriptors used in IQR, and thus referenced via their UUIDs in the IQR session state dump, must exist external to the IQR web-app (uses a non-memory backend). This is needed so that this script might access them for classifier training.
Click the “Save IQR State” button to download the IqrState file encapsulating the descriptors of positively and negatively marked items. These descriptors will be used to train the configured SupervisedClassifier.
Named Arguments¶
Output additional debug logging.
Default: False
Configuration¶
make_balltree¶
Script for building and saving the model for the
SkLearnBallTreeHashIndex
implementation ofHashIndex
.Named Arguments¶
Output additional debug logging.
Default: False
Configuration¶
minibatch_kmeans_clusters¶
Script for generating clusters from descriptors in a given index using the mini-batch KMeans implementation from Scikit-learn (http://scikit-learn.org/stable/modules/generated/sklearn.cluster.MiniBatchKMeans.html).
By the nature of Scikit-learn’s MiniBatchKMeans implementation, euclidean distance is used to measure distance between descriptors.
Named Arguments¶
Output additional debug logging.
Default: False
Configuration¶
output¶
proxyManagerServer¶
Server for hosting proxy manager which hosts proxy object instances.
This takes a simple configuration file that looks like the following:
Named Arguments¶
Output additional debug logging.
Default: False
Configuration¶
removeOldFiles¶
Utility to recursively scan and remove files underneath a given directory if they have not been modified for longer than a set amount of time.
Named Arguments¶
Display more messages (debugging).
Default: False
runApplication¶
Generic entry point for running SMQTK web applications defined in [
smqtk.web
](/python/smqtk/web).Runs conforming SMQTK Web Applications.
Named Arguments¶
Output additional debug logging.
Default: False
Configuration¶
Application Selection¶
List currently available applications for running. More description is included if SMQTK verbosity is increased (-v | –debug-smqtk)
Default: False
Server options¶
Turn on server reloading.
Default: False
Turn on server multi-threading.
Default: False
Use global basic authentication as configured.
Default: False
Other options¶
Turn on server debugging messages ONLY
Default: False
Turn on SMQTK debugging messages ONLY
Default: False
summarizePlugins¶
Print out information about what plugins are currently usable and the documentation headers for each implementation.
Named Arguments¶
Output additional debug logging.
Default: False
Optionally generate default configuration blocks for each plugin structure and output as JSON to the specified path.
Default: False
train_itq¶
Tool for training the ITQ functor algorithm’s model on descriptors in an index.
By default, we use all descriptors in the configured index (
uuids_list_filepath
is not given a value).The
uuids_list_filepath
configuration property is optional and should be used to specify a sub-set of descriptors in the configured index to train on. This only works if the stored descriptors’ UUID is a type of string.Named Arguments¶
Output additional debug logging.
Default: False
Configuration¶
Plugin Architecture¶
Each of these main components are housed within distinct sub-modules under
smqtk
and adhere to a plugin pattern for the dynamic discovery of implementations.In SMQTK, data structures and algorithms are first defined by an abstract interface class that lays out what that services the data structure, or methods that the algorithm, should provide. This allows users to treat instances of structures and algorithms in a generic way, based on their defined high level functionality, without needing to knowing what specific implementation is running underneath. It lies, of course, to the implementations of these interfaces to provide the concrete functionality.
When creating a new data structure or algorithm interface, the pattern is that each interface is defined inside its own sub-module in the
__init__.py
file. This file also defines a functionget_..._impls()
(replacing the...
with the name of the interface) that returns a mapping of implementation class names to the implementation class type, by calling the general helper methodsmqtk.utils.plugin.get_plugins()
. This helper method looks for modules defined parallel to the__init__.py
file as well as classes defined in modules listed in an environment variable (defined by the specific call toget_plugins()
). The function then extracts classes that extend from the specified interface class as denoted by a helper variable in the discovered module or by searching attributes exposed by the module. See the doc-string ofsmqtk.utils.plugin.get_plugins()
for more information on how plugin modules are discovered.Adding a new Interface and Internal Implementation¶
For example, lets say we’re creating a new data representation interface called
FooBar
. We would create a directory and__init__.py
file (python module) to house the interface as follows:Since we are making a new data representation interface, our new interface should descend from the
smqtk.representation.SmqtkRepresentation
interface (algorithm interfaces would descend fromsmqtk.algorithms.SmqtkAlgorithm
). TheSmqtkRepresentation
base-class descends from theConfigurable
interface (interface class sets__metaclass__ = abc.ABCMeta
, thus it is not set in the example below).The
__init__.py
file for our new sub-module might look something like the following, defining a new abstract class:When adding a an implementation class, if it is sufficient to be contained in a single file, a new module can be added like:
Where
some_impl.py
might look like:Implementation classes can also live inside of a nested sub-module. This is useful when an implementation class requires specific or extensive support utilities (for example, see the
DescriptorGenerator
implementationColorDescriptor
).:Where the
__init__.py
file should at least expose concrete implementation classes that should be exported as attributes for the plugin getter to discover.Both
Pluggable
andConfigurable
¶It is important to note that our new interface, as defined above, descends from both the
Configurable
interface (transitive through theSmqtkRepresentation
base-class) and thePluggable
interface.The
Configurable
interface allows classes to be instantiated via a dictionary with JSON-compliant data types. In conjunction with the plugin getter function (get_foo_bar_impls
in our example above), we are able to select and construct specific implementations of an interface via a configuration or during runtime (e.g. via a transcoded JSON object). With this flexibility, an application can set up a pipeline using the high-level interfaces as reference, allowing specific implementations to be swapped in an out via configuration.Reload Use Warning¶
While the
smqtk.utils.plugin.get_plugins()
function allows for reloading discovered modules for potentially new content, this is not recommended under normal conditions. When reloading a plugin module afterpickle
serializing an instance of an implementation, deserialization causes an error because the original class type that was pickled is no longer valid as the reloaded module overwrote the previous plugin class type.Function and Interface Reference¶
smqtk.utils.plugin.
get_plugins
(base_module_str, internal_dir, dir_env_var, helper_var, baseclass_type, warn=True, reload_modules=False)[source]¶Discover and return classes found in the SMQTK internal plugin directory and any additional directories specified via an environment variable.
In order to specify additional out-of-SMQTK python modules containing base-class implementations, additions to the given environment variable must be made. Entries must be separated by either a ‘;’ (for windows) or ‘:’ (for everything else). This is the same as for the PATH environment variable on your platform. Entries should be paths to importable modules containing attributes for potential import.
When looking at module attributes, we acknowledge those that start with an alphanumeric character (‘_’ prefixed attributes are hidden from import by this function).
We required that the base class that we are checking for also descends from the
Pluggable
interface defined above. This allows us to check if a loaded classis_usable
.Within a module we first look for a helper variable by the name provided, which can either be a single class object or an iterable of class objects, to be specifically exported. If the variable is set to None, we skip that module and do not import anything. If the variable is not present, we look at attributes defined in that module for classes that descend from the given base class type. If none of the above are found, or if an exception occurs, the module is skipped.
Map of discovered class objects descending from type
baseclass_type
andsmqtk.utils.plugin.Pluggable
whose keys are the string names of the class types.dict[str, type]
smqtk.utils.plugin.
Pluggable
[source]¶Interface for classes that have plugin implementations
is_usable
()[source]¶Check whether this class is available for use.
Since certain plugin implementations may require additional dependencies that may not yet be available on the system, this method should check for those dependencies and return a boolean saying if the implementation is usable.
smqtk.utils.configurable_interface.
Configurable
[source]¶Interface for objects that should be configurable via a configuration dictionary consisting of JSON types.
from_config
(config_dict, merge_default=True)[source]¶Instantiate a new instance of this class given the configuration JSON-compliant dictionary encapsulating initialization arguments.
This method should not be called via super unless an instance of the class is desired.
get_default_config
.Constructed instance from the provided config.
Configurable
get_config
()[source]¶Return a JSON-compliant dictionary that could be passed to this class’s
from_config
method to produce an instance with identical configuration.In the common case, this involves naming the keys of the dictionary based on the initialization argument names as if it were to be passed to the constructor via dictionary expansion.
get_default_config
()[source]¶Generate and return a default configuration dictionary for this class. This will be primarily used for generating what the configuration dictionary would look like for this class without instantiating it.
By default, we observe what this class’s constructor takes as arguments, turning those argument names into configuration dictionary keys. If any of those arguments have defaults, we will add those values into the configuration dictionary appropriately. The dictionary returned should only contain JSON compliant value types.
It is not be guaranteed that the configuration dictionary returned from this method is valid for construction of an instance of this class.
Examples¶
Simple Feature Computation with ColorDescriptor¶
The following is a concrete example of performing feature computation for a set of ten butterfly images using the CSIFT descriptor from the ColorDescriptor software package. It assumes you have set up the
colordescriptor
executable and python library in your PATH and PYTHONPATH. Once set up, the following code will compute a CSIFT descriptor:Nearest Neighbor Computation with Caffe¶
The following is a concrete example of performing a nearest neighbor computation using a set of ten butterfly images. This example has been tested using Caffe version rc2, ) and may work with the master version of Caffe from GitHub.
To generate the required model files
image_mean_filepath
andnetwork_model_filepath
, run the following scripts:Once this is done, the nearest neighbor index for the butterfly images can be built with the following code:
NearestNeighborServiceServer Incremental Update Example¶
Goal and Plan¶
In this example, we will show how to initially set up an instance of the
NearestNeighborServiceServer
web API service class such that it can handle incremental updates to its background data. We will also show how to perform incremental updates and confirm that the service recognizes this new data.For this example, we will use the
LSHNearestNeighborIndex
implementation as it is one that currently supports live-reloading its component model files. Along with it, we will use theItqFunctor
andPostgresDescriptorIndex
implementations as the components of theLSHNearestNeighborIndex
. For simplicity, we will not use a specificHashIndex
, which causes aLinearHashIndex
to be constructed and used at query time.All scripts used in this example’s proceedure have a command line interface that uses dash options. Their available options can be listed by giving the
-h
/--help
option. Additional debug logging can be seen output by providing a-d
or-v
option, depending on the script.This example assumes that you have a basic understanding of:
Dependencies¶
Due to our use of the
PostgresDescriptorIndex
in this example, a minimum installed version of PostgreSQL 9.4 is required, as is thepsycopg2
python module (conda
andpip
installable). Please check and modify the configuration files for this example to be able to connect to the database of your choosing.Take a look at the
etc/smqtk/postgres/descriptor_element/example_table_init.sql
andetc/smqtk/postgres/descriptor_index/example_table_init.sql
files, located from the root of the source tree, for table creation examples for element and index storage:Proceedure¶
[1] Getting and Splitting the data set¶
For this example we will use the Leeds butterfly data set (see the
download_leeds_butterfly.sh
script). We will split the data set into an initial sub-set composed of about half of the images from each butterfly catagory (418 total images in the2.ingest_files_1.txt
file). We will then split the data into a two more sub-sets each composed of about half of the remaining data (each composing about 1/4 of the original data set, totaling 209 and 205 images each in theTODO.ingest_files_2.txt
andTODO.ingest_files_3.txt
files respectively).[2] Computing Initial Ingest¶
For this example, an “ingest” consists of a set of descriptors in an index and a mapping of hash codes to the descriptors.
In this example, we also train the LSH hash code functor’s model, if it needs one, based on the descriptors computed before computing the hash codes. We are using the ITQ functor which does require a model. It may be the case that the functor of choice does not require a model, or a sufficient model for the functor is already available for use, in which case that step may be skipped.
Our example’s initial ingest will use the image files listed in the
2.ingest_files_1.txt
test file.[2a] Computing Descriptors¶
We will use the script
bin/scripts/compute_many_descriptors.py
for computing descriptors from a list of file paths. This script will be used again in later sections for additional incremental ingests.The example configuration file for this script,
2a.config.compute_many_descriptors.json
(shown below), should be modified to connect to the appropriate PostgreSQL database and the correct Caffe model files for your system. For this example, we will be using Caffe’sbvlc_alexnet
network model with theilsvrc12
image mean be used for this example.For running the script, take a look at the example invocation in the file
2a.run.sh
:This step yields two side effects:
[2b] Training ITQ Model¶
To train the ITQ model, we will use the script:
./bin/scripts/train_itq.py
. We’ll want to train the functor’s model using the descriptors computed in step 2a. Since we will be using the whole index (418 descriptors), we will not need to provide the script with an additional list of UUIDs.The example configuration file for this script,
2b.config.train_itq.json
, should be modified to connect to the appropriate PostgreSQL database.2b.run.sh
contains an example call of the training script:This step produces the following side effects:
[2c] Computing Hash Codes¶
For this step we will be using the script
bin/scripts/compute_hash_codes.py
to compute ITQ hash codes for the currently computed descriptors. We will be using the descriptor index we added to before as well as theItqFunctor
models we trained in the previous step.This script additionally wants to know the UUIDs of the descriptors to compute hash codes for. We can use the
2a.completed_files.csv
file computed earlier in step 2a to get at the UUIDs (SHA1 checksum) values for the computed files. Remember, as is documented in theDescriptorGenerator
interface, descriptor UUIDs are the same as the UUID of the data from which it was generated from, thus we can use this file.We can conveniently extract these UUIDs with the following commands in script
2c.extract_ingest_uuids.sh
, resulting in the file2c.uuids_for_processing.txt
:With this file, we can now complete the configuration for our computation script:
We are not setting a value for
hash2uuids_input_filepath
because this is the first time we are running this script, thus we do not have an existing structure to add to.We can now move forward and run the computation script:
This step produces the following side effects:
[2d] Starting the
NearestNeighborServiceServer
¶Normally, a
NearestNeighborsIndex
instance would need to be have its index built before it can be used. However, we have effectively already done this in the preceeding steps, so are instead able to get right to configuring and starting theNearestNeighborServiceServer
. A default configuration may be generated using the genericbin/runApplication.py
script (since web applications/servers are plugins) using the command:An example configuration has been provided in
2d.config.nnss_app.json
. TheDescriptorIndex
,DescriptorGenerator
andLshFunctor
configuration sections should be the same as used in the preceeding sections.Before configuring, we are copying
2c.hash2uuids.pickle
to2d.hash2uuids.pickle
. Since we will be overwriting this file (the2d
version) in steps to come, we want to separate it from the results of step 2c.Note the highlighted lines for configurations of note for the
LSHNearestNeighborIndex
implementation. These will be explained below.Emphasized line explanations:
We can now start the service using:
We can test the server by calling its web api via curl using one of our ingested images,
leedsbutterfly/images/001_0001.jpg
:If we compare the result neighbor UUIDs to the SHA1 hash signatures of the original files (that descritpors were computed from), listed in the step 2a result file
2a.completed_files.csv
, we find that the above results are all of the class001
, or monarch butterflies.If we used either of the files
leedsbutterfly/images/001_0042.jpg
orleedsbutterfly/images/001_0063.jpg
, which are not in our initial ingest, but in the subsequent ingests, and set.../n=832/...
(the maximum size we will see in ingest grow to), we would see that the API does not return their UUIDs since they have not been ingested yet. We will also see that only 418 neighbors are returned even though we asked for 832, since there are only 418 elements currently in the index. We will use these three files as proof that we are actually expanding the searchable content after each incremental ingest.We provide a helper bash script,
test_in_index.sh
, for checking if a file is findable via in the search API. A call of the form:… performs a curl call to the server’s default host address and port for the 832 nearest neighbors to the query image file, and checks if the UUIDs of the given file (the sha1sum) is in the returned list of UUIDs.
[3] First Incremental Update¶
Now that we have a live
NearestNeighborServiceServer
instance running, we can incrementally process the files listed in3.ingest_files_2.txt
, making them available for search without having to shut down or otherwise do anything to the running server instance.We will be performing the same actions taken in steps 2a and 2c, but with different inputs and outputs:
The following is the updated configuration file for hash code generation. Note the highlighted lines for differences from step 2c (notes to follow):
Line notes:
The provided
3.run.sh
script is an example of the commands to run for updating the indices and models:After calling the
compute_hash_codes.py
script, the server logging should yield messages (if run in debug/verbose mode) showing that theLSHNearestNeighborIndex
updated its model.We can now test that the
NearestNeighborServiceServer
using the query examples used at the end of step 2d. Using imagesleedsbutterfly/images/001_0001.jpg
andleedsbutterfly/images/001_0042.jpg
as our query examples (and.../n=832/...
), we can see that both are in the index (each image is the nearest neighbor to itself). We also see that a total of 627 neighbors are returned, which is the current number of elements now in the index after this update. The sha1 of the third image file,leedsbutterfly/images/001_0082.jpg
, when used as the query example, is not included in the returned neighbors and thus found in the index.[4] Second Incremental Update¶
Let us repeat again the above process, but using the third increment set (highlighted lines different from
3.run.sh
):After this, we should be able to query all three example files used before and see that they are all now included in the index. We will now also see that all 832 neighbors requested are returned for each of the queries, which equals the total number of files we have ingested over the above steps. If we increase
n
for a query, only 832 neighbors are returned, showing that there are 832 elements in the index at this point.Release Process and Notes¶
Steps of the SMQTK Release Process¶
Three types of releases are expected to occur: - major - minor - patch
See the
CONTRIBUTING.md
file for information on how to contribute features and patches.The following process should apply when any release that changes the version number occurs.
Create and merge version update branch¶
Patch Release¶
Create a new branch off of the
release
branch named something likerelease-patch-{NEW_VERSION}
.Merge version bump branch into
release
andmaster
branches.Major and Minor Releases¶
Create a new branch off of the
master
branch named something likerelease-[major,minor]-{NEW_VERSION}
.Merge version bump branch into the
master
branch.Reset the release branch (–hard) to point to the new master.
Tag new version¶
Create a new git tag using the new version number (format:
v<MAJOR.<MINOR>.<PATCH>
) on the merge commit for the version update branch merger:Push this new tag to GitHub (assuming origin remote points to SMQTK on GitHub:
To add the release notes to GitHub, navigate to the tags page on GitHub and click on the “Add release notes” link for the new release tag. Copy and paste this version’s release notes into the description field and the version number should be used as the release title.
Create new version release to PYPI¶
Make sure the source is checked out on the newest version tag, the repo is clean (no uncommited files/edits), and the
build
anddist
directories are removed:Create the
build
anddist
files for the current version with the following command(s) from the source tree root directory:Make sure your
$HOME/.pypirc
file is up-to-date and includes the following section with your username/password:Make sure the
twine
python package is installed and is up-to-date and then upload dist packages created with:Release Notes¶
SMQTK v0.2 Release Notes¶
This is a minor release if SMQTK that provides both new functionality and fixes over the previous version v0.1.
The highlights of this release are new and updated interface classes, an updated plugin system, new HBase and PostgreSQL DataElement implementations, and a new wrapper for Caffe CNN descriptor extraction.
Additional one-off scripts were added for reference as well as a more generally usable utility for listing out available plugins for the running system and environment.
Additional notes about the release are provided below.
Updates / New Features since v0.1¶
General
Documentation
Plugins
Data Elements
Data Sets
Descriptor Generators
Nearest Neighbors
Web Tools
Python Utilities
Tools / Scripts
Fixes since v0.1¶
IQR web application demo
Code Index
Descriptor Generators
Nearest Neighbors
Relevancy Index
IQR Utils
Tests
Tools / Scripts
Miscellaneous
SMQTK v0.2.1 Release Notes¶
This is a minor release with a necessary bug fix for installing SMQTK. This release also has a minor documentation update regarding Caffe AlexNet default model files and how/where to get them.
Updates / New Features since v0.2¶
Documentation
Fixes since v0.2¶
Build
SMQTK v0.2.2 Release Nodes¶
This minor release primarily adds classifier algorithm and classification representation support, a new service web application for nearest-neighbors algorithms, as well as additional documentation.
Also, this release adds a few more command line tools, especially of note is
iqrTrainClassifier.py
that can train a classifier based on the saved state of the IQR demonstration application (also a new feature).Updates / New Features since v0.2.1¶
Classifiers
Classification Elements
Data Elements
Descriptor Elements
Descriptor Generators
Documentation
Tools / Scripts
Web / Services
Fixes since v0.2.1¶
Custom LibSVM
Descriptor Elements
Data Sets
Docs
Utils
SMQTK v0.3.0 Release Notes¶
This minor release primarily adds a new modular LSH nearest-neighbor index algorithm implementation. This new implementation strictly replaces the now deprecated and removed
ITQNearestNeighborsIndex
implementation because of its increased modularity and flexibility. The oldITQNearestNeighborsIndex
implementation had been hard-coded and its previous functionality can be reproduced with the new implementation (ItqFunctor
+LinearHashIndex
).The
CodeIndex
representation interface has also been depricated as its function has been replaced by the combination of theLSHNearestNeighborIndex
implementation.Updates / New Features since v0.2.2¶
CodeIndex
Custom LibSVM
DescriptorIndex
Documentation
HashIndex
LshFunctor
NearestNeighborIndex
Tests
Tools / Scripts
Utilities
Web / Services
Fixes since v0.2.2¶
DescriptorElement
Tools / Scripts
Utilities
Web / Services
SMQTK v0.4.0 Release Notes¶
This is a minor release that provides various minor updates and fixes as well as a few new command-line tools and a new web service application.
Among the new tools include a couple classifier validation scripts for checking the performance of a classification algorithm fundamentally as well as against a specific test set.
A few MEMEX program specific scripts have been added in a separated directory, defining an ingestion process from an ElasticSearch instance through descritpor and hash code computation.
Finally, a new web service has been added that exposes the IQR process for external tools. The existing IQR demo web application still functions as it did before, but does not yet use this service under the hood.
Updates / New Features since v0.3.0¶
Classifiers
Compute Functions
Descriptor Index
Documentation
IQR
Tools / Scripts
Utilities
Web
Fixes since v0.3.0¶
ClassificationElement
HashIndex
SMQTK v0.5.0 Release Notes¶
This is a minor release that provides minor updates and fixes as well as a new Classifier implementation, new parameters for some existing algorithms and added scripts that were the result of a recent hackathon.
The new classifier implementation, the
IndexLabelClassifier
, was created for the situation where the resultant vector from DescriptorGenerator is actually classification probabilities. An example where this may be the case is when a CNN model and configuration for the Caffe implementation yields a class probability (or Softmax) layer.The specific scripts added from the hackathon are related to classifying entities based on associated image content.
Updates / New Features since v0.4.0¶
Classifier
Descriptor Generators
Descriptor Index
libSVM
LSH Functors
Scripts
Utilities
Fixes since v0.4.0¶
CMake
Scripts
SMQTK v0.6.0 Release Notes¶
This minor release provides bug fixes and minor updates as well as Docker wrapping support for RESTful services, a one-button Docker initialization script for a directory of images, and timed IQR session expiration.
The
docker
directory is intended to host container Dockerfile packages as well as other associated scripts relating to docker use of SMQTK. With this release, we provide a Dockerfile, with associated scripts and default configurations, for a container that hosts a Caffe install for descriptor computation, replete with AlexNet model files, as well as the NearestNeighbor and IQR RESTful services. This container can be used with thedocker/smqtk_services.run_images.sh
for image directory processing, or with existing model files and descriptor index.The IQR Controller class has been updated to optionally time-out sessions and clean itself over time. This is required for any service that is going to stick around for any substantial length of time as resources would otherwise build up and the host machine would run out of RAM.
Updates / New Features since v0.5.0¶
CMake
Descriptor Index
Docker
IQR
Nearest Neighbors Index
Utilities
Scripts
Web Apps
Fixes since v0.5.0¶
Descriptor Index
IQR
Utilities
Web Apps
SMQTK v0.6.1 Release Notes¶
This is a patch release with bug fixs for the Docker wrapping of RESTful services introduced in v0.6.0.
Fixes since v0.6.0¶
Docker
SMQTK v0.6.1 Release Notes¶
This is a patch release with a bug fix for Caffe descriptor generation introduced in v0.6.0.
Fixes since v0.6.0¶
Descriptor Generation
SMQTK v0.7.0 Release Notes¶
This minor release incorporates various fixes and enhancements to representation and algorithms interfaces and implementations.
A new docker image has been added to wrap the IQR web interface and headless services. This image can either be used as a push-button image ingestion and IQR interface container, or as a fully feature environment to play around with SMQTK, Caffe deep-learning-based content description and IQR.
A major departure has happened for some representation structures, like DataElements, as they are no longer considered hashable and now have interfaces reflecting their mutability. Representation structures, by their nature of having arbitrary backends, may be modifiable my external agents interacting in a separate manner with the backend being used. This has also opened up the ability to provide algorithm implementations with DataElement instances instead of filepaths for desired byte content and many implementations have transitioned over to using this pattern. There is nothing fundamentally wrong with requesting file-path input, however it is restricting as to where configuration files or data models may come from.
Updates / New Features since v0.6.2¶
Algorithms
Build System
Classifier Interface
Compute Functions
Descriptor Elements
Descriptor Generator
Devops::Ansible
Devops::Docker
Documentation
Girder
Misc.
Representation
Scripts
Utilities
Web
Fixes since v0.6.2¶
Documentation
Scripts
Metrics
Utilities
Web
SMQTK v0.8.0 Release Notes¶
This minor release represents the merger of a public release that added a Girder-based implementation of the DataElement interface. We also optimized the use of the PostgreSQL DescriptorIndex implementation to use named cursors for large queries.
Updates / New Features since v0.7.0¶
Data Structures
Girder
Fixes since v0.7.0¶
Data Structures
Dependencies
Scripts
Tests
SMQTK v0.8.1 Release Notes¶
This patch release addresses a bug with PostgreSQL implementations incorrectly calling a helper class.
Fixes since v0.8.0¶
Descriptor Index Plugins
Utilities
SMQTK v0.9.0 Release Notes¶
This minor release represents an update to supporting python 3 versions as well as adding connection pooling support to the PostgreSQL helper class.
Updates / New Features since v0.8.1¶
General
Travis CI
Fixes since v0.8.1¶
Tests
SMQTK v0.10.0 Release Notes¶
This minor release represents the merger of public release request 88ABW-2018-3703. This large updated adds a number of functionality improvements and API changes, docker image improvements and expansions (see the new classifier service), FAISS algorithm wrapper improvements,
NearestNeighborIndex
update and removal support, a switch topy.test
testing framework, generalized classification probability adjustment function, code clean-up, bug fixes and more.Updates / New Features since v0.9.0¶
Algorithms
Representations
Docker
IQR module
Scripts
Testing
Utilities module
Web
Fixes since v0.9.0¶
Algorithms
Representations
Scripts
Setup.py
Tests
Web