Utilities and Applications

Also part of SMQTK are support utility modules, utility scripts (effectively the “binaries”) and service-oriented and demonstration web applications.

Utility Modules

Various unclassified functionality intended to support the primary goals of SMQTK. See doc-string comments on sub-module classes and functions in [smqtk.utils](/python/smqtk/utils) module.

Utility Scripts

Located in the [smqtk.bin](/python/smqtk/bin) module are various scripts intended to provide quick access or generic entry points to common SMQTK functionality. These scripts generally require configuration via a JSON text file and executable entry points are installed via the setup.py. By rule of thumb, scripts that require a configuration also provide an option for outputting a default or example configuration file.

Currently available utility scripts in alphabetical order:

classifier_kfold_validation

Helper utility for cross validating a supervised classifier configuration. The classifier used should NOT be configured to save its model since this process requires us to train the classifier multiple times.

  • plugins
    • supervised_classifier
      Supervised Classifier implementation configuration to use. This should not be set to use a persistent model if able (this utility will repeatedly train a new model for each fold).
    • descriptor_index
      Index to draw descriptors to classify from.
  • cross_validation
    • truth_labels
      Path to a CSV file containing descriptor UUID the truth label associations. This defines what descriptors are used from the given index. We error if any descriptor UUIDs listed here are not available in the given descriptor index. This file should be in [uuid, label] column format.
    • num_folds
      Number of folds to make for cross validation.
    • random_seed
      Optional fixed seed for the
    • classification_use_multiprocessing
      If we should use multiprocessing (vs threading) when classifying elements.
  • pr_curves
    • enabled
      If Precision/Recall plots should be generated.
    • show
      If we should attempt to show the graph after it has been generated (matplotlib).
    • output_directory
      Directory to save generated plots to. If None, we will not save plots. Otherwise we will create the directory (and required parent directories) if it does not exist.
    • file_prefix
      String prefix to prepend to standard plot file names.
  • roc_curves
    • enabled
      If ROC curves should be generated
    • show
      If we should attempt to show the plot after it has been generated (matplotlib).
    • output_directory
      Directory to save generated plots to. If None, we will not save plots. Otherwise we will create the directory (and required parent directories) if it does not exist.
    • file_prefix
      String prefix to prepend to standard plot file names.

usage: classifier_kfold_validation [-h] [-v] [-c PATH] [-g PATH]

Named Arguments

-v, --verbose

Output additional debug logging.

Default: False

Configuration

-c, --config Path to the JSON configuration file.
-g, --generate-config
 Optionally generate a default configuration file at the specified path. If a configuration file was provided, we update the default configuration with the contents of the given configuration.

classifier_model_validation

Utility for validating a given classifier implementation’s model against some labeled testing data, outputting PR and ROC curve plots with area-under-curve score values.

This utility can optionally be used train a supervised classifier model if the given classifier model configuration does not exist and a second CSV file listing labeled training data is provided. Training will be attempted if train is set to true. If training is performed, we exit after training completes. A SupervisedClassifier sub-classing implementation must be configured

We expect the test and train CSV files in the column format:

… <UUID>,<label> …

The UUID is of the descriptor to which the label applies. The label may be any arbitrary string value, but all labels must be consistent in application.

Some metrics presented assume the highest confidence class as the single predicted class for an element:

  • confusion matrix

The output UUID confusion matrix is a JSON dictionary where the top-level keys are the true labels, and the inner dictionary is the mapping of predicted labels to the UUIDs of the classifications/descriptors that yielded the prediction. Again, this is based on the maximum probability label for a classification result (T=0.5).

See Scikit-Learn PR and ROC curve explanations and examples:

usage: classifier_model_validation [-h] [-v] [-c PATH] [-g PATH]

Named Arguments

-v, --verbose

Output additional debug logging.

Default: False

Configuration

-c, --config Path to the JSON configuration file.
-g, --generate-config
 Optionally generate a default configuration file at the specified path. If a configuration file was provided, we update the default configuration with the contents of the given configuration.

classifyFiles

Based on an input, trained classifier configuration, classify a number of media files, whose descriptor is computed by the configured descriptor generator. Input files that classify as the given label are then output to standard out. Thus, this script acts like a filter.

usage: classifyFiles [-h] [-v] [-c PATH] [-g PATH] [--overwrite] [-l LABEL]
                     [GLOB [GLOB ...]]

Positional Arguments

GLOB Series of shell globs specifying the files to classify.

Named Arguments

-v, --verbose

Output additional debug logging.

Default: False

Configuration

-c, --config Path to the JSON configuration file.
-g, --generate-config
 Optionally generate a default configuration file at the specified path. If a configuration file was provided, we update the default configuration with the contents of the given configuration.

Classification

--overwrite

When generating a configuration file, overwrite an existing file.

Default: False

-l, --label The class to filter by. This is based on the classifier configuration/model used. If this is not provided, we will list the available labels in the provided classifier configuration.

compute_classifications

Script for asynchronously computing classifications for DescriptorElements in a DescriptorIndex specified via a list of UUIDs. Results are output to a CSV file in the format:

uuid, label1_confidence, label2_confidence, …

CSV columns labels are output to the given CSV header file path. Label columns will be in the order as reported by the classifier implementations get_labels method.

Due to using an input file-list of UUIDs, we require that the UUIDs of indexed descriptors be strings, or equality comparable to the UUIDs’ string representation.

usage: compute_classifications [-h] [-v] [-c PATH] [-g PATH]
                               [--uuids-list PATH] [--csv-header PATH]
                               [--csv-data PATH]

Named Arguments

-v, --verbose

Output additional debug logging.

Default: False

Configuration

-c, --config Path to the JSON configuration file.
-g, --generate-config
 Optionally generate a default configuration file at the specified path. If a configuration file was provided, we update the default configuration with the contents of the given configuration.

Input Output Files

--uuids-list Path to the input file listing UUIDs to process.
--csv-header Path to the file to output column header labels.
--csv-data Path to the file to output the CSV data to.

compute_hash_codes

Compute LSH hash codes based on the provided functor on all or specific descriptors from the configured index given a file-list of UUIDs.

When using an input file-list of UUIDs, we require that the UUIDs of indexed descriptors be strings, or equality comparable to the UUIDs’ string representation.

We update a key-value store with the results of descriptor hash computation. We assume the keys of the store are the integer hash values and the values of the store are frozenset instances of descriptor UUIDs (hashable-type objects). We also assume that no other source is concurrently modifying this key-value store due to the need to modify the values of keys.

usage: compute_hash_codes [-h] [-v] [-c PATH] [-g PATH] [--uuids-list PATH]

Named Arguments

-v, --verbose

Output additional debug logging.

Default: False

Configuration

-c, --config Path to the JSON configuration file.
-g, --generate-config
 Optionally generate a default configuration file at the specified path. If a configuration file was provided, we update the default configuration with the contents of the given configuration.

I/O

--uuids-list Optional path to a file listing UUIDs of descriptors to computed hash codes for. If not provided we compute hash codes for all descriptors in the configured descriptor index.

compute_many_descriptors

Descriptor computation helper utility. Checks data content type with respect to the configured descriptor generator to skip content that does not match the accepted types. Optionally, we can additionally filter out image content whose image bytes we cannot load via PIL.Image.open.

usage: compute_many_descriptors [-h] [-v] [-c PATH] [-g PATH] [-b INT]
                                [--check-image] [-f PATH] [-p PATH]

Named Arguments

-v, --verbose

Output additional debug logging.

Default: False

-b, --batch-size
 

Number of files to batch together into a single compute async call. This defines the granularity of the checkpoint file in regards to computation completed. If given 0, we do not batch and will perform a single compute_async call on the configured generator. Default batch size is 0.

Default: 0

--check-image

If se should check image pixel loading before queueing an input image for processing. If we cannot load the image pixels via PIL.Image.open, the input image is not queued for processing

Default: False

Configuration

-c, --config Path to the JSON configuration file.
-g, --generate-config
 Optionally generate a default configuration file at the specified path. If a configuration file was provided, we update the default configuration with the contents of the given configuration.

Required Arguments

-f, --file-list
 Path to a file that lists data file paths. Paths in this file may be relative, but will at some point be coerced into absolute paths based on the current working directory.
-p, --completed-files
 Path to a file into which we add CSV format lines detailing filepaths that have been computed from the file-list provided, as the UUID for that data (currently the SHA1 checksum of the data).

computeDescriptor

Compute a descriptor vector for a given data file, outputting the generated feature vector to standard out, or to an output file if one was specified (in numpy format).

usage: computeDescriptor [-h] [-v] [-c PATH] [-g PATH] [--overwrite]
                         [-o OUTPUT_FILEPATH]
                         [input_file]

Positional Arguments

input_file Data file to compute descriptor on

Named Arguments

-v, --verbose

Output additional debug logging.

Default: False

--overwrite

Force descriptor computation even if an existing descriptor vector was discovered based on the given content descriptor type and data combination.

Default: False

-o, --output-filepath
 Optional path to a file to output feature vector to. Otherwise the feature vector is printed to standard out. Output is saved in numpy binary format (.npy suffix recommended).

Configuration

-c, --config Path to the JSON configuration file.
-g, --generate-config
 Optionally generate a default configuration file at the specified path. If a configuration file was provided, we update the default configuration with the contents of the given configuration.

createFileIngest

Add a set of local system files to a data set via explicit paths or shell-style glob strings.

usage: createFileIngest [-h] [-v] [-c PATH] [-g PATH] [GLOB [GLOB ...]]

Positional Arguments

GLOB

Named Arguments

-v, --verbose

Output additional debug logging.

Default: False

Configuration

-c, --config Path to the JSON configuration file.
-g, --generate-config
 Optionally generate a default configuration file at the specified path. If a configuration file was provided, we update the default configuration with the contents of the given configuration.

descriptors_to_svmtrainfile

Utility script to transform a set of descriptors, specified by UUID, with matching class labels, to a test file usable by libSVM utilities for train/test experiments.

The input CSV file is assumed to be of the format:

uuid,label …

This is the same as the format requested for other scripts like classifier_model_validation.py.

This is very useful for searching for -c and -g parameter values for a training sample of data using the tools/grid.py script, found in the libSVM source tree. For example:

<smqtk_source>/TPL/libsvm-3.1-custom/tools/grid.py -log2c -5,15,2 -log2c 3,-15,-2 -v 5 -out libsvm.grid.out -png libsvm.grid.png -t 0 -w1 3.46713615023 -w2 12.2613240418 output_of_this_script.txt

usage: descriptors_to_svmtrainfile [-h] [-v] [-c PATH] [-g PATH] [-f PATH]
                                   [-o PATH]

Named Arguments

-v, --verbose

Output additional debug logging.

Default: False

Configuration

-c, --config Path to the JSON configuration file.
-g, --generate-config
 Optionally generate a default configuration file at the specified path. If a configuration file was provided, we update the default configuration with the contents of the given configuration.

IO Options

-f Path to the csv file mapping descriptor UUIDs to their class label. String labels are transformed into integers for libSVM. Integers start at 1 and are applied in the order that labels are seen in this input file.
-o Path to the output file to write libSVM labeled descriptors to.

generate_image_transform

Utility for transforming an input image in various standardized ways, saving out those transformed images with standard namings. Transformations used are configurable via a configuration file (JSON).

Configuration details: {

”crop”: {

“center_levels”: null | int
# If greater than 0, crop out one or more increasing smaller images # from a base image by cutting off increasingly larger portions of # the outside perimeter. Cropped image dimensions determined by the # dimensions of the base image and the number of crops to generate.
”quadrant_pyramid_levels”: null | int
# If greater than 0, generate a number of crops based on a number of # quad-tree partitions made based on the given number of levels. # Partitions for all levels less than the level provides are also # made.
”tile_shape”: null | [width, height]
# If not null and is a list of two integers, crop out tile windows # from the base image that have the width and height specified. # If the image width or height is not evenly divisible by the tile # width or height, respectively, then the crop out as many tiles as # neatly fit starting from the axis origin. The remaining pixels are # ignored.
”tile_stride”: null | [x, y]
# If not null and is a list of two integers, crop out sub-images of # the above width and height (if given) with this stride. When not # this is not provided, the default stride is the same as the tile # width and height.

},

”brightness_levels”: null | int
# Generate a number of images with different brightness levels using # linear interpolation to choose levels between 0 (black) and 1 # (original image) as well as between 1 and 2. # Results will not include contrast level 0, 1 or 2 images.
”contrast_levels”: null | int
# Generate a number of images with different contrast levels using # linear interpolation to choose levels between 0 (black) and 1 # (original image) as well as between 1 and 2. # Results will not include contrast level 0, 1 or 2 images.

}

usage: generate_image_transform [-h] [-v] [-c PATH] [-g PATH] [-i IMAGE]
                                [-o OUTPUT]

Named Arguments

-v, --verbose

Output additional debug logging.

Default: False

Configuration

-c, --config Path to the JSON configuration file.
-g, --generate-config
 Optionally generate a default configuration file at the specified path. If a configuration file was provided, we update the default configuration with the contents of the given configuration.

Input/Output

-i, --image Image to produce transformations for.
-o, --output Directory to output generated images to. By default, if not told otherwise, we will write output images in the same directory as the source image. Output images share a core filename as the source image, but with extra suffix syntax to differentiate produced images from the original. Output images will share the same image extension as the source image.

iqr_app_model_generation

Train and generate models for the SMQTK IQR Application.

This application takes the same configuration file as the IqrService REST service. To generate a default configuration, please refer to the runApplication tool for the IqrService application:

runApplication -a IqrService -g config.IqrService.json

usage: iqr_app_model_generation [-h] [-v] -c PATH PATH -t TAB GLOB [GLOB ...]

Positional Arguments

GLOB Shell glob to files to add to the configured data set.

Named Arguments

-v, --verbose

Output additional debug logging.

Default: False

-c, --config Path to the JSON configuration files. The first file provided should be the configuration file for the IqrSearchDispatcher web-application and the second should be the configuration file for the IqrService web-application.
-t, --tab The configuration “tab” of the IqrSearchDispatcher configuration to use. This informs what dataset to add the input data files to.

iqrTrainClassifier

Train a supervised classifier based on an IQR session state dump.

Descriptors used in IQR, and thus referenced via their UUIDs in the IQR session state dump, must exist external to the IQR web-app (uses a non-memory backend). This is needed so that this script might access them for classifier training.

Click the “Save IQR State” button to download the IqrState file encapsulating the descriptors of positively and negatively marked items. These descriptors will be used to train the configured SupervisedClassifier.

usage: iqrTrainClassifier [-h] [-v] [-c PATH] [-g PATH] [-i IQR_STATE]

Named Arguments

-v, --verbose

Output additional debug logging.

Default: False

-i, --iqr-state
 Path to the ZIP file saved from an IQR session.

Configuration

-c, --config Path to the JSON configuration file.
-g, --generate-config
 Optionally generate a default configuration file at the specified path. If a configuration file was provided, we update the default configuration with the contents of the given configuration.

make_balltree

Script for building and saving the model for the SkLearnBallTreeHashIndex implementation of HashIndex.

usage: make_balltree [-h] [-v] [-c PATH] [-g PATH]

Named Arguments

-v, --verbose

Output additional debug logging.

Default: False

Configuration

-c, --config Path to the JSON configuration file.
-g, --generate-config
 Optionally generate a default configuration file at the specified path. If a configuration file was provided, we update the default configuration with the contents of the given configuration.

minibatch_kmeans_clusters

Script for generating clusters from descriptors in a given index using the mini-batch KMeans implementation from Scikit-learn (http://scikit-learn.org/stable/modules/generated/sklearn.cluster.MiniBatchKMeans.html).

By the nature of Scikit-learn’s MiniBatchKMeans implementation, euclidean distance is used to measure distance between descriptors.

usage: minibatch_kmeans_clusters [-h] [-v] [-c PATH] [-g PATH] [-o PATH]

Named Arguments

-v, --verbose

Output additional debug logging.

Default: False

Configuration

-c, --config Path to the JSON configuration file.
-g, --generate-config
 Optionally generate a default configuration file at the specified path. If a configuration file was provided, we update the default configuration with the contents of the given configuration.

output

-o, --output-map
 Path to output the clustering class mapping to. Saved as a pickle file with -1 format.

proxyManagerServer

Server for hosting proxy manager which hosts proxy object instances.

This takes a simple configuration file that looks like the following:

[server]
port = <integer>
authkey = <string>

usage: proxyManagerServer [-h] [-v] [-c PATH] [-g PATH]

Named Arguments

-v, --verbose

Output additional debug logging.

Default: False

Configuration

-c, --config Path to the JSON configuration file.
-g, --generate-config
 Optionally generate a default configuration file at the specified path. If a configuration file was provided, we update the default configuration with the contents of the given configuration.

removeOldFiles

Utility to recursively scan and remove files underneath a given directory if they have not been modified for longer than a set amount of time.

usage: removeOldFiles [-h] [-d BASE_DIR] [-i INTERVAL] [-e EXPIRY] [-v]

Named Arguments

-d, --base-dir Starting directory for scan.
-i, --interval Number of seconds between each scan (integer).
-e, --expiry Number of seconds until a file has “expired” (integer).
-v, --verbose

Display more messages (debugging).

Default: False

runApplication

Generic entry point for running SMQTK web applications defined in [smqtk.web](/python/smqtk/web).

Runs conforming SMQTK Web Applications.

usage: runApplication [-h] [-v] [-c PATH] [-g PATH] [-l] [-a APPLICATION] [-r]
                      [-t] [--host HOST] [--port PORT] [--use-basic-auth]
                      [--debug-server] [--debug-smqtk]

Named Arguments

-v, --verbose

Output additional debug logging.

Default: False

Configuration

-c, --config Path to the JSON configuration file.
-g, --generate-config
 Optionally generate a default configuration file at the specified path. If a configuration file was provided, we update the default configuration with the contents of the given configuration.

Application Selection

-l, --list

List currently available applications for running. More description is included if SMQTK verbosity is increased (-v | –debug-smqtk)

Default: False

-a, --application
 Label of the web application to run.

Server options

-r, --reload

Turn on server reloading.

Default: False

-t, --threaded

Turn on server multi-threading.

Default: False

--host Run host address specification override. This will override all other configuration method specifications.
--port Run port specification override. This will override all other configuration method specifications.
--use-basic-auth
 

Use global basic authentication as configured.

Default: False

Other options

--debug-server

Turn on server debugging messages ONLY

Default: False

--debug-smqtk

Turn on SMQTK debugging messages ONLY

Default: False

summarizePlugins

Print out information about what plugins are currently usable and the documentation headers for each implementation.

usage: summarizePlugins [-h] [-v] [--defaults DEFAULTS]

Named Arguments

-v, --verbose

Output additional debug logging.

Default: False

--defaults

Optionally generate default configuration blocks for each plugin structure and output as JSON to the specified path.

Default: False

train_itq

Tool for training the ITQ functor algorithm’s model on descriptors in an index.

By default, we use all descriptors in the configured index (uuids_list_filepath is not given a value).

The uuids_list_filepath configuration property is optional and should be used to specify a sub-set of descriptors in the configured index to train on. This only works if the stored descriptors’ UUID is a type of string.

usage: train_itq [-h] [-v] [-c PATH] [-g PATH]

Named Arguments

-v, --verbose

Output additional debug logging.

Default: False

Configuration

-c, --config Path to the JSON configuration file.
-g, --generate-config
 Optionally generate a default configuration file at the specified path. If a configuration file was provided, we update the default configuration with the contents of the given configuration.