Utilities and Applications¶

Also part of SMQTK are support utility modules, utility scripts (effectively the “binaries”) and service-oriented and demonstration web applications.

Utility Modules¶

Various unclassified functionality intended to support the primary goals of SMQTK. See doc-string comments on sub-module classes and functions in [smqtk.utils](/python/smqtk/utils) module.

Utility Scripts¶

Located in the [smqtk.bin](/python/smqtk/bin) module are various scripts intended to provide quick access or generic entry points to common SMQTK functionality. These scripts generally require configuration via a JSON text file and executable entry points are installed via the setup.py. By rule of thumb, scripts that require a configuration also provide an option for outputting a default or example configuration file.

Currently available utility scripts in alphabetical order:

classifier_kfold_validation¶

Helper utility for cross validating a supervised classifier configuration. The classifier used should NOT be configured to save its model since this process requires us to train the classifier multiple times.

plugins

supervised_classifier
Supervised Classifier implementation configuration to use. This should not be set to use a persistent model if able (this utility will repeatedly train a new model for each fold).

descriptor_set
Index to draw descriptors to classify from.

cross_validation

truth_labels
Path to a CSV file containing descriptor UUID the truth label associations. This defines what descriptors are used from the given index. We error if any descriptor UUIDs listed here are not available in the given descriptor index. This file should be in [uuid, label] column format.

num_folds
Number of folds to make for cross validation.

random_seed
Optional fixed seed for the

classification_use_multiprocessing
If we should use multiprocessing (vs threading) when classifying elements.

pr_curves

enabled
If Precision/Recall plots should be generated.

show
If we should attempt to show the graph after it has been generated (matplotlib).

output_directory
Directory to save generated plots to. If None, we will not save plots. Otherwise we will create the directory (and required parent directories) if it does not exist.

file_prefix
String prefix to prepend to standard plot file names.

roc_curves

enabled
If ROC curves should be generated

show
If we should attempt to show the plot after it has been generated (matplotlib).

output_directory
Directory to save generated plots to. If None, we will not save plots. Otherwise we will create the directory (and required parent directories) if it does not exist.

file_prefix
String prefix to prepend to standard plot file names.

usage: classifier_kfold_validation [-h] [-v] [-c PATH] [-g PATH]

Named Arguments¶

-v, --verbose

Output additional debug logging.

Default: False

Configuration¶

-c, --config: Path to the JSON configuration file.
-g, --generate-config: Optionally generate a default configuration file at the specified path. If a configuration file was provided, we update the default configuration with the contents of the given configuration.

classifier_model_validation¶

Utility for validating a given classifier implementation’s model against some labeled testing data, outputting PR and ROC curve plots with area-under-curve score values.

This utility can optionally be used train a supervised classifier model if the given classifier model configuration does not exist and a second CSV file listing labeled training data is provided. Training will be attempted if train is set to true. If training is performed, we exit after training completes. A SupervisedClassifier sub-classing implementation must be configured

We expect the test and train CSV files in the column format:

… <UUID>,<label> …

The UUID is of the descriptor to which the label applies. The label may be any arbitrary string value, but all labels must be consistent in application.

Some metrics presented assume the highest confidence class as the single predicted class for an element:

confusion matrix

The output UUID confusion matrix is a JSON dictionary where the top-level keys are the true labels, and the inner dictionary is the mapping of predicted labels to the UUIDs of the classifications/descriptors that yielded the prediction. Again, this is based on the maximum probability label for a classification result (T=0.5).

See Scikit-Learn PR and ROC curve explanations and examples:

usage: classifier_model_validation [-h] [-v] [-c PATH] [-g PATH]

Named Arguments¶

-v, --verbose

Output additional debug logging.

Default: False

Configuration¶

-c, --config: Path to the JSON configuration file.
-g, --generate-config: Optionally generate a default configuration file at the specified path. If a configuration file was provided, we update the default configuration with the contents of the given configuration.

classifyFiles¶

Based on an input, trained classifier configuration, classify a number of media files, whose descriptor is computed by the configured descriptor generator. Input files that classify as the given label are then output to standard out. Thus, this script acts like a filter.

usage: classifyFiles [-h] [-v] [-c PATH] [-g PATH] [--overwrite] [-l LABEL]
                     [GLOB [GLOB ...]]

Positional Arguments¶

GLOB: Series of shell globs specifying the files to classify.

Named Arguments¶

-v, --verbose

Output additional debug logging.

Default: False

Configuration¶

-c, --config: Path to the JSON configuration file.
-g, --generate-config: Optionally generate a default configuration file at the specified path. If a configuration file was provided, we update the default configuration with the contents of the given configuration.

Classification¶

--overwrite

When generating a configuration file, overwrite an existing file.

Default: False

-l, --label

The class to filter by. This is based on the classifier configuration/model used. If this is not provided, we will list the available labels in the provided classifier configuration.

compute_classifications¶

Script for asynchronously computing classifications for DescriptorElements in a DescriptorSet specified via a list of UUIDs. Results are output to a CSV file in the format:

uuid, label1_confidence, label2_confidence, …

CSV columns labels are output to the given CSV header file path. Label columns will be in the order as reported by the classifier implementations get_labels method.

Due to using an input file-list of UUIDs, we require that the UUIDs of indexed descriptors be strings, or equality comparable to the UUIDs’ string representation.

usage: compute_classifications [-h] [-v] [-c PATH] [-g PATH]
                               [--uuids-list PATH] [--csv-header PATH]
                               [--csv-data PATH]

Named Arguments¶

-v, --verbose

Output additional debug logging.

Default: False

Configuration¶

-c, --config: Path to the JSON configuration file.
-g, --generate-config: Optionally generate a default configuration file at the specified path. If a configuration file was provided, we update the default configuration with the contents of the given configuration.

Input Output Files¶

--uuids-list: Path to the input file listing UUIDs to process.
--csv-header: Path to the file to output column header labels.
--csv-data: Path to the file to output the CSV data to.

compute_hash_codes¶

Compute LSH hash codes based on the provided functor on all or specific descriptors from the configured index given a file-list of UUIDs.

When using an input file-list of UUIDs, we require that the UUIDs of indexed descriptors be strings, or equality comparable to the UUIDs’ string representation.

We update a key-value store with the results of descriptor hash computation. We assume the keys of the store are the integer hash values and the values of the store are frozenset instances of descriptor UUIDs (hashable-type objects). We also assume that no other source is concurrently modifying this key-value store due to the need to modify the values of keys.

usage: compute_hash_codes [-h] [-v] [-c PATH] [-g PATH] [--uuids-list PATH]

Named Arguments¶

-v, --verbose

Output additional debug logging.

Default: False

Configuration¶

-c, --config: Path to the JSON configuration file.
-g, --generate-config: Optionally generate a default configuration file at the specified path. If a configuration file was provided, we update the default configuration with the contents of the given configuration.

I/O¶

--uuids-list: Optional path to a file listing UUIDs of descriptors to computed hash codes for. If not provided we compute hash codes for all descriptors in the configured descriptor index.

compute_many_descriptors¶

Descriptor computation helper utility. Checks data content type with respect to the configured descriptor generator to skip content that does not match the accepted types. Optionally, we can additionally filter out image content whose image bytes we cannot load via PIL.Image.open.

usage: compute_many_descriptors [-h] [-v] [-c PATH] [-g PATH] [-b INT]
                                [--check-image] [-f PATH] [-p PATH]

Named Arguments¶

-v, --verbose

Output additional debug logging.

Default: False

-b, --batch-size

Number of files to batch together into a single compute async call. This defines the granularity of the checkpoint file in regards to computation completed. If given 0, we do not batch and will perform a single compute_async call on the configured generator. Default batch size is 0.

Default: 0

--check-image

If se should check image pixel loading before queueing an input image for processing. If we cannot load the image pixels via PIL.Image.open, the input image is not queued for processing

Default: False

Configuration¶

-c, --config: Path to the JSON configuration file.
-g, --generate-config: Optionally generate a default configuration file at the specified path. If a configuration file was provided, we update the default configuration with the contents of the given configuration.

Required Arguments¶

-f, --file-list: Path to a file that lists data file paths. Paths in this file may be relative, but will at some point be coerced into absolute paths based on the current working directory.
-p, --completed-files: Path to a file into which we add CSV format lines detailing filepaths that have been computed from the file-list provided, as the UUID for that data (currently the SHA1 checksum of the data).

computeDescriptor¶

Compute a descriptor vector for a given data file, outputting the generated feature vector to standard out, or to an output file if one was specified (in numpy format).

usage: computeDescriptor [-h] [-v] [-c PATH] [-g PATH] [--overwrite]
                         [-o OUTPUT_FILEPATH]
                         [input_file]

Positional Arguments¶

input_file: Data file to compute descriptor on

Named Arguments¶

-v, --verbose

Output additional debug logging.

Default: False

--overwrite

Force descriptor computation even if an existing descriptor vector was discovered based on the given content descriptor type and data combination.

Default: False

-o, --output-filepath

Optional path to a file to output feature vector to. Otherwise the feature vector is printed to standard out. Output is saved in numpy binary format (.npy suffix recommended).

Configuration¶

-c, --config: Path to the JSON configuration file.
-g, --generate-config: Optionally generate a default configuration file at the specified path. If a configuration file was provided, we update the default configuration with the contents of the given configuration.

createFileIngest¶

Add a set of local system files to a data set via explicit paths or shell-style glob strings.

usage: createFileIngest [-h] [-v] [-c PATH] [-g PATH] [GLOB [GLOB ...]]

Positional Arguments¶

GLOB

Named Arguments¶

-v, --verbose

Output additional debug logging.

Default: False

Configuration¶

-c, --config: Path to the JSON configuration file.
-g, --generate-config: Optionally generate a default configuration file at the specified path. If a configuration file was provided, we update the default configuration with the contents of the given configuration.

descriptors_to_svmtrainfile¶

Utility script to transform a set of descriptors, specified by UUID, with matching class labels, to a test file usable by libSVM utilities for train/test experiments.

The input CSV file is assumed to be of the format:

uuid,label …

This is the same as the format requested for other scripts like classifier_model_validation.py.

This is very useful for searching for -c and -g parameter values for a training sample of data using the tools/grid.py script, found in the libSVM source tree. For example:

<smqtk_source>/TPL/libsvm-3.1-custom/tools/grid.py -log2c -5,15,2 -log2c 3,-15,-2 -v 5 -out libsvm.grid.out -png libsvm.grid.png -t 0 -w1 3.46713615023 -w2 12.2613240418 output_of_this_script.txt

usage: descriptors_to_svmtrainfile [-h] [-v] [-c PATH] [-g PATH] [-f PATH]
                                   [-o PATH]

Named Arguments¶

-v, --verbose

Output additional debug logging.

Default: False

Configuration¶

-c, --config: Path to the JSON configuration file.
-g, --generate-config: Optionally generate a default configuration file at the specified path. If a configuration file was provided, we update the default configuration with the contents of the given configuration.

IO Options¶

-f: Path to the csv file mapping descriptor UUIDs to their class label. String labels are transformed into integers for libSVM. Integers start at 1 and are applied in the order that labels are seen in this input file.
-o: Path to the output file to write libSVM labeled descriptors to.

generate_image_transform¶

Utility for transforming an input image in various standardized ways, saving out those transformed images with standard namings. Transformations used are configurable via a configuration file (JSON).

Configuration details: {

”crop”: {

“center_levels”: null | int
# If greater than 0, crop out one or more increasing smaller images # from a base image by cutting off increasingly larger portions of # the outside perimeter. Cropped image dimensions determined by the # dimensions of the base image and the number of crops to generate.

”quadrant_pyramid_levels”: null | int
# If greater than 0, generate a number of crops based on a number of # quad-tree partitions made based on the given number of levels. # Partitions for all levels less than the level provides are also # made.

”tile_shape”: null | [width, height]
# If not null and is a list of two integers, crop out tile windows # from the base image that have the width and height specified. # If the image width or height is not evenly divisible by the tile # width or height, respectively, then the crop out as many tiles as # neatly fit starting from the axis origin. The remaining pixels are # ignored.

”tile_stride”: null | [x, y]
# If not null and is a list of two integers, crop out sub-images of # the above width and height (if given) with this stride. When not # this is not provided, the default stride is the same as the tile # width and height.

},

”brightness_levels”: null | int
# Generate a number of images with different brightness levels using # linear interpolation to choose levels between 0 (black) and 1 # (original image) as well as between 1 and 2. # Results will not include contrast level 0, 1 or 2 images.

”contrast_levels”: null | int
# Generate a number of images with different contrast levels using # linear interpolation to choose levels between 0 (black) and 1 # (original image) as well as between 1 and 2. # Results will not include contrast level 0, 1 or 2 images.

}

usage: generate_image_transform [-h] [-v] [-c PATH] [-g PATH] [-i IMAGE]
                                [-o OUTPUT]

Named Arguments¶

-v, --verbose

Output additional debug logging.

Default: False

Configuration¶

-c, --config: Path to the JSON configuration file.
-g, --generate-config: Optionally generate a default configuration file at the specified path. If a configuration file was provided, we update the default configuration with the contents of the given configuration.

Input/Output¶

-i, --image: Image to produce transformations for.
-o, --output: Directory to output generated images to. By default, if not told otherwise, we will write output images in the same directory as the source image. Output images share a core filename as the source image, but with extra suffix syntax to differentiate produced images from the original. Output images will share the same image extension as the source image.

iqr_app_model_generation¶

Train and generate models for the SMQTK IQR Application.

This application takes the same configuration file as the IqrService REST service. To generate a default configuration, please refer to the runApplication tool for the IqrService application:

runApplication -a IqrService -g config.IqrService.json

usage: iqr_app_model_generation [-h] [-v] -c PATH PATH -t TAB GLOB [GLOB ...]

Positional Arguments¶

GLOB: Shell glob to files to add to the configured data set.

Named Arguments¶

-v, --verbose

Output additional debug logging.

Default: False

-c, --config

Path to the JSON configuration files. The first file provided should be the configuration file for the IqrSearchDispatcher web-application and the second should be the configuration file for the IqrService web-application.

-t, --tab

The configuration “tab” of the IqrSearchDispatcher configuration to use. This informs what dataset to add the input data files to.

iqrTrainClassifier¶

Train a supervised classifier based on an IQR session state dump.

Descriptors used in IQR, and thus referenced via their UUIDs in the IQR session state dump, must exist external to the IQR web-app (uses a non-memory backend). This is needed so that this script might access them for classifier training.

Click the “Save IQR State” button to download the IqrState file encapsulating the descriptors of positively and negatively marked items. These descriptors will be used to train the configured SupervisedClassifier.

usage: iqrTrainClassifier [-h] [-v] [-c PATH] [-g PATH] [-i IQR_STATE]

Named Arguments¶

-v, --verbose

Output additional debug logging.

Default: False

-i, --iqr-state

Path to the ZIP file saved from an IQR session.

Configuration¶

-c, --config: Path to the JSON configuration file.
-g, --generate-config: Optionally generate a default configuration file at the specified path. If a configuration file was provided, we update the default configuration with the contents of the given configuration.

make_balltree¶

Script for building and saving the model for the SkLearnBallTreeHashIndex implementation of HashIndex.

usage: make_balltree [-h] [-v] [-c PATH] [-g PATH]

Named Arguments¶

-v, --verbose

Output additional debug logging.

Default: False

Configuration¶

-c, --config: Path to the JSON configuration file.
-g, --generate-config: Optionally generate a default configuration file at the specified path. If a configuration file was provided, we update the default configuration with the contents of the given configuration.

minibatch_kmeans_clusters¶

Script for generating clusters from descriptors in a given descriptor set using the mini-batch KMeans implementation from Scikit-learn (http://scikit-learn.org/stable/modules/generated/sklearn.cluster.MiniBatchKMeans.html).

By the nature of Scikit-learn’s MiniBatchKMeans implementation, euclidean distance is used to measure distance between descriptors.

usage: minibatch_kmeans_clusters [-h] [-v] [-c PATH] [-g PATH] [-o PATH]

Named Arguments¶

-v, --verbose

Output additional debug logging.

Default: False

Configuration¶

-c, --config: Path to the JSON configuration file.
-g, --generate-config: Optionally generate a default configuration file at the specified path. If a configuration file was provided, we update the default configuration with the contents of the given configuration.

output¶

-o, --output-map: Path to output the clustering class mapping to. Saved as a pickle file with -1 format.

proxyManagerServer¶

Server for hosting proxy manager which hosts proxy object instances.

This takes a simple configuration file that looks like the following:

[server]
port = <integer>
authkey = <string>

usage: proxyManagerServer [-h] [-v] [-c PATH] [-g PATH]

Named Arguments¶

-v, --verbose

Output additional debug logging.

Default: False

Configuration¶

-c, --config: Path to the JSON configuration file.
-g, --generate-config: Optionally generate a default configuration file at the specified path. If a configuration file was provided, we update the default configuration with the contents of the given configuration.

removeOldFiles¶

Utility to recursively scan and remove files underneath a given directory if they have not been modified for longer than a set amount of time.

usage: removeOldFiles [-h] [-d BASE_DIR] [-i INTERVAL] [-e EXPIRY] [-v]

Named Arguments¶

-d, --base-dir

Starting directory for scan.

-i, --interval

Number of seconds between each scan (integer).

-e, --expiry

Number of seconds until a file has “expired” (integer).

-v, --verbose

Display more messages (debugging).

Default: False

runApplication¶

Generic entry point for running SMQTK web applications defined in [smqtk.web](/python/smqtk/web).

Runs conforming SMQTK Web Applications.

usage: runApplication [-h] [-v] [-c PATH] [-g PATH] [-l] [-a APPLICATION] [-r]
                      [-t] [--host HOST] [--port PORT] [--use-basic-auth]
                      [--use-simple-cors] [--debug-server] [--debug-smqtk]
                      [--debug-app] [--debug-ns DEBUG_NS]

Named Arguments¶

-v, --verbose

Output additional debug logging.

Default: False

Configuration¶

-c, --config: Path to the JSON configuration file.
-g, --generate-config: Optionally generate a default configuration file at the specified path. If a configuration file was provided, we update the default configuration with the contents of the given configuration.

Application Selection¶

-l, --list

List currently available applications for running. More description is included if SMQTK verbosity is increased (-v | –debug-smqtk)

Default: False

-a, --application

Label of the web application to run.

Server options¶

-r, --reload

Turn on server reloading.

Default: False

-t, --threaded

Turn on server multi-threading.

Default: False

--host

Run host address specification override. This will override all other configuration method specifications.

--port

Run port specification override. This will override all other configuration method specifications.

--use-basic-auth

Use global basic authentication as configured.

Default: False

--use-simple-cors

Allow CORS for all domains on all routes. This follows the “Simple Usage” of flask-cors: https://flask-cors.readthedocs.io/en/latest/#simple-usage

Default: False

Other options¶

--debug-server

Turn on server debugging messages ONLY. This is implied when -v|–verbose is enabled.

Default: False

--debug-smqtk

Turn on SMQTK debugging messages ONLY. This is implied when -v|–verbose is enabled.

Default: False

--debug-app

Turn on flask app logger namespace debugging messages ONLY. This is effectively enabled if the flask app is provided with SMQTK and “–debug-smqtk” is passed. This is also implied if -v|–verbose is enabled.

Default: False

--debug-ns

Specify additional python module namespaces to enable debug logging for.

Default: []

summarizePlugins¶

Print out information about what plugins are currently usable and the documentation headers for each implementation.

usage: summarizePlugins [-h] [-v] [--defaults DEFAULTS]

Named Arguments¶

-v, --verbose

Output additional debug logging.

Default: False

--defaults

Optionally generate default configuration blocks for each plugin structure and output as JSON to the specified path.

Default: False

train_itq¶

Tool for training the ITQ functor algorithm’s model on descriptors in a set.

By default, we use all descriptors in the configured set (uuids_list_filepath is not given a value).

The uuids_list_filepath configuration property is optional and should be used to specify a sub-set of descriptors in the configured set to train on. This only works if the stored descriptors’ UUID is a type of string.

usage: train_itq [-h] [-v] [-c PATH] [-g PATH]

Named Arguments¶

-v, --verbose

Output additional debug logging.

Default: False

Configuration¶

-c, --config: Path to the JSON configuration file.
-g, --generate-config: Optionally generate a default configuration file at the specified path. If a configuration file was provided, we update the default configuration with the contents of the given configuration.