Utilities and Applications¶
Also part of SMQTK are support utility modules, utility scripts (effectively the “binaries”) and service-oriented and demonstration web applications.
Utility Modules¶
Various unclassified functionality intended to support the primary goals of SMQTK.
See doc-string comments on sub-module classes and functions in [smqtk.utils
](/python/smqtk/utils) module.
Utility Scripts¶
Located in the [smqtk.bin
](/python/smqtk/bin) module are various scripts intended to provide quick access or generic entry points to common SMQTK functionality.
These scripts generally require configuration via a JSON text file and executable entry points are installed via the setup.py
.
By rule of thumb, scripts that require a configuration also provide an option for outputting a default or example configuration file.
Currently available utility scripts in alphabetical order:
classifier_kfold_validation¶
Helper utility for cross validating a supervised classifier configuration. The classifier used should NOT be configured to save its model since this process requires us to train the classifier multiple times.
- plugins
- supervised_classifier
- Supervised Classifier implementation configuration to use. This should not be set to use a persistent model if able (this utility will repeatedly train a new model for each fold).
- descriptor_index
- Index to draw descriptors to classify from.
- cross_validation
- truth_labels
- Path to a CSV file containing descriptor UUID the truth label associations. This defines what descriptors are used from the given index. We error if any descriptor UUIDs listed here are not available in the given descriptor index. This file should be in [uuid, label] column format.
- num_folds
- Number of folds to make for cross validation.
- random_seed
- Optional fixed seed for the
- classification_use_multiprocessing
- If we should use multiprocessing (vs threading) when classifying elements.
- pr_curves
- enabled
- If Precision/Recall plots should be generated.
- show
- If we should attempt to show the graph after it has been generated (matplotlib).
- output_directory
- Directory to save generated plots to. If None, we will not save plots. Otherwise we will create the directory (and required parent directories) if it does not exist.
- file_prefix
- String prefix to prepend to standard plot file names.
- roc_curves
- enabled
- If ROC curves should be generated
- show
- If we should attempt to show the plot after it has been generated (matplotlib).
- output_directory
- Directory to save generated plots to. If None, we will not save plots. Otherwise we will create the directory (and required parent directories) if it does not exist.
- file_prefix
- String prefix to prepend to standard plot file names.
usage: classifier_kfold_validation [-h] [-v] [-c PATH] [-g PATH]
Named Arguments¶
-v, --verbose | Output additional debug logging. Default: False |
Configuration¶
-c, --config | Path to the JSON configuration file. |
-g, --generate-config | |
Optionally generate a default configuration file at the specified path. If a configuration file was provided, we update the default configuration with the contents of the given configuration. |
classifier_model_validation¶
Utility for validating a given classifier implementation’s model against some labeled testing data, outputting PR and ROC curve plots with area-under-curve score values.
This utility can optionally be used train a supervised classifier model if
the given classifier model configuration does not exist and a second CSV
file listing labeled training data is provided. Training will be attempted
if train
is set to true. If training is performed, we exit after
training completes. A SupervisedClassifier
sub-classing implementation
must be configured
We expect the test and train CSV files in the column format:
… <UUID>,<label> …
The UUID is of the descriptor to which the label applies. The label may be any arbitrary string value, but all labels must be consistent in application.
Some metrics presented assume the highest confidence class as the single predicted class for an element:
- confusion matrix
The output UUID confusion matrix is a JSON dictionary where the top-level keys are the true labels, and the inner dictionary is the mapping of predicted labels to the UUIDs of the classifications/descriptors that yielded the prediction. Again, this is based on the maximum probability label for a classification result (T=0.5).
- See Scikit-Learn PR and ROC curve explanations and examples:
usage: classifier_model_validation [-h] [-v] [-c PATH] [-g PATH]
Named Arguments¶
-v, --verbose | Output additional debug logging. Default: False |
Configuration¶
-c, --config | Path to the JSON configuration file. |
-g, --generate-config | |
Optionally generate a default configuration file at the specified path. If a configuration file was provided, we update the default configuration with the contents of the given configuration. |
classifyFiles¶
Based on an input, trained classifier configuration, classify a number of media files, whose descriptor is computed by the configured descriptor generator. Input files that classify as the given label are then output to standard out. Thus, this script acts like a filter.
usage: classifyFiles [-h] [-v] [-c PATH] [-g PATH] [--overwrite] [-l LABEL]
[GLOB [GLOB ...]]
Positional Arguments¶
GLOB | Series of shell globs specifying the files to classify. |
Named Arguments¶
-v, --verbose | Output additional debug logging. Default: False |
Configuration¶
-c, --config | Path to the JSON configuration file. |
-g, --generate-config | |
Optionally generate a default configuration file at the specified path. If a configuration file was provided, we update the default configuration with the contents of the given configuration. |
Classification¶
--overwrite | When generating a configuration file, overwrite an existing file. Default: False |
-l, --label | The class to filter by. This is based on the classifier configuration/model used. If this is not provided, we will list the available labels in the provided classifier configuration. |
compute_classifications¶
Script for asynchronously computing classifications for DescriptorElements in a DescriptorIndex specified via a list of UUIDs. Results are output to a CSV file in the format:
uuid, label1_confidence, label2_confidence, …
CSV columns labels are output to the given CSV header file path. Label
columns will be in the order as reported by the classifier implementations
get_labels
method.
Due to using an input file-list of UUIDs, we require that the UUIDs of indexed descriptors be strings, or equality comparable to the UUIDs’ string representation.
usage: compute_classifications [-h] [-v] [-c PATH] [-g PATH]
[--uuids-list PATH] [--csv-header PATH]
[--csv-data PATH]
Named Arguments¶
-v, --verbose | Output additional debug logging. Default: False |
Configuration¶
-c, --config | Path to the JSON configuration file. |
-g, --generate-config | |
Optionally generate a default configuration file at the specified path. If a configuration file was provided, we update the default configuration with the contents of the given configuration. |
Input Output Files¶
--uuids-list | Path to the input file listing UUIDs to process. |
--csv-header | Path to the file to output column header labels. |
--csv-data | Path to the file to output the CSV data to. |
compute_hash_codes¶
Compute LSH hash codes based on the provided functor on all or specific descriptors from the configured index given a file-list of UUIDs.
When using an input file-list of UUIDs, we require that the UUIDs of indexed descriptors be strings, or equality comparable to the UUIDs’ string representation.
We update a key-value store with the results of descriptor hash computation. We
assume the keys of the store are the integer hash values and the values of the
store are frozenset
instances of descriptor UUIDs (hashable-type objects).
We also assume that no other source is concurrently modifying this key-value
store due to the need to modify the values of keys.
usage: compute_hash_codes [-h] [-v] [-c PATH] [-g PATH] [--uuids-list PATH]
Named Arguments¶
-v, --verbose | Output additional debug logging. Default: False |
Configuration¶
-c, --config | Path to the JSON configuration file. |
-g, --generate-config | |
Optionally generate a default configuration file at the specified path. If a configuration file was provided, we update the default configuration with the contents of the given configuration. |
I/O¶
--uuids-list | Optional path to a file listing UUIDs of descriptors to computed hash codes for. If not provided we compute hash codes for all descriptors in the configured descriptor index. |
compute_many_descriptors¶
Descriptor computation helper utility. Checks data content type with respect
to the configured descriptor generator to skip content that does not match
the accepted types. Optionally, we can additionally filter out image content
whose image bytes we cannot load via PIL.Image.open
.
usage: compute_many_descriptors [-h] [-v] [-c PATH] [-g PATH] [-b INT]
[--check-image] [-f PATH] [-p PATH]
Named Arguments¶
-v, --verbose | Output additional debug logging. Default: False |
-b, --batch-size | |
Number of files to batch together into a single compute async call. This defines the granularity of the checkpoint file in regards to computation completed. If given 0, we do not batch and will perform a single Default: 0 | |
--check-image | If se should check image pixel loading before queueing an input image for processing. If we cannot load the image pixels via Default: False |
Configuration¶
-c, --config | Path to the JSON configuration file. |
-g, --generate-config | |
Optionally generate a default configuration file at the specified path. If a configuration file was provided, we update the default configuration with the contents of the given configuration. |
Required Arguments¶
-f, --file-list | |
Path to a file that lists data file paths. Paths in this file may be relative, but will at some point be coerced into absolute paths based on the current working directory. | |
-p, --completed-files | |
Path to a file into which we add CSV format lines detailing filepaths that have been computed from the file-list provided, as the UUID for that data (currently the SHA1 checksum of the data). |
computeDescriptor¶
Compute a descriptor vector for a given data file, outputting the generated feature vector to standard out, or to an output file if one was specified (in numpy format).
usage: computeDescriptor [-h] [-v] [-c PATH] [-g PATH] [--overwrite]
[-o OUTPUT_FILEPATH]
[input_file]
Positional Arguments¶
input_file | Data file to compute descriptor on |
Named Arguments¶
-v, --verbose | Output additional debug logging. Default: False |
--overwrite | Force descriptor computation even if an existing descriptor vector was discovered based on the given content descriptor type and data combination. Default: False |
-o, --output-filepath | |
Optional path to a file to output feature vector to. Otherwise the feature vector is printed to standard out. Output is saved in numpy binary format (.npy suffix recommended). |
Configuration¶
-c, --config | Path to the JSON configuration file. |
-g, --generate-config | |
Optionally generate a default configuration file at the specified path. If a configuration file was provided, we update the default configuration with the contents of the given configuration. |
createFileIngest¶
Add a set of local system files to a data set via explicit paths or shell-style glob strings.
usage: createFileIngest [-h] [-v] [-c PATH] [-g PATH] [GLOB [GLOB ...]]
Positional Arguments¶
GLOB |
Named Arguments¶
-v, --verbose | Output additional debug logging. Default: False |
Configuration¶
-c, --config | Path to the JSON configuration file. |
-g, --generate-config | |
Optionally generate a default configuration file at the specified path. If a configuration file was provided, we update the default configuration with the contents of the given configuration. |
descriptors_to_svmtrainfile¶
Utility script to transform a set of descriptors, specified by UUID, with matching class labels, to a test file usable by libSVM utilities for train/test experiments.
The input CSV file is assumed to be of the format:
uuid,label …
This is the same as the format requested for other scripts like
classifier_model_validation.py
.
This is very useful for searching for -c and -g parameter values for a
training sample of data using the tools/grid.py
script, found in the
libSVM source tree. For example:
<smqtk_source>/TPL/libsvm-3.1-custom/tools/grid.py -log2c -5,15,2 -log2c 3,-15,-2 -v 5 -out libsvm.grid.out -png libsvm.grid.png -t 0 -w1 3.46713615023 -w2 12.2613240418 output_of_this_script.txt
usage: descriptors_to_svmtrainfile [-h] [-v] [-c PATH] [-g PATH] [-f PATH]
[-o PATH]
Named Arguments¶
-v, --verbose | Output additional debug logging. Default: False |
Configuration¶
-c, --config | Path to the JSON configuration file. |
-g, --generate-config | |
Optionally generate a default configuration file at the specified path. If a configuration file was provided, we update the default configuration with the contents of the given configuration. |
IO Options¶
-f | Path to the csv file mapping descriptor UUIDs to their class label. String labels are transformed into integers for libSVM. Integers start at 1 and are applied in the order that labels are seen in this input file. |
-o | Path to the output file to write libSVM labeled descriptors to. |
generate_image_transform¶
Utility for transforming an input image in various standardized ways, saving out those transformed images with standard namings. Transformations used are configurable via a configuration file (JSON).
Configuration details: {
”crop”: {
- “center_levels”: null | int
- # If greater than 0, crop out one or more increasing smaller images # from a base image by cutting off increasingly larger portions of # the outside perimeter. Cropped image dimensions determined by the # dimensions of the base image and the number of crops to generate.
- ”quadrant_pyramid_levels”: null | int
- # If greater than 0, generate a number of crops based on a number of # quad-tree partitions made based on the given number of levels. # Partitions for all levels less than the level provides are also # made.
- ”tile_shape”: null | [width, height]
- # If not null and is a list of two integers, crop out tile windows # from the base image that have the width and height specified. # If the image width or height is not evenly divisible by the tile # width or height, respectively, then the crop out as many tiles as # neatly fit starting from the axis origin. The remaining pixels are # ignored.
- ”tile_stride”: null | [x, y]
- # If not null and is a list of two integers, crop out sub-images of # the above width and height (if given) with this stride. When not # this is not provided, the default stride is the same as the tile # width and height.
},
- ”brightness_levels”: null | int
- # Generate a number of images with different brightness levels using # linear interpolation to choose levels between 0 (black) and 1 # (original image) as well as between 1 and 2. # Results will not include contrast level 0, 1 or 2 images.
- ”contrast_levels”: null | int
- # Generate a number of images with different contrast levels using # linear interpolation to choose levels between 0 (black) and 1 # (original image) as well as between 1 and 2. # Results will not include contrast level 0, 1 or 2 images.
}
usage: generate_image_transform [-h] [-v] [-c PATH] [-g PATH] [-i IMAGE]
[-o OUTPUT]
Named Arguments¶
-v, --verbose | Output additional debug logging. Default: False |
Configuration¶
-c, --config | Path to the JSON configuration file. |
-g, --generate-config | |
Optionally generate a default configuration file at the specified path. If a configuration file was provided, we update the default configuration with the contents of the given configuration. |
Input/Output¶
-i, --image | Image to produce transformations for. |
-o, --output | Directory to output generated images to. By default, if not told otherwise, we will write output images in the same directory as the source image. Output images share a core filename as the source image, but with extra suffix syntax to differentiate produced images from the original. Output images will share the same image extension as the source image. |
iqr_app_model_generation¶
Train and generate models for the SMQTK IQR Application.
This application takes the same configuration file as the IqrService REST
service. To generate a default configuration, please refer to the
runApplication
tool for the IqrService
application:
runApplication -a IqrService -g config.IqrService.json
usage: iqr_app_model_generation [-h] [-v] -c PATH PATH -t TAB GLOB [GLOB ...]
Positional Arguments¶
GLOB | Shell glob to files to add to the configured data set. |
Named Arguments¶
-v, --verbose | Output additional debug logging. Default: False |
-c, --config | Path to the JSON configuration files. The first file provided should be the configuration file for the IqrSearchDispatcher web-application and the second should be the configuration file for the IqrService web-application. |
-t, --tab | The configuration “tab” of the IqrSearchDispatcher configuration to use. This informs what dataset to add the input data files to. |
iqrTrainClassifier¶
Train a supervised classifier based on an IQR session state dump.
Descriptors used in IQR, and thus referenced via their UUIDs in the IQR session state dump, must exist external to the IQR web-app (uses a non-memory backend). This is needed so that this script might access them for classifier training.
Click the “Save IQR State” button to download the IqrState file encapsulating the descriptors of positively and negatively marked items. These descriptors will be used to train the configured SupervisedClassifier.
usage: iqrTrainClassifier [-h] [-v] [-c PATH] [-g PATH] [-i IQR_STATE]
Named Arguments¶
-v, --verbose | Output additional debug logging. Default: False |
-i, --iqr-state | |
Path to the ZIP file saved from an IQR session. |
Configuration¶
-c, --config | Path to the JSON configuration file. |
-g, --generate-config | |
Optionally generate a default configuration file at the specified path. If a configuration file was provided, we update the default configuration with the contents of the given configuration. |
make_balltree¶
Script for building and saving the model for the SkLearnBallTreeHashIndex
implementation of HashIndex
.
usage: make_balltree [-h] [-v] [-c PATH] [-g PATH]
Named Arguments¶
-v, --verbose | Output additional debug logging. Default: False |
Configuration¶
-c, --config | Path to the JSON configuration file. |
-g, --generate-config | |
Optionally generate a default configuration file at the specified path. If a configuration file was provided, we update the default configuration with the contents of the given configuration. |
minibatch_kmeans_clusters¶
Script for generating clusters from descriptors in a given index using the mini-batch KMeans implementation from Scikit-learn (http://scikit-learn.org/stable/modules/generated/sklearn.cluster.MiniBatchKMeans.html).
By the nature of Scikit-learn’s MiniBatchKMeans implementation, euclidean distance is used to measure distance between descriptors.
usage: minibatch_kmeans_clusters [-h] [-v] [-c PATH] [-g PATH] [-o PATH]
Named Arguments¶
-v, --verbose | Output additional debug logging. Default: False |
Configuration¶
-c, --config | Path to the JSON configuration file. |
-g, --generate-config | |
Optionally generate a default configuration file at the specified path. If a configuration file was provided, we update the default configuration with the contents of the given configuration. |
output¶
-o, --output-map | |
Path to output the clustering class mapping to. Saved as a pickle file with -1 format. |
proxyManagerServer¶
Server for hosting proxy manager which hosts proxy object instances.
This takes a simple configuration file that looks like the following:
usage: proxyManagerServer [-h] [-v] [-c PATH] [-g PATH]
Named Arguments¶
-v, --verbose | Output additional debug logging. Default: False |
Configuration¶
-c, --config | Path to the JSON configuration file. |
-g, --generate-config | |
Optionally generate a default configuration file at the specified path. If a configuration file was provided, we update the default configuration with the contents of the given configuration. |
removeOldFiles¶
Utility to recursively scan and remove files underneath a given directory if they have not been modified for longer than a set amount of time.
usage: removeOldFiles [-h] [-d BASE_DIR] [-i INTERVAL] [-e EXPIRY] [-v]
Named Arguments¶
-d, --base-dir | Starting directory for scan. |
-i, --interval | Number of seconds between each scan (integer). |
-e, --expiry | Number of seconds until a file has “expired” (integer). |
-v, --verbose | Display more messages (debugging). Default: False |
runApplication¶
Generic entry point for running SMQTK web applications defined in [smqtk.web
](/python/smqtk/web).
Runs conforming SMQTK Web Applications.
usage: runApplication [-h] [-v] [-c PATH] [-g PATH] [-l] [-a APPLICATION] [-r]
[-t] [--host HOST] [--port PORT] [--use-basic-auth]
[--debug-server] [--debug-smqtk]
Named Arguments¶
-v, --verbose | Output additional debug logging. Default: False |
Configuration¶
-c, --config | Path to the JSON configuration file. |
-g, --generate-config | |
Optionally generate a default configuration file at the specified path. If a configuration file was provided, we update the default configuration with the contents of the given configuration. |
Application Selection¶
-l, --list | List currently available applications for running. More description is included if SMQTK verbosity is increased (-v | –debug-smqtk) Default: False |
-a, --application | |
Label of the web application to run. |
Server options¶
-r, --reload | Turn on server reloading. Default: False |
-t, --threaded | Turn on server multi-threading. Default: False |
--host | Run host address specification override. This will override all other configuration method specifications. |
--port | Run port specification override. This will override all other configuration method specifications. |
--use-basic-auth | |
Use global basic authentication as configured. Default: False |
Other options¶
--debug-server | Turn on server debugging messages ONLY Default: False |
--debug-smqtk | Turn on SMQTK debugging messages ONLY Default: False |
summarizePlugins¶
Print out information about what plugins are currently usable and the documentation headers for each implementation.
usage: summarizePlugins [-h] [-v] [--defaults DEFAULTS]
Named Arguments¶
-v, --verbose | Output additional debug logging. Default: False |
--defaults | Optionally generate default configuration blocks for each plugin structure and output as JSON to the specified path. Default: False |
train_itq¶
Tool for training the ITQ functor algorithm’s model on descriptors in an index.
By default, we use all descriptors in the configured index
(uuids_list_filepath
is not given a value).
The uuids_list_filepath
configuration property is optional and should
be used to specify a sub-set of descriptors in the configured index to
train on. This only works if the stored descriptors’ UUID is a type of
string.
usage: train_itq [-h] [-v] [-c PATH] [-g PATH]
Named Arguments¶
-v, --verbose | Output additional debug logging. Default: False |
Configuration¶
-c, --config | Path to the JSON configuration file. |
-g, --generate-config | |
Optionally generate a default configuration file at the specified path. If a configuration file was provided, we update the default configuration with the contents of the given configuration. |