Utilities and Applications¶
Also part of SMQTK are support utility modules, utility scripts (effectively the “binaries”) and service-oriented and demonstration web applications.
Utility Modules¶
Various unclassified functionality intended to support the primary goals of SMQTK.
See doc-string comments on sub-module classes and functions in [smqtk.utils
](/python/smqtk/utils) module.
Utility Scripts¶
Located in the [smqtk.bin
](/python/smqtk/bin) module are various scripts intended to provide quick access or generic entry points to common SMQTK functionality.
These scripts generally require configuration via a JSON text file and executable entry points are installed via the setup.py
.
By rule of thumb, scripts that require a configuration also provide an option for outputting a default or example configuration file.
Currently available utility scripts in alphabetical order:
classifier_kfold_validation¶
Helper utility for cross validating a supervised classifier configuration. The classifier used should NOT be configured to save its model since this process requires us to train the classifier multiple times.
- plugins
- supervised_classifier
Supervised Classifier implementation configuration to use. This should not be set to use a persistent model if able (this utility will repeatedly train a new model for each fold).
- descriptor_set
Index to draw descriptors to classify from.
- cross_validation
- truth_labels
Path to a CSV file containing descriptor UUID the truth label associations. This defines what descriptors are used from the given index. We error if any descriptor UUIDs listed here are not available in the given descriptor index. This file should be in [uuid, label] column format.
- num_folds
Number of folds to make for cross validation.
- random_seed
Optional fixed seed for the
- classification_use_multiprocessing
If we should use multiprocessing (vs threading) when classifying elements.
- pr_curves
- enabled
If Precision/Recall plots should be generated.
- show
If we should attempt to show the graph after it has been generated (matplotlib).
- output_directory
Directory to save generated plots to. If None, we will not save plots. Otherwise we will create the directory (and required parent directories) if it does not exist.
- file_prefix
String prefix to prepend to standard plot file names.
- roc_curves
- enabled
If ROC curves should be generated
- show
If we should attempt to show the plot after it has been generated (matplotlib).
- output_directory
Directory to save generated plots to. If None, we will not save plots. Otherwise we will create the directory (and required parent directories) if it does not exist.
- file_prefix
String prefix to prepend to standard plot file names.
usage: classifier_kfold_validation [-h] [-v] [-c PATH] [-g PATH]
Named Arguments¶
- -v, --verbose
Output additional debug logging.
Default: False
Configuration¶
- -c, --config
Path to the JSON configuration file.
- -g, --generate-config
Optionally generate a default configuration file at the specified path. If a configuration file was provided, we update the default configuration with the contents of the given configuration.
classifier_model_validation¶
Utility for validating a given classifier implementation’s model against some labeled testing data, outputting PR and ROC curve plots with area-under-curve score values.
This utility can optionally be used train a supervised classifier model if
the given classifier model configuration does not exist and a second CSV
file listing labeled training data is provided. Training will be attempted
if train
is set to true. If training is performed, we exit after
training completes. A SupervisedClassifier
sub-classing implementation
must be configured
We expect the test and train CSV files in the column format:
… <UUID>,<label> …
The UUID is of the descriptor to which the label applies. The label may be any arbitrary string value, but all labels must be consistent in application.
Some metrics presented assume the highest confidence class as the single predicted class for an element:
confusion matrix
The output UUID confusion matrix is a JSON dictionary where the top-level keys are the true labels, and the inner dictionary is the mapping of predicted labels to the UUIDs of the classifications/descriptors that yielded the prediction. Again, this is based on the maximum probability label for a classification result (T=0.5).
- See Scikit-Learn PR and ROC curve explanations and examples:
usage: classifier_model_validation [-h] [-v] [-c PATH] [-g PATH]
Named Arguments¶
- -v, --verbose
Output additional debug logging.
Default: False
Configuration¶
- -c, --config
Path to the JSON configuration file.
- -g, --generate-config
Optionally generate a default configuration file at the specified path. If a configuration file was provided, we update the default configuration with the contents of the given configuration.
classifyFiles¶
Based on an input, trained classifier configuration, classify a number of media files, whose descriptor is computed by the configured descriptor generator. Input files that classify as the given label are then output to standard out. Thus, this script acts like a filter.
usage: classifyFiles [-h] [-v] [-c PATH] [-g PATH] [--overwrite] [-l LABEL]
[GLOB [GLOB ...]]
Positional Arguments¶
- GLOB
Series of shell globs specifying the files to classify.
Named Arguments¶
- -v, --verbose
Output additional debug logging.
Default: False
Configuration¶
- -c, --config
Path to the JSON configuration file.
- -g, --generate-config
Optionally generate a default configuration file at the specified path. If a configuration file was provided, we update the default configuration with the contents of the given configuration.
Classification¶
- --overwrite
When generating a configuration file, overwrite an existing file.
Default: False
- -l, --label
The class to filter by. This is based on the classifier configuration/model used. If this is not provided, we will list the available labels in the provided classifier configuration.
compute_classifications¶
Script for asynchronously computing classifications for DescriptorElements in a DescriptorSet specified via a list of UUIDs. Results are output to a CSV file in the format:
uuid, label1_confidence, label2_confidence, …
CSV columns labels are output to the given CSV header file path. Label
columns will be in the order as reported by the classifier implementations
get_labels
method.
Due to using an input file-list of UUIDs, we require that the UUIDs of indexed descriptors be strings, or equality comparable to the UUIDs’ string representation.
usage: compute_classifications [-h] [-v] [-c PATH] [-g PATH]
[--uuids-list PATH] [--csv-header PATH]
[--csv-data PATH]
Named Arguments¶
- -v, --verbose
Output additional debug logging.
Default: False
Configuration¶
- -c, --config
Path to the JSON configuration file.
- -g, --generate-config
Optionally generate a default configuration file at the specified path. If a configuration file was provided, we update the default configuration with the contents of the given configuration.
Input Output Files¶
- --uuids-list
Path to the input file listing UUIDs to process.
- --csv-header
Path to the file to output column header labels.
- --csv-data
Path to the file to output the CSV data to.
compute_hash_codes¶
Compute LSH hash codes based on the provided functor on all or specific descriptors from the configured index given a file-list of UUIDs.
When using an input file-list of UUIDs, we require that the UUIDs of indexed descriptors be strings, or equality comparable to the UUIDs’ string representation.
We update a key-value store with the results of descriptor hash computation. We
assume the keys of the store are the integer hash values and the values of the
store are frozenset
instances of descriptor UUIDs (hashable-type objects).
We also assume that no other source is concurrently modifying this key-value
store due to the need to modify the values of keys.
usage: compute_hash_codes [-h] [-v] [-c PATH] [-g PATH] [--uuids-list PATH]
Named Arguments¶
- -v, --verbose
Output additional debug logging.
Default: False
Configuration¶
- -c, --config
Path to the JSON configuration file.
- -g, --generate-config
Optionally generate a default configuration file at the specified path. If a configuration file was provided, we update the default configuration with the contents of the given configuration.
I/O¶
- --uuids-list
Optional path to a file listing UUIDs of descriptors to computed hash codes for. If not provided we compute hash codes for all descriptors in the configured descriptor index.
compute_many_descriptors¶
Descriptor computation helper utility. Checks data content type with respect
to the configured descriptor generator to skip content that does not match
the accepted types. Optionally, we can additionally filter out image content
whose image bytes we cannot load via PIL.Image.open
.
usage: compute_many_descriptors [-h] [-v] [-c PATH] [-g PATH] [-b INT]
[--check-image] [-f PATH] [-p PATH]
Named Arguments¶
- -v, --verbose
Output additional debug logging.
Default: False
- -b, --batch-size
Number of files to batch together into a single compute async call. This defines the granularity of the checkpoint file in regards to computation completed. If given 0, we do not batch and will perform a single
compute_async
call on the configured generator. Default batch size is 0.Default: 0
- --check-image
If se should check image pixel loading before queueing an input image for processing. If we cannot load the image pixels via
PIL.Image.open
, the input image is not queued for processingDefault: False
Configuration¶
- -c, --config
Path to the JSON configuration file.
- -g, --generate-config
Optionally generate a default configuration file at the specified path. If a configuration file was provided, we update the default configuration with the contents of the given configuration.
Required Arguments¶
- -f, --file-list
Path to a file that lists data file paths. Paths in this file may be relative, but will at some point be coerced into absolute paths based on the current working directory.
- -p, --completed-files
Path to a file into which we add CSV format lines detailing filepaths that have been computed from the file-list provided, as the UUID for that data (currently the SHA1 checksum of the data).
computeDescriptor¶
Compute a descriptor vector for a given data file, outputting the generated feature vector to standard out, or to an output file if one was specified (in numpy format).
usage: computeDescriptor [-h] [-v] [-c PATH] [-g PATH] [--overwrite]
[-o OUTPUT_FILEPATH]
[input_file]
Positional Arguments¶
- input_file
Data file to compute descriptor on
Named Arguments¶
- -v, --verbose
Output additional debug logging.
Default: False
- --overwrite
Force descriptor computation even if an existing descriptor vector was discovered based on the given content descriptor type and data combination.
Default: False
- -o, --output-filepath
Optional path to a file to output feature vector to. Otherwise the feature vector is printed to standard out. Output is saved in numpy binary format (.npy suffix recommended).
Configuration¶
- -c, --config
Path to the JSON configuration file.
- -g, --generate-config
Optionally generate a default configuration file at the specified path. If a configuration file was provided, we update the default configuration with the contents of the given configuration.
createFileIngest¶
Add a set of local system files to a data set via explicit paths or shell-style glob strings.
usage: createFileIngest [-h] [-v] [-c PATH] [-g PATH] [GLOB [GLOB ...]]
Positional Arguments¶
- GLOB
Named Arguments¶
- -v, --verbose
Output additional debug logging.
Default: False
Configuration¶
- -c, --config
Path to the JSON configuration file.
- -g, --generate-config
Optionally generate a default configuration file at the specified path. If a configuration file was provided, we update the default configuration with the contents of the given configuration.
descriptors_to_svmtrainfile¶
Utility script to transform a set of descriptors, specified by UUID, with matching class labels, to a test file usable by libSVM utilities for train/test experiments.
The input CSV file is assumed to be of the format:
uuid,label …
This is the same as the format requested for other scripts like
classifier_model_validation.py
.
This is very useful for searching for -c and -g parameter values for a
training sample of data using the tools/grid.py
script, found in the
libSVM source tree. For example:
<smqtk_source>/TPL/libsvm-3.1-custom/tools/grid.py -log2c -5,15,2 -log2c 3,-15,-2 -v 5 -out libsvm.grid.out -png libsvm.grid.png -t 0 -w1 3.46713615023 -w2 12.2613240418 output_of_this_script.txt
usage: descriptors_to_svmtrainfile [-h] [-v] [-c PATH] [-g PATH] [-f PATH]
[-o PATH]
Named Arguments¶
- -v, --verbose
Output additional debug logging.
Default: False
Configuration¶
- -c, --config
Path to the JSON configuration file.
- -g, --generate-config
Optionally generate a default configuration file at the specified path. If a configuration file was provided, we update the default configuration with the contents of the given configuration.
IO Options¶
- -f
Path to the csv file mapping descriptor UUIDs to their class label. String labels are transformed into integers for libSVM. Integers start at 1 and are applied in the order that labels are seen in this input file.
- -o
Path to the output file to write libSVM labeled descriptors to.
generate_image_transform¶
Utility for transforming an input image in various standardized ways, saving out those transformed images with standard namings. Transformations used are configurable via a configuration file (JSON).
Configuration details: {
”crop”: {
- “center_levels”: null | int
# If greater than 0, crop out one or more increasing smaller images # from a base image by cutting off increasingly larger portions of # the outside perimeter. Cropped image dimensions determined by the # dimensions of the base image and the number of crops to generate.
- ”quadrant_pyramid_levels”: null | int
# If greater than 0, generate a number of crops based on a number of # quad-tree partitions made based on the given number of levels. # Partitions for all levels less than the level provides are also # made.
- ”tile_shape”: null | [width, height]
# If not null and is a list of two integers, crop out tile windows # from the base image that have the width and height specified. # If the image width or height is not evenly divisible by the tile # width or height, respectively, then the crop out as many tiles as # neatly fit starting from the axis origin. The remaining pixels are # ignored.
- ”tile_stride”: null | [x, y]
# If not null and is a list of two integers, crop out sub-images of # the above width and height (if given) with this stride. When not # this is not provided, the default stride is the same as the tile # width and height.
},
- ”brightness_levels”: null | int
# Generate a number of images with different brightness levels using # linear interpolation to choose levels between 0 (black) and 1 # (original image) as well as between 1 and 2. # Results will not include contrast level 0, 1 or 2 images.
- ”contrast_levels”: null | int
# Generate a number of images with different contrast levels using # linear interpolation to choose levels between 0 (black) and 1 # (original image) as well as between 1 and 2. # Results will not include contrast level 0, 1 or 2 images.
}
usage: generate_image_transform [-h] [-v] [-c PATH] [-g PATH] [-i IMAGE]
[-o OUTPUT]
Named Arguments¶
- -v, --verbose
Output additional debug logging.
Default: False
Configuration¶
- -c, --config
Path to the JSON configuration file.
- -g, --generate-config
Optionally generate a default configuration file at the specified path. If a configuration file was provided, we update the default configuration with the contents of the given configuration.
Input/Output¶
- -i, --image
Image to produce transformations for.
- -o, --output
Directory to output generated images to. By default, if not told otherwise, we will write output images in the same directory as the source image. Output images share a core filename as the source image, but with extra suffix syntax to differentiate produced images from the original. Output images will share the same image extension as the source image.
iqr_app_model_generation¶
Train and generate models for the SMQTK IQR Application.
This application takes the same configuration file as the IqrService REST
service. To generate a default configuration, please refer to the
runApplication
tool for the IqrService
application:
runApplication -a IqrService -g config.IqrService.json
usage: iqr_app_model_generation [-h] [-v] -c PATH PATH -t TAB GLOB [GLOB ...]
Positional Arguments¶
- GLOB
Shell glob to files to add to the configured data set.
Named Arguments¶
- -v, --verbose
Output additional debug logging.
Default: False
- -c, --config
Path to the JSON configuration files. The first file provided should be the configuration file for the
IqrSearchDispatcher
web-application and the second should be the configuration file for theIqrService
web-application.- -t, --tab
The configuration “tab” of the
IqrSearchDispatcher
configuration to use. This informs what dataset to add the input data files to.
iqrTrainClassifier¶
Train a supervised classifier based on an IQR session state dump.
Descriptors used in IQR, and thus referenced via their UUIDs in the IQR session state dump, must exist external to the IQR web-app (uses a non-memory backend). This is needed so that this script might access them for classifier training.
Click the “Save IQR State” button to download the IqrState file encapsulating the descriptors of positively and negatively marked items. These descriptors will be used to train the configured SupervisedClassifier.
usage: iqrTrainClassifier [-h] [-v] [-c PATH] [-g PATH] [-i IQR_STATE]
Named Arguments¶
- -v, --verbose
Output additional debug logging.
Default: False
- -i, --iqr-state
Path to the ZIP file saved from an IQR session.
Configuration¶
- -c, --config
Path to the JSON configuration file.
- -g, --generate-config
Optionally generate a default configuration file at the specified path. If a configuration file was provided, we update the default configuration with the contents of the given configuration.
make_balltree¶
Script for building and saving the model for the SkLearnBallTreeHashIndex
implementation of HashIndex
.
usage: make_balltree [-h] [-v] [-c PATH] [-g PATH]
Named Arguments¶
- -v, --verbose
Output additional debug logging.
Default: False
Configuration¶
- -c, --config
Path to the JSON configuration file.
- -g, --generate-config
Optionally generate a default configuration file at the specified path. If a configuration file was provided, we update the default configuration with the contents of the given configuration.
minibatch_kmeans_clusters¶
Script for generating clusters from descriptors in a given descriptor set using the mini-batch KMeans implementation from Scikit-learn (http://scikit-learn.org/stable/modules/generated/sklearn.cluster.MiniBatchKMeans.html).
By the nature of Scikit-learn’s MiniBatchKMeans implementation, euclidean distance is used to measure distance between descriptors.
usage: minibatch_kmeans_clusters [-h] [-v] [-c PATH] [-g PATH] [-o PATH]
Named Arguments¶
- -v, --verbose
Output additional debug logging.
Default: False
Configuration¶
- -c, --config
Path to the JSON configuration file.
- -g, --generate-config
Optionally generate a default configuration file at the specified path. If a configuration file was provided, we update the default configuration with the contents of the given configuration.
output¶
- -o, --output-map
Path to output the clustering class mapping to. Saved as a pickle file with -1 format.
proxyManagerServer¶
Server for hosting proxy manager which hosts proxy object instances.
This takes a simple configuration file that looks like the following:
usage: proxyManagerServer [-h] [-v] [-c PATH] [-g PATH]
Named Arguments¶
- -v, --verbose
Output additional debug logging.
Default: False
Configuration¶
- -c, --config
Path to the JSON configuration file.
- -g, --generate-config
Optionally generate a default configuration file at the specified path. If a configuration file was provided, we update the default configuration with the contents of the given configuration.
removeOldFiles¶
Utility to recursively scan and remove files underneath a given directory if they have not been modified for longer than a set amount of time.
usage: removeOldFiles [-h] [-d BASE_DIR] [-i INTERVAL] [-e EXPIRY] [-v]
Named Arguments¶
- -d, --base-dir
Starting directory for scan.
- -i, --interval
Number of seconds between each scan (integer).
- -e, --expiry
Number of seconds until a file has “expired” (integer).
- -v, --verbose
Display more messages (debugging).
Default: False
runApplication¶
Generic entry point for running SMQTK web applications defined in [smqtk.web
](/python/smqtk/web).
Runs conforming SMQTK Web Applications.
usage: runApplication [-h] [-v] [-c PATH] [-g PATH] [-l] [-a APPLICATION] [-r]
[-t] [--host HOST] [--port PORT] [--use-basic-auth]
[--use-simple-cors] [--debug-server] [--debug-smqtk]
[--debug-app] [--debug-ns DEBUG_NS]
Named Arguments¶
- -v, --verbose
Output additional debug logging.
Default: False
Configuration¶
- -c, --config
Path to the JSON configuration file.
- -g, --generate-config
Optionally generate a default configuration file at the specified path. If a configuration file was provided, we update the default configuration with the contents of the given configuration.
Application Selection¶
- -l, --list
List currently available applications for running. More description is included if SMQTK verbosity is increased (-v | –debug-smqtk)
Default: False
- -a, --application
Label of the web application to run.
Server options¶
- -r, --reload
Turn on server reloading.
Default: False
- -t, --threaded
Turn on server multi-threading.
Default: False
- --host
Run host address specification override. This will override all other configuration method specifications.
- --port
Run port specification override. This will override all other configuration method specifications.
- --use-basic-auth
Use global basic authentication as configured.
Default: False
- --use-simple-cors
Allow CORS for all domains on all routes. This follows the “Simple Usage” of flask-cors: https://flask-cors.readthedocs.io/en/latest/#simple-usage
Default: False
Other options¶
- --debug-server
Turn on server debugging messages ONLY. This is implied when -v|–verbose is enabled.
Default: False
- --debug-smqtk
Turn on SMQTK debugging messages ONLY. This is implied when -v|–verbose is enabled.
Default: False
- --debug-app
Turn on flask app logger namespace debugging messages ONLY. This is effectively enabled if the flask app is provided with SMQTK and “–debug-smqtk” is passed. This is also implied if -v|–verbose is enabled.
Default: False
- --debug-ns
Specify additional python module namespaces to enable debug logging for.
Default: []
summarizePlugins¶
Print out information about what plugins are currently usable and the documentation headers for each implementation.
usage: summarizePlugins [-h] [-v] [--defaults DEFAULTS]
Named Arguments¶
- -v, --verbose
Output additional debug logging.
Default: False
- --defaults
Optionally generate default configuration blocks for each plugin structure and output as JSON to the specified path.
Default: False
train_itq¶
Tool for training the ITQ functor algorithm’s model on descriptors in a set.
By default, we use all descriptors in the configured set
(uuids_list_filepath
is not given a value).
The uuids_list_filepath
configuration property is optional and should
be used to specify a sub-set of descriptors in the configured set to
train on. This only works if the stored descriptors’ UUID is a type of
string.
usage: train_itq [-h] [-v] [-c PATH] [-g PATH]
Named Arguments¶
- -v, --verbose
Output additional debug logging.
Default: False
Configuration¶
- -c, --config
Path to the JSON configuration file.
- -g, --generate-config
Optionally generate a default configuration file at the specified path. If a configuration file was provided, we update the default configuration with the contents of the given configuration.