NearestNeighborServiceServer Incremental Update Example

Goal and Plan

In this example, we will show how to initially set up an instance of the NearestNeighborServiceServer web API service class such that it can handle incremental updates to its background data. We will also show how to perform incremental updates and confirm that the service recognizes this new data.

For this example, we will use the LSHNearestNeighborIndex implementation as it is one that currently supports live-reloading its component model files. Along with it, we will use the ItqFunctor and PostgresDescriptorIndex implementations as the components of the LSHNearestNeighborIndex. For simplicity, we will not use a specific HashIndex, which causes a LinearHashIndex to be constructed and used at query time.

All scripts used in this example’s proceedure have a command line interface that uses dash options. Their available options can be listed by giving the -h/--help option. Additional debug logging can be seen output by providing a -d or -v option, depending on the script.

This example assumes that you have a basic understanding of:

  • JSON for configuring files
  • how to use the bin/runApplication.py
  • SMQTK’s NearestNeighborServiceServer application and algorithmic/data-structure components.
    • NearestNeighborsIndex, specific the implementation LSHNearestNeighborIndex
    • DescriptorIndex abstract and implementations with an updatable persistance storage mechanism (e.g. PostgresDescriptorIndex).
    • LshFunctor abstract and implementations

Dependencies

Due to our use of the PostgresDescriptorIndex in this example, a minimum installed version of PostgreSQL 9.4 is required, as is the psycopg2 python module (conda and pip installable). Please check and modify the configuration files for this example to be able to connect to the database of your choosing.

Take a look at the etc/smqtk/postgres/descriptor_element/example_table_init.sql and etc/smqtk/postgres/descriptor_index/example_table_init.sql files, located from the root of the source tree, for table creation examples for element and index storage:

$ psql postgres -f etc/smqtk/postgres/descriptor_element/example_table_init.sql
$ psql postgres -f etc/smqtk/postgres/descriptor_index/example_table_init.sql

Proceedure

[1] Getting and Splitting the data set

For this example we will use the Leeds butterfly data set (see the download_leeds_butterfly.sh script). We will split the data set into an initial sub-set composed of about half of the images from each butterfly catagory (418 total images in the 2.ingest_files_1.txt file). We will then split the data into a two more sub-sets each composed of about half of the remaining data (each composing about 1/4 of the original data set, totaling 209 and 205 images each in the TODO.ingest_files_2.txt and TODO.ingest_files_3.txt files respectively).

[2] Computing Initial Ingest

For this example, an “ingest” consists of a set of descriptors in an index and a mapping of hash codes to the descriptors.

In this example, we also train the LSH hash code functor’s model, if it needs one, based on the descriptors computed before computing the hash codes. We are using the ITQ functor which does require a model. It may be the case that the functor of choice does not require a model, or a sufficient model for the functor is already available for use, in which case that step may be skipped.

Our example’s initial ingest will use the image files listed in the 2.ingest_files_1.txt test file.

[2a] Computing Descriptors

We will use the script bin/scripts/compute_many_descriptors.py for computing descriptors from a list of file paths. This script will be used again in later sections for additional incremental ingests.

The example configuration file for this script, 2a.config.compute_many_descriptors.json (shown below), should be modified to connect to the appropriate PostgreSQL database and the correct Caffe model files for your system. For this example, we will be using Caffe’s bvlc_alexnet network model with the ilsvrc12 image mean be used for this example.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
{
    "descriptor_factory": {
        "PostgresDescriptorElement": {
            "binary_col": "vector",
            "db_host": "/dev/shm",
            "db_name": "postgres",
            "db_pass": null,
            "db_port": null,
            "db_user": null,
            "table_name": "descriptors",
            "type_col": "type_str",
            "uuid_col": "uid"
        },
        "type": "PostgresDescriptorElement"
    },
    "descriptor_generator": {
        "CaffeDescriptorGenerator": {
            "batch_size": 256,
            "data_layer": "data",
            "gpu_device_id": 0,
            "image_mean_filepath": "/home/purg/dev/caffe/source/data/ilsvrc12/imagenet_mean.binaryproto",
            "load_truncated_images": false,
            "network_is_bgr": true,
            "network_model_filepath": "/home/purg/dev/caffe/source/models/bvlc_alexnet/bvlc_alexnet.caffemodel",
            "network_prototxt_filepath": "/home/purg/dev/caffe/source/models/bvlc_alexnet/deploy.prototxt",
            "pixel_rescale": null,
            "return_layer": "fc7",
            "use_gpu": false
        },
        "type": "CaffeDescriptorGenerator"
    },
    "descriptor_index": {
        "PostgresDescriptorIndex": {
            "db_host": "/dev/shm",
            "db_name": "postgres",
            "db_pass": null,
            "db_port": null,
            "db_user": null,
            "element_col": "element",
            "multiquery_batch_size": 1000,
            "pickle_protocol": -1,
            "read_only": false,
            "table_name": "descriptor_index",
            "uuid_col": "uid"
        },
        "type": "PostgresDescriptorIndex"
    }
}

For running the script, take a look at the example invocation in the file 2a.run.sh:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
#!/usr/bin/env bash
set -e
SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
cd "${SCRIPT_DIR}"

../../../bin/scripts/compute_many_descriptors.py \
  -v \
  -c 2a.config.compute_many_descriptors.json \
  -f 2.ingest_files_1.txt \
  --completed-files 2a.completed_files.csv

This step yields two side effects:

  • Descriptors computed are saved in the configured implementation’s persistant storage (a postgres database in our case)
  • A file is generated that mapping input files to their DataElement UUID values, or otherwise known as their SHA1 checksum values (2a.completed_files.csv for us).
    • This file will be used later as a convenient way of getting at the UUIDs of descriptors processed for a particular ingest.
    • Other uses of this file for other tasks may include:
      • interfacing with other systems that use file paths as the primary identifier of base data files
      • want to quickly back-reference the original file for a given UUID, as UUIDs for descriptor and classification elements are currently the same as the original file they are computed from.

[2b] Training ITQ Model

To train the ITQ model, we will use the script: ./bin/scripts/train_itq.py. We’ll want to train the functor’s model using the descriptors computed in step 2a. Since we will be using the whole index (418 descriptors), we will not need to provide the script with an additional list of UUIDs.

The example configuration file for this script, 2b.config.train_itq.json, should be modified to connect to the appropriate PostgreSQL database.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
{
    "descriptor_index": {
        "PostgresDescriptorIndex": {
            "db_host": "/dev/shm",
            "db_name": "postgres",
            "db_pass": null,
            "db_port": null,
            "db_user": null,
            "element_col": "element",
            "multiquery_batch_size": 1000,
            "pickle_protocol": -1,
            "read_only": false,
            "table_name": "descriptor_index",
            "uuid_col": "uid"
        },
        "type": "PostgresDescriptorIndex"
    },
    "itq_config": {
        "bit_length": 256,
        "itq_iterations": 50,
        "mean_vec_filepath": "2b.itq.256bit.mean_vec.npy",
        "random_seed": 0,
        "rotation_filepath": "2b.itq.256bit.rotation.npy"
    },
    "parallel": {
        "index_load_cores": 4,
        "use_multiprocessing": true
    },
    "uuids_list_filepath": null
}

2b.run.sh contains an example call of the training script:

1
2
3
4
5
6
#!/usr/bin/env bash
set -e
SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
cd "${SCRIPT_DIR}"

../../../bin/scripts/train_itq.py -v -c 2b.config.train_itq.json

This step produces the following side effects:

  • Writes the two file components of the model as configured.
    • We configured the output files:
      • 2b.itq.256bit.mean_vec.npy
      • 2b.nnss.itq.256bit.rotation.npy

[2c] Computing Hash Codes

For this step we will be using the script bin/scripts/compute_hash_codes.py to compute ITQ hash codes for the currently computed descriptors. We will be using the descriptor index we added to before as well as the ItqFunctor models we trained in the previous step.

This script additionally wants to know the UUIDs of the descriptors to compute hash codes for. We can use the 2a.completed_files.csv file computed earlier in step 2a to get at the UUIDs (SHA1 checksum) values for the computed files. Remember, as is documented in the DescriptorGenerator interface, descriptor UUIDs are the same as the UUID of the data from which it was generated from, thus we can use this file.

We can conveniently extract these UUIDs with the following commands in script 2c.extract_ingest_uuids.sh, resulting in the file 2c.uuids_for_processing.txt:

1
2
3
4
5
6
#!/usr/bin/env bash
set -e
SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
cd "${SCRIPT_DIR}"

cat 2a.completed_files.csv | cut -d',' -f2 >2c.uuids_for_processing.txt

With this file, we can now complete the configuration for our computation script:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
{
    "plugins": {
        "descriptor_index": {
            "PostgresDescriptorIndex": {
                "db_host": "/dev/shm",
                "db_name": "postgres",
                "db_pass": null,
                "db_port": null,
                "db_user": null,
                "element_col": "element",
                "multiquery_batch_size": 1000,
                "pickle_protocol": -1,
                "read_only": false,
                "table_name": "descriptor_index",
                "uuid_col": "uid"
            },
            "type": "PostgresDescriptorIndex"
        },
        "lsh_functor": {
            "ItqFunctor": {
                "bit_length": 256,
                "itq_iterations": 50,
                "mean_vec_filepath": "2b.itq.256bit.mean_vec.npy",
                "random_seed": 0,
                "rotation_filepath": "2b.itq.256bit.rotation.npy"
            },
            "type": "ItqFunctor"
        }
    },
    "utility": {
        "hash2uuids_input_filepath": null,
        "hash2uuids_output_filepath": "2c.hash2uuids.pickle",
        "pickle_protocol": -1,
        "report_interval": 1.0,
        "use_multiprocessing": true,
        "uuid_list_filepath": "2c.uuids_for_processing.txt"
    }
}

We are not setting a value for hash2uuids_input_filepath because this is the first time we are running this script, thus we do not have an existing structure to add to.

We can now move forward and run the computation script:

1
2
3
4
5
6
#!/usr/bin/env bash
set -e
SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
cd "${SCRIPT_DIR}"

../../../bin/scripts/compute_hash_codes.py -v -c 2c.config.compute_hash_codes.json

This step produces the following side effects:

  • Writed the file 2c.hash2uuids.pickle
    • This file will be copied and used in configuring the LSHNearestNeighborIndex for the NearestNeighborServiceServer

[2d] Starting the NearestNeighborServiceServer

Normally, a NearestNeighborsIndex instance would need to be have its index built before it can be used. However, we have effectively already done this in the preceeding steps, so are instead able to get right to configuring and starting the NearestNeighborServiceServer. A default configuration may be generated using the generic bin/runApplication.py script (since web applications/servers are plugins) using the command:

$ runApplication.py -a NearestNeighborServiceServer -g 2d.config.nnss_app.json

An example configuration has been provided in 2d.config.nnss_app.json. The DescriptorIndex, DescriptorGenerator and LshFunctor configuration sections should be the same as used in the preceeding sections.

Before configuring, we are copying 2c.hash2uuids.pickle to 2d.hash2uuids.pickle. Since we will be overwriting this file (the 2d version) in steps to come, we want to separate it from the results of step 2c.

Note the highlighted lines for configurations of note for the LSHNearestNeighborIndex implementation. These will be explained below.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
{
    "descriptor_factory": {
        "PostgresDescriptorElement": {
            "binary_col": "vector",
            "db_host": "/dev/shm",
            "db_name": "postgres",
            "db_pass": null,
            "db_port": null,
            "db_user": null,
            "table_name": "descriptors",
            "type_col": "type_str",
            "uuid_col": "uid"
        },
        "type": "PostgresDescriptorElement"
    },
    "descriptor_generator": {
        "CaffeDescriptorGenerator": {
            "batch_size": 256,
            "data_layer": "data",
            "gpu_device_id": 0,
            "image_mean_filepath": "/home/purg/dev/caffe/source/data/ilsvrc12/imagenet_mean.binaryproto",
            "load_truncated_images": false,
            "network_is_bgr": true,
            "network_model_filepath": "/home/purg/dev/caffe/source/models/bvlc_alexnet/bvlc_alexnet.caffemodel",
            "network_prototxt_filepath": "/home/purg/dev/caffe/source/models/bvlc_alexnet/deploy.prototxt",
            "pixel_rescale": null,
            "return_layer": "fc7",
            "use_gpu": false
        },
        "type": "CaffeDescriptorGenerator"
    },
    "flask_app": {
        "BASIC_AUTH_PASSWORD": "demo",
        "BASIC_AUTH_USERNAME": "demo",
        "SECRET_KEY": "MySuperUltraSecret"
    },
    "nn_index": {
        "LSHNearestNeighborIndex": {
            "descriptor_index": {
                "PostgresDescriptorIndex": {
                    "db_host": "/dev/shm",
                    "db_name": "postgres",
                    "db_pass": null,
                    "db_port": null,
                    "db_user": null,
                    "element_col": "element",
                    "multiquery_batch_size": 1000,
                    "pickle_protocol": -1,
                    "read_only": false,
                    "table_name": "descriptor_index",
                    "uuid_col": "uid"
                },
                "type": "PostgresDescriptorIndex"
            },
            "distance_method": "hik",
            "hash2uuid_cache_filepath": "2d.hash2uuids.pickle",
            "hash_index": {
                "type": null
            },
            "hash_index_comment": "'hash_index' may also be null to default to a linear index built at query time.",
            "live_reload": true,
            "lsh_functor": {
                "ItqFunctor": {
                    "bit_length": 256,
                    "itq_iterations": 50,
                    "mean_vec_filepath": "2b.itq.256bit.mean_vec.npy",
                    "random_seed": 0,
                    "rotation_filepath": "2b.itq.256bit.rotation.npy"
                },
                "type": "ItqFunctor"
            },
            "read_only": true,
            "reload_mon_interval": 0.1,
            "reload_settle_window": 1.0
        },
        "type": "LSHNearestNeighborIndex"
    },
    "server": {
        "host": "127.0.0.1",
        "port": 5000
    }
}

Emphasized line explanations:

  • On line 55, we are using the hik distance method, or histogram intersection distance, as it has been experimentally shown to out perform other distance metrics for AlexNet descriptors.
  • On line 56, we are using the output generated during step 2c. This file will be updated during incremental updates, along with the configured DescriptorIndex.
  • On line 58, we are choosing not to use a pre-computed HashIndex. This means that a LinearHashIndex will be created and used at query time. Other implementations in the future may incorporate live-reload functionality.
  • On line 61, we are telling the LSHNearestNeighborIndex to reload its implementation-specific model files when it detects that they’ve changed.
    • We listed LSHNearestNeighborIndex implementation’s only model file on line 56 and will be updated via the bin/scripts/compute_hash_codes.py
  • On line 72, we are telling the implementation to make sure it does not write to any of its resources.

We can now start the service using:

$ runApplication.py -a NearestNeighborServiceServer -c 2d.config.nnss_app.json

We can test the server by calling its web api via curl using one of our ingested images, leedsbutterfly/images/001_0001.jpg:

$ curl http://127.0.0.1:5000/nn/n=10/file:///home/purg/data/smqtk/leedsbutterfly/images/001_0001.jpg
{
  "distances": [
    -2440.0882132202387,
    -1900.5749250203371,
    -1825.7734497860074,
    -1771.708476960659,
    -1753.6621350347996,
    -1729.6928340941668,
    -1684.2977819740772,
    -1627.438737615943,
    -1608.4607088603079,
    -1536.5930510759354
  ],
  "message": "execution nominal",
  "neighbors": [
    "84f62ef716fb73586231016ec64cfeed82305bba",
    "ad4af38cf36467f46a3d698c1720f927ff729ed7",
    "2dffc1798596bc8be7f0af8629208c28606bba65",
    "8f5b4541f1993a7c69892844e568642247e4acf2",
    "e1e5f3e21d8e3312a4c59371f3ad8c49a619bbca",
    "e8627a1a3a5a55727fe76848ba980c989bcef103",
    "750e88705efeee2f12193b45fb34ec10565699f9",
    "e21b695a99fee6ff5af8d2b86d4c3e8fe3295575",
    "0af474b31fc8002fa9b9a2324617227069649f43",
    "7da0501f7d6322aef0323c34002d37a986a3bf74"
  ],
  "reference_uri": "file:///home/purg/data/smqtk/leedsbutterfly/images/001_0001.jpg",
  "success": true
}

If we compare the result neighbor UUIDs to the SHA1 hash signatures of the original files (that descritpors were computed from), listed in the step 2a result file 2a.completed_files.csv, we find that the above results are all of the class 001, or monarch butterflies.

If we used either of the files leedsbutterfly/images/001_0042.jpg or leedsbutterfly/images/001_0063.jpg, which are not in our initial ingest, but in the subsequent ingests, and set .../n=832/... (the maximum size we will see in ingest grow to), we would see that the API does not return their UUIDs since they have not been ingested yet. We will also see that only 418 neighbors are returned even though we asked for 832, since there are only 418 elements currently in the index. We will use these three files as proof that we are actually expanding the searchable content after each incremental ingest.

We provide a helper bash script, test_in_index.sh, for checking if a file is findable via in the search API. A call of the form:

$ ./test_in_index.sh leedsbutterfly/images/001_0001.jpg 832

… performs a curl call to the server’s default host address and port for the 832 nearest neighbors to the query image file, and checks if the UUIDs of the given file (the sha1sum) is in the returned list of UUIDs.

[3] First Incremental Update

Now that we have a live NearestNeighborServiceServer instance running, we can incrementally process the files listed in 3.ingest_files_2.txt, making them available for search without having to shut down or otherwise do anything to the running server instance.

We will be performing the same actions taken in steps 2a and 2c, but with different inputs and outputs:

  1. Compute descriptors for files listed in 3.ingest_files_2.txt using script compute_many_descriptors.py, outputting file 3.completed_files.csv.
  2. Create a list of descriptor UUIDs just computed (see 2c.extract_ingest_uuids.sh) and compute hash codes for those descriptors, overwriting 2d.hash2uuids.pickle (which causes the server the LSHNearestNeighborIndex instance to update itself).

The following is the updated configuration file for hash code generation. Note the highlighted lines for differences from step 2c (notes to follow):

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
{
    "plugins": {
        "descriptor_index": {
            "PostgresDescriptorIndex": {
                "db_host": "/dev/shm",
                "db_name": "postgres",
                "db_pass": null,
                "db_port": null,
                "db_user": null,
                "element_col": "element",
                "multiquery_batch_size": 1000,
                "pickle_protocol": -1,
                "read_only": false,
                "table_name": "descriptor_index",
                "uuid_col": "uid"
            },
            "type": "PostgresDescriptorIndex"
        },
        "lsh_functor": {
            "ItqFunctor": {
                "bit_length": 256,
                "itq_iterations": 50,
                "mean_vec_filepath": "2b.itq.256bit.mean_vec.npy",
                "random_seed": 0,
                "rotation_filepath": "2b.itq.256bit.rotation.npy"
            },
            "type": "ItqFunctor"
        }
    },
    "utility": {
        "hash2uuids_input_filepath": "2d.hash2uuids.pickle",
        "hash2uuids_output_filepath": "2d.hash2uuids.pickle",
        "pickle_protocol": -1,
        "report_interval": 1.0,
        "use_multiprocessing": true,
        "uuid_list_filepath": "3.uuids_for_processing.txt"
    }
}

Line notes:

  • Lines 31 and 32 are set to the model file that the LSHNearestNeighborIndex implementation for the server was configured to use.
  • Line 36 should be set to the descriptor UUIDs file generated from 3.completed_files.csv (see 2c.extract_ingest_uuids.sh)

The provided 3.run.sh script is an example of the commands to run for updating the indices and models:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
#!/usr/bin/env bash
set -e
SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
cd "${SCRIPT_DIR}"

# Compute descriptors for new files, outputing a file that matches input
# files to thair SHA1 checksum values (their UUIDs)
../../../bin/scripts/compute_many_descriptors.py \
  -d \
  -c 2a.config.compute_many_descriptors.json \
  -f 3.ingest_files_2.txt \
  --completed-files 3.completed_files.csv

# Extract UUIDs of files/descriptors just generated
cat 3.completed_files.csv | cut -d, -f2 > 3.uuids_for_processing.txt

# Compute hash codes for descriptors just generated, updating the target
# hash2uuids model file.
../../../bin/scripts/compute_hash_codes.py -v -c 3.config.compute_hash_codes.json

After calling the compute_hash_codes.py script, the server logging should yield messages (if run in debug/verbose mode) showing that the LSHNearestNeighborIndex updated its model.

We can now test that the NearestNeighborServiceServer using the query examples used at the end of step 2d. Using images leedsbutterfly/images/001_0001.jpg and leedsbutterfly/images/001_0042.jpg as our query examples (and .../n=832/...), we can see that both are in the index (each image is the nearest neighbor to itself). We also see that a total of 627 neighbors are returned, which is the current number of elements now in the index after this update. The sha1 of the third image file, leedsbutterfly/images/001_0082.jpg, when used as the query example, is not included in the returned neighbors and thus found in the index.

[4] Second Incremental Update

Let us repeat again the above process, but using the third increment set (highlighted lines different from 3.run.sh):

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
#!/usr/bin/env bash
set -e
SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
cd "${SCRIPT_DIR}"

# Compute descriptors for new files, outputing a file that matches input
# files to thair SHA1 checksum values (their UUIDs)
../../../bin/scripts/compute_many_descriptors.py \
  -d \
  -c 2a.config.compute_many_descriptors.json \
  -f 4.ingest_files_3.txt \
  --completed-files 4.completed_files.csv

# Extract UUIDs of files/descriptors just generated
cat 4.completed_files.csv | cut -d, -f2 > 4.uuids_for_processing.txt

# Compute hash codes for descriptors just generated, updating the target
# hash2uuids model file.
../../../bin/scripts/compute_hash_codes.py -v -c 4.config.compute_hash_codes.json

After this, we should be able to query all three example files used before and see that they are all now included in the index. We will now also see that all 832 neighbors requested are returned for each of the queries, which equals the total number of files we have ingested over the above steps. If we increase n for a query, only 832 neighbors are returned, showing that there are 832 elements in the index at this point.