WEBVTT

00:06.809 --> 00:09.390
Hello , everybody . Good afternoon .

00:10.579 --> 00:12.689
All right , thank you for joining us

00:12.689 --> 00:14.911
today for our quarterly capability demo

00:14.911 --> 00:17.090
of the JIC program . Uh , we are

00:17.090 --> 00:19.312
excited , very excited to show you some

00:19.312 --> 00:21.201
of our new developments from this

00:21.201 --> 00:23.201
increment . Um , I'm going to begin

00:23.201 --> 00:25.257
with some quick background about the

00:25.257 --> 00:27.590
program as people are trickling in , uh ,

00:27.590 --> 00:29.701
for those of you who for whom this is

00:29.701 --> 00:31.368
your first time attending . S

00:33.840 --> 00:35.840
Uh , so for some quick background ,

00:35.840 --> 00:37.896
Jade was created to help address the

00:37.896 --> 00:39.951
novel test and evaluation challenges

00:39.951 --> 00:42.590
for AI enabled systems , slide . Um , a

00:42.590 --> 00:44.534
few different documents identified

00:44.534 --> 00:46.368
these gaps , which really helped

00:46.368 --> 00:48.423
motivate and shape the design of our

00:48.423 --> 00:50.646
program . The first one is the National

00:50.646 --> 00:52.590
Security Commission on AII's final

00:52.590 --> 00:54.534
report . The second one is the DOD

00:54.534 --> 00:56.701
responsible AI Strategy Implementation

00:56.701 --> 00:58.757
Pathway , and the last one is CEAO's

00:58.757 --> 01:00.979
own AI test and evaluation capabilities

01:00.979 --> 01:03.740
gap study . Uh , these documents found

01:03.740 --> 01:06.980
that better AITNE was a necessary

01:06.980 --> 01:08.813
condition for the DOD to develop

01:08.813 --> 01:10.980
responsible AI and to continue leading

01:10.980 --> 01:13.091
its adoption . That AI was a class of

01:13.091 --> 01:14.758
technology that created novel

01:14.758 --> 01:16.869
challenges and capability gaps within

01:16.869 --> 01:18.869
its testing and evaluation , as I'm

01:18.869 --> 01:20.924
sure you guys are all aware , and in

01:20.924 --> 01:22.813
particular there was a consistent

01:22.813 --> 01:24.647
desire across the department for

01:24.647 --> 01:26.813
centralized investment in AI tools and

01:26.813 --> 01:29.036
resources to help fill these new gaps .

01:29.209 --> 01:31.959
So JDI was created in FY23 to address

01:31.959 --> 01:35.239
some of these needs . On top of that ,

01:35.519 --> 01:37.519
in October of last year , the White

01:37.519 --> 01:39.630
House released the AI executive order

01:39.630 --> 01:41.741
which laid out a myriad of actions to

01:41.741 --> 01:43.908
ensure that we're developing and using

01:43.908 --> 01:46.019
AI in the right way , and many of the

01:46.019 --> 01:45.849
lines of effort have to do with TNE .

01:45.959 --> 01:48.279
So we have now moved to align JDI with

01:48.279 --> 01:50.430
several of these crucial activities

01:50.430 --> 01:54.410
slide . And so these reports

01:54.410 --> 01:57.040
and directives found a very large

01:57.040 --> 01:59.349
number of capability gaps within AITE

01:59.349 --> 02:02.480
across a lot of different capabilities

02:02.480 --> 02:04.519
from tools to technical enablers to

02:04.519 --> 02:06.959
infrastructure to workforce , and

02:06.959 --> 02:09.070
addressing all of these problems is a

02:09.070 --> 02:11.070
very , very large effort . Uh , JDI

02:11.070 --> 02:13.403
seeks to address one small part of that .

02:13.403 --> 02:15.720
So our program objective is to develop

02:15.720 --> 02:18.199
software to accelerate and enable AI

02:18.199 --> 02:20.366
test and evaluation for testers across

02:20.366 --> 02:22.310
the DOD Enterprise , ultimately in

02:22.310 --> 02:24.088
order to provide insight on the

02:24.088 --> 02:25.977
performance , effectiveness , and

02:25.977 --> 02:27.977
robustness of the DADA's AI enabled

02:27.977 --> 02:31.690
systems slide . So the software that we

02:31.690 --> 02:33.801
develop within our program adheres to

02:33.801 --> 02:35.912
some core design principles . First ,

02:35.912 --> 02:37.968
all of the Jed tools that you'll see

02:37.968 --> 02:40.079
today are freely available across the

02:40.079 --> 02:42.134
DD Enterprise . All of our tools are

02:42.134 --> 02:44.134
designed to be widely interoperable

02:44.134 --> 02:46.190
with different MLOPs , platforms and

02:46.190 --> 02:47.968
tools . They are designed to be

02:47.968 --> 02:50.023
straightforward to deploy so you can

02:50.023 --> 02:49.919
bring tools to your data instead of

02:50.130 --> 02:52.130
having to do the other way around .

02:52.130 --> 02:54.186
They are designed to be accessible ,

02:54.186 --> 02:56.408
well documented , and easy to use , and

02:56.408 --> 02:58.241
also to be mature , stable , and

02:58.241 --> 03:00.463
hardened software to be able to support

03:00.463 --> 03:02.408
high consequence AITNE decisions .

03:02.408 --> 03:05.360
Slide ? And you can find all of our

03:05.360 --> 03:07.638
tools that you see today on our Gitlab ,

03:07.638 --> 03:09.749
gitlab . Jig.net . You can also visit

03:09.749 --> 03:11.720
our public website at CDO.pages .

03:11.940 --> 03:14.399
Jig.net/public . Next .

03:16.500 --> 03:18.669
Um , so for some quick background on

03:18.669 --> 03:20.891
the demo today , uh , today's demo will

03:20.891 --> 03:22.947
feature the persona Isaac , who is a

03:22.947 --> 03:25.669
senior AI T&E engineer . Isaac will be

03:25.669 --> 03:27.979
using JDI tools within an operationally

03:27.979 --> 03:29.812
realistic workflow to perform AI

03:29.812 --> 03:31.923
testing for his program . And again ,

03:31.923 --> 03:34.146
all the tools that you see today , um ,

03:34.146 --> 03:36.201
you can find them immediately at our

03:36.201 --> 03:38.990
gitlab on gitlab.j.net . So if there's

03:38.990 --> 03:40.768
any particular tool that you're

03:40.768 --> 03:42.768
interested in , you can immediately

03:42.768 --> 03:44.546
find and download it . Um , and

03:44.546 --> 03:46.490
throughout the demo as we show off

03:46.490 --> 03:48.379
these different tools , we really

03:48.379 --> 03:50.434
encourage you to think about them in

03:50.434 --> 03:52.657
the context of your program . This is a

03:52.657 --> 03:54.712
persona-based demo , so we're really

03:54.712 --> 03:56.879
trying to show how these things can be

03:56.879 --> 03:58.768
useful within an actual realistic

03:58.768 --> 04:00.879
workflow . And so consider how an AIT

04:00.879 --> 04:02.934
engineer might get use out of any of

04:02.934 --> 04:04.990
these tools or might integrate these

04:04.990 --> 04:07.157
tools into their workflow . During the

04:07.157 --> 04:08.934
demo , please feel free to drop

04:08.934 --> 04:11.101
questions in the chat , and we'll also

04:11.101 --> 04:11.000
have some time for questions at the end

04:11.000 --> 04:13.167
of the demonstration . And with that ,

04:13.167 --> 04:15.389
I'll pass it off to Austin to begin the

04:15.389 --> 04:18.850
demo . Thank you .

04:19.220 --> 04:21.331
So today we will be demonstrating how

04:21.331 --> 04:24.179
Isaac , our senior T&E engineer , uses

04:24.179 --> 04:26.660
a panel-based wizard to create a custom

04:26.660 --> 04:29.339
AI testing pipeline and deploy this

04:29.339 --> 04:31.579
testing pipeline to to perform

04:31.579 --> 04:33.357
repeatable testing and generate

04:33.357 --> 04:35.859
detailed test reports . We will begin

04:35.859 --> 04:37.859
by providing some background on who

04:37.859 --> 04:40.220
Isaac is , what his expertise is , and

04:40.220 --> 04:43.890
what the mission is . Our persona Isaac

04:43.890 --> 04:45.769
is a senior test and evaluation

04:45.769 --> 04:47.410
engineer . He has degrees in

04:47.410 --> 04:49.640
mathematics and data science and has 5

04:49.640 --> 04:51.640
years of experience in the field of

04:51.640 --> 04:54.350
Model T&E . Isaac is a government

04:54.350 --> 04:56.670
civilian employee working for the for a

04:56.670 --> 04:59.470
DoD organization and as a senior T&E

04:59.470 --> 05:01.192
engineer , Isaac is focused on

05:01.192 --> 05:03.414
measuring and comparing the performance

05:03.414 --> 05:05.081
of models by verifying system

05:05.081 --> 05:07.137
requirements and quantifying risks .

05:07.410 --> 05:09.243
Isaac works for a program called

05:09.243 --> 05:12.109
Project Birds Eye , a hypothetical DAD

05:12.109 --> 05:14.519
program seeking to field AI models to

05:14.519 --> 05:16.679
identify structures of interest from

05:16.679 --> 05:19.809
high altitude UAB imagery . The project

05:19.809 --> 05:21.940
leverages computer vision object

05:21.940 --> 05:24.250
detection to identify and classify

05:24.540 --> 05:26.762
potential structures of interest in the

05:26.762 --> 05:28.762
imagery . Project Birdseye receives

05:28.762 --> 05:30.929
externally trained models from vendors

05:30.929 --> 05:32.873
and tests them for suitability for

05:32.873 --> 05:35.096
deployment in the mission . Most models

05:35.096 --> 05:37.096
the program receives are trained on

05:37.096 --> 05:39.151
data collected in the US . However ,

05:39.151 --> 05:41.373
the program expects to deploy models to

05:41.373 --> 05:43.540
many geographic regions , so they have

05:43.540 --> 05:45.373
collected a set of operationally

05:45.373 --> 05:47.373
relevant data from multiple regions

05:47.373 --> 05:49.207
with with which to perform their

05:49.207 --> 05:52.390
testing . In Isaac's role as a senior

05:52.390 --> 05:54.223
test and evaluation engineer for

05:54.223 --> 05:56.446
Project Birdseye , Isaac is expected to

05:56.446 --> 05:58.612
process the externally provided models

05:58.612 --> 06:00.723
and develop tests that evaluate those

06:00.723 --> 06:03.057
models according to system requirements .

06:03.057 --> 06:04.668
He is provided with a system

06:04.668 --> 06:06.779
requirements document , a system risk

06:06.779 --> 06:09.130
assessment , vendor provided models ,

06:09.230 --> 06:11.059
and a held out operationally

06:11.059 --> 06:13.940
representative testing data set . From

06:13.940 --> 06:16.160
these inputs , he will produce a test

06:16.160 --> 06:18.160
plan to evaluate the models against

06:18.160 --> 06:20.739
requirements , detailed test reports on

06:20.739 --> 06:22.579
the model's performance , and

06:22.579 --> 06:24.690
recommendations to the program office

06:24.690 --> 06:26.690
about the needs for additional data

06:26.690 --> 06:28.857
collection , data labeling , and model

06:28.857 --> 06:32.480
retraining . The workflow for Project

06:32.480 --> 06:35.040
Birds Eye consists of two phases the

06:35.040 --> 06:37.096
dataset analysis phase and the model

06:37.096 --> 06:38.984
evaluation phase . During dataset

06:38.984 --> 06:41.151
analysis , Isaac will perform tests on

06:41.151 --> 06:43.207
held out data from both the data set

06:43.207 --> 06:45.207
that was originally used to develop

06:45.207 --> 06:47.318
models , as well as the operationally

06:47.318 --> 06:49.151
relevant data set . The analysis

06:49.151 --> 06:51.373
provided by Isaac allows the program to

06:51.373 --> 06:53.429
identify early on whether either the

06:53.429 --> 06:55.484
development or operational data sets

06:55.484 --> 06:57.651
contain undesirable features that need

06:57.651 --> 06:59.762
to be corrected prior to moving on to

06:59.762 --> 07:01.984
the evaluation of models . During model

07:01.984 --> 07:03.984
evaluation , Isaac will test models

07:03.984 --> 07:05.818
against the previously validated

07:05.818 --> 07:07.984
testing data sets , as well as apply a

07:07.984 --> 07:09.984
series of targeted perturbations to

07:09.984 --> 07:11.929
test the model's response to known

07:11.929 --> 07:13.929
risks . If the model is found to be

07:13.929 --> 07:15.818
insufficient according to mission

07:15.818 --> 07:17.929
requirements , then Isaac may provide

07:17.929 --> 07:20.096
feedback to Project Birdseye about the

07:20.096 --> 07:22.151
specific deficiencies which they may

07:22.151 --> 07:24.096
then use in making decisions about

07:24.096 --> 07:26.040
model retraining . If the model is

07:26.040 --> 07:28.262
found to achieve the thresholds defined

07:28.262 --> 07:30.096
by the mission , then Isaac will

07:30.096 --> 07:30.079
provide his analysis to the project

07:30.079 --> 07:32.246
along with the recommendation that the

07:32.246 --> 07:33.801
model be approved for use .

07:40.529 --> 07:42.696
In order to evaluate the model against

07:42.696 --> 07:44.862
defined requirements , Isaac developed

07:44.862 --> 07:47.196
a test plan to structure his assessment .

07:47.196 --> 07:49.307
This test plan that I'll present here

07:49.307 --> 07:51.362
has been developed for demonstration

07:51.362 --> 07:53.585
purposes only and is not intended to be

07:53.585 --> 07:55.807
used for real testing . The sections of

07:55.807 --> 07:58.790
the test plan first introduce the

07:58.790 --> 08:01.070
document's purpose . Describe the

08:01.070 --> 08:03.910
mission need . Describe what the

08:03.910 --> 08:06.230
current mission is . Describe what the

08:06.230 --> 08:08.429
proposed mission is , and a system

08:08.429 --> 08:11.609
description for Project Birds Eye . The

08:11.609 --> 08:15.130
next section Defines the system

08:15.130 --> 08:17.297
boundaries and in scope conditions for

08:17.297 --> 08:19.519
the test plan . Within this test plan ,

08:19.519 --> 08:21.741
Isaac focuses on testing and evaluating

08:21.741 --> 08:23.908
the AI algorithm , so the test plan is

08:23.908 --> 08:25.963
scoped to consider only the AI model

08:25.963 --> 08:28.609
without any external components . Isaac

08:28.609 --> 08:30.553
further defines the conditions and

08:30.553 --> 08:32.609
expected use that are within scope ,

08:32.609 --> 08:34.331
such as the UAV and the sensor

08:34.331 --> 08:36.289
parameters , as well as in scope

08:36.289 --> 08:38.400
operational conditions such as haze ,

08:38.489 --> 08:42.328
fog , and rain . Next , Isaac

08:42.328 --> 08:44.272
documents the findings of the risk

08:44.272 --> 08:46.495
assessment , which focuses on different

08:46.495 --> 08:48.717
performance risks of the AI algorithm .

08:48.729 --> 08:50.507
These risks include things like

08:50.507 --> 08:52.729
turbulence , which may introduce camera

08:52.729 --> 08:54.396
jitter that impacts the model

08:54.396 --> 08:56.618
performance , training data that is not

08:56.618 --> 08:58.507
representative of the operational

08:58.507 --> 08:58.369
environment , reducing model

08:58.369 --> 09:00.480
performance , or data that is labeled

09:00.480 --> 09:02.480
incorrectly , which can degrade the

09:02.480 --> 09:05.380
quality of tests or training . Using

09:05.380 --> 09:08.820
this risk assessment , And the system

09:08.820 --> 09:10.876
performance requirements that he has

09:10.876 --> 09:13.098
been provided , Isaac develops tests to

09:13.098 --> 09:15.098
collect quantifiable data about the

09:15.098 --> 09:16.876
system performance around these

09:16.876 --> 09:18.764
requirements and around the risks

09:18.764 --> 09:20.764
presented . Within this test plan ,

09:20.764 --> 09:23.150
Isaac plans to run the following tests

09:23.150 --> 09:25.261
identify duplicate and near duplicate

09:25.261 --> 09:27.428
data within the operational data set ,

09:27.510 --> 09:30.270
identify potential outliers and under

09:30.270 --> 09:32.179
sampled regions of the data set .

09:32.510 --> 09:34.399
Identify correlations between the

09:34.399 --> 09:36.630
operational data set metadata and

09:36.630 --> 09:38.852
object classes to detect potential data

09:38.852 --> 09:41.650
set imbalance . Identify classes which

09:41.650 --> 09:43.872
may be under sampled in the operational

09:43.872 --> 09:46.719
data set . Identify low value test data

09:46.719 --> 09:48.609
in the operational data set , and

09:48.609 --> 09:50.831
identify potential errors in the labels

09:50.831 --> 09:52.831
of the operational data set . He'll

09:52.831 --> 09:54.887
also develop tests for measuring the

09:54.887 --> 09:56.998
per class performance of the model on

09:56.998 --> 09:59.053
clean test data to establish a model

09:59.053 --> 10:01.165
baseline . He will evaluate the model

10:01.165 --> 10:03.331
performance against increasing degrees

10:03.331 --> 10:05.109
of camera jitter in the X and Y

10:05.109 --> 10:07.220
directions , and he will evaluate the

10:07.220 --> 10:09.276
model performance against increasing

10:09.276 --> 10:11.770
degrees of average blur . For each test ,

10:11.880 --> 10:14.047
Isaac will identify the purpose of the

10:14.047 --> 10:16.320
test . He'll identify which data sets

10:16.320 --> 10:18.653
or models the test should be applied to .

10:18.880 --> 10:21.080
He will identify which thresholds must

10:21.080 --> 10:22.913
be achieved for the system to be

10:22.913 --> 10:24.989
passing , and he'll identify any

10:24.989 --> 10:27.211
actions that will be taken based on the

10:27.211 --> 10:29.322
information collected above . He will

10:29.322 --> 10:31.433
also identify any tools or techniques

10:31.433 --> 10:33.545
he used to run the test , and he will

10:33.545 --> 10:35.669
identify what inputs the tools needs

10:35.909 --> 10:38.242
and the outputs expected from the tools .

10:38.530 --> 10:40.863
After Isaac is happy with his test plan ,

10:40.863 --> 10:43.086
he submits it to his project leadership

10:43.086 --> 10:45.308
for their review and approval . Once he

10:45.308 --> 10:47.419
has his test plan back , he can begin

10:47.419 --> 10:50.979
his testing . Isaac performs all of

10:50.979 --> 10:53.059
his development and testing on the

10:53.059 --> 10:55.450
Nabari platform , a managed environment

10:55.450 --> 10:57.561
developed within the JIC program that

10:57.561 --> 10:59.506
allows him to quickly start up the

10:59.506 --> 11:01.672
computer resources he needs to perform

11:01.672 --> 11:03.783
his testing . To enable his testing ,

11:03.783 --> 11:05.979
Isaac has created a series of low code

11:05.979 --> 11:08.549
interfaces for quickly configuring

11:08.549 --> 11:10.739
repeatable testing pipelines . For

11:10.739 --> 11:12.961
today's demo , Isaac will configure two

11:12.961 --> 11:14.795
pipelines , one for his data set

11:14.795 --> 11:16.850
analysis tests and one for his model

11:16.850 --> 11:19.020
evaluations . We will begin with the

11:19.020 --> 11:21.131
configuration of the dataset analysis

11:21.131 --> 11:22.131
pipeline .

11:25.330 --> 11:27.640
Isaac will use this low code panel

11:27.640 --> 11:29.909
application for the configuration of an

11:29.909 --> 11:32.076
automated data set analysis pipeline .

11:32.250 --> 11:34.530
The pipeline configuration is separated

11:34.530 --> 11:36.752
into different stages based on the type

11:36.752 --> 11:38.974
of analysis provided and the tools that

11:38.974 --> 11:40.974
provide the capability . Isaac will

11:40.974 --> 11:43.086
configure this pipeline , the options

11:43.086 --> 11:45.141
in it to match the tests outlined in

11:45.141 --> 11:45.090
his plan .

11:52.580 --> 11:55.200
In order to provide useful data about

11:55.200 --> 11:57.367
model performance , data sets used for

11:57.367 --> 11:59.119
T&E must be high quality and

11:59.119 --> 12:01.008
representative of the operational

12:01.008 --> 12:03.063
environment . The objectives of this

12:03.063 --> 12:05.286
data set analysis phase are to validate

12:05.286 --> 12:07.508
the overall quality of the data sets in

12:07.508 --> 12:09.675
their labels , increase the efficiency

12:09.675 --> 12:11.619
of model testing by optimizing the

12:11.619 --> 12:13.730
testing testing data set , assess the

12:13.730 --> 12:15.508
coverage of the data set across

12:15.508 --> 12:17.675
structure classes , and identify areas

12:17.675 --> 12:19.730
where further data collection may be

12:19.730 --> 12:22.063
necessary based on the current coverage .

12:23.010 --> 12:25.239
The first few tests in his dataset

12:25.239 --> 12:27.950
analysis test plan will be provided by

12:27.950 --> 12:29.950
the Data Analysis Metrics library ,

12:30.039 --> 12:32.559
also known as DAML . DAMML is a genetic

12:32.559 --> 12:34.781
product that provides metrics and tools

12:34.781 --> 12:36.948
to help characterize data sets using 4

12:36.948 --> 12:40.239
types of analyses lining . Is

12:41.349 --> 12:44.549
Feasibility And dataset shift .

12:45.390 --> 12:47.260
On this page there are two main

12:47.260 --> 12:48.871
sections . The first for the

12:48.871 --> 12:51.038
development dataset and the second for

12:51.038 --> 12:53.260
the operational dataset , since many of

12:53.260 --> 12:54.760
the DAML tests can compute

12:54.760 --> 12:56.816
relationships between these datasets

12:56.816 --> 12:59.149
such as distance or drift . Since Isaac ,

12:59.479 --> 13:01.701
Isaac is performing dataset analysis to

13:01.701 --> 13:03.868
validate only his held out operational

13:03.868 --> 13:06.280
dataset . He will be running each of

13:06.280 --> 13:08.289
these tests only on the operational

13:08.520 --> 13:11.590
data set . The first test he will

13:11.590 --> 13:13.989
configure is to identify duplicates .

13:14.150 --> 13:17.830
Duplicates , duplicates are undesirable

13:17.830 --> 13:19.886
in the data set , and this test will

13:19.886 --> 13:22.052
identify them for later removal and to

13:22.052 --> 13:24.219
enable that test , he just selects the

13:24.219 --> 13:25.989
duplicate box in the test .

13:26.919 --> 13:29.820
Configuration . The next test is to

13:29.820 --> 13:32.042
find outliers within the data set to in

13:32.042 --> 13:34.059
order to help identify data points

13:34.059 --> 13:36.059
which are not representative of the

13:36.059 --> 13:38.115
underlying data distribution or data

13:38.115 --> 13:39.948
regions which have not been well

13:39.948 --> 13:42.170
represented and either to identify data

13:42.170 --> 13:44.448
to prune or do further data collection .

13:44.460 --> 13:46.682
To enable this test , Isaac will select

13:46.682 --> 13:49.016
the outlier option in the configuration .

13:50.840 --> 13:53.119
Next , Isaac has a test to measure the

13:53.119 --> 13:55.286
co-occurrence of metadata factors with

13:55.286 --> 13:57.270
class labels in order to identify

13:57.270 --> 13:59.492
metadata factors from which a model may

13:59.492 --> 14:01.326
learn shortcuts , ensure that no

14:01.326 --> 14:02.826
metadata factors correlate

14:02.826 --> 14:04.992
disproportionately with the particular

14:04.992 --> 14:07.214
classes , and improve the understanding

14:07.214 --> 14:09.437
of the operational environment and data

14:09.437 --> 14:11.603
set distributions . This is useful for

14:11.603 --> 14:13.659
determining whether any correlations

14:13.659 --> 14:15.992
are disproportionate and may allow them ,

14:15.992 --> 14:18.159
may allow the model to learn shortcuts

14:18.159 --> 14:20.326
or to decide if future data collection

14:20.326 --> 14:22.437
events are required in regions of low

14:22.437 --> 14:24.659
coverage . And to enable this test , we

14:24.659 --> 14:26.492
select the balance option in the

14:26.492 --> 14:28.429
configuration app . Our last test

14:28.429 --> 14:31.619
provided by Gamel is to identify

14:31.619 --> 14:33.508
classes which may have been under

14:33.508 --> 14:35.675
sampled . Proper coverage ensures that

14:35.675 --> 14:37.897
models have enough information to learn

14:37.897 --> 14:39.897
to detect and classify each objects

14:39.897 --> 14:41.952
class . Again , this informs whether

14:41.952 --> 14:43.841
future data collection events are

14:43.841 --> 14:45.786
needed , and to add this , we just

14:45.786 --> 14:47.786
select the coverage option . If you

14:47.786 --> 14:47.640
desperate .

14:56.450 --> 15:00.219
Hot hot hot mic

15:00.219 --> 15:04.210
somebody can somebody mute ? OK .

15:04.619 --> 15:06.786
Uh , we will export that configuration

15:06.786 --> 15:08.786
to save it to our configuration and

15:08.786 --> 15:11.280
move on to the next stage . The next

15:11.280 --> 15:13.280
stage in the configuration pipeline

15:13.280 --> 15:15.391
enables the use of Survivor , a jadic

15:15.391 --> 15:17.280
product which identifies specific

15:17.280 --> 15:19.391
subsets within a larger data set that

15:19.391 --> 15:21.113
have the largest impact on the

15:21.113 --> 15:23.336
performance of a model . Survivor works

15:23.336 --> 15:25.669
by generating predictions for each data

15:25.669 --> 15:27.789
point with a set of models of similar

15:27.789 --> 15:30.229
quality . Data points which all models

15:30.229 --> 15:32.451
predict correctly are considered easy ,

15:32.451 --> 15:34.396
while data points which all models

15:34.396 --> 15:36.729
predict incorrectly are considered hard .

15:36.729 --> 15:38.673
Data points where there is a large

15:38.673 --> 15:40.507
proportion of both incorrect and

15:40.507 --> 15:42.618
correct predictions across all models

15:42.618 --> 15:44.618
are considered on the bubble . Data

15:44.618 --> 15:46.507
which is on the bubble is a major

15:46.507 --> 15:48.396
driver of the differences between

15:48.396 --> 15:48.109
different models' performance and

15:48.109 --> 15:50.387
provides the most value during testing .

15:50.780 --> 15:52.836
The remaining easy and hard data are

15:52.836 --> 15:55.058
considered low value and are candidates

15:55.058 --> 15:57.169
for pruning or identifying regions of

15:57.169 --> 15:58.836
additional sampling . In this

15:58.836 --> 16:01.002
implementation , Survivor will use all

16:01.002 --> 16:02.780
available models to perform its

16:02.780 --> 16:05.640
analysis . So Isaac will use Survivor

16:05.640 --> 16:07.849
to identify data instances which are

16:07.849 --> 16:10.016
low value and prune them from the data

16:10.016 --> 16:12.369
set or decide whether further data

16:12.369 --> 16:14.369
collection is needed in the regions

16:14.369 --> 16:16.313
where models have universally poor

16:16.313 --> 16:18.369
performance . We'll use the default

16:18.369 --> 16:20.258
settings here and add that to our

16:20.258 --> 16:23.770
configuration . The last stage in our

16:23.770 --> 16:26.140
data set analysis pipeline is the

16:26.140 --> 16:28.307
configuration of the Real label tool .

16:28.307 --> 16:30.196
One of the challenges with object

16:30.196 --> 16:32.084
detection data is the accuracy of

16:32.084 --> 16:34.307
labels that are generated when the data

16:34.307 --> 16:36.251
set is collected . Real label is a

16:36.251 --> 16:38.473
quality assurance tool developed within

16:38.473 --> 16:40.529
the JDI program designed to identify

16:40.529 --> 16:42.751
potentially missing or erroneous ground

16:42.751 --> 16:44.862
truth labels in object detection data

16:44.862 --> 16:46.584
sets . Real label does this by

16:46.584 --> 16:48.751
aggregating the inferences of multiple

16:48.751 --> 16:50.940
models and identifying instances of

16:50.940 --> 16:52.829
strong disagreement between those

16:52.829 --> 16:54.996
models and the provided ground truth .

16:54.996 --> 16:57.539
In this , in this implementation , Real

16:57.539 --> 16:59.595
label is configured to run using the

16:59.595 --> 17:01.650
results of all models that have been

17:01.650 --> 17:03.872
provided . Isaac will use Real label to

17:03.872 --> 17:05.761
do his last data analysis test to

17:05.761 --> 17:07.928
identify ground truth labels which may

17:07.928 --> 17:09.872
be erroneous to help improve label

17:09.872 --> 17:11.983
quality and the accuracy of testing .

17:11.983 --> 17:14.206
We will use the default intersection of

17:14.206 --> 17:16.372
union threshold value and confidence ,

17:16.372 --> 17:18.261
and we will run with ground truth

17:18.261 --> 17:20.483
indicating we are looking for errors in

17:20.483 --> 17:22.539
the provided ground truth data set .

17:22.539 --> 17:24.706
You'll export that configuration . And

17:24.706 --> 17:26.817
move on to the last page which allows

17:26.817 --> 17:30.020
Isaac to download his Configured

17:30.020 --> 17:33.260
pipeline . We can load that and as you

17:33.260 --> 17:35.459
can see here , this exported JSON

17:35.459 --> 17:37.626
defines the entire testing pipeline in

17:37.626 --> 17:39.570
a format which can be ingested and

17:39.570 --> 17:41.699
easily executed . It is also a format

17:41.699 --> 17:43.810
which allows Isaac to verify that the

17:43.810 --> 17:45.810
test you've configured , and he can

17:45.810 --> 17:47.977
share , he can save or share this file

17:47.977 --> 17:50.199
to enable repeatable experiments across

17:50.199 --> 17:51.199
the program .

17:56.420 --> 17:58.587
Going back to our homepage , Isaac may

17:58.587 --> 18:01.500
now load the dataset analysis dashboard .

18:03.550 --> 18:05.494
which manages the test which he is

18:05.494 --> 18:08.459
going to run . The dashboard has been

18:08.459 --> 18:10.515
configured to automatically populate

18:10.515 --> 18:12.403
with the relevant development and

18:12.403 --> 18:15.270
operational data sets . To deploy his

18:15.270 --> 18:17.103
customized testing pipeline , he

18:17.103 --> 18:19.326
uploads the JSON file configured in the

18:19.326 --> 18:22.650
previous step . Which will

18:22.650 --> 18:24.872
automatically load the appropriate test

18:24.872 --> 18:27.689
stages in code . To run his tests ,

18:27.829 --> 18:29.718
Isaac will choose the appropriate

18:29.718 --> 18:32.051
dataset for development and operational ,

18:32.051 --> 18:33.996
and he will click the run analysis

18:33.996 --> 18:36.051
button , which triggers a run of his

18:36.051 --> 18:38.273
previously configured data set analysis

18:38.273 --> 18:40.329
tools . Please note that we have run

18:40.329 --> 18:42.385
that we have pre-run these tests and

18:42.385 --> 18:44.496
cached the results , as many of these

18:44.496 --> 18:46.440
tests can be quite computationally

18:46.440 --> 18:48.551
expensive and may take a long time to

18:48.551 --> 18:50.718
run . When the run is complete , he is

18:50.718 --> 18:52.829
presented with top level metrics that

18:52.829 --> 18:55.051
provide a summary of what each analysis

18:55.051 --> 18:54.560
provides , as well as a link to the

18:54.560 --> 18:56.920
gradient report detailing the findings .

18:57.349 --> 18:59.460
Gradient is a jaded product which can

18:59.460 --> 19:01.516
ingest the outputs from our analysis

19:01.516 --> 19:03.182
and programmatically generate

19:03.182 --> 19:05.405
PowerPoint test reports , model cards ,

19:05.405 --> 19:07.738
and data cards in a standardized format .

19:07.738 --> 19:09.738
Isaac will download the most recent

19:09.738 --> 19:11.738
report and then load the PowerPoint

19:11.738 --> 19:14.339
view . To do an in-depth review of the

19:14.339 --> 19:16.339
findings produced by his analysis .

19:17.089 --> 19:19.170
Gradient has the gradient report has

19:19.170 --> 19:21.392
been configured such there is one slide

19:21.392 --> 19:23.503
per test in his test plan , so we can

19:23.503 --> 19:25.989
review the gradient report and our test

19:25.989 --> 19:27.930
plan side by side .

19:31.439 --> 19:33.910
In order to compare what the analysis

19:33.910 --> 19:36.069
provides and what information we are

19:36.069 --> 19:38.125
expecting and what decisions we will

19:38.125 --> 19:40.260
make with that . The first test

19:40.260 --> 19:43.199
identifies duplicates . The duplicate's

19:43.199 --> 19:45.199
detection function within DAMLL has

19:45.199 --> 19:48.199
identified no exact duplicates and 22

19:48.199 --> 19:50.088
near duplicates and a set of 8000

19:50.088 --> 19:52.421
images which are displayed to the right .

19:52.421 --> 19:54.199
Isaac will report that the near

19:54.199 --> 19:56.421
duplicates to the data science team for

19:56.421 --> 19:58.643
their review , at which point they will

19:58.643 --> 20:00.755
decide to remove one of them from the

20:00.755 --> 20:03.219
data set . The next test , the outlier

20:03.219 --> 20:06.479
detection function , has identified 217

20:06.479 --> 20:09.400
outliers out of 8000 images . In these

20:09.400 --> 20:11.456
sample images , it is seen that most

20:11.456 --> 20:13.622
out these examples show that there are

20:13.622 --> 20:15.844
no structures of interest , and this is

20:15.844 --> 20:17.956
the case for most outliers . Based on

20:17.956 --> 20:20.011
these findings , Isaac sends off the

20:20.011 --> 20:22.011
subsets of outlier data to the data

20:22.011 --> 20:24.233
science team for their review . Most of

20:24.233 --> 20:26.344
these images will be removed from the

20:26.344 --> 20:26.160
operational testing data set .

20:30.880 --> 20:32.869
The next test balance measured the

20:32.869 --> 20:34.813
co-occurrence between the metadata

20:34.813 --> 20:36.591
factors and classes . The graph

20:36.591 --> 20:38.813
displays the mutual information between

20:38.813 --> 20:40.702
each of these pairs with metadata

20:40.702 --> 20:42.989
factors on the x axis and classes on

20:42.989 --> 20:45.589
the y axis . The analysis shows that

20:45.589 --> 20:47.811
there was no strong correlation between

20:47.811 --> 20:49.811
any of the metadata factors and the

20:49.811 --> 20:51.867
classes as indicated by the low heat

20:51.867 --> 20:54.089
map values . Thus , it is unlikely that

20:54.089 --> 20:55.922
models could have had could have

20:55.922 --> 20:58.145
learned any detection or classification

20:58.145 --> 20:59.922
shortcuts due to the metadata .

21:03.339 --> 21:05.459
The next test identified the level of

21:05.459 --> 21:07.515
coverage that the dataset had across

21:07.515 --> 21:10.250
different classes . The total uncovered

21:10.250 --> 21:12.250
refers to the number of data points

21:12.250 --> 21:14.361
within the data set which do not have

21:14.361 --> 21:16.449
adequate coverage . The image labels

21:16.449 --> 21:18.560
indicated the classes for which there

21:18.560 --> 21:20.782
are the most uncovered labels . In this

21:20.782 --> 21:23.400
case , place of worship . class has the

21:23.400 --> 21:25.511
poorest coverage among the structured

21:25.511 --> 21:27.567
classes . Isaac shares these results

21:27.567 --> 21:29.289
with the data science team and

21:29.289 --> 21:31.400
indicates that further collection may

21:31.400 --> 21:33.567
be necessary in poorly covered classes

21:33.567 --> 21:33.400
from the dataset , such as place of

21:33.400 --> 21:34.400
workship .

21:38.439 --> 21:41.189
Survivor helped identify which data was

21:41.189 --> 21:42.911
driving differences in models'

21:42.911 --> 21:44.967
performance and which data was not .

21:44.967 --> 21:47.829
Easy data is identified was a reminder

21:47.829 --> 21:50.150
that easy data was data that models

21:50.150 --> 21:52.150
performed uniformly well on , while

21:52.150 --> 21:54.469
hard data was data that models

21:54.469 --> 21:56.469
performed uniformly poorly on . The

21:56.469 --> 21:58.580
remaining data was on the bubble data

21:58.580 --> 22:00.802
was made up about 31% of the data , and

22:00.802 --> 22:02.525
this drove the majority of the

22:02.525 --> 22:04.580
differences in performance of during

22:04.580 --> 22:07.109
the model evaluation . Isaac sends off

22:07.109 --> 22:09.469
the data subsets of the easy and hard

22:09.469 --> 22:11.691
data to the data science team for their

22:11.691 --> 22:13.747
review . Most of the easy images and

22:13.747 --> 22:15.525
many of the hard images will be

22:15.525 --> 22:17.747
eliminated from the operational testing

22:17.747 --> 22:19.747
set entirely . Analysis of the hard

22:19.747 --> 22:21.969
images may identify areas where further

22:21.969 --> 22:24.136
data collection is needed for training

22:24.136 --> 22:23.910
and testing .

22:28.959 --> 22:31.181
The final data set analysis test , Real

22:31.181 --> 22:33.070
label identified potential errors

22:33.070 --> 22:34.681
within the labels . For this

22:34.681 --> 22:36.848
demonstration , we only ran Real label

22:36.848 --> 22:39.126
on the images labeled Airport Terminal .

22:39.126 --> 22:40.903
Among those images , Real label

22:40.903 --> 22:43.126
identified 37 true positives , 11 false

22:43.126 --> 22:45.126
positives , and zero false negative

22:45.126 --> 22:46.959
labels . Isaac will send off the

22:46.959 --> 22:49.126
potential false positive labels to the

22:49.126 --> 22:51.348
labeling team for a second review where

22:51.348 --> 22:51.310
the detected errors will be corrected

22:51.310 --> 22:53.189
in the next iteration of the

22:53.189 --> 22:54.467
operational data set .

23:00.430 --> 23:02.900
That concludes Isaac's first run of the

23:02.900 --> 23:05.219
data set analysis pipeline . Using the

23:05.219 --> 23:07.108
dashboard , Isaac can repeat this

23:07.108 --> 23:09.380
process as many times as he needs with

23:09.380 --> 23:11.859
using different data sets . Each data

23:11.859 --> 23:13.692
set will be run through the same

23:13.692 --> 23:15.803
pipeline , which will run each of the

23:15.803 --> 23:18.026
selected tests and generate a report in

23:18.026 --> 23:20.137
the same standardized format for easy

23:20.137 --> 23:22.248
comparison . After this round of data

23:22.248 --> 23:24.248
set analysis , Isaac has identified

23:24.248 --> 23:26.359
duplicates , outliers , and low value

23:26.359 --> 23:28.359
data points for potential pruning .

23:28.359 --> 23:30.248
Identified some classes with poor

23:30.248 --> 23:32.359
coverage , validated that none of the

23:32.359 --> 23:34.415
metadata factors correlated strongly

23:34.415 --> 23:36.470
with the class label , and validated

23:36.470 --> 23:38.637
with the validated the accuracy of the

23:38.637 --> 23:40.692
ground troop labels . He will send a

23:40.692 --> 23:42.915
summary of this information back to his

23:42.915 --> 23:45.130
program and recommend that more data

23:45.130 --> 23:46.574
within underrepresented ,

23:46.574 --> 23:48.852
underrepresented regions are collected .

23:53.030 --> 23:55.252
After reporting his recommendations for

23:55.252 --> 23:57.474
dataset updates and and potentially new

23:57.474 --> 23:59.697
data set collection , the Birdseye team

23:59.697 --> 24:01.752
is able to validate his findings and

24:01.752 --> 24:03.974
update the development and testing data

24:03.974 --> 24:06.197
set to a version 2.0 , which has pruned

24:06.197 --> 24:08.030
duplicates as well as subsets of

24:08.030 --> 24:10.252
outliers and low value data , has added

24:10.252 --> 24:12.420
more instances for classes with poor

24:12.420 --> 24:14.476
coverage and corrected several label

24:14.476 --> 24:16.364
errors . Isaac will be using this

24:16.364 --> 24:18.420
version 2.0 for his model evaluation

24:18.420 --> 24:21.619
testing . To begin his body evaluation

24:21.619 --> 24:24.060
testing , he returns to his test plan .

24:26.130 --> 24:28.297
Model testing measures the performance

24:28.297 --> 24:30.352
of models against a held out testing

24:30.352 --> 24:32.519
data set . The objectives of the model

24:32.519 --> 24:34.630
testing phase are to identify whether

24:34.630 --> 24:36.852
models meet the established performance

24:36.852 --> 24:38.963
requirements , quantify the impact of

24:38.963 --> 24:40.741
various in scope risks on model

24:40.741 --> 24:42.574
performance , characterize model

24:42.574 --> 24:44.408
performance across the different

24:44.408 --> 24:46.241
structures and conditions in the

24:46.241 --> 24:48.408
operational environment , and identify

24:48.408 --> 24:50.241
gaps in model performance on the

24:50.241 --> 24:52.408
development and operational data set .

24:52.408 --> 24:54.297
As with as before , Isaac will be

24:54.297 --> 24:56.130
configuring his model evaluation

24:56.130 --> 24:58.297
testing using Nabari and his specially

24:58.297 --> 25:00.469
crafted object detection model

25:00.469 --> 25:03.280
evaluation configuration app . Similar

25:03.280 --> 25:05.391
to the dataset analysis configuration

25:05.391 --> 25:07.169
app , this will enable Isaac to

25:07.169 --> 25:09.391
configure a model evaluation pipeline .

25:10.280 --> 25:12.359
This pipeline will evaluate baseline

25:12.359 --> 25:14.303
model performance as well as model

25:14.303 --> 25:16.248
performance around the major risks

25:16.248 --> 25:18.359
identified within the risk assessment

25:18.359 --> 25:20.415
camera jitter and camera blur . Each

25:20.415 --> 25:22.026
test will be run on both the

25:22.026 --> 25:24.248
operational and development data set in

25:24.248 --> 25:26.248
order to understand potential model

25:26.248 --> 25:28.415
overfitting and performance gaps . Our

25:28.415 --> 25:30.359
first test evaluating the baseline

25:30.359 --> 25:32.829
model performance seeks to identify the

25:32.829 --> 25:34.900
overall best performing models ,

25:35.390 --> 25:37.790
overall model performance metrics , the

25:37.790 --> 25:39.623
model performance metrics across

25:39.623 --> 25:41.846
classes of interest , and the frequency

25:41.846 --> 25:44.150
of different model failure modes such

25:44.150 --> 25:46.483
as misclassification , missed detection ,

25:46.483 --> 25:48.829
and hallucination . This will be run on

25:48.829 --> 25:51.051
all models that are provided by vendors

25:51.051 --> 25:53.162
as well as internal baseline models .

25:53.162 --> 25:55.750
It will be run using both operational

25:55.750 --> 25:58.010
and development data sets . And it is

25:58.010 --> 25:59.788
looking to identify models that

25:59.788 --> 26:01.621
maintain an overall mean average

26:01.621 --> 26:04.050
position of greater than 0.5 and a

26:04.050 --> 26:06.217
minimum class mean average position of

26:06.217 --> 26:09.209
greater than 0.35 . This information

26:09.209 --> 26:11.376
will be used to determine which models

26:11.376 --> 26:13.098
meet emission requirements and

26:13.098 --> 26:15.209
determine whether any additional data

26:15.209 --> 26:16.987
collection or model training is

26:16.987 --> 26:18.709
necessary to improve the model

26:18.709 --> 26:20.598
performance . This test is run by

26:20.598 --> 26:22.376
default in the model evaluation

26:22.376 --> 26:24.209
pipeline and does not need to be

26:24.209 --> 26:28.189
specially configured . The next

26:28.189 --> 26:30.300
stage in our Python configuration app

26:30.300 --> 26:32.300
provides the capability to evaluate

26:32.300 --> 26:34.522
model performance under various degrees

26:34.522 --> 26:36.300
of operationally realistic data

26:36.300 --> 26:38.522
degradation . One particular risk noted

26:38.522 --> 26:40.467
within the risk assessment was the

26:40.467 --> 26:42.411
occurrence of camera jitter due to

26:42.411 --> 26:44.300
turbulence affecting the aircraft

26:44.300 --> 26:46.245
carrying the sensor . This test is

26:46.245 --> 26:48.411
performed using the natural robustness

26:48.411 --> 26:50.356
tool kit , which is a jdic product

26:50.356 --> 26:52.245
designed to provide operationally

26:52.245 --> 26:54.300
realistic perturbations to test it ,

26:54.300 --> 26:56.522
including the recently developed Camera

26:56.522 --> 26:56.349
jitter optical transfer function .

26:57.130 --> 26:59.241
Isaac will use this tool to configure

26:59.241 --> 27:01.439
his test 6.2 to evaluate the model

27:01.439 --> 27:03.495
performance against camera jitter in

27:03.495 --> 27:05.717
order to quantify the impact of various

27:05.717 --> 27:08.050
levels of jitter on mission performance .

27:08.050 --> 27:09.828
Again , this will be run on all

27:09.828 --> 27:12.050
provided models . It will be run on all

27:12.050 --> 27:14.280
relevant data sets , and we'll be

27:14.280 --> 27:16.280
looking for models that maintain an

27:16.280 --> 27:18.169
overall mean average precision of

27:18.169 --> 27:20.224
greater than 0.5 and a minimum class

27:20.224 --> 27:22.391
mean average precision of greater than

27:22.391 --> 27:24.849
0.35 . This information is used to

27:24.849 --> 27:26.793
document the impact of the various

27:26.793 --> 27:28.627
levels of jitter within the risk

27:28.627 --> 27:30.770
assessment and determine whether the

27:30.770 --> 27:32.810
level of performance degradation is

27:32.810 --> 27:36.670
acceptable . In this stage , Isaac will

27:36.670 --> 27:38.503
be configuring a sweep of jitter

27:38.503 --> 27:40.614
intensity where the model performance

27:40.614 --> 27:42.670
will be calculated for each level of

27:42.670 --> 27:45.089
intensity . We will start with a value

27:45.089 --> 27:48.380
of 0.0001 .

27:49.050 --> 27:50.469
And stop at a value of

27:50.469 --> 27:53.750
0.0005 . He will do this

27:53.750 --> 27:57.229
over eight steps for a good amount of

27:57.229 --> 27:59.619
discretization , and he can preview

28:00.229 --> 28:02.270
what the effect of the worst case

28:02.270 --> 28:04.270
perturbation would be in the window

28:04.270 --> 28:06.550
here and ensure that this level of

28:06.550 --> 28:09.270
jitter is realistic within the

28:09.270 --> 28:11.739
operational limits of the mission . He

28:11.739 --> 28:14.099
can add this test stage to his list of

28:14.099 --> 28:16.500
tests , and he can repeat this process

28:16.500 --> 28:18.722
for jitter in the Y direction using the

28:18.722 --> 28:21.170
same starting and ending parameters ,

28:21.380 --> 28:24.609
and he can again test and see what the

28:24.609 --> 28:27.219
and ensure that the level of jitter

28:27.219 --> 28:29.380
observed in the worst case is within

28:29.380 --> 28:31.713
the operational limits that is expected .

28:31.910 --> 28:33.650
Add that to his test stage .

28:39.410 --> 28:41.577
A second operational degradation noted

28:41.577 --> 28:43.521
within the risk assessment was the

28:43.521 --> 28:45.577
occurrence of camera blur . His last

28:45.577 --> 28:47.243
test is to evaluate the model

28:47.243 --> 28:49.188
performance against camera blur in

28:49.188 --> 28:51.410
order to quantify the impact of various

28:51.410 --> 28:53.077
levels of blur on the mission

28:53.077 --> 28:55.132
performance . Again , this is run on

28:55.132 --> 28:57.354
all models and data sets and is looking

28:57.354 --> 28:59.354
for models that maintain an overall

28:59.354 --> 29:01.466
mean average position of greater than

29:01.466 --> 29:03.521
0.5 and a minimum class mean average

29:03.521 --> 29:05.890
position of greater than 0.35 . This

29:05.890 --> 29:07.834
information we use to document the

29:07.834 --> 29:10.057
impact of various levels of blur within

29:10.057 --> 29:11.946
the risk assessment and determine

29:11.946 --> 29:13.557
whether there is whether the

29:13.557 --> 29:15.779
performance degradation is acceptable .

29:15.779 --> 29:17.668
To include this test in assessing

29:17.668 --> 29:19.390
pipeline , he will add another

29:19.390 --> 29:21.168
perturbation of average blur to

29:21.168 --> 29:23.279
simulate the effect of camera blur in

29:23.279 --> 29:25.112
the image . He will sweep across

29:25.112 --> 29:27.112
different levels of blur blur using

29:27.112 --> 29:29.279
different kernel sizes starting from 1

29:29.279 --> 29:31.334
going to 9 , and again , he can test

29:31.334 --> 29:33.446
the perturber settings to ensure that

29:33.446 --> 29:36.920
the blur in this example image fits

29:36.920 --> 29:39.031
what is expected within the mission .

29:39.640 --> 29:41.807
You can add this test stage and export

29:41.807 --> 29:43.918
the configuration which will add each

29:43.918 --> 29:45.862
of these perturbation tests to his

29:45.862 --> 29:49.739
pipeline . Finally , Isaac can

29:49.739 --> 29:51.906
save his configuration to a JSO file ,

29:52.140 --> 29:54.251
which is easily parsable by the model

29:54.251 --> 29:58.020
evaluation dashboard . In the

29:58.020 --> 30:00.131
configuration , note that there are 4

30:00.131 --> 30:02.076
testing stages shown , one for the

30:02.076 --> 30:04.670
evaluation on clean data . One for the

30:04.670 --> 30:07.349
evaluation on different levels of

30:07.349 --> 30:09.489
camera jitter in both the X and Y

30:09.489 --> 30:11.989
directions , and one for evaluations

30:11.989 --> 30:14.322
across different levels of average blur .

30:14.430 --> 30:16.486
For simplicity , we have limited the

30:16.486 --> 30:18.541
number of degraded conditions we are

30:18.541 --> 30:20.652
testing within this phase . However ,

30:20.652 --> 30:22.374
we could easily add additional

30:22.374 --> 30:24.541
corruptions either independently or in

30:24.541 --> 30:26.763
combination with each other in order to

30:26.763 --> 30:26.550
evaluate performance across different

30:26.550 --> 30:28.883
types of degraded conditions identified .

30:32.670 --> 30:35.520
Returning to our Nabari . Home page we

30:35.520 --> 30:37.464
can load into our model evaluation

30:37.464 --> 30:40.979
dashboard . which manages the tests

30:40.979 --> 30:43.689
that Isaac will run . The dashboard has

30:43.689 --> 30:45.522
been configured to automatically

30:45.522 --> 30:47.578
populate with new models as they are

30:47.578 --> 30:49.689
provided by external partners as well

30:49.689 --> 30:51.856
as the relevant testing data sets . To

30:51.856 --> 30:54.170
deploy his customized pipeline . He

30:54.170 --> 30:56.392
will upload the JSON file configured in

30:56.392 --> 30:59.199
the previous step . And the dashboard

30:59.199 --> 31:01.421
will automatically load the appropriate

31:01.421 --> 31:04.079
test stages in code . Now , to use this

31:04.079 --> 31:06.246
test , all use this testing pipeline ,

31:06.246 --> 31:08.301
Isaac only needs to select the model

31:08.301 --> 31:10.357
that he wants to run his tests for .

31:10.357 --> 31:12.357
And click the run analysis button .

31:13.949 --> 31:16.300
Again , we have generated , we have

31:16.300 --> 31:19.930
pre-computed this information . Oh

31:23.989 --> 31:26.211
We've precomputed the information and I

31:26.540 --> 31:27.929
pulled the wrong cache .

31:29.760 --> 31:32.979
But this the Uh

31:36.319 --> 31:38.541
Because this takes a long time to run .

31:39.510 --> 31:42.069
Um , what will happen is that a new row

31:42.069 --> 31:44.180
will appear in the dashboard . Let me

31:44.180 --> 31:45.513
just start this again .

31:55.170 --> 31:58.339
Here we go . Right .

32:06.050 --> 32:07.839
Sorry . One moment .

32:40.390 --> 32:42.557
OK , well , the dashboard seems to not

32:42.557 --> 32:44.560
be working right now , but , uh ,

32:45.290 --> 32:47.512
normally what we , what we would see is

32:47.512 --> 32:49.790
as with the with the data set analysis ,

32:49.790 --> 32:51.846
we would see rows appear for each of

32:51.846 --> 32:55.250
our tests , and we , and Isaac can

32:55.250 --> 32:57.520
quickly review based on these views ,

32:57.650 --> 33:00.089
he can quickly compare different models

33:00.089 --> 33:01.922
against each other . And will be

33:01.922 --> 33:03.978
provided with a gradient report . He

33:03.978 --> 33:05.978
can review the review in detail the

33:05.978 --> 33:08.033
results of the different evaluations

33:08.033 --> 33:10.256
performed by each stage of the pipeline

33:10.256 --> 33:10.255
and get more insight into the

33:10.255 --> 33:12.255
sensitivity of the model to various

33:12.255 --> 33:14.311
corruptions in data and get a better

33:14.311 --> 33:16.477
understanding of where the limitations

33:16.477 --> 33:19.680
of the model may lie . You may download

33:19.680 --> 33:23.520
these . Download these reports and

33:23.520 --> 33:26.890
review them next to his tests . As

33:26.890 --> 33:29.630
before . He can compare models .

33:30.550 --> 33:34.010
Using the identically uh formatted

33:34.329 --> 33:38.329
test report . Uh , this , in

33:38.329 --> 33:40.218
this case , these , these are two

33:40.218 --> 33:42.479
reports that we generated that compare

33:42.750 --> 33:45.290
two models of different iterations

33:45.290 --> 33:48.489
trained on the similar data sets . And

33:48.489 --> 33:50.600
we can step through each of the tests

33:50.600 --> 33:52.545
configured in the previous space .

33:52.545 --> 33:54.656
First , Isaac will compare the models

33:54.656 --> 33:56.711
along the baseline performance . The

33:56.711 --> 33:58.878
orange bar within each graph shows the

33:58.878 --> 33:58.810
model's average performance , while the

33:58.810 --> 34:00.977
blue bars show the model's performance

34:00.977 --> 34:03.143
on a particular class . The horizontal

34:03.143 --> 34:05.839
red line represents the minimum

34:06.489 --> 34:08.489
acceptable threshold on a per class

34:08.489 --> 34:10.610
performance at 0.35 , while the black

34:10.610 --> 34:12.777
line represents the minimum acceptable

34:12.777 --> 34:14.888
threshold for the overall performance

34:14.888 --> 34:17.699
at 0.5 . As we can see , both models ,

34:18.209 --> 34:20.153
the mean average precision overall

34:20.153 --> 34:22.810
exceeds the required threshold of 0.5 .

34:23.060 --> 34:25.060
However , we can see quite a bit of

34:25.060 --> 34:27.060
variation in performance across the

34:27.060 --> 34:28.949
classes within both models , with

34:28.949 --> 34:30.616
simpler overall trends in the

34:30.616 --> 34:33.179
difficulty of various classes . Both

34:33.179 --> 34:35.235
models perform exceptionally well on

34:35.235 --> 34:36.957
stadiums and struggle with the

34:36.957 --> 34:39.123
hospitals and military facilities . In

34:39.123 --> 34:41.235
particular , both models fail to meet

34:41.235 --> 34:43.290
the minimum acceptable threshold for

34:43.290 --> 34:45.235
mean average position for military

34:45.235 --> 34:47.290
facilities . Isaac will report these

34:47.290 --> 34:49.235
results to the model providers and

34:49.235 --> 34:51.457
recommend the inclusion of more data to

34:51.457 --> 34:53.623
be included for these classes added to

34:53.623 --> 34:55.457
the training set . Our next test

34:55.457 --> 34:57.790
evaluating the effects of camera jitter ,

34:57.790 --> 34:59.846
he compares the model performance in

34:59.846 --> 35:02.068
cases of camera jitter using the NRTK's

35:02.068 --> 35:04.235
Jitter OTF perturber . As we can see ,

35:04.235 --> 35:06.401
the performance of both models rapidly

35:06.401 --> 35:08.320
drops off as jitter increases and

35:08.320 --> 35:10.320
quickly drops below the performance

35:10.320 --> 35:12.487
threshold as jitter increases beyond a

35:12.487 --> 35:15.429
value of about 0.0003 .

35:19.010 --> 35:21.177
This is for the X direction and in the

35:21.177 --> 35:23.288
Y direction we see a similar result .

35:23.288 --> 35:25.510
The models quickly become ineffective .

35:25.510 --> 35:27.959
These results help Isaac quantify the

35:27.959 --> 35:30.015
impact of different levels of camera

35:30.015 --> 35:32.126
jitter on the operational performance

35:32.126 --> 35:34.015
to add to the risk assessment and

35:34.015 --> 35:36.300
support a more accurate assessment of

35:36.300 --> 35:38.189
the severity of the degradation .

35:38.989 --> 35:40.629
Finally , Isaac compares the

35:40.629 --> 35:43.399
performance of both models . When they

35:43.399 --> 35:45.649
encounter camera blur , he finds that

35:45.649 --> 35:47.816
while both models' performance degrade

35:47.816 --> 35:50.050
under severe blur under severe blur

35:50.050 --> 35:52.659
degradations , the version 4 version of

35:52.659 --> 35:54.826
the model is significantly more robust

35:54.826 --> 35:56.937
since it is able to remain performant

35:56.937 --> 35:59.459
up to a kernel size of 5 rather than

35:59.459 --> 36:01.879
the V3 , which fails at a kernel size

36:01.879 --> 36:04.290
of about 3 . Overall , the model

36:04.290 --> 36:06.290
evaluation results find that the V4

36:06.290 --> 36:08.457
model is slightly more performant than

36:08.457 --> 36:10.530
the V3 . But both suffer from

36:10.530 --> 36:12.429
significant issue of not meeting

36:12.429 --> 36:14.540
performance requirements on detecting

36:14.540 --> 36:16.485
military facilities , which likely

36:16.485 --> 36:18.651
means that they will not be able to be

36:18.651 --> 36:18.629
fielded as is .

36:27.040 --> 36:29.207
On the dashboard view , Isaac would be

36:29.207 --> 36:31.429
able to , would be able to run multiple

36:31.429 --> 36:33.318
evaluations in succession without

36:33.318 --> 36:36.010
having to rewrite any code . And be

36:36.010 --> 36:37.621
able to ensure that the same

36:37.621 --> 36:39.732
evaluations are done for each model .

36:39.732 --> 36:41.899
After this round of model evaluation ,

36:41.899 --> 36:44.066
Isaac has identified which models meet

36:44.066 --> 36:45.788
the performance requirements ,

36:45.788 --> 36:47.566
quantified the impact of common

36:47.566 --> 36:49.788
degradations to model performance , and

36:49.788 --> 36:49.729
identified the structure classes which

36:49.729 --> 36:52.389
models perform poorly on . He's also

36:52.389 --> 36:54.459
quantified the model performance gap

36:54.459 --> 36:56.459
between operational and development

36:56.459 --> 36:58.590
data . Isaac will send a summary of

36:58.590 --> 37:00.646
this information back to his program

37:00.646 --> 37:02.590
and recommend that further data is

37:02.590 --> 37:04.646
collected for the structured classes

37:04.646 --> 37:06.646
that , and that further development

37:06.646 --> 37:08.534
data is collected , which is more

37:08.534 --> 37:10.812
representative of the operational data .

37:14.889 --> 37:17.056
As demonstrated today , Isaac was able

37:17.056 --> 37:19.278
to quickly configure and deploy testing

37:19.278 --> 37:21.159
pipelines using JDI tools . The

37:21.159 --> 37:23.326
pipelines that we demonstrated today ,

37:23.326 --> 37:25.603
as well as their configuration wizards ,

37:25.603 --> 37:27.603
are just a few examples of low code

37:27.603 --> 37:29.437
applications which can be easily

37:29.437 --> 37:31.326
developed . The TNE functionality

37:31.326 --> 37:33.437
within the JI tools can be integrated

37:33.437 --> 37:35.381
into various types of ML platforms

37:35.381 --> 37:37.548
tailored to the specific project . The

37:37.548 --> 37:39.770
analysis provided by these products can

37:39.770 --> 37:41.715
help provide valuable insights for

37:41.715 --> 37:43.937
making critical decisions such as where

37:43.937 --> 37:46.159
to invest resources , whether to deploy

37:46.159 --> 37:46.149
models , or how the models should be

37:46.149 --> 37:49.280
used . Thank you for attending the

37:49.280 --> 37:51.502
demonstration today . If you would like

37:51.502 --> 37:53.502
to learn more about the tools shown

37:53.502 --> 37:55.724
today or the JI program , you can visit

37:55.724 --> 37:57.891
our website or our Gitlab scene here .

37:57.949 --> 37:59.838
On our websites , you can quickly

37:59.838 --> 38:02.005
install each of the tools that you saw

38:02.005 --> 38:04.116
demonstrated today . We now have some

38:04.116 --> 38:04.629
time for some questions .

