improving precision #393

sudiptoguha · 2023-06-29T21:18:40Z

Issue #, if available: 390

Description of changes: The main change in the PR is to improve precision. ThresholdedRCF had the capacity to use different streaming normalizations/transformations -- these transformations are now standardized and smoothened. As a consequence, it is feasible to use transformation A to determine significance, even though the goal is to use transformation B. This is an extension of the multi-mode capability and by default, improves the precision of the results significantly.

In addition, the PR removes unused (and unlikely to be used code) which have been remnants from version 1.0 and 2.0. It also adds more tests for branch coverage, specially for RandomCutTree.

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

kaituo

read until Java/parkservices/src/main/java/com/amazon/randomcutforest/parkservices/ErrorHandler.java. Will continue.

kaituo · 2023-07-06T00:36:56Z

Java/core/src/main/java/com/amazon/randomcutforest/summarization/Summarizer.java

@@ -105,8 +105,8 @@ public static <R> void assignAndRecompute(List<Weighted<Integer>> sampledPoints,
                double minDist = Double.MAX_VALUE;
                int minDistNbr = -1;
                for (int i = 0; i < clusters.size(); i++) {
+                    // will check for negative distances


where will you check for negative distances?

The function clusters.get(i).distance(getPoint.apply(point.index), distance) checks for -ve distances for GenericMultiCenter (MultiCenter). Will add a check to distance() in Center.

Can you update your comment to include where you add a check for negative distances?

Yes. Its in Center.Java 0ed94fc
GenericMultiCenter.java already had the checks.

kaituo · 2023-07-06T00:58:56Z

Java/core/src/main/java/com/amazon/randomcutforest/tree/RandomCutTree.java

@@ -272,7 +272,6 @@ public Integer addPoint(Integer pointIndex, long sequenceIndex) {

            if (Arrays.equals(point, oldPoint)) {
                increaseLeafMass(leafNode);
-                checkArgument(!nodeStore.freeNodeManager.isEmpty(), "incorrect/impossible state");


Why don't you need the check anymore?

This was verifying that when a leaf node is duplicated then there was a free internal node which was empty. But if addPoint() were to succeed then there must be a free internal node. So the check was not achieving anything. In the subsequent PRs we should check the NodeStore, IndexInterval manager directly.

kaituo · 2023-07-06T00:59:03Z

Java/core/src/main/java/com/amazon/randomcutforest/tree/RandomCutTree.java

@@ -311,7 +310,6 @@ public Integer addPoint(Integer pointIndex, long sequenceIndex) {
                        parentPath.push(new int[] { node, sibling });
                    }

-                    checkArgument(savedDim != Integer.MAX_VALUE, () -> " cut failed at index " + pointIndex);


Why don't you need the check anymore?

the function randomCut() now has a perfect branch coverage and this check is now superfluous.

kaituo · 2023-07-06T01:04:14Z

...xamples/src/main/java/com/amazon/randomcutforest/examples/parkservices/RCFCasterExample.java

@@ -154,7 +154,7 @@ void printResult(BufferedWriter file, ForecastDescriptor result, int current, in
        float[] lowerError = result.getObservedErrorDistribution().lower;
        DiVector rmse = result.getErrorRMSE();
        float[] mean = result.getErrorMean();
-        float[] calibration = result.getCalibration();
+        float[] calibration = result.getIntervalPrecision();


Should we rename the local variable to intervalPrecision too?

yes. Changing.

fixed in 0ed94fc

kaituo · 2023-07-06T17:49:44Z

Java/parkservices/src/main/java/com/amazon/randomcutforest/parkservices/ErrorHandler.java

-        for (int i = 0; i < inputLength; i++) {
-            actuals[errorIndex][i] = (float) input[i];
+        if (sequenceIndex > 0) {
+            // sequenceIndex indicates the first empty place for input


You meant inputIndex indicates the first empty place for input, right?

Yes, you're correct. Will fix comments with the refactor.

kaituo · 2023-07-06T17:50:32Z

Java/parkservices/src/main/java/com/amazon/randomcutforest/parkservices/ErrorHandler.java

-            actuals[errorIndex][i] = (float) input[i];
+        if (sequenceIndex > 0) {
+            // sequenceIndex indicates the first empty place for input
+            // note the predictions have already been stored


Where do you store predictions? I thought you store predictions between line 192~197.

You're correct -- I meant to say that the corresponding forecasts would have already been stored (in the previous steps). The current (most recent) forecasts cannot be measured for accuracy.

kaituo · 2023-07-06T17:53:23Z

Java/parkservices/src/main/java/com/amazon/randomcutforest/parkservices/ErrorHandler.java

@@ -161,36 +165,40 @@ public ErrorHandler(int errorHorizon, int forecastHorizon, int sequenceIndex, do
    public void update(ForecastDescriptor descriptor, Calibration calibrationMethod) {
        int arrayLength = pastForecasts.length;
        int length = pastForecasts[0].values.length;
-        int errorIndex = sequenceIndex % (arrayLength);
+        int storedForecastIndex = sequenceIndex % (arrayLength);


storedForecastIndex is the index to store forecasts, while inputIndex is the index to store actuals, right?

As the sequenceIndex increments, these values would cycle through the range [0, arrayLength - 1], but the inputIndex will always be one step behind storedForecastIndex, looping back to the end when it reaches the beginning.

The inputIndex will always be one step behind the storedForecastIndex in the circular array, given that you subtract 1 before applying the modulo operation, right?

Here is a simple example to illustrate the difference. Let's say arrayLength is 5, and sequenceIndex is 3:

storedForecastIndex = 3 % 5 = 3
inputIndex = (3 + 5 - 1) % 5 = 2

So the storedForecastIndex for this sequenceIndex would be 3, and the inputIndex would be 2.

yes -- this code may have to be refactored though In the next set PR/commits we would split the update() step into
(i) store actuals (ii) calibrate (iii) store the most recent forecasts. The most recent forecasts are not useful for calibration since we have not seen any corresponding actual yet. The update to sequenceIndex has to occur before calibration -- so that calibration is now state dependent and can be invoked from anywhere.

kaituo · 2023-07-06T21:30:38Z

Java/parkservices/src/main/java/com/amazon/randomcutforest/parkservices/ErrorHandler.java

+        checkArgument(inputLength <= errorDeviations.length, "deviations should be at least as long as input lengths");
+        for (int i = 0; i < forecastHorizon; i++) {
+            // this is the only place where the newer (possibly shorter) horizon matters
+            int len = (sequenceIndex > errorHorizon + i + 1) ? errorHorizon : sequenceIndex - i - 1;


Should we create a method for this line? It was reused two times in ErrorHandler.

Done in ff99de3

kaituo · 2023-07-06T21:53:58Z

Java/parkservices/src/main/java/com/amazon/randomcutforest/parkservices/PredictorCorrector.java

@@ -454,7 +488,8 @@ protected <P extends AnomalyDescriptor> boolean isSignificant(boolean significan
                    shiftAmount += multiplier * DEFAULT_NORMALIZATION_PRECISION
                            * (scaleFactor + (Math.abs(a) + Math.abs(b)) / 2);
                }
-                answer = significantScore && delta > 1e-6 || (delta > shiftAmount + DEFAULT_NORMALIZATION_PRECISION);
+                answer = (significantScore && delta > 1e-6 || (delta > shiftAmount + DEFAULT_NORMALIZATION_PRECISION))
+                        && (delta > noiseFactor * result.getDeviations()[baseDimensions + y]);


delta is the scaled difference between actual and expected values, right? result.getDeviations()[baseDimensions + y] is the deviation of the diff between successive actual points, right? If both are yes, why do we compare them? They don't seem comparable.

Or are you comparing the deviation of new points from expected points with some measure of variability in the data? If the new points deviate significantly compared to the usual variability, it may be considered an anomaly. If yes, why do you use result.getDeviations()[baseDimensions + y]) instead of other deviations?

It's the latter. The variability is the mean deviation between successive points -- this is typical for random opaque box noise. The thinking is that there is a natural delta between subsequent events -- so if we got similar difference between actual and predicted, then we should probably not trigger anomaly.

kaituo · 2023-07-06T22:15:52Z

Java/parkservices/src/main/java/com/amazon/randomcutforest/parkservices/PredictorCorrector.java

+                    + Math.abs(candidate.high[i] - ideal.high[i]);
+        }
+        return (differentialRemainder > DEFAULT_DIFFERENTIAL_FACTOR * lastAnomalyScore)
+                && differentialRemainder * dimensions / difference > workingThreshold;


Does the line of code make it harder to trigger an anomaly the further out it is from the last anomaly? If difference is large, meaning a larger gap since the last anomaly, the factor dimensions / difference would be small. As this factor is multiplied with differentialRemainder to be compared against the workingThreshold, a larger gap makes it more difficult to surpass the threshold and thus trigger an anomaly. If my interpretation is right, what's the rationale behind the logic?

Your observation is correct - but difference is bounded by dimensions (the size of the shingle). What the code checks is that the contribution of the new part of the shingle extrapolated to overall shingle.

I still don't understand the intuition behind the harder to trigger an anomaly the further out it is from the last anomaly. Will ask you offline.

I get it -- your question is "does it make it 'harder' to trigger "? The answer is not really -- it more or less keeps the same threshold but scales up the new component. This could be an issue to dig deeper in the next PR and actually make it 'harder' by raising the trigger. I believe in prior versions the trigger was set higher -- need to have some measurements re how this affects number of anomalies. In the next PR the goal would be to tag/eliminate drifts/runs of anomalies -- so this could come in handy.

Got it. I noticed that the differential remainder is being scaled by dimensions / difference. However, I believe this scaling factor may not be correctly representing the "complete" difference if we are only considering a part of the array (difference to dimensions-1).

Should we consider scaling the differential remainder by dimensions / (dimensions - difference) instead? This would ensure that the differential remainder correctly represents the proportion of the total range we're considering when compared to the workingThreshold.

We should consider alternate scalings as well -- and getting to your original point of making it 'harder'. The number of entries being summed up is difference though ... the loop is dimension - difference to dimension (I agree this is not ideal - and perhaps a result of multiple edits). Will address in next PR.

kaituo · 2023-07-07T00:28:00Z

...rvices/src/main/java/com/amazon/randomcutforest/parkservices/ThresholdedRandomCutForest.java

+                if (outputAfter.isPresent()) {
+                    startNormalization = Optional.of(min(DEFAULT_START_NORMALIZATION, outputAfter.get()));
+                } else {
+                    // startNormalization = Optional.of(max(1, (int) (sampleSize *


should we do something in the branch?

Oops. Should remove the comment. The default is 10 -- sending the value to default fraction connected the two parameters outputAfter and defaultFraction and the parameters of many of the existing tests would have to change. Maybe think about this with fewer files/changes.

kaituo · 2023-07-07T00:36:00Z

...ervices/src/main/java/com/amazon/randomcutforest/parkservices/preprocessor/Preprocessor.java

            deviationList[i] = new Deviation(timeDecay);
        }
-        usedDeviations = max(usedDeviations, deviationList.length - deviationList.length / 3);
+        usedDeviations = max(usedDeviations, deviationList.length - 2 * deviationList.length / 5);


There are 5 stats, but you only initialized 3 stats here. Is it a problem?

No the first two should have higher time decay. The last three have lower time decay (which is the smoothening).

kaituo · 2023-07-07T00:41:44Z

...ervices/src/main/java/com/amazon/randomcutforest/parkservices/preprocessor/Preprocessor.java

@@ -378,7 +376,7 @@ public double[] getScale() {
            scale[inputLength] = (weightTime == 0) ? 0 : 1.0 / weightTime;
            if (normalizeTime) {
                scale[inputLength] *= NORMALIZATION_SCALING_FACTOR
-                        * (timeStampDeviations[2].getMean() + DEFAULT_NORMALIZATION_PRECISION);
+                        * (timeStampDeviations[4].getMean() + DEFAULT_NORMALIZATION_PRECISION);


Instead of using hard coded number, can you use getTimeGapDifference()? It improves readability.

kaituo · 2023-07-08T00:30:15Z

Java/parkservices/src/main/java/com/amazon/randomcutforest/parkservices/RCFCaster.java

+                        errorHandler.updateActuals(description.getCurrentInput(), description.getPostDeviations());
+                        errorHandler.augmentDescriptor(description);
+                        timedForecast = extrapolate(forecastHorizon);
+                        errorHandler.updateForecasts(timedForecast.rangeVector);
+                        description.setTimedForecast(timedForecast);


Can you make this a method since another places also use these lines?

Good suggestion. Actually we can take that a step further -- and now both the single step process and processSequentially call the exact same function and differ in the placement of the caching step. The main goal was to eliminate the repeated caching on/off and that is now clear. Done in e28e68a

sudiptoguha added 2 commits June 29, 2023 14:13

improving precision

1d4231a

refactor plus more tests

8348c76

sudiptoguha requested a review from jotok June 30, 2023 16:11

sudiptoguha added 2 commits June 30, 2023 17:53

100 percent coverage for tree and boundingbox

b4002b4

tests and coverage

9349b38

kaituo reviewed Jul 6, 2023

View reviewed changes

100 per cent coverage for summarization

0ed94fc

kaituo reviewed Jul 7, 2023

View reviewed changes

100 percent coverage for ErrorHandler

ff99de3

kaituo reviewed Jul 8, 2023

View reviewed changes

refactor RCFCaster

e28e68a

kaituo approved these changes Jul 10, 2023

View reviewed changes

jotok approved these changes Jul 11, 2023

View reviewed changes

jotok merged commit 5ab5a97 into aws:main Jul 11, 2023
1 check passed

improving precision #393

improving precision #393

Conversation

sudiptoguha commented Jun 29, 2023

kaituo left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment