Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

improving precision #393

Merged
merged 7 commits into from
Jul 11, 2023
Merged

improving precision #393

merged 7 commits into from
Jul 11, 2023

Conversation

sudiptoguha
Copy link
Contributor

Issue #, if available: 390

Description of changes: The main change in the PR is to improve precision. ThresholdedRCF had the capacity to use different streaming normalizations/transformations -- these transformations are now standardized and smoothened. As a consequence, it is feasible to use transformation A to determine significance, even though the goal is to use transformation B. This is an extension of the multi-mode capability and by default, improves the precision of the results significantly.

In addition, the PR removes unused (and unlikely to be used code) which have been remnants from version 1.0 and 2.0. It also adds more tests for branch coverage, specially for RandomCutTree.

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

@sudiptoguha sudiptoguha requested a review from jotok June 30, 2023 16:11
Copy link
Collaborator

@kaituo kaituo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

read until Java/parkservices/src/main/java/com/amazon/randomcutforest/parkservices/ErrorHandler.java. Will continue.

@@ -105,8 +105,8 @@ public static <R> void assignAndRecompute(List<Weighted<Integer>> sampledPoints,
double minDist = Double.MAX_VALUE;
int minDistNbr = -1;
for (int i = 0; i < clusters.size(); i++) {
// will check for negative distances
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

where will you check for negative distances?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The function clusters.get(i).distance(getPoint.apply(point.index), distance) checks for -ve distances for GenericMultiCenter (MultiCenter). Will add a check to distance() in Center.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you update your comment to include where you add a check for negative distances?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. Its in Center.Java 0ed94fc
GenericMultiCenter.java already had the checks.

@@ -272,7 +272,6 @@ public Integer addPoint(Integer pointIndex, long sequenceIndex) {

if (Arrays.equals(point, oldPoint)) {
increaseLeafMass(leafNode);
checkArgument(!nodeStore.freeNodeManager.isEmpty(), "incorrect/impossible state");
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why don't you need the check anymore?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was verifying that when a leaf node is duplicated then there was a free internal node which was empty. But if addPoint() were to succeed then there must be a free internal node. So the check was not achieving anything. In the subsequent PRs we should check the NodeStore, IndexInterval manager directly.

@@ -311,7 +310,6 @@ public Integer addPoint(Integer pointIndex, long sequenceIndex) {
parentPath.push(new int[] { node, sibling });
}

checkArgument(savedDim != Integer.MAX_VALUE, () -> " cut failed at index " + pointIndex);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why don't you need the check anymore?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the function randomCut() now has a perfect branch coverage and this check is now superfluous.

@@ -154,7 +154,7 @@ void printResult(BufferedWriter file, ForecastDescriptor result, int current, in
float[] lowerError = result.getObservedErrorDistribution().lower;
DiVector rmse = result.getErrorRMSE();
float[] mean = result.getErrorMean();
float[] calibration = result.getCalibration();
float[] calibration = result.getIntervalPrecision();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we rename the local variable to intervalPrecision too?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes. Changing.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed in 0ed94fc

for (int i = 0; i < inputLength; i++) {
actuals[errorIndex][i] = (float) input[i];
if (sequenceIndex > 0) {
// sequenceIndex indicates the first empty place for input
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You meant inputIndex indicates the first empty place for input, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, you're correct. Will fix comments with the refactor.

actuals[errorIndex][i] = (float) input[i];
if (sequenceIndex > 0) {
// sequenceIndex indicates the first empty place for input
// note the predictions have already been stored
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where do you store predictions? I thought you store predictions between line 192~197.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're correct -- I meant to say that the corresponding forecasts would have already been stored (in the previous steps). The current (most recent) forecasts cannot be measured for accuracy.

@@ -161,36 +165,40 @@ public ErrorHandler(int errorHorizon, int forecastHorizon, int sequenceIndex, do
public void update(ForecastDescriptor descriptor, Calibration calibrationMethod) {
int arrayLength = pastForecasts.length;
int length = pastForecasts[0].values.length;
int errorIndex = sequenceIndex % (arrayLength);
int storedForecastIndex = sequenceIndex % (arrayLength);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

storedForecastIndex is the index to store forecasts, while inputIndex is the index to store actuals, right?

As the sequenceIndex increments, these values would cycle through the range [0, arrayLength - 1], but the inputIndex will always be one step behind storedForecastIndex, looping back to the end when it reaches the beginning.

The inputIndex will always be one step behind the storedForecastIndex in the circular array, given that you subtract 1 before applying the modulo operation, right?

Here is a simple example to illustrate the difference. Let's say arrayLength is 5, and sequenceIndex is 3:

storedForecastIndex = 3 % 5 = 3
inputIndex = (3 + 5 - 1) % 5 = 2

So the storedForecastIndex for this sequenceIndex would be 3, and the inputIndex would be 2.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes -- this code may have to be refactored though In the next set PR/commits we would split the update() step into
(i) store actuals (ii) calibrate (iii) store the most recent forecasts. The most recent forecasts are not useful for calibration since we have not seen any corresponding actual yet. The update to sequenceIndex has to occur before calibration -- so that calibration is now state dependent and can be invoked from anywhere.

checkArgument(inputLength <= errorDeviations.length, "deviations should be at least as long as input lengths");
for (int i = 0; i < forecastHorizon; i++) {
// this is the only place where the newer (possibly shorter) horizon matters
int len = (sequenceIndex > errorHorizon + i + 1) ? errorHorizon : sequenceIndex - i - 1;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we create a method for this line? It was reused two times in ErrorHandler.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in ff99de3

@@ -454,7 +488,8 @@ protected <P extends AnomalyDescriptor> boolean isSignificant(boolean significan
shiftAmount += multiplier * DEFAULT_NORMALIZATION_PRECISION
* (scaleFactor + (Math.abs(a) + Math.abs(b)) / 2);
}
answer = significantScore && delta > 1e-6 || (delta > shiftAmount + DEFAULT_NORMALIZATION_PRECISION);
answer = (significantScore && delta > 1e-6 || (delta > shiftAmount + DEFAULT_NORMALIZATION_PRECISION))
&& (delta > noiseFactor * result.getDeviations()[baseDimensions + y]);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

delta is the scaled difference between actual and expected values, right? result.getDeviations()[baseDimensions + y] is the deviation of the diff between successive actual points, right? If both are yes, why do we compare them? They don't seem comparable.

Or are you comparing the deviation of new points from expected points with some measure of variability in the data? If the new points deviate significantly compared to the usual variability, it may be considered an anomaly. If yes, why do you use result.getDeviations()[baseDimensions + y]) instead of other deviations?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's the latter. The variability is the mean deviation between successive points -- this is typical for random opaque box noise. The thinking is that there is a natural delta between subsequent events -- so if we got similar difference between actual and predicted, then we should probably not trigger anomaly.

+ Math.abs(candidate.high[i] - ideal.high[i]);
}
return (differentialRemainder > DEFAULT_DIFFERENTIAL_FACTOR * lastAnomalyScore)
&& differentialRemainder * dimensions / difference > workingThreshold;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does the line of code make it harder to trigger an anomaly the further out it is from the last anomaly? If difference is large, meaning a larger gap since the last anomaly, the factor dimensions / difference would be small. As this factor is multiplied with differentialRemainder to be compared against the workingThreshold, a larger gap makes it more difficult to surpass the threshold and thus trigger an anomaly. If my interpretation is right, what's the rationale behind the logic?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your observation is correct - but difference is bounded by dimensions (the size of the shingle). What the code checks is that the contribution of the new part of the shingle extrapolated to overall shingle.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still don't understand the intuition behind the harder to trigger an anomaly the further out it is from the last anomaly. Will ask you offline.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I get it -- your question is "does it make it 'harder' to trigger "? The answer is not really -- it more or less keeps the same threshold but scales up the new component. This could be an issue to dig deeper in the next PR and actually make it 'harder' by raising the trigger. I believe in prior versions the trigger was set higher -- need to have some measurements re how this affects number of anomalies. In the next PR the goal would be to tag/eliminate drifts/runs of anomalies -- so this could come in handy.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it. I noticed that the differential remainder is being scaled by dimensions / difference. However, I believe this scaling factor may not be correctly representing the "complete" difference if we are only considering a part of the array (difference to dimensions-1).

Should we consider scaling the differential remainder by dimensions / (dimensions - difference) instead? This would ensure that the differential remainder correctly represents the proportion of the total range we're considering when compared to the workingThreshold.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should consider alternate scalings as well -- and getting to your original point of making it 'harder'. The number of entries being summed up is difference though ... the loop is dimension - difference to dimension (I agree this is not ideal - and perhaps a result of multiple edits). Will address in next PR.

if (outputAfter.isPresent()) {
startNormalization = Optional.of(min(DEFAULT_START_NORMALIZATION, outputAfter.get()));
} else {
// startNormalization = Optional.of(max(1, (int) (sampleSize *
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we do something in the branch?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops. Should remove the comment. The default is 10 -- sending the value to default fraction connected the two parameters outputAfter and defaultFraction and the parameters of many of the existing tests would have to change. Maybe think about this with fewer files/changes.

deviationList[i] = new Deviation(timeDecay);
}
usedDeviations = max(usedDeviations, deviationList.length - deviationList.length / 3);
usedDeviations = max(usedDeviations, deviationList.length - 2 * deviationList.length / 5);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are 5 stats, but you only initialized 3 stats here. Is it a problem?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No the first two should have higher time decay. The last three have lower time decay (which is the smoothening).

@@ -378,7 +376,7 @@ public double[] getScale() {
scale[inputLength] = (weightTime == 0) ? 0 : 1.0 / weightTime;
if (normalizeTime) {
scale[inputLength] *= NORMALIZATION_SCALING_FACTOR
* (timeStampDeviations[2].getMean() + DEFAULT_NORMALIZATION_PRECISION);
* (timeStampDeviations[4].getMean() + DEFAULT_NORMALIZATION_PRECISION);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of using hard coded number, can you use getTimeGapDifference()? It improves readability.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

next PR?

Comment on lines 258 to 262
errorHandler.updateActuals(description.getCurrentInput(), description.getPostDeviations());
errorHandler.augmentDescriptor(description);
timedForecast = extrapolate(forecastHorizon);
errorHandler.updateForecasts(timedForecast.rangeVector);
description.setTimedForecast(timedForecast);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you make this a method since another places also use these lines?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good suggestion. Actually we can take that a step further -- and now both the single step process and processSequentially call the exact same function and differ in the placement of the caching step. The main goal was to eliminate the repeated caching on/off and that is now clear. Done in e28e68a

@jotok jotok merged commit 5ab5a97 into aws:main Jul 11, 2023
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants