-
Notifications
You must be signed in to change notification settings - Fork 33
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
improving precision #393
improving precision #393
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
read until Java/parkservices/src/main/java/com/amazon/randomcutforest/parkservices/ErrorHandler.java. Will continue.
@@ -105,8 +105,8 @@ public static <R> void assignAndRecompute(List<Weighted<Integer>> sampledPoints, | |||
double minDist = Double.MAX_VALUE; | |||
int minDistNbr = -1; | |||
for (int i = 0; i < clusters.size(); i++) { | |||
// will check for negative distances |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
where will you check for negative distances?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The function clusters.get(i).distance(getPoint.apply(point.index), distance) checks for -ve distances for GenericMultiCenter (MultiCenter). Will add a check to distance() in Center.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you update your comment to include where you add a check for negative distances?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. Its in Center.Java 0ed94fc
GenericMultiCenter.java already had the checks.
@@ -272,7 +272,6 @@ public Integer addPoint(Integer pointIndex, long sequenceIndex) { | |||
|
|||
if (Arrays.equals(point, oldPoint)) { | |||
increaseLeafMass(leafNode); | |||
checkArgument(!nodeStore.freeNodeManager.isEmpty(), "incorrect/impossible state"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why don't you need the check anymore?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was verifying that when a leaf node is duplicated then there was a free internal node which was empty. But if addPoint() were to succeed then there must be a free internal node. So the check was not achieving anything. In the subsequent PRs we should check the NodeStore, IndexInterval manager directly.
@@ -311,7 +310,6 @@ public Integer addPoint(Integer pointIndex, long sequenceIndex) { | |||
parentPath.push(new int[] { node, sibling }); | |||
} | |||
|
|||
checkArgument(savedDim != Integer.MAX_VALUE, () -> " cut failed at index " + pointIndex); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why don't you need the check anymore?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the function randomCut() now has a perfect branch coverage and this check is now superfluous.
@@ -154,7 +154,7 @@ void printResult(BufferedWriter file, ForecastDescriptor result, int current, in | |||
float[] lowerError = result.getObservedErrorDistribution().lower; | |||
DiVector rmse = result.getErrorRMSE(); | |||
float[] mean = result.getErrorMean(); | |||
float[] calibration = result.getCalibration(); | |||
float[] calibration = result.getIntervalPrecision(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we rename the local variable to intervalPrecision too?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes. Changing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed in 0ed94fc
for (int i = 0; i < inputLength; i++) { | ||
actuals[errorIndex][i] = (float) input[i]; | ||
if (sequenceIndex > 0) { | ||
// sequenceIndex indicates the first empty place for input |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You meant inputIndex indicates the first empty place for input, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, you're correct. Will fix comments with the refactor.
actuals[errorIndex][i] = (float) input[i]; | ||
if (sequenceIndex > 0) { | ||
// sequenceIndex indicates the first empty place for input | ||
// note the predictions have already been stored |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Where do you store predictions? I thought you store predictions between line 192~197.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're correct -- I meant to say that the corresponding forecasts would have already been stored (in the previous steps). The current (most recent) forecasts cannot be measured for accuracy.
@@ -161,36 +165,40 @@ public ErrorHandler(int errorHorizon, int forecastHorizon, int sequenceIndex, do | |||
public void update(ForecastDescriptor descriptor, Calibration calibrationMethod) { | |||
int arrayLength = pastForecasts.length; | |||
int length = pastForecasts[0].values.length; | |||
int errorIndex = sequenceIndex % (arrayLength); | |||
int storedForecastIndex = sequenceIndex % (arrayLength); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
storedForecastIndex is the index to store forecasts, while inputIndex is the index to store actuals, right?
As the sequenceIndex increments, these values would cycle through the range [0, arrayLength - 1], but the inputIndex will always be one step behind storedForecastIndex, looping back to the end when it reaches the beginning.
The inputIndex will always be one step behind the storedForecastIndex in the circular array, given that you subtract 1 before applying the modulo operation, right?
Here is a simple example to illustrate the difference. Let's say arrayLength is 5, and sequenceIndex is 3:
storedForecastIndex = 3 % 5 = 3
inputIndex = (3 + 5 - 1) % 5 = 2
So the storedForecastIndex for this sequenceIndex would be 3, and the inputIndex would be 2.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes -- this code may have to be refactored though In the next set PR/commits we would split the update() step into
(i) store actuals (ii) calibrate (iii) store the most recent forecasts. The most recent forecasts are not useful for calibration since we have not seen any corresponding actual yet. The update to sequenceIndex has to occur before calibration -- so that calibration is now state dependent and can be invoked from anywhere.
checkArgument(inputLength <= errorDeviations.length, "deviations should be at least as long as input lengths"); | ||
for (int i = 0; i < forecastHorizon; i++) { | ||
// this is the only place where the newer (possibly shorter) horizon matters | ||
int len = (sequenceIndex > errorHorizon + i + 1) ? errorHorizon : sequenceIndex - i - 1; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we create a method for this line? It was reused two times in ErrorHandler.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done in ff99de3
@@ -454,7 +488,8 @@ protected <P extends AnomalyDescriptor> boolean isSignificant(boolean significan | |||
shiftAmount += multiplier * DEFAULT_NORMALIZATION_PRECISION | |||
* (scaleFactor + (Math.abs(a) + Math.abs(b)) / 2); | |||
} | |||
answer = significantScore && delta > 1e-6 || (delta > shiftAmount + DEFAULT_NORMALIZATION_PRECISION); | |||
answer = (significantScore && delta > 1e-6 || (delta > shiftAmount + DEFAULT_NORMALIZATION_PRECISION)) | |||
&& (delta > noiseFactor * result.getDeviations()[baseDimensions + y]); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
delta is the scaled difference between actual and expected values, right? result.getDeviations()[baseDimensions + y] is the deviation of the diff between successive actual points, right? If both are yes, why do we compare them? They don't seem comparable.
Or are you comparing the deviation of new points from expected points with some measure of variability in the data? If the new points deviate significantly compared to the usual variability, it may be considered an anomaly. If yes, why do you use result.getDeviations()[baseDimensions + y]) instead of other deviations?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's the latter. The variability is the mean deviation between successive points -- this is typical for random opaque box noise. The thinking is that there is a natural delta between subsequent events -- so if we got similar difference between actual and predicted, then we should probably not trigger anomaly.
+ Math.abs(candidate.high[i] - ideal.high[i]); | ||
} | ||
return (differentialRemainder > DEFAULT_DIFFERENTIAL_FACTOR * lastAnomalyScore) | ||
&& differentialRemainder * dimensions / difference > workingThreshold; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does the line of code make it harder to trigger an anomaly the further out it is from the last anomaly? If difference is large, meaning a larger gap since the last anomaly, the factor dimensions / difference would be small. As this factor is multiplied with differentialRemainder to be compared against the workingThreshold, a larger gap makes it more difficult to surpass the threshold and thus trigger an anomaly. If my interpretation is right, what's the rationale behind the logic?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Your observation is correct - but difference is bounded by dimensions (the size of the shingle). What the code checks is that the contribution of the new part of the shingle extrapolated to overall shingle.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I still don't understand the intuition behind the harder to trigger an anomaly the further out it is from the last anomaly. Will ask you offline.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I get it -- your question is "does it make it 'harder' to trigger "? The answer is not really -- it more or less keeps the same threshold but scales up the new component. This could be an issue to dig deeper in the next PR and actually make it 'harder' by raising the trigger. I believe in prior versions the trigger was set higher -- need to have some measurements re how this affects number of anomalies. In the next PR the goal would be to tag/eliminate drifts/runs of anomalies -- so this could come in handy.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it. I noticed that the differential remainder is being scaled by dimensions / difference. However, I believe this scaling factor may not be correctly representing the "complete" difference if we are only considering a part of the array (difference to dimensions-1).
Should we consider scaling the differential remainder by dimensions / (dimensions - difference) instead? This would ensure that the differential remainder correctly represents the proportion of the total range we're considering when compared to the workingThreshold.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should consider alternate scalings as well -- and getting to your original point of making it 'harder'. The number of entries being summed up is difference though ... the loop is dimension - difference to dimension (I agree this is not ideal - and perhaps a result of multiple edits). Will address in next PR.
if (outputAfter.isPresent()) { | ||
startNormalization = Optional.of(min(DEFAULT_START_NORMALIZATION, outputAfter.get())); | ||
} else { | ||
// startNormalization = Optional.of(max(1, (int) (sampleSize * |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we do something in the branch?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oops. Should remove the comment. The default is 10 -- sending the value to default fraction connected the two parameters outputAfter and defaultFraction and the parameters of many of the existing tests would have to change. Maybe think about this with fewer files/changes.
deviationList[i] = new Deviation(timeDecay); | ||
} | ||
usedDeviations = max(usedDeviations, deviationList.length - deviationList.length / 3); | ||
usedDeviations = max(usedDeviations, deviationList.length - 2 * deviationList.length / 5); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are 5 stats, but you only initialized 3 stats here. Is it a problem?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No the first two should have higher time decay. The last three have lower time decay (which is the smoothening).
@@ -378,7 +376,7 @@ public double[] getScale() { | |||
scale[inputLength] = (weightTime == 0) ? 0 : 1.0 / weightTime; | |||
if (normalizeTime) { | |||
scale[inputLength] *= NORMALIZATION_SCALING_FACTOR | |||
* (timeStampDeviations[2].getMean() + DEFAULT_NORMALIZATION_PRECISION); | |||
* (timeStampDeviations[4].getMean() + DEFAULT_NORMALIZATION_PRECISION); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of using hard coded number, can you use getTimeGapDifference()? It improves readability.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
next PR?
errorHandler.updateActuals(description.getCurrentInput(), description.getPostDeviations()); | ||
errorHandler.augmentDescriptor(description); | ||
timedForecast = extrapolate(forecastHorizon); | ||
errorHandler.updateForecasts(timedForecast.rangeVector); | ||
description.setTimedForecast(timedForecast); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you make this a method since another places also use these lines?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good suggestion. Actually we can take that a step further -- and now both the single step process and processSequentially call the exact same function and differ in the placement of the caching step. The main goal was to eliminate the repeated caching on/off and that is now clear. Done in e28e68a
Issue #, if available: 390
Description of changes: The main change in the PR is to improve precision. ThresholdedRCF had the capacity to use different streaming normalizations/transformations -- these transformations are now standardized and smoothened. As a consequence, it is feasible to use transformation A to determine significance, even though the goal is to use transformation B. This is an extension of the multi-mode capability and by default, improves the precision of the results significantly.
In addition, the PR removes unused (and unlikely to be used code) which have been remnants from version 1.0 and 2.0. It also adds more tests for branch coverage, specially for RandomCutTree.
By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.