-
Notifications
You must be signed in to change notification settings - Fork 46
Description
Performance degradation on PCA
I was getting unexpectedly poor performance from PCA on the Exathlon data (using the VUS-PR metric):
| TSB-AD - latest commit | Values reported in the paper | Difference | |
|---|---|---|---|
| TranAD | 0.95 | 0.10 | 0.86 |
| OFA | 0.85 | 0.58 | 0.28 |
| CNN | 0.95 | 0.68 | 0.27 |
| LSTMAD | 0.96 | 0.82 | 0.14 |
| OmniAnomaly | 0.97 | 0.84 | 0.13 |
| USAD | 0.97 | 0.84 | 0.13 |
| RobustPCA | 0.81 | 0.77 | 0.04 |
| AnomalyTransformer | 0.14 | 0.10 | 0.04 |
| AutoEncoder | 0.91 | 0.91 | 0.00 |
| IForest | 0.32 | 0.35 | -0.04 |
| PCA | 0.53 | 0.95 | -0.42 |
First column are values I obtained with the latest commit of TSB-AD, second are values from the publication.
The improvement in most methods can probably be ascribed to the fixes (e.g. use of correct hyperparameters) since publication. However, there might be a bug affecting PCA and possibly other methods as well:
Bug (in PCA?)
At least part of the issue might be normalization introduced in a79f315:
# models/PCA.py
X = Window(window = self.slidingWindow).convert(X)
if self.normalize:
if n_features == 1:
X = zscore(X, axis=0, ddof=0)
else:
X = zscore(X, axis=1, ddof=1) #<--- 2nd issue
# validate inputs X and y (optional)
X = check_array(X)
self._set_n_classes(y)
# PCA is recommended to use on the standardized data (zero mean and
# unit variance).
if self.standardization: #<--- 1st issue
X, self.scaler_ = standardizer(X, keep_scalar=True)In the example of PCA above:
- Normalization is applied twice, once for the
normalizationflag and once for thestandardizationflag. X = zscore(X, axis=1, ddof=1)seems to apply normalization independently on each window. Each window contains multiple features and has form[feat0_t0, feat0_t1, ..., feat0_tw, feat1_t0, feat1_t1, ..., feat1_tw, ...].
It is not intuitive for me what No 2. is trying to achieve, but it does not seem correct PCA. Independent application of the z-score to each time window e.g. messes up information on absolute magnitude of features between time windows, and overall seems to only remove information from the data.
Also, I think normalization might not be needed for IsolationForest forest at all? For other methods, I am currently unsure I can't tell for sure without looking into them more.
Fix
Two parts to this:
- In my opinion, PCA should not have both "normalization" and "standardization". I think removing "normalization" code and renaming "standardization" to "normalization" would be appropriate to be consistent with other methods. Even though I think "standardization" is actually a better term here.
- I think
X = zscore(X, axis=1, ddof=1)is a bug at least for some methods (PCA, IForest). Naively, I believe changes from a79f315 could be replaced to useStandardScalerinstandardizeralong the columns for all the methods involved.
I can prepare a PR for either 1. or 1. and 2., if that is welcome.
PS: thank you for an awesome project of such a large scope <3