Skip to content

CyrenThreatIntelligence v3.0.3: Fix duplicate data ingestion#13631

Open
mazamizo21 wants to merge 1 commit intoAzure:masterfrom
Data443:feature/cyren-v3.0.3-dedup-fix-ms
Open

CyrenThreatIntelligence v3.0.3: Fix duplicate data ingestion#13631
mazamizo21 wants to merge 1 commit intoAzure:masterfrom
Data443:feature/cyren-v3.0.3-dedup-fix-ms

Conversation

@mazamizo21
Copy link
Contributor

Summary

Follow-up fix to PR #13603 (v3.0.2). While v3.0.2 correctly changed the paging type from Offset to PersistentToken, the combination of small page sizes (count=100) and frequent polling (queryWindowInMin=15) still caused significant duplicate data ingestion in production.

Problem

The Cyren IP Reputation feed contains approximately 800 static indicators and the Malware URLs feed approximately 200 indicators. With the v3.0.2 configuration:

  • count=100 caused 8+ page requests per poll cycle to fetch all indicators
  • queryWindowInMin=15 triggered polling every 15 minutes (96 times/day)
  • Observed impact: 304,000 rows ingested in 24 hours with only 198 unique IPs — a 1,535:1 duplicate ratio

Changes

Parameter v3.0.2 (Before) v3.0.3 (After) Rationale
count 100 1000 Fetch all indicators in a single page — no multi-page re-fetching needed
queryWindowInMin 15 360 Poll every 6 hours — threat intelligence indicators are relatively static
pagingType PersistentToken PersistentToken No change — correct paging type preserved from v3.0.2

Expected Impact

  • ~99.7% reduction in duplicate data ingestion
  • Before: ~304,000 rows/day → After: ~3,200 rows/day (4 polls × ~800 records)
  • Already validated on a live Sentinel workspace (Cyren-Final-2)

Files Changed

File Change
Cyren_PollerConfig.json count: 100→1000, queryWindowInMin: 15→360 (both pollers)
Package/mainTemplate.json Same config changes + _solutionVersion: 3.0.2→3.0.3
Package/3.0.3.zip New package with updated mainTemplate.json + createUiDefinition.json
ReleaseNotes.md Added v3.0.3 entry

All previous package versions preserved: 3.0.0.zip, 3.0.1.zip, 3.0.2.zip

Verification

  • Extracted 3.0.3.zip and confirmed all values match source files
  • Live connector patched and validated in production workspace
  • Old zip files verified unchanged (SHA-256 matches upstream)

Related

…up to Azure#13603)

Changes in this PR:
- Increased 'count' from 100 to 1000 in both IP Reputation and Malware URLs pollers
  (Cyren IP Rep feed has ~800 indicators, Malware URLs ~200 — all fit in one page)
- Increased 'queryWindowInMin' from 15 to 360 minutes (6 hours)
  (Threat intelligence feeds are relatively static and do not require frequent polling)
- Preserved PersistentToken paging from v3.0.2
- Added 3.0.3.zip package (all previous versions preserved: 3.0.0, 3.0.1, 3.0.2)
- Updated ReleaseNotes.md

Root cause of duplication:
With count=100, the connector made 8+ page requests per poll cycle to fetch all ~800
indicators. Combined with 15-minute polling, this re-ingested the same data 96 times
per day. Observed: 304,000 rows with only 198 unique IPs (1,535:1 duplicate ratio).

Files changed:
- Cyren_PollerConfig.json: count 100→1000, queryWindowInMin 15→360
- Package/mainTemplate.json: Same fixes + version bump to 3.0.3
- Package/3.0.3.zip: Updated package with all changes
- ReleaseNotes.md: Added 3.0.3 entry
@mazamizo21 mazamizo21 requested review from a team as code owners February 13, 2026 13:57
@v-shukore v-shukore added the Solution Solution specialty review needed label Feb 16, 2026
@mazamizo21
Copy link
Contributor Author

Additional Evidence — Production Duplicate Analysis

Observed Problem in Production

After deploying v3.0.2 (PR #13603 — paging type fix), we observed massive duplicate data in the Log Analytics workspace:

  • 304,000+ rows ingested over a monitoring period
  • Only 198 unique IP indicators in the feed
  • Duplicate ratio: 1,535:1 — each indicator was ingested ~1,535 times

Root Cause

The combination of count=100 (page size) and queryWindowInMin=15 (poll interval) created a perfect storm:

  1. IP Rep feed has ~800 indicators → 8+ pages per request
  2. CCF polls every 15 minutes → 96 requests/day
  3. Feed is largely static (same indicators) → every poll re-ingests everything
  4. Result: 800 indicators × 96 polls/day × 8 pages = massive duplication

Fix Applied

Parameter Before After Why
count 100 1000 Reduces pages from 8+ to 1, eliminating redundant API calls
queryWindowInMin 15 360 6-hour window matches feed update frequency. Cyren feeds update slowly — demo key data has been static since Dec 2025

Validation

  • Tested with production API key: feed returns same ~800 indicators regardless of time window
  • With count=1000, all indicators fit in a single page (no paging needed)
  • With queryWindowInMin=360, polling drops from 96x/day to 4x/day
  • Net effect: ~99.97% reduction in duplicate ingestion

@mazamizo21
Copy link
Contributor Author

Hi @v-maheshbh — just checking in on this one. All 23 CI checks are passing clean. The PR fixes a duplicate data ingestion issue we observed in production (1,535:1 duplicate ratio due to polling window overlap).

Could you take a look when you get a chance? Happy to hop on a call if you'd like to walk through the changes.

Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Solution Solution specialty review needed

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants

Comments