Introduce batching to MISP feed output bot#2473
Open
arvchristos wants to merge 1 commit intocerttools:developfrom
Open
Introduce batching to MISP feed output bot#2473arvchristos wants to merge 1 commit intocerttools:developfrom
arvchristos wants to merge 1 commit intocerttools:developfrom
Conversation
f4e5d77 to
d88ca48
Compare
kamil-certat
requested changes
Apr 16, 2024
Contributor
kamil-certat
left a comment
There was a problem hiding this comment.
Hey, this is a very good idea, I've recently started using the Feed Output bot, and I have similar thoughts that it may not be the best way how it works.
However, I have some concerns about the PR:
- Have you tested your bot with a feed producing data very slow? Or very quick? Line 123 checks the size of the source queue, which holds events incoming to the bot, not how many events were accumulated. If I haven't missed anything, if your bot will never get 100 events in the source queue, it will never save data. And if your bot quickly gets 100 events in the queue (e.g. wasn't running for some time), it will do the same as the original code: save on every event.
- You acknowledge messages and then keep events in a list (
event_batch). This means they are in-memory only, and any surprising restart/crash will cause data lost. - You haven't implemented the
shutdownmethod, so any restart or reload will not flush the batch, and so - cause data lost. - You have made an assumption that source queue is always Redis-compatible. When it's the most common case, it can theoretically be e.g. AMQP (see
pipeline.py).
I have a few suggestions:
- This is in fact a kind of caching. Let's re-use CacheMixin von
intelmq/lib/mixins/cache.pyto connect with redis. It doesn't have support for storing events in a queue, but you can then use the client by getting it fromcache_get_redis_instance. - Store events in a separated redis queue. This will move a responsibility for keeping data away from the bot.
- I would suggest changing the logic a little: let use
interval_eventandbatch_sizeas two conditions, and when any of them is set and reached the threshold, create a new event and save. It would be for me a little more logical.
- E.g. for very slow feeds, the interval would hit first, and we will save a few collected data e.g. every hour.
- For quick feeds, the batch limit would hist first, and to avoid creating very big MISP events, we should also create a new MISP event for every batch.
- Removing
interval_eventvalue should trigger new MISP event only when the batch size was reached. - Removing the
batch_size(= setting to 0) should preserve the current behaviour: save immediately, but create a new event only when the interval is reached. - Setting
batch_sizeto 1 should act as saving every event immediately in new MISP event.
What do you think?
Member
And prevent data loss too. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description
Implementation of batch mode for MISP feed output to resolve slow performance when queue has many events. The existing code is actually prone to performance issues:
The following code is being executed for every event in the queue, making the bot extremely slow as events arrive and feed becomes larger:
Motivation
We are trying to create feeds based on Alienvault OTX pulses including thousands of IOCs per day. This is basically not possible with the current MISP feed output bot performance.
Fix
With this MR, batched feed creation is introduced. The user can now configure the batch size using the
batch_sizeparameter. Batch functionality is based on the actual internal queue used from the bot.Benchmark
On an average server, before this improvement a feed of 8k events required several hours to be created while now requires less than 5 minutes (depends on the available resources).