Why ?
- We needlessly store information that is either in the header or easy to opbtain in Messagev3 properties. This is taking space on Cassandra...
Tables (not tiered, with per message entries) for 66 million emails (3 nodes RF=3) :
messagev3 table 17 GB
imapuidtable table 10GB
messageidtable table 7 GB
email_query_view_received_at table 2 GB
firstunseen table 287 MB
thread_lookup_3 6GB
We see a footprint of ~2-3KB (replicated, tiered) per message.
We can expect a 33% reduction of messagev3 size by removing the content description and properties field. Translating to a 10-13% space saving. At scale for 10 billion messages this means 20TB -> 18TB... Sad for something that is useful only for IMAP FETCH BODYSTRUCTURE and could be easily recomputed.
-
We count line with unoptimized input stream for each message with content type text/* reading byte per byte (PERF KILLER!) while it is useful only upon IMAP FETCH BODYSTRUCTURE - we'd rather move it at read time.
-
At last MessageStorer calls parsing for each and every message. We could easily cary other (after removing PropertyBuilder) the content type and trigger this expensive parsing IF and only IF content type is multipart/* or content-disposition is attachment in main headers, saving CPU on the write path.
How ?
Remove propertyBuider from Message POJOs.
IMAP FETCH BODYSTRUCTURE operates on full content: we can easily recompute this in MessageResult POJO when (and only when) needed.
Take care to still carry other contentType and ContentDescription for the unrelated but connex and interesting MessageStorer optimization.
Expected gains
Significant CPU gains for text/* message APPEND / reception
~ 10% data reduction on Cassandra
Why ?
Tables (not tiered, with per message entries) for 66 million emails (3 nodes RF=3) :
messagev3table 17 GBimapuidtabletable 10GBmessageidtabletable 7 GBemail_query_view_received_attable 2 GBfirstunseentable 287 MBthread_lookup_36GBWe see a footprint of ~2-3KB (replicated, tiered) per message.
We can expect a 33% reduction of messagev3 size by removing the content description and properties field. Translating to a 10-13% space saving. At scale for 10 billion messages this means 20TB -> 18TB... Sad for something that is useful only for IMAP FETCH BODYSTRUCTURE and could be easily recomputed.
We count line with unoptimized input stream for each message with content type
text/*reading byte per byte (PERF KILLER!) while it is useful only upon IMAP FETCH BODYSTRUCTURE - we'd rather move it at read time.At last MessageStorer calls parsing for each and every message. We could easily cary other (after removing PropertyBuilder) the content type and trigger this expensive parsing IF and only IF content type is
multipart/*orcontent-dispositionisattachmentin main headers, saving CPU on the write path.How ?
Remove propertyBuider from Message POJOs.
IMAP FETCH BODYSTRUCTURE operates on full content: we can easily recompute this in MessageResult POJO when (and only when) needed.
Take care to still carry other contentType and ContentDescription for the unrelated but connex and interesting MessageStorer optimization.
Expected gains
Significant CPU gains for
text/*message APPEND / reception~ 10% data reduction on Cassandra