adding support for pdf-object-hashing#146
Conversation
adding: - scan_pdf_obj_hash.py - requirements.txt to pull in the main branch for the most up-to-date pdf object hashing library - updating scanners.yaml to also run this scanner on PDFs
adding a time out
| oletools==0.60.1 | ||
| opencv-python==4.8.1.78 | ||
| opencv-contrib-python==4.8.1.78 | ||
| pdf-object-hashing @ git+https://github.com/0xkyle/pdf_object_hashing.git |
There was a problem hiding this comment.
So I have to ask as I don't know the answer at all but we may want to see if we're cool with using this package (I know it's yours) but currently it's Apache 2.0 license so just wanted to make sure. I know MIT is cool but I personally don't know much about apache license types and wanted to atleast point it out.
There was a problem hiding this comment.
I don't know much about license types either, but it looks like strelka itself is under apache 2.0 https://github.com/target/strelka/blob/master/LICENSE
| self.event["object_hash"] = obj_hash | ||
| self.event["hash_string"] = obj_hash_str | ||
| else: | ||
| self.event["object_hash"] = False |
There was a problem hiding this comment.
I don't know if this is just considered 'pythonic', but switching up the data type here gives me pause. Up above, we set it to a string "error" when there's a problem, but here we just set it to false. I'm assuming we're doing this so that we can quickly see that there isn't an object hash or a hash string, but then we'd need to check the string in case it exists to see if its an error or not anyway.
Could we track the error and/or flag presence in a field other than object_hash/hash_string, and only populate those keys in the dict if we have a meaningful value to provide?
| limit: 2000 | ||
| pdf_to_png: True | ||
| no_object_extraction: True | ||
| 'ScanPdfObjHash': |
There was a problem hiding this comment.
Should we use our existing PDF scanner instead?
There was a problem hiding this comment.
I’m open to migrating this into the existing scanner, but keeping it separate makes sense to me given that the object hashing has a much narrower scope than the overall PDF scanner. I also like the idea of not impacting existing PDF processes in the scanner (pdf->png) or having the hashing guardrails (timeouts, object thresholds) affect the main PDF scanner. That being said, if you have ideas on how to proceed, I’m all ears.
| from hashlib import md5 | ||
| from pdf_object_hashing import pdf_object as po | ||
|
|
||
| class ScanPdfObjHash(strelka.Scanner): |
There was a problem hiding this comment.
Does this do something for every object? How long is it and is it expensive? We've observed in the past that naively getting every object is expensive and creates a bunch of noise. See #116 for more context, and we can also chat further internally if needed
There was a problem hiding this comment.
responded internally
dlynch-sublime
left a comment
There was a problem hiding this comment.
changes look good for the comments I raised; I'm assuming you and @rw-access are working out resolutions to his comments
|
Chatted with @rw-access and we're going to go ahead with this version and monitor for any hiccups. |
adding:
Describe the change
This adds in support for PDF object hashing, a method to create a structural hash for PDF objects (think imphash or ja3). This allows us to match complete matches with the
object_hashvalue and partial matches by matching parts of thehash_stringDescribe testing procedures
I build the scanner in our strelka docker instance on a VM, and used oneshot to send the file to the strelka scanner. After that was working I added the fields to the platform and tested there, testing against ~200 PDFs attached to emails.
Sample output
Checklist