1. The current YARA syntax only allows you to count the number of matches for each condition, but it cannot analyze the internal matching details.
For example, when writing a regular expression for a phone number, it can only represent the number of times a phone number is matched, but it cannot enforce the number of different phone numbers that were matched.
rule example
{
strings:
$phone = /\d{11}/
condition:
#phone > 10
}
2. We need to design a new syntax to solve it.
#cond cardinality {min_times} {op} {count_limit}
The following expression means that there are at least 20 phone numbers that appear more than 10 times, and no more than 50 phone numbers that appear more than 5 times.
rule example
{
strings:
$phone = /\d{11}/
condition:
#phone cardinality 10 > 20 and #phone cardinality 5 < 50
}
3. About performance.
Due to the limitations of the data structures in C, the implementation is relatively complex. Therefore, only the functionality of C is provided, and the task of calculating word frequencies [A and B] is handed over to the DLL caller to calculate.
A. Lazy calculation of match frequencies
B. No repeated frequency calculations
C. User calculates the frequencies
4. About implementation.



