-
Notifications
You must be signed in to change notification settings - Fork 95
Description
When extracting the symbols of the binary files of the dataset, base64 of the function prototype is used to build a ground truth of same functions.
but with different compilers, and platforms the function prototype does not remain the same.
Thus making the algorithm possibly put the same function (with different prototype) in the repulsion file in the training and validation sets.
Since this is indeed a frequent case, I believe this may have affected the evaluation significantly.
For example:
WideToChar(wchar_t const*, char*, unsigned long)
BASE64 : V2lkZVRvQ2hhcih3Y2hhcl90IGNvbnN0KiwgY2hhciosIHVuc2lnbmVkIGxvbmcp
WideToChar(wchar_t const*, char*, unsigned long)
BASE64: V2lkZVRvQ2hhcih3Y2hhcl90IGNvbnN0KiwgY2hhciosIHVuc2lnbmVkIGludCk=
these are 2 functions each exists in 22 distinct files of the dataset. these refere to the same function but the training will try to make them look different
QuickOpen::ReadRaw(RawRead&)
BASE64: UXVpY2tPcGVuOjpSZWFkUmF3KFJhd1JlYWQmKQ==
QuickOpen::ReadRaw( RawRead&)
BASE64: UXVpY2tPcGVuOjpSZWFkUmF3KCBSYXdSZWFkJik=
The first function appeared in 64 files while the second appeared in 41 different files. Some of the compilers have just put a space before the parameter and this will make troubles in the training.
There are many cases as this issue.
Suggestion: Use the name of the function as a symbol (without the parameters)