Unefficient creation of repulsion sets

When extracting the symbols of the binary files of the dataset, base64 of the function prototype is used to build a ground truth of same functions.
but with different compilers, and platforms the function prototype does not remain the same.
Thus making the algorithm possibly put the same function (with different prototype) in the repulsion file in the training and validation sets.
Since this is indeed a frequent case, I believe this may have affected the evaluation significantly.

For example: 
WideToChar(wchar_t const*, char*, unsigned long) 
BASE64 : V2lkZVRvQ2hhcih3Y2hhcl90IGNvbnN0KiwgY2hhciosIHVuc2lnbmVkIGxvbmcp
WideToChar(wchar_t const*, char*, unsigned long)
BASE64: V2lkZVRvQ2hhcih3Y2hhcl90IGNvbnN0KiwgY2hhciosIHVuc2lnbmVkIGludCk=
these are 2 functions each exists in 22 distinct files of the dataset. these refere to the same function but the training will try to make them look different

QuickOpen::ReadRaw(RawRead&) 
BASE64: UXVpY2tPcGVuOjpSZWFkUmF3KFJhd1JlYWQmKQ==
QuickOpen::ReadRaw( RawRead&) 
BASE64: UXVpY2tPcGVuOjpSZWFkUmF3KCBSYXdSZWFkJik=
The first function appeared in 64 files while the second appeared in 41 different files. Some of the compilers have just put a space before the parameter and this will make troubles in the training.

There are many cases as this issue.

Suggestion: Use the name of the function as a symbol (without the parameters)  

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Unefficient creation of repulsion sets #17

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Unefficient creation of repulsion sets #17

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions