Training and Testing Set Generation

Machine-learning requires that testing sets be similar to training sets in some ways, but different in others. For example, when dealing with molecules, the two sets should have similar distributions of logP, molecular weight, etc. But the compounds of the two sets should be structurally distinct (i.e., analogs of the compounds in the testing set should not appear in the training set, to ensure independence). As a further complication, we need the same ratio of actives to inactives in the testing and training sets. What algorithm can generate sets like these?

[project-management file=”Training-and-Testing-Set-Generation.xml”]