December 4, 2011

Circular fingerprints in CDK

I've been working on some code that computes circular fingerprints for all atoms in a molecule with the help of the CDK java library.
The code implements the circular fingerprints from a paper by Xing et al. (J. Chem. Inf. Comput. Sci. 2003, 43, 870-879).

If anyone who has the skills and time would turn this into a proper CDK function which could be implemented into the CDK library that would probably be helpful for other people. Here follows my code with assumes that it's applied in java class where this is an AtomContainer.

 
// Calculates the atom environment descriptor of all atoms
public void calculateAtomEnvironmentDescriptor() throws CDKException{
        //maxLevel is how many bonds from the atom we will count atom types
 int maxLevel = 5;
 IAtomType[] atomTypes = null;
 int natoms = this.getAtomCount();
 //do atom type matching
 IAtomTypeMatcher atm = SybylAtomTypeMatcher.getInstance(NoNotificationChemObjectBuilder.getInstance());
 InputStream ins = this.getClass().getClassLoader().getResourceAsStream("org/openscience/cdk/dict/data/sybyl-atom-types.owl");
 AtomTypeFactory factory = AtomTypeFactory.getInstance(ins,"owl",NoNotificationChemObjectBuilder.getInstance());
 atomTypes = factory.getAllAtomTypes();
 //map atomtypes to the atomIndex integer array
 TreeMap map = new TreeMap();
    for (int i = 0; i < atomTypes.length; i++) { 
        map.put(atomTypes[i].getAtomTypeName(),new Integer(i));
    }
    int[] atomIndex = new int[natoms]; //array of atom type integers
    for (int i = 0; i < natoms; i++) {
        try {
            IAtomType a = atm.findMatchingAtomType(this,this.getAtom(i));
            if ( a != null) {
                Object mappedType = map.get(a.getAtomTypeName());
                if (mappedType != null) 
                    atomIndex[i] = ((Integer) mappedType).intValue();
                else {
                    //System.out.println(a.getAtomTypeName() + " not found in " + map);
                    atomIndex[i] = -1;
                }    
            } else //atom type not found 
             atomIndex[i] = -1;
        } catch (Exception x) {
            x.printStackTrace();
            throw new CDKException(x.getMessage() + "\ninitConnectionMatrix");
        }                
    }
    //compute bond distances between all atoms
    int[][] aMatrix = PathTools.computeFloydAPSP(AdjacencyMatrix.getMatrix(this));
    //assign values to the results arrays for all atoms
 int L = (atomTypes.length +1) ;
 int [][] result = new int[natoms][L*(maxLevel)+2]; //create result array
 for (int i = 0; i < natoms; i++) {
  //for every atom, iterate through its connections to all other atoms
  for (int j=0; j < natoms; j++) {
   if (aMatrix[i][j] == 0) 
    result[i][1] = atomIndex[j]; //atom j is atom i
         else if (aMatrix[i][j] > 0 && aMatrix[i][j] <= maxLevel){ //j is not atom i and bonds less or equal to maxlevel
          if (atomIndex[j] >= 0) //atom type defined in factory
              result[i][L*(aMatrix[i][j]-1)+atomIndex[j]+2]++;
          else if (atomIndex[j] == -1) //-1, unknown  type
           result[i][L*(aMatrix[i][j]-1)+(L-1)+2]++;
         }
  }
  //checksum for easy comparison        
  for (int j = 1; j < result[i].length; j++) result[i][0] += result[i][j];
 }
 if (debug == 1){
        //print out the names of atom types for reference use
     System.out.print("Sum\tAtomType\t");
  for (int j=0; j < factory.getSize(); j++) {
   System.out.print(j);
   System.out.print(".");
      System.out.print(atomTypes[j].getAtomTypeName());
      System.out.print("\t");
     }       
     System.out.println("");
     for (int i = 0; i < natoms; i++)
      System.out.println(Arrays.toString(result[i]));
 }
}
Big thanks to Nina Jeliazkova who gave me a preliminary version of the code from her old Ambit code. I have rewritten her code so now it's bug free and faster.