December 4, 2011

Circular fingerprints in CDK

I've been working on some code that computes circular fingerprints for all atoms in a molecule with the help of the CDK java library.
The code implements the circular fingerprints from a paper by Xing et al. (J. Chem. Inf. Comput. Sci. 2003, 43, 870-879).

If anyone who has the skills and time would turn this into a proper CDK function which could be implemented into the CDK library that would probably be helpful for other people. Here follows my code with assumes that it's applied in java class where this is an AtomContainer.

 
// Calculates the atom environment descriptor of all atoms
public void calculateAtomEnvironmentDescriptor() throws CDKException{
        //maxLevel is how many bonds from the atom we will count atom types
 int maxLevel = 5;
 IAtomType[] atomTypes = null;
 int natoms = this.getAtomCount();
 //do atom type matching
 IAtomTypeMatcher atm = SybylAtomTypeMatcher.getInstance(NoNotificationChemObjectBuilder.getInstance());
 InputStream ins = this.getClass().getClassLoader().getResourceAsStream("org/openscience/cdk/dict/data/sybyl-atom-types.owl");
 AtomTypeFactory factory = AtomTypeFactory.getInstance(ins,"owl",NoNotificationChemObjectBuilder.getInstance());
 atomTypes = factory.getAllAtomTypes();
 //map atomtypes to the atomIndex integer array
 TreeMap map = new TreeMap();
    for (int i = 0; i < atomTypes.length; i++) { 
        map.put(atomTypes[i].getAtomTypeName(),new Integer(i));
    }
    int[] atomIndex = new int[natoms]; //array of atom type integers
    for (int i = 0; i < natoms; i++) {
        try {
            IAtomType a = atm.findMatchingAtomType(this,this.getAtom(i));
            if ( a != null) {
                Object mappedType = map.get(a.getAtomTypeName());
                if (mappedType != null) 
                    atomIndex[i] = ((Integer) mappedType).intValue();
                else {
                    //System.out.println(a.getAtomTypeName() + " not found in " + map);
                    atomIndex[i] = -1;
                }    
            } else //atom type not found 
             atomIndex[i] = -1;
        } catch (Exception x) {
            x.printStackTrace();
            throw new CDKException(x.getMessage() + "\ninitConnectionMatrix");
        }                
    }
    //compute bond distances between all atoms
    int[][] aMatrix = PathTools.computeFloydAPSP(AdjacencyMatrix.getMatrix(this));
    //assign values to the results arrays for all atoms
 int L = (atomTypes.length +1) ;
 int [][] result = new int[natoms][L*(maxLevel)+2]; //create result array
 for (int i = 0; i < natoms; i++) {
  //for every atom, iterate through its connections to all other atoms
  for (int j=0; j < natoms; j++) {
   if (aMatrix[i][j] == 0) 
    result[i][1] = atomIndex[j]; //atom j is atom i
         else if (aMatrix[i][j] > 0 && aMatrix[i][j] <= maxLevel){ //j is not atom i and bonds less or equal to maxlevel
          if (atomIndex[j] >= 0) //atom type defined in factory
              result[i][L*(aMatrix[i][j]-1)+atomIndex[j]+2]++;
          else if (atomIndex[j] == -1) //-1, unknown  type
           result[i][L*(aMatrix[i][j]-1)+(L-1)+2]++;
         }
  }
  //checksum for easy comparison        
  for (int j = 1; j < result[i].length; j++) result[i][0] += result[i][j];
 }
 if (debug == 1){
        //print out the names of atom types for reference use
     System.out.print("Sum\tAtomType\t");
  for (int j=0; j < factory.getSize(); j++) {
   System.out.print(j);
   System.out.print(".");
      System.out.print(atomTypes[j].getAtomTypeName());
      System.out.print("\t");
     }       
     System.out.println("");
     for (int i = 0; i < natoms; i++)
      System.out.println(Arrays.toString(result[i]));
 }
}
Big thanks to Nina Jeliazkova who gave me a preliminary version of the code from her old Ambit code. I have rewritten her code so now it's bug free and faster.

4 comments:

  1. Looks interesting. I have been working on CDK fingerprints (here: https://github.com/jonalv/cdk) I must take a close look at this when I have some time.

    ReplyDelete
  2. These are atomic spherical fingerprints, not traditional molecular fingerprints. I've also verified against many molecules that they work as they should.

    ReplyDelete
  3. Patrik, such atomic fingerprints can easily be used to create molecular fingerprints... this is, for example, done with atomic signatures. Effectively, every unique atomic fingerprint becomes a bit in the molecular fingerprint.

    If this codes makes it into the CDK, it would be the third kind of atomic environment descriptor, after the HOSE code and the signature. Can you explain why this circular fingerprint is more interesting than the others?

    ReplyDelete
  4. Well, turning this into molecular fingerprints maybe could be applied using a small MaxLevel (3 or something like that). When the level reaches 5 or 6 the vectors soon become unique (unless you have a really standard fragment)

    This fingerprint is interesting for two reasons:
    1. It is coded to generate atomic (not molecular) fingerprints. CDK is good at molecular properties but atomic are often not very well represented among descriptors (and non-existent among fingerprints)
    2. It has been validated to be applicable to the prediction of atomic properties.

    ReplyDelete