The code implements the circular fingerprints from a paper by Xing et al. (J. Chem. Inf. Comput. Sci. 2003, 43, 870-879).
If anyone who has the skills and time would turn this into a proper CDK function which could be implemented into the CDK library that would probably be helpful for other people. Here follows my code with assumes that it's applied in java class where this is an AtomContainer.
// Calculates the atom environment descriptor of all atoms public void calculateAtomEnvironmentDescriptor() throws CDKException{ //maxLevel is how many bonds from the atom we will count atom types int maxLevel = 5; IAtomType[] atomTypes = null; int natoms = this.getAtomCount(); //do atom type matching IAtomTypeMatcher atm = SybylAtomTypeMatcher.getInstance(NoNotificationChemObjectBuilder.getInstance()); InputStream ins = this.getClass().getClassLoader().getResourceAsStream("org/openscience/cdk/dict/data/sybyl-atom-types.owl"); AtomTypeFactory factory = AtomTypeFactory.getInstance(ins,"owl",NoNotificationChemObjectBuilder.getInstance()); atomTypes = factory.getAllAtomTypes(); //map atomtypes to the atomIndex integer array TreeMapBig thanks to Nina Jeliazkova who gave me a preliminary version of the code from her old Ambit code. I have rewritten her code so now it's bug free and faster.map = new TreeMap (); for (int i = 0; i < atomTypes.length; i++) { map.put(atomTypes[i].getAtomTypeName(),new Integer(i)); } int[] atomIndex = new int[natoms]; //array of atom type integers for (int i = 0; i < natoms; i++) { try { IAtomType a = atm.findMatchingAtomType(this,this.getAtom(i)); if ( a != null) { Object mappedType = map.get(a.getAtomTypeName()); if (mappedType != null) atomIndex[i] = ((Integer) mappedType).intValue(); else { //System.out.println(a.getAtomTypeName() + " not found in " + map); atomIndex[i] = -1; } } else //atom type not found atomIndex[i] = -1; } catch (Exception x) { x.printStackTrace(); throw new CDKException(x.getMessage() + "\ninitConnectionMatrix"); } } //compute bond distances between all atoms int[][] aMatrix = PathTools.computeFloydAPSP(AdjacencyMatrix.getMatrix(this)); //assign values to the results arrays for all atoms int L = (atomTypes.length +1) ; int [][] result = new int[natoms][L*(maxLevel)+2]; //create result array for (int i = 0; i < natoms; i++) { //for every atom, iterate through its connections to all other atoms for (int j=0; j < natoms; j++) { if (aMatrix[i][j] == 0) result[i][1] = atomIndex[j]; //atom j is atom i else if (aMatrix[i][j] > 0 && aMatrix[i][j] <= maxLevel){ //j is not atom i and bonds less or equal to maxlevel if (atomIndex[j] >= 0) //atom type defined in factory result[i][L*(aMatrix[i][j]-1)+atomIndex[j]+2]++; else if (atomIndex[j] == -1) //-1, unknown type result[i][L*(aMatrix[i][j]-1)+(L-1)+2]++; } } //checksum for easy comparison for (int j = 1; j < result[i].length; j++) result[i][0] += result[i][j]; } if (debug == 1){ //print out the names of atom types for reference use System.out.print("Sum\tAtomType\t"); for (int j=0; j < factory.getSize(); j++) { System.out.print(j); System.out.print("."); System.out.print(atomTypes[j].getAtomTypeName()); System.out.print("\t"); } System.out.println(""); for (int i = 0; i < natoms; i++) System.out.println(Arrays.toString(result[i])); } }
Looks interesting. I have been working on CDK fingerprints (here: https://github.com/jonalv/cdk) I must take a close look at this when I have some time.
ReplyDeleteThese are atomic spherical fingerprints, not traditional molecular fingerprints. I've also verified against many molecules that they work as they should.
ReplyDeletePatrik, such atomic fingerprints can easily be used to create molecular fingerprints... this is, for example, done with atomic signatures. Effectively, every unique atomic fingerprint becomes a bit in the molecular fingerprint.
ReplyDeleteIf this codes makes it into the CDK, it would be the third kind of atomic environment descriptor, after the HOSE code and the signature. Can you explain why this circular fingerprint is more interesting than the others?
Well, turning this into molecular fingerprints maybe could be applied using a small MaxLevel (3 or something like that). When the level reaches 5 or 6 the vectors soon become unique (unless you have a really standard fragment)
ReplyDeleteThis fingerprint is interesting for two reasons:
1. It is coded to generate atomic (not molecular) fingerprints. CDK is good at molecular properties but atomic are often not very well represented among descriptors (and non-existent among fingerprints)
2. It has been validated to be applicable to the prediction of atomic properties.
I constantly like to read a top quality content having accurate info pertaining to the subject and the exact same thing I found in this article. Nice job. tatsumaki commands
ReplyDeleteA creator must have a huge learning of vocabulary. The word reference of an essayist must be brimming with new english vocabulary to make their work more alluring. call my lost phone
ReplyDeleteThe site is really beneficial for everyone to know about this topic. I think if you read blog than you will get some more information from blog. This is really useful blog. craigslist personals replacement
ReplyDelete