{
"cells": [
{
"cell_type": "markdown",
"id": "216aeebb",
"metadata": {},
"source": [
"# Getting started"
]
},
{
"cell_type": "markdown",
"id": "931c4a3a",
"metadata": {},
"source": [
"## Installation"
]
},
{
"cell_type": "markdown",
"id": "a82abbf0",
"metadata": {},
"source": [
"To install the last version of `ugropy` uploaded to [PyPI](https://pypi.org/project/ugropy/) simply do:\n",
"\n",
"```\n",
"pip install ugropy\n",
"```\n",
"\n",
"If you are working on a google colab use:\n",
"\n",
"```\n",
"%pip install ugropy\n",
"```\n",
"\n",
"If you are a developer who wants to contribute to `ugropy`, you can install\n",
"it in development mode by cloning the repository and running:\n",
"\n",
"```\n",
"pip install -e .\n",
"```"
]
},
{
"cell_type": "markdown",
"id": "fe3933f4",
"metadata": {},
"source": [
"## Citing ugropy"
]
},
{
"cell_type": "markdown",
"id": "8edd9cd1",
"metadata": {},
"source": [
"If you use `ugropy` in your research, please cite the following [article](https://doi.org/10.1021/acs.iecr.5c02552):\n",
"\n",
"```\n",
"@article{brandolin2025ugropy,\n",
" title={Ugropy: An Extensible Python Package for Thermodynamic Model Functional Group Identification via Mathematical Optimization},\n",
" author={Brandol{\\'\\i}n, Salvador E and Benelli, Federico E and Magario, Ivana and Scilipoti, Jos{\\'e} A},\n",
" journal={Industrial \\& Engineering Chemistry Research},\n",
" volume={64},\n",
" number={35},\n",
" pages={17217--17227},\n",
" year={2025},\n",
" publisher={ACS Publications},\n",
" doi = {10.1021/acs.iecr.5c02552}\n",
"}\n",
"```"
]
},
{
"cell_type": "markdown",
"id": "d1d908bc",
"metadata": {
"vscode": {
"languageId": "plaintext"
}
},
"source": [
"## About the algorithm"
]
},
{
"cell_type": "markdown",
"id": "62412ff7",
"metadata": {},
"source": [
"First, `ugropy`'s algorithm relies on three major dependency libraries:\n",
"\n",
"- [RDKit](https://www.rdkit.org/): for handling chemical structures and performing substructure searches.\n",
"- [PuLP](https://github.com/coin-or/pulp): for formulating and solving the integer optimization problems.\n",
"- [PubChemPy](https://github.com/mcs07/PubChemPy): for retrieving chemical information (SMILES) from the PubChem database.\n",
"\n",
"Without the capabilities of these libraries, `ugropy` would not be able to\n",
"perform its core functions effectively. `RDKit` allows for efficient\n",
"manipulation of chemical structures, while `PuLP` provides a powerful framework\n",
"for solving the optimization problems that arise in functional group\n",
"identification. `PubChemPy` enables access to SMILES strings from the\n",
"molecules' names.\n",
"\n",
"`ugropy`'s algorithm is based on the concept of fragmentation, which involves\n",
"selecting a set of fragments (functional groups) that can be used to represent\n",
"the molecule from a list of structures.\n",
"\n",
"The `RDKit` library is used to perform substructure searches, for example,\n",
"let's imagine that we have a very simple fragmentation model that only detects\n",
"the presence of methyl (CH3) and ethyl (CH2) fragments."
]
},
{
"cell_type": "code",
"execution_count": 1,
"id": "2474c9fa",
"metadata": {},
"outputs": [],
"source": [
"from rdkit import Chem\n",
"\n",
"\n",
"# Definition of the fragments of our model\n",
"ch3 = Chem.MolFromSmarts(\"[CX4H3]\")\n",
"ch2 = Chem.MolFromSmarts(\"[CX4H2]\")"
]
},
{
"cell_type": "markdown",
"id": "77078f51",
"metadata": {},
"source": [
"In the majority of fragmentation models, we usually have to specify\n",
"which fragments are used to represent a molecule and the number of times that\n",
"fragments are present in the molecule. For example, if we want to fragment the\n",
"n-hexane molecule with our simple fragmentation model, we can simply do it with\n",
"the `RDKit` library as follows:"
]
},
{
"cell_type": "code",
"execution_count": 2,
"id": "8436e8c0",
"metadata": {},
"outputs": [],
"source": [
"# Molecule to fragment\n",
"hexane = Chem.MolFromSmiles(\"CCCCCC\")"
]
},
{
"cell_type": "code",
"execution_count": 3,
"id": "500eaac9",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"((0,), (5,))"
]
},
"execution_count": 3,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"ch3_detections = hexane.GetSubstructMatches(ch3)\n",
"\n",
"ch3_detections"
]
},
{
"cell_type": "code",
"execution_count": 4,
"id": "b30589f7",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"((1,), (2,), (3,), (4,))"
]
},
"execution_count": 4,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"ch2_detections = hexane.GetSubstructMatches(ch2)\n",
"\n",
"ch2_detections"
]
},
{
"cell_type": "markdown",
"id": "aa78ea1a",
"metadata": {},
"source": [
"`RDKit` is returning a tuple of tuples for each fragment, where each inner\n",
"tuple contains the indices of the atoms that match the fragment. The length of\n",
"the outer tuple indicates how many times the fragment is present in the\n",
"molecule. In this case, we have 2 methyl groups and 4 ethyl groups in the\n",
"n-hexane molecule. For that we can directly build a dictionary that represents\n",
"the result of the n-hexane fragmentation with our simple model:"
]
},
{
"cell_type": "code",
"execution_count": 5,
"id": "79416177",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"{'CH3': 2, 'CH2': 4}"
]
},
"execution_count": 5,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"solution = {\n",
" \"CH3\": len(ch3_detections),\n",
" \"CH2\": len(ch2_detections),\n",
"}\n",
"\n",
"solution"
]
},
{
"cell_type": "markdown",
"id": "bd6cdbae",
"metadata": {},
"source": [
"This simple molecule with this simple fragmentation model was the start of the\n",
"development of `ugropy`:\n",
"\n",
" \"Well... now we just need to add all the fragments of a model like UNIFAC,\n",
" make a direct detection with RDKit for each one and that's all... right?\".\n",
"\n",
"Well, no. The problem is that some fragments can overlap on the molecule's\n",
"atoms when performing a direct detection. Let's check the following example:"
]
},
{
"cell_type": "code",
"execution_count": 6,
"id": "59d1608e",
"metadata": {},
"outputs": [
{
"data": {
"image/svg+xml": [
""
],
"text/plain": [
""
]
},
"execution_count": 6,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"from rdkit import Chem\n",
"from rdkit.Chem.Draw import rdMolDraw2D\n",
"from IPython.display import SVG\n",
"\n",
"\n",
"mol = Chem.MolFromSmiles(\"c1ccccc1Cc2ccccc2\")\n",
"\n",
"drawer = rdMolDraw2D.MolDraw2DSVG(700, 200)\n",
"opts = drawer.drawOptions()\n",
"\n",
"opts.addAtomIndices = True\n",
"\n",
"drawer.DrawMolecule(mol)\n",
"drawer.FinishDrawing()\n",
"\n",
"SVG(drawer.GetDrawingText())"
]
},
{
"cell_type": "markdown",
"id": "cc72aa5c",
"metadata": {},
"source": [
"In many fragmentation models like UNIFAC, there are fragments to differentiate\n",
"between aliphatic and aromatic carbons. Let's imagine the following\n",
"fragmentation model:"
]
},
{
"cell_type": "code",
"execution_count": 7,
"id": "80426abf",
"metadata": {},
"outputs": [],
"source": [
"# Definition of the fragments of our model\n",
"\n",
"# Aliphatic CH2\n",
"ch2 = Chem.MolFromSmarts(\"[CX4H2]\")\n",
"\n",
"# Aromatic CH\n",
"ach = Chem.MolFromSmarts(\"[cH]\")\n",
"\n",
"# Aromatic C without H\n",
"ac = Chem.MolFromSmarts(\"[cH0]\")\n",
"\n",
"# Aromatic C bonded with aliphatic CH2\n",
"acch2 = Chem.MolFromSmarts(\"[cH0][CX4H2]\")\n",
"\n",
"# List of fragments of our model\n",
"fragments = [ch2, ach, ac, acch2]\n",
"fragments_names = [\"CH2\", \"ACH\", \"AC\", \"ACCH2\"]"
]
},
{
"cell_type": "markdown",
"id": "54896ea5",
"metadata": {},
"source": [
"Let's analyze the molecule with our model doing a direct detection with\n",
"`RDKit`:"
]
},
{
"cell_type": "code",
"execution_count": 8,
"id": "8bb40c53",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"====================================================================\n",
"Atoms occupied by each fragment:\n",
"--------------------------------------------------------------------\n",
"CH2: ((6,),)\n",
"ACH: ((0,), (1,), (2,), (3,), (4,), (8,), (9,), (10,), (11,), (12,))\n",
"AC: ((5,), (7,))\n",
"ACCH2: ((5, 6), (7, 6))\n",
"\n",
"====================================================================\n",
"Occurrence of each fragment:\n",
"--------------------------------------------------------------------\n",
"{'CH2': 1, 'ACH': 10, 'AC': 2, 'ACCH2': 2}\n"
]
}
],
"source": [
"mol = Chem.MolFromSmiles(\"c1ccccc1Cc2ccccc2\")\n",
"occurrences = {}\n",
"\n",
"print(\"====================================================================\")\n",
"print(\"Atoms occupied by each fragment:\")\n",
"print(\"--------------------------------------------------------------------\")\n",
"for fragment, name in zip(fragments, fragments_names):\n",
" detections = mol.GetSubstructMatches(fragment)\n",
" print(f\"{name}: {detections}\")\n",
" occurrences[name] = len(detections)\n",
"\n",
"print(\"\\n====================================================================\")\n",
"print(\"Occurrence of each fragment:\")\n",
"print(\"--------------------------------------------------------------------\")\n",
"print(occurrences)"
]
},
{
"cell_type": "markdown",
"id": "e72ddf24",
"metadata": {},
"source": [
"Anyone with some experience with the UNIFAC model will know that the correct\n",
"fragmentation for this molecule is:\n",
"\n",
"```\n",
"{'ACH': 10, 'AC': 1, 'ACCH2': 1}\n",
"```\n",
"\n",
"You may notice that the direct detection of `ACCH2` fragment is getting two\n",
"detections. Moreover, the direct detection of `AC`, `ACCH2` and `CH2` are\n",
"overlapping in the same three atoms of the molecule.\n",
"\n",
"The `ugropy`'s algorithm creates an integer (binary) optimization to solve this\n",
"decision problem of which fragments use to occupy the atoms that present\n",
"overlaps. This mathematical optimization problem is called a Set Cover Problem,\n",
"and it is solved with the `PuLP` library.\n",
"\n",
"For more details about the algorithm, please refer to the [ugropy's article](https://doi.org/10.1021/acs.iecr.5c02552):\n",
"\n",
"\n",
"```\n",
"@article{brandolin2025ugropy,\n",
" title={Ugropy: An Extensible Python Package for Thermodynamic Model Functional Group Identification via Mathematical Optimization},\n",
" author={Brandol{\\'\\i}n, Salvador E and Benelli, Federico E and Magario, Ivana and Scilipoti, Jos{\\'e} A},\n",
" journal={Industrial \\& Engineering Chemistry Research},\n",
" volume={64},\n",
" number={35},\n",
" pages={17217--17227},\n",
" year={2025},\n",
" publisher={ACS Publications},\n",
" doi = {10.1021/acs.iecr.5c02552}\n",
"}\n",
"```\n",
"\n",
"\n",
"Now, let's use `ugropy` to obtain the correct fragmentation of the molecule\n",
"with the UNIFAC model."
]
},
{
"cell_type": "code",
"execution_count": 9,
"id": "d129de1d",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"{'ACH': 10, 'AC': 1, 'ACCH2': 1}"
]
},
"execution_count": 9,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"from ugropy import unifac\n",
"\n",
"\n",
"solution = unifac.get_groups(\"c1ccccc1Cc2ccccc2\", \"smiles\")\n",
"\n",
"solution.subgroups"
]
},
{
"cell_type": "code",
"execution_count": 10,
"id": "d50b442b",
"metadata": {},
"outputs": [
{
"data": {
"image/svg+xml": [
""
],
"text/plain": [
""
]
},
"execution_count": 10,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"solution.draw(width=700)"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "ugropy (3.12.13)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.12.13"
}
},
"nbformat": 4,
"nbformat_minor": 5
}