Go to file

liambrydon e26d5a21e3 [Enhancement] Refactor Dataset (#184 )

# Summary
I have introduced a new base `class Dataset` in `ml.py` which all datasets should subclass. It stores the dataset as a polars DataFrame with the column names and number of columns determined by the subclass. It implements generic methods such as `add_row`, `at`, `limit` and dataset saving. It also details abstract methods required by the subclasses. These include `X`, `y` and `generate_dataset`.

There are two subclasses that currently exist. `RuleBasedDataset` for the MLRR models and `EnviFormerDataset` for the enviFormer models.

# Old Dataset to New RuleBasedDataset Functionality Translation

- [x] \_\_init\_\_
    - self.columns and self.num_labels moved to base Dataset class
    - self.data moved to base class with name self.df along with initialising from list or from another DataFrame
    - struct_features, triggered and observed remain the same
- [x] \_block\_indices
    - function moved to base Dataset class
- [x] structure_id
    - stays in RuleBasedDataset, now requires an index for the row of interest
- [x] add_row
    - moved to base Dataset class, now calls add_rows so one or more rows can be added at a time
- [x] times_triggered
    - stays in RuleBasedDataset, now does a look up using polars df.filter
- [x] struct_features (see init)
- [x] triggered (see init)
- [x] observed (see init)
- [x] at
    - removed in favour of indexing with getitem
- [x] limit
    - removed in favour of indexing with getitem
- [x] classification_dataset
    - stays in RuleBasedDataset, largely the same just with new dataset construction using add_rows
- [x] generate_dataset
    - stays in RuleBasedDataset, largely the same just with new dataset construction using add_rows
- [x] X
    - moved to base Dataset as @abstract_method, RuleBasedDataset implementation functionally the same but uses polars
- [x] trig
    - stays in RuleBasedDataset, functionally the same but uses polars
- [x] y
    - moved to base Dataset as @abstract_method, RuleBasedDataset implementation functionally the same but uses polars
- [x] \_\_get_item\_\_
    - moved to base dataset, now passes item to the dataframe for polars to handle
- [x] to_arff
    - stays in RuleBasedDataset, functionally the same but uses polars
- [x] \_\_repr\_\_
    - moved to base dataset
- [x] \_\_iter\_\_
    - moved to base Dataset, now uses polars iter_rows

# Base Dataset class Features
The following functions are available in the base Dataset class

- init - Create the dataset from a list of columns and data in format list of list. Or can create a dataset from a polars Dataframe, this is essential for recreating itself during indexing. Can create an empty dataset by just passing column names.
- add_rows - Add rows to the Dataset, we check that the new data length is the same but it is presumed that the column order matches the existing dataframe
- add_row - Add one row, see add_rows
- block_indices - Returns the column indices that start with the given prefix
- columns - Property, returns dataframe.columns
- shape - Property, returns dataframe.shape
- X - Abstract method to be implemented by the subclasses, it should represent the input to a ML model
- y - Abstract method to be implemented by the subclasses, it should represent the target for a ML model
- generate_dataset - Abstract and static method to be implemented by the subclasses, should return an initialised subclass of Dataset
- iter - returns the iterable from dataframe.iter_rows()
- getitem - passes the item argument to the dataframe. If the result of indexing the dataframe is another dataframe, the new dataframe is  packaged into a new Dataset of the same subclass. If the result of indexing is something else (int, float, polar Series) return the result.
- save - Pickle and save the dataframe to the given path
- load - Static method to load the dataset from the given path
- to_numpy - returns the dataframe as a numpy array. Required for compatibility with training of the ECC model
- repr - return a representation of the dataset
- len - return the length of the dataframe
- iter_rows - Return dataframe.iterrows with arguments passed through. Mainly used to get the named iterable which returns rows of the dataframe as dict of column names: column values instead of tuple of column values.
- filter - pass to dataframe.filter and recreates self with the result
- select - pass to dataframe.select and recreates self with the result
- with_columns - pass to dataframe.with_columns and recreates self with the result
- sort - pass to dataframe.sort and recreates self with the result
- item - pass to dataframe.item
- fill_nan - fill the dataframe nan's with value
- height - Property, returns the height (number of rows) of the dataframe

- [x] App domain
- [x] MACCS alternatives

Co-authored-by: Liam Brydon <62733830+MyCreativityOutlet@users.noreply.github.com>
Reviewed-on: enviPath/enviPy#184
Reviewed-by: jebus <lorsbach@envipath.com>
Co-authored-by: liambrydon <lbry121@aucklanduni.ac.nz>
Co-committed-by: liambrydon <lbry121@aucklanduni.ac.nz>

2025-11-07 08:46:17 +13:00

envipath

[Feature] Make Matomo Site ID configurable via .env (#183 )

2025-11-05 10:19:07 +13:00

epauth

[Feature] Initial Active Directory / Entra Login (#101 )

2025-09-10 08:29:27 +12:00

epdb

[Enhancement] Refactor Dataset (#184 )

2025-11-07 08:46:17 +13:00

fixtures

[Feature] External Identifier/References

2025-10-02 00:40:00 +13:00

migration

[Feature] Documentation for development setup

2025-10-08 18:51:50 +13:00

static

[Fix] AppDomain Leftovers (#161 )

2025-10-16 08:17:39 +13:00

templates

[Feature] Make Matomo Site ID configurable via .env (#183 )

2025-11-05 10:19:07 +13:00

tests

[Enhancement] Refactor Dataset (#184 )

2025-11-07 08:46:17 +13:00

utilities

[Enhancement] Refactor Dataset (#184 )

2025-11-07 08:46:17 +13:00

.env.local.example

[Feature] Documentation for development setup

2025-10-08 18:51:50 +13:00

.env.prod.example

[Feature] Make Matomo Site ID configurable via .env (#183 )

2025-11-05 10:19:07 +13:00

.gitignore

[Feature] Documentation for development setup

2025-10-08 18:51:50 +13:00

.pre-commit-config.yaml

[Feature] Documentation for development setup

2025-10-08 18:51:50 +13:00

.python-version

[Misc] Performance improvements, SMIRKS Coverage, Minor Bugfixes (#132 )

2025-09-26 19:33:03 +12:00

docker-compose.dev.yml

[Feature] Documentation for development setup

2025-10-08 18:51:50 +13:00

manage.py

Current Dev State

2025-06-23 20:13:54 +02:00

pyproject.toml

[Enhancement] Refactor Dataset (#184 )

2025-11-07 08:46:17 +13:00

README.md

[Feature] Documentation for development setup

2025-10-08 18:51:50 +13:00

uv.lock

[Enhancement] Refactor Dataset (#184 )

2025-11-07 08:46:17 +13:00

README.md

enviPy

Local Development Setup

These instructions will guide you through setting up the project for local development.

Prerequisites

Python 3.11 or later
uv - A fast Python package installer and resolver.
Docker and Docker Compose - Required for running the PostgreSQL database.
Git

Note: This application requires PostgreSQL, which uses ArrayField. Docker is the recommended way to run PostgreSQL locally.

1. Install Dependencies

This project uses uv to manage dependencies and poe-the-poet for task running. First, install uv if you don't have it yet.

Then, sync the project dependencies. This will create a virtual environment in .venv and install all necessary packages, including poe-the-poet.

uv sync --dev

Note on RDkit: If you have a different version of rdkit installed globally, the dependency installation may fail. If this happens, please uninstall the global version and run uv sync again.

2. Set Up Environment File

Copy the example environment file for local setup:

cp .env.local.example .env

This file contains the necessary environment variables for local development.

3. Quick Setup with Poe

The easiest way to set up the development environment is by using the poe task runner, which is executed via uv run.

uv run poe setup

This single command will:

Start the PostgreSQL database using Docker Compose.
Run database migrations.
Bootstrap initial data (anonymous user, default packages, models).

After setup, start the development server:

uv run poe dev

The application will be available at http://localhost:8000.

Other useful Poe commands:

You can list all available commands by running uv run poe --help.

uv run poe db-up         # Start PostgreSQL only
uv run poe db-down       # Stop PostgreSQL
uv run poe migrate       # Run migrations only
uv run poe bootstrap     # Bootstrap data only
uv run poe shell         # Open the Django shell
uv run poe clean         # Remove database volumes (WARNING: destroys all data)

Troubleshooting

Docker Connection Error: If you see an error like open //./pipe/dockerDesktopLinuxEngine: The system cannot find the file specified (on Windows), it likely means your Docker Desktop application is not running. Please start Docker Desktop and try the command again.
SSH Keys for Git Dependencies: Some dependencies are installed from private git repositories and require SSH authentication. Ensure your SSH keys are configured correctly for Git.
- For a general guide, see GitHub's official documentation.
- Windows Users: If uv sync hangs while fetching git dependencies, you may need to explicitly configure Git to use the Windows OpenSSH client and use the ssh-agent to manage your key's passphrase.
  1. Point Git to the correct SSH executable:
```
git config --global core.sshCommand "C:/Windows/System32/OpenSSH/ssh.exe"
```
  2. Enable and use the SSH agent:
```
# Run these commands in an administrator PowerShell
Get-Service ssh-agent | Set-Service -StartupType Automatic -PassThru | Start-Service

# Add your key to the agent. It will prompt for the passphrase once.
ssh-add
```

Languages

Python 39.5%

JavaScript 35.4%

HTML 23%

Less 1.6%

CSS 0.5%