forked from enviPath/enviPy
# Summary
I have introduced a new base `class Dataset` in `ml.py` which all datasets should subclass. It stores the dataset as a polars DataFrame with the column names and number of columns determined by the subclass. It implements generic methods such as `add_row`, `at`, `limit` and dataset saving. It also details abstract methods required by the subclasses. These include `X`, `y` and `generate_dataset`.
There are two subclasses that currently exist. `RuleBasedDataset` for the MLRR models and `EnviFormerDataset` for the enviFormer models.
# Old Dataset to New RuleBasedDataset Functionality Translation
- [x] \_\_init\_\_
- self.columns and self.num_labels moved to base Dataset class
- self.data moved to base class with name self.df along with initialising from list or from another DataFrame
- struct_features, triggered and observed remain the same
- [x] \_block\_indices
- function moved to base Dataset class
- [x] structure_id
- stays in RuleBasedDataset, now requires an index for the row of interest
- [x] add_row
- moved to base Dataset class, now calls add_rows so one or more rows can be added at a time
- [x] times_triggered
- stays in RuleBasedDataset, now does a look up using polars df.filter
- [x] struct_features (see init)
- [x] triggered (see init)
- [x] observed (see init)
- [x] at
- removed in favour of indexing with getitem
- [x] limit
- removed in favour of indexing with getitem
- [x] classification_dataset
- stays in RuleBasedDataset, largely the same just with new dataset construction using add_rows
- [x] generate_dataset
- stays in RuleBasedDataset, largely the same just with new dataset construction using add_rows
- [x] X
- moved to base Dataset as @abstract_method, RuleBasedDataset implementation functionally the same but uses polars
- [x] trig
- stays in RuleBasedDataset, functionally the same but uses polars
- [x] y
- moved to base Dataset as @abstract_method, RuleBasedDataset implementation functionally the same but uses polars
- [x] \_\_get_item\_\_
- moved to base dataset, now passes item to the dataframe for polars to handle
- [x] to_arff
- stays in RuleBasedDataset, functionally the same but uses polars
- [x] \_\_repr\_\_
- moved to base dataset
- [x] \_\_iter\_\_
- moved to base Dataset, now uses polars iter_rows
# Base Dataset class Features
The following functions are available in the base Dataset class
- init - Create the dataset from a list of columns and data in format list of list. Or can create a dataset from a polars Dataframe, this is essential for recreating itself during indexing. Can create an empty dataset by just passing column names.
- add_rows - Add rows to the Dataset, we check that the new data length is the same but it is presumed that the column order matches the existing dataframe
- add_row - Add one row, see add_rows
- block_indices - Returns the column indices that start with the given prefix
- columns - Property, returns dataframe.columns
- shape - Property, returns dataframe.shape
- X - Abstract method to be implemented by the subclasses, it should represent the input to a ML model
- y - Abstract method to be implemented by the subclasses, it should represent the target for a ML model
- generate_dataset - Abstract and static method to be implemented by the subclasses, should return an initialised subclass of Dataset
- iter - returns the iterable from dataframe.iter_rows()
- getitem - passes the item argument to the dataframe. If the result of indexing the dataframe is another dataframe, the new dataframe is packaged into a new Dataset of the same subclass. If the result of indexing is something else (int, float, polar Series) return the result.
- save - Pickle and save the dataframe to the given path
- load - Static method to load the dataset from the given path
- to_numpy - returns the dataframe as a numpy array. Required for compatibility with training of the ECC model
- repr - return a representation of the dataset
- len - return the length of the dataframe
- iter_rows - Return dataframe.iterrows with arguments passed through. Mainly used to get the named iterable which returns rows of the dataframe as dict of column names: column values instead of tuple of column values.
- filter - pass to dataframe.filter and recreates self with the result
- select - pass to dataframe.select and recreates self with the result
- with_columns - pass to dataframe.with_columns and recreates self with the result
- sort - pass to dataframe.sort and recreates self with the result
- item - pass to dataframe.item
- fill_nan - fill the dataframe nan's with value
- height - Property, returns the height (number of rows) of the dataframe
- [x] App domain
- [x] MACCS alternatives
Co-authored-by: Liam Brydon <62733830+MyCreativityOutlet@users.noreply.github.com>
Reviewed-on: enviPath/enviPy#184
Reviewed-by: jebus <lorsbach@envipath.com>
Co-authored-by: liambrydon <lbry121@aucklanduni.ac.nz>
Co-committed-by: liambrydon <lbry121@aucklanduni.ac.nz>
93 lines
3.1 KiB
TOML
93 lines
3.1 KiB
TOML
[project]
|
||
name = "envipy"
|
||
version = "0.1.0"
|
||
description = "Add your description here"
|
||
readme = "README.md"
|
||
requires-python = ">=3.12"
|
||
dependencies = [
|
||
"celery>=5.5.2",
|
||
"django>=5.2.1",
|
||
"django-extensions>=4.1",
|
||
"django-model-utils>=5.0.0",
|
||
"django-ninja>=1.4.1",
|
||
"django-oauth-toolkit>=3.0.1",
|
||
"django-polymorphic>=4.1.0",
|
||
"enviformer",
|
||
"envipy-additional-information",
|
||
"envipy-ambit>=0.1.0",
|
||
"envipy-plugins",
|
||
"epam-indigo>=1.30.1",
|
||
"gunicorn>=23.0.0",
|
||
"networkx>=3.4.2",
|
||
"psycopg2-binary>=2.9.10",
|
||
"python-dotenv>=1.1.0",
|
||
"rdkit>=2025.3.2",
|
||
"redis>=6.1.0",
|
||
"requests>=2.32.3",
|
||
"scikit-learn>=1.6.1",
|
||
"sentry-sdk[django]>=2.32.0",
|
||
"setuptools>=80.8.0",
|
||
"polars==1.35.1",
|
||
]
|
||
|
||
[tool.uv.sources]
|
||
enviformer = { git = "ssh://git@git.envipath.com/enviPath/enviformer.git", rev = "v0.1.4" }
|
||
envipy-plugins = { git = "ssh://git@git.envipath.com/enviPath/enviPy-plugins.git", rev = "v0.1.0" }
|
||
envipy-additional-information = { git = "ssh://git@git.envipath.com/enviPath/enviPy-additional-information.git", rev = "v0.1.7"}
|
||
envipy-ambit = { git = "ssh://git@git.envipath.com/enviPath/enviPy-ambit.git" }
|
||
|
||
[project.optional-dependencies]
|
||
ms-login = ["msal>=1.33.0"]
|
||
dev = [
|
||
"celery-stubs==0.1.3",
|
||
"django-stubs>=5.2.4",
|
||
"poethepoet>=0.37.0",
|
||
"pre-commit>=4.3.0",
|
||
"ruff>=0.13.3",
|
||
]
|
||
|
||
[tool.ruff]
|
||
line-length = 100
|
||
|
||
[tool.ruff.lint]
|
||
# Allow fix for all enabled rules (when `--fix`) is provided.
|
||
fixable = ["ALL"]
|
||
# Allow unused variables when underscore-prefixed.
|
||
dummy-variable-rgx = "^(_+|(_+[a-zA-Z0-9_]*[a-zA-Z0-9]+?))$"
|
||
|
||
[tool.ruff.format]
|
||
docstring-code-format = true
|
||
|
||
# 4. Ignore `E402` (import violations) in all `__init__.py` files, and in selected subdirectories.
|
||
[tool.ruff.lint.per-file-ignores]
|
||
"__init__.py" = ["E402"]
|
||
"**/{tests,docs,tools}/*" = ["E402"]
|
||
|
||
[tool.poe.tasks]
|
||
# Main tasks
|
||
setup = { sequence = ["db-up", "migrate", "bootstrap"], help = "Complete setup: start database, run migrations, and bootstrap data" }
|
||
dev = { cmd = "python manage.py runserver", help = "Start the development server", deps = ["db-up"] }
|
||
|
||
# Database tasks
|
||
db-up = { cmd = "docker compose -f docker-compose.dev.yml up -d", help = "Start PostgreSQL database using Docker Compose" }
|
||
db-down = { cmd = "docker compose -f docker-compose.dev.yml down", help = "Stop PostgreSQL database" }
|
||
|
||
# Full cleanup tasks
|
||
clean = { sequence = ["clean-db"], help = "Remove model files and database volumes (WARNING: destroys all data!)" }
|
||
clean-db = { cmd = "docker compose -f docker-compose.dev.yml down -v", help = "Removes the database container and volume." }
|
||
|
||
# Django tasks
|
||
migrate = { cmd = "python manage.py migrate", help = "Run database migrations" }
|
||
bootstrap = { shell = """
|
||
echo "Bootstrapping initial data..."
|
||
echo "This will take a bit ⏱️. Get yourself some coffee..."
|
||
python manage.py bootstrap
|
||
echo "✓ Bootstrap complete"
|
||
echo ""
|
||
echo "Default admin credentials:"
|
||
echo " Username: admin"
|
||
echo " Email: admin@envipath.com"
|
||
echo " Password: SuperSafe"
|
||
""", help = "Bootstrap initial data (anonymous user, packages, models)" }
|
||
shell = { cmd = "python manage.py shell", help = "Open Django shell" }
|