forked from enviPath/enviPy
# Summary
I have introduced a new base `class Dataset` in `ml.py` which all datasets should subclass. It stores the dataset as a polars DataFrame with the column names and number of columns determined by the subclass. It implements generic methods such as `add_row`, `at`, `limit` and dataset saving. It also details abstract methods required by the subclasses. These include `X`, `y` and `generate_dataset`.
There are two subclasses that currently exist. `RuleBasedDataset` for the MLRR models and `EnviFormerDataset` for the enviFormer models.
# Old Dataset to New RuleBasedDataset Functionality Translation
- [x] \_\_init\_\_
- self.columns and self.num_labels moved to base Dataset class
- self.data moved to base class with name self.df along with initialising from list or from another DataFrame
- struct_features, triggered and observed remain the same
- [x] \_block\_indices
- function moved to base Dataset class
- [x] structure_id
- stays in RuleBasedDataset, now requires an index for the row of interest
- [x] add_row
- moved to base Dataset class, now calls add_rows so one or more rows can be added at a time
- [x] times_triggered
- stays in RuleBasedDataset, now does a look up using polars df.filter
- [x] struct_features (see init)
- [x] triggered (see init)
- [x] observed (see init)
- [x] at
- removed in favour of indexing with getitem
- [x] limit
- removed in favour of indexing with getitem
- [x] classification_dataset
- stays in RuleBasedDataset, largely the same just with new dataset construction using add_rows
- [x] generate_dataset
- stays in RuleBasedDataset, largely the same just with new dataset construction using add_rows
- [x] X
- moved to base Dataset as @abstract_method, RuleBasedDataset implementation functionally the same but uses polars
- [x] trig
- stays in RuleBasedDataset, functionally the same but uses polars
- [x] y
- moved to base Dataset as @abstract_method, RuleBasedDataset implementation functionally the same but uses polars
- [x] \_\_get_item\_\_
- moved to base dataset, now passes item to the dataframe for polars to handle
- [x] to_arff
- stays in RuleBasedDataset, functionally the same but uses polars
- [x] \_\_repr\_\_
- moved to base dataset
- [x] \_\_iter\_\_
- moved to base Dataset, now uses polars iter_rows
# Base Dataset class Features
The following functions are available in the base Dataset class
- init - Create the dataset from a list of columns and data in format list of list. Or can create a dataset from a polars Dataframe, this is essential for recreating itself during indexing. Can create an empty dataset by just passing column names.
- add_rows - Add rows to the Dataset, we check that the new data length is the same but it is presumed that the column order matches the existing dataframe
- add_row - Add one row, see add_rows
- block_indices - Returns the column indices that start with the given prefix
- columns - Property, returns dataframe.columns
- shape - Property, returns dataframe.shape
- X - Abstract method to be implemented by the subclasses, it should represent the input to a ML model
- y - Abstract method to be implemented by the subclasses, it should represent the target for a ML model
- generate_dataset - Abstract and static method to be implemented by the subclasses, should return an initialised subclass of Dataset
- iter - returns the iterable from dataframe.iter_rows()
- getitem - passes the item argument to the dataframe. If the result of indexing the dataframe is another dataframe, the new dataframe is packaged into a new Dataset of the same subclass. If the result of indexing is something else (int, float, polar Series) return the result.
- save - Pickle and save the dataframe to the given path
- load - Static method to load the dataset from the given path
- to_numpy - returns the dataframe as a numpy array. Required for compatibility with training of the ECC model
- repr - return a representation of the dataset
- len - return the length of the dataframe
- iter_rows - Return dataframe.iterrows with arguments passed through. Mainly used to get the named iterable which returns rows of the dataframe as dict of column names: column values instead of tuple of column values.
- filter - pass to dataframe.filter and recreates self with the result
- select - pass to dataframe.select and recreates self with the result
- with_columns - pass to dataframe.with_columns and recreates self with the result
- sort - pass to dataframe.sort and recreates self with the result
- item - pass to dataframe.item
- fill_nan - fill the dataframe nan's with value
- height - Property, returns the height (number of rows) of the dataframe
- [x] App domain
- [x] MACCS alternatives
Co-authored-by: Liam Brydon <62733830+MyCreativityOutlet@users.noreply.github.com>
Reviewed-on: enviPath/enviPy#184
Reviewed-by: jebus <lorsbach@envipath.com>
Co-authored-by: liambrydon <lbry121@aucklanduni.ac.nz>
Co-committed-by: liambrydon <lbry121@aucklanduni.ac.nz>
80 lines
3.2 KiB
Python
80 lines
3.2 KiB
Python
from collections import defaultdict
|
|
from datetime import datetime
|
|
from tempfile import TemporaryDirectory
|
|
from django.test import TestCase, tag
|
|
from epdb.logic import PackageManager
|
|
from epdb.models import User, EnviFormer, Package, Setting
|
|
from epdb.tasks import predict_simple, predict
|
|
|
|
|
|
def measure_predict(mod, pathway_pk=None):
|
|
# Measure and return the prediction time
|
|
start = datetime.now()
|
|
if pathway_pk:
|
|
s = Setting()
|
|
s.model = mod
|
|
s.model_threshold = 0.2
|
|
s.max_depth = 4
|
|
s.max_nodes = 20
|
|
s.save()
|
|
pred_result = predict.delay(pathway_pk, s.pk, limit=s.max_depth)
|
|
else:
|
|
pred_result = predict_simple.delay(mod.pk, "C1=CC=C(CSCC2=CC=CC=C2)C=C1")
|
|
_ = pred_result.get()
|
|
return round((datetime.now() - start).total_seconds(), 2)
|
|
|
|
|
|
@tag("slow")
|
|
class EnviFormerTest(TestCase):
|
|
fixtures = ["test_fixtures.jsonl.gz"]
|
|
|
|
@classmethod
|
|
def setUpClass(cls):
|
|
super(EnviFormerTest, cls).setUpClass()
|
|
cls.user = User.objects.get(username="anonymous")
|
|
cls.package = PackageManager.create_package(cls.user, "Anon Test Package", "No Desc")
|
|
cls.BBD_SUBSET = Package.objects.get(name="Fixtures")
|
|
|
|
def test_model_flow(self):
|
|
"""Test the full flow of EnviFormer, dataset build -> model finetune -> model evaluate -> model inference"""
|
|
with TemporaryDirectory() as tmpdir:
|
|
with self.settings(MODEL_DIR=tmpdir):
|
|
threshold = float(0.5)
|
|
data_package_objs = [self.BBD_SUBSET]
|
|
eval_packages_objs = [self.BBD_SUBSET]
|
|
mod = EnviFormer.create(self.package, data_package_objs, threshold=threshold)
|
|
|
|
mod.build_dataset()
|
|
mod.build_model()
|
|
mod.evaluate_model(True, eval_packages_objs, n_splits=2)
|
|
|
|
mod.predict("CCN(CC)C(=O)C1=CC(=CC=C1)C")
|
|
|
|
def test_predict_runtime(self):
|
|
with TemporaryDirectory() as tmpdir:
|
|
with self.settings(MODEL_DIR=tmpdir):
|
|
threshold = float(0.5)
|
|
data_package_objs = [self.BBD_SUBSET]
|
|
mods = []
|
|
for _ in range(4):
|
|
mod = EnviFormer.create(self.package, data_package_objs, threshold=threshold)
|
|
mod.build_dataset()
|
|
mod.build_model()
|
|
mods.append(mod)
|
|
|
|
# Test prediction time drops after first prediction
|
|
times = [measure_predict(mods[0]) for _ in range(5)]
|
|
print(f"First prediction took {times[0]} seconds, subsequent ones took {times[1:]}")
|
|
|
|
# Test pathway prediction
|
|
times = [measure_predict(mods[1], self.BBD_SUBSET.pathways[0].pk) for _ in range(5)]
|
|
print(f"First pathway prediction took {times[0]} seconds, subsequent ones took {times[1:]}")
|
|
|
|
# Test eviction by performing three prediction with every model, twice.
|
|
times = defaultdict(list)
|
|
for _ in range(2): # Eviction should cause the second iteration here to have to reload the models
|
|
for mod in mods:
|
|
for _ in range(3):
|
|
times[mod.pk].append(measure_predict(mod))
|
|
print(times)
|