augernet.feature_assembly¶
Feature Assembly Module¶
Provides runtime feature selection and assembly for GNN training.
During data preparation, ALL possible node features are computed and stored
as separate attributes on each PyG Data object. At training time, the user
selects which features to include via a list of integer keys, and this module
concatenates and scales them into data.x.
Feature Catalog
Key Name Dim Description ─── ────────────── ──── ───────────────────────────────────────── 0 skipatom_200 200 SkipAtom atom-type embedding (200-dim) 1 skipatom_30 30 SkipAtom atom-type embedding (30-dim) 2 onehot 5 Element one-hot encoding (H, C, N, O, F) 3 atomic_be 1 Isolated-atom 1s BE (Hartree, raw) 4 mol_be 1 Molecular CEBE for C, atomic for others (Hartree, raw) 5 e_score 1 Electronegativity-difference score (raw) 6 env_onehot 36 Carbon environment one-hot (NUM_CARBON_CATEGORIES) 7 morgan_fp 256 Per-atom Morgan fingerprint (ECFP2, radius=1)
Only the category_feature ([1,0,0], [0,1,0], [0,0,1]) is placed in
data.x at preparation time. Everything else lives in data.<name>
attributes and is assembled here at training time.
Usage
from augernet.feature_assembly import assemble_node_features, parse_feature_keys feature_keys_parsed = parse_feature_keys('035') # [0, 3, 5]
Before creating DataLoader — modifies data.x in-place¶
for data in data_list: ... assemble_node_features(data, feature_keys_parsed)
assemble_dataset(data_list, feature_keys, norm_stats=None)
¶
Apply assemble_node_features to every graph in a list (in-place).
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
data_list
|
list
|
List of PyG Data objects. |
required |
feature_keys
|
sequence of int
|
Which features to include. |
required |
norm_stats
|
dict
|
Dataset-wide CEBE normalisation stats forwarded to
|
None
|
Returns
|
|
required |
Source code in src/augernet/feature_assembly.py
assemble_node_features(data, feature_keys, inplace=True, norm_stats=None)
¶
Concatenate selected node features into data.x.
The existing data.x (category_feature, shape [N, 3]) is kept as the
first columns. Selected features are scaled and appended.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
data
|
Data
|
A single graph. Must have feature attributes set during preparation. |
required |
feature_keys
|
sequence of int
|
Which features to include (see FEATURE_NAMES). |
required |
inplace
|
bool
|
If True, modifies |
True
|
norm_stats
|
dict
|
|
None
|
Returns:
| Name | Type | Description |
|---|---|---|
data |
the (possibly modified) Data object.
|
|
Source code in src/augernet/feature_assembly.py
compute_feature_tag(feature_keys)
¶
Compute a compact filename-safe tag from sorted feature keys.
compute_feature_tag([3, 0, 5]) '035'
describe_features(feature_keys)
¶
Return a human-readable description of the selected feature set.
describe_features([0, 3, 5]) 'skipatom_200 (200) + atomic_be (1) + e_score (1)'
Source code in src/augernet/feature_assembly.py
get_feature_dim(data, feature_keys)
¶
Compute the total node-feature dimension that assemble_node_features
will produce (category_feature columns + selected feature columns).
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
data
|
Data
|
A single graph from the dataset (used to read tensor shapes). |
required |
feature_keys
|
sequence of int
|
Feature keys to include. |
required |
Returns:
| Type | Description |
|---|---|
int
|
Total |
Source code in src/augernet/feature_assembly.py
parse_feature_keys(tag)
¶
Parse a compact feature-key string into a sorted list of ints.
Each character in the string is one feature key digit.
parse_feature_keys('035') [0, 3, 5] parse_feature_keys('7') [7]