ID conventions¶
Note
While grammar and model IDs can contain any unique combination of alphanumeric characters (and hyphens), we encourage to encode information into the IDs by adhering to a naming scheme. This will help keeping large seqgra analyses with many different grammars and many different architectures organized.
Grammar ID scheme¶
Note
Grammar IDs are defined by the root ID attribute of the data definition XML file, see Data definition for details. They are used as folder names in the the seqgra output folder structure.
[task]-[input-space]-[sim|exp]-[grammar-descriptor]-[data-set-size]-[simulation-seed]
task: mc for multi-class classification, ml for multi-label classification, followed by number of classes/labels
input-space: dna for DNA alphabet, protein for protein alphabet, followed by width of input window
sim|exp: sim for simulated data, exp for experimental data
grammar-descriptor: concisely describes grammar or experimental data
data-set-size: total data set size (sum of training, validation, and test sets), usually using
k
for thousand, e.g., 10k, 50k, 2000ksimulation-seed: random seed used when simulating the data, always prefixed by
s
, e.g., s1, s5, s17
Examples:
mc2-dna1000-exp-sox2-oct4-1000k-s1
:mc2: multi-class classification task with 2 classes
dna1000: DNA input sequence space, 1000 nt input sequence window
exp: experimental data
sox2-oct4: experimental data descriptor
1000k: data set contains 1,000,000 examples; sum of training, validation, and test sets
s1: simulation seed 1
ml50-dna150-sim-homer-interaction-order-90k-s4
:ml50: multi-label classification task with 50 labels
dna150: DNA input sequence space, 150 nt input sequence window
sim: simulated data
homer-interaction-order: grammar descriptor
90k: data set contains 90,000 examples; sum of training, validation, and test sets
s4: simulation seed 4
Model ID scheme¶
[library]-[task]-[input-space]-[model-descriptor]-[model-seed]
library: machine learning library the model is implemented in, either torch for PyTorch, tf for TensorFlow, or boc for Bayes Optimal Classifier
task: mc for multi-class classification, ml for multi-label classification, followed by number of classes/labels
input-space: dna for DNA alphabet, protein for protein alphabet, followed by width of input window
model-descriptor: concisely describes model architecture, following its own scheme (see below)
model-seed: random seed used when training the model, always prefixed by
s
, e.g., s1, s5, s17
Examples:
torch-ml2-dna1000-conv10w-conv10w-gmp-fc5-s2
:torch: model implemented using PyTorch library
ml2: multi-label classification task with 2 labels
dna1000: DNA input sequence space, 1000 nt input sequence window
conv10w-conv10w-gmp-fc5: model descriptor, following its own scheme (see below)
s2: model seed 2
tf-mc10-dna150-conv10-do03-conv10-fc5-do03-s3
:tf: model implemented using TensorFlow library
mc10: multi-class classification task with 10 labels
dna150: DNA input sequence space, 150 nt input sequence window
conv10-do03-conv10-fc5-do03: model descriptor, following its own scheme (see below)
s3: model seed 3
Model descriptor scheme¶
Model IDs contain a model descriptor, which is an attempt to provide as much information as possible about the architecture while being as concise as possible. The output layer is never specified as it is determined by the classification task.
General rules:
conv: convolutional layer
conv10: convolutional layer with 10 11-nt wide filters
conv1xn: convolutional layer with 1 3-nt wide filter
conv5n: convolutional layer with 5 5-nt wide filters
conv50w: convolutional layer with 50 21-nt wide filters
conv2xw: convolutional layer with 2 41-nt wide filters
conv100xxw: convolutional layer with 100 81-nt wide filters
fc: dense or fully connected layer
fc10: fully connected layer with 10 units
gmp: global max pooling operation
do: dropout layer
do03: dropout layer with 30% dropout rate
bn: batch normalization layer
Examples:
conv10-do03-conv10-fc5-do03: architecture with
conv10: convolutional layer with 10 11-nt wide filters,
do03: dropout layer with 30% dropout rate,
conv10: convolutional layer with 10 11-nt wide filters,
fc5: fully connected layer with 5 units,
do03: dropout layer with 30% dropout rate,
and output layer (always unspecified)
conv10w-conv10w-gmp-fc5: architecture with
conv10w: convolutional layer with 10 21-nt wide filters,
conv10w: convolutional layer with 10 21-nt wide filters,
gmp: global max pooling operation,
fc5: fully connected layer with 5 units,
and output layer (always unspecified)
deepsea: known architecture, adjusted to fit classification task and input data