ID conventions

Note

While grammar and model IDs can contain any unique combination of alphanumeric characters (and hyphens), we encourage to encode information into the IDs by adhering to a naming scheme. This will help keeping large seqgra analyses with many different grammars and many different architectures organized.

Grammar ID scheme

Note

Grammar IDs are defined by the root ID attribute of the data definition XML file, see Data definition for details. They are used as folder names in the the seqgra output folder structure.

[task]-[input-space]-[sim|exp]-[grammar-descriptor]-[data-set-size]-[simulation-seed]

  • task: mc for multi-class classification, ml for multi-label classification, followed by number of classes/labels

  • input-space: dna for DNA alphabet, protein for protein alphabet, followed by width of input window

  • sim|exp: sim for simulated data, exp for experimental data

  • grammar-descriptor: concisely describes grammar or experimental data

  • data-set-size: total data set size (sum of training, validation, and test sets), usually using k for thousand, e.g., 10k, 50k, 2000k

  • simulation-seed: random seed used when simulating the data, always prefixed by s, e.g., s1, s5, s17

Examples:

mc2-dna1000-exp-sox2-oct4-1000k-s1:
  • mc2: multi-class classification task with 2 classes

  • dna1000: DNA input sequence space, 1000 nt input sequence window

  • exp: experimental data

  • sox2-oct4: experimental data descriptor

  • 1000k: data set contains 1,000,000 examples; sum of training, validation, and test sets

  • s1: simulation seed 1

ml50-dna150-sim-homer-interaction-order-90k-s4:
  • ml50: multi-label classification task with 50 labels

  • dna150: DNA input sequence space, 150 nt input sequence window

  • sim: simulated data

  • homer-interaction-order: grammar descriptor

  • 90k: data set contains 90,000 examples; sum of training, validation, and test sets

  • s4: simulation seed 4

Model ID scheme

[library]-[task]-[input-space]-[model-descriptor]-[model-seed]

  • library: machine learning library the model is implemented in, either torch for PyTorch, tf for TensorFlow, or boc for Bayes Optimal Classifier

  • task: mc for multi-class classification, ml for multi-label classification, followed by number of classes/labels

  • input-space: dna for DNA alphabet, protein for protein alphabet, followed by width of input window

  • model-descriptor: concisely describes model architecture, following its own scheme (see below)

  • model-seed: random seed used when training the model, always prefixed by s, e.g., s1, s5, s17

Examples:

torch-ml2-dna1000-conv10w-conv10w-gmp-fc5-s2:
  • torch: model implemented using PyTorch library

  • ml2: multi-label classification task with 2 labels

  • dna1000: DNA input sequence space, 1000 nt input sequence window

  • conv10w-conv10w-gmp-fc5: model descriptor, following its own scheme (see below)

  • s2: model seed 2

tf-mc10-dna150-conv10-do03-conv10-fc5-do03-s3:
  • tf: model implemented using TensorFlow library

  • mc10: multi-class classification task with 10 labels

  • dna150: DNA input sequence space, 150 nt input sequence window

  • conv10-do03-conv10-fc5-do03: model descriptor, following its own scheme (see below)

  • s3: model seed 3

Model descriptor scheme

Model IDs contain a model descriptor, which is an attempt to provide as much information as possible about the architecture while being as concise as possible. The output layer is never specified as it is determined by the classification task.

General rules:

  • conv: convolutional layer

    • conv10: convolutional layer with 10 11-nt wide filters

    • conv1xn: convolutional layer with 1 3-nt wide filter

    • conv5n: convolutional layer with 5 5-nt wide filters

    • conv50w: convolutional layer with 50 21-nt wide filters

    • conv2xw: convolutional layer with 2 41-nt wide filters

    • conv100xxw: convolutional layer with 100 81-nt wide filters

  • fc: dense or fully connected layer

    • fc10: fully connected layer with 10 units

  • gmp: global max pooling operation

  • do: dropout layer

    • do03: dropout layer with 30% dropout rate

  • bn: batch normalization layer

Examples:

conv10-do03-conv10-fc5-do03: architecture with

  • conv10: convolutional layer with 10 11-nt wide filters,

  • do03: dropout layer with 30% dropout rate,

  • conv10: convolutional layer with 10 11-nt wide filters,

  • fc5: fully connected layer with 5 units,

  • do03: dropout layer with 30% dropout rate,

  • and output layer (always unspecified)

conv10w-conv10w-gmp-fc5: architecture with

  • conv10w: convolutional layer with 10 21-nt wide filters,

  • conv10w: convolutional layer with 10 21-nt wide filters,

  • gmp: global max pooling operation,

  • fc5: fully connected layer with 5 units,

  • and output layer (always unspecified)

deepsea: known architecture, adjusted to fit classification task and input data