RandomDefaultUser
diff --git a/‎docs/source/advanced_usage/trainingmodel.rst‎
Lines changed: 12 additions & 3 deletions b/‎docs/source/advanced_usage/trainingmodel.rst‎
Lines changed: 12 additions & 3 deletions
diff --git a/‎docs/source/basic_usage/more_data.rst‎
Lines changed: 15 additions & 6 deletions b/‎docs/source/basic_usage/more_data.rst‎
Lines changed: 15 additions & 6 deletions
diff --git a/‎docs/source/basic_usage/trainingmodel.rst‎
Lines changed: 30 additions & 5 deletions b/‎docs/source/basic_usage/trainingmodel.rst‎
Lines changed: 30 additions & 5 deletions
diff --git a/‎examples/advanced/ex02_shuffle_data.py‎
Lines changed: 100 additions & 2 deletions b/‎examples/advanced/ex02_shuffle_data.py‎
Lines changed: 100 additions & 2 deletions
diff --git a/‎examples/advanced/ex10_convert_numpy_openpmd.py‎
Lines changed: 9 additions & 9 deletions b/‎examples/advanced/ex10_convert_numpy_openpmd.py‎
Lines changed: 9 additions & 9 deletions
diff --git a/‎examples/basic/ex01_train_network.py‎
Lines changed: 29 additions & 2 deletions b/‎examples/basic/ex01_train_network.py‎
Lines changed: 29 additions & 2 deletions
@@ -170,10 +170,11 @@ data sets have to be saved - in-memory implementations are currently developed.
 To use the data shuffling (also shown in example
 ``advanced/ex02_shuffle_data.py``), you can use the ``DataShuffler`` class.
 
-The syntax is very easy, you create a ``DataShufller`` object,
+The syntax is very easy, you create a ``DataShuffler`` object,
 which provides the same ``add_snapshot`` functionalities as the ``DataHandler``
-object, and shuffle the data once you have added all snapshots in question,
-i.e.,
+object, and shuffle the data once you have added all snapshots in question.
+Just as with the ``DataHandler`` class, on-the-fly calculation of bispectrum
+descriptors is supported.
 
       .. code-block:: python
 
@@ -187,6 +188,14 @@ i.e.,
             data_shuffler.shuffle_snapshots(complete_save_path="../",
                                             save_name="Be_shuffled*")
 
+By using the ``shuffle_to_temporary`` keyword, you can shuffle the data to
+temporary files, which will can deleted after the training run. This is useful
+if you want to shuffle the data right before training and do not plan to re-use
+shuffled data files for multiple training runs. As detailed in
+``advanced/ex02_shuffle_data.py``, access to temporary files is provided via
+``data_shuffler.temporary_shuffled_snapshots[...]``, which is a list containing
+``mala.Snapshot`` objects.
+
 The seed ``parameters.data.shuffling_seed`` ensures reproducibility of data
 sets. The ``shuffle_snapshots`` function has a path handling ability akin to
 the ``DataConverter`` class. Further, via the ``number_of_shuffled_snapshots``
 
@@ -49,7 +49,13 @@ MALA can be used to process raw data into ready-to-use data for ML-DFT model
 creation. For this, the ``DataConverter`` class can be used, as also shown
 in the example ``basic/ex03_preprocess_data``.
 The first thing when converting data is to select how the data should be
-processed. Up until now, MALA operates with bispectrum descriptors as
+processed. As outlined in :doc:`the training documentation <trainingmodel>`,
+there are two ways to provide descriptor data to MALA models. One can either
+precompute files containing descriptors with the ``DataConverter`` class or
+compute descriptor data on-the-fly by providing MALA generated JSON files
+containing simulation output information. These JSON files can also be
+generated by the ``DataConverter`` class.
+Up until now, MALA operates with bispectrum descriptors as
 input data (=descriptors) and LDOS as output data (=targets). Their
 calculation is calculated via
 
@@ -73,6 +79,8 @@ values are included in the energy grid upon which the LDOS is sampled,
 ``ldos_gridoffset_ev`` determines the lowest energy value sampled. These values
 are chosen for the ``pp.x`` simulation and have to be given here.
 
+If descriptors are precomputed, then hyperparameters for their calculation
+have to be provided.
 For the bispectrum calculation, ``bispectrum_cutoff`` gives the radius of
 the cutoff sphere from which information on the atomic structure is incoporated
 into the bispectrum descriptor vector at each point in space, whereas
@@ -99,19 +107,20 @@ and fill it with data, e.g., by
                                         descriptor_input_path=outfile,
                                         target_input_type=".cube",
                                         target_input_path=ldosfile,
-                                        additional_info_input_type="espresso-out",
-                                        additional_info_input_path=outfile,
+                                        simulation_output_type="espresso-out",
+                                        simulation_output_path=outfile,
                                         target_units="1/(Ry*Bohr^3)")
 
 The ``add_snapshot`` function can be called multiple times to add
 multiple snapshots to MALA.
 For regular Quantum ESPRESSO calculations, the ``descriptor_input_type``
 and ``target_input_type`` will always be ``"espresso-out"`` and ``".cube"``,
 respectively, and the ``target_units`` will always be ``"1/(Ry*Bohr^3)"``.
-The paths have to be modified accordingly. ``additional_info_input_*`` refers
+The paths have to be modified accordingly. ``simulation_output_*`` refers
 to the calculation output file - MALA provides an interface to condense
 the entire, verbose simulation output to ``.json`` files for further
-processing. In the preceding section, we had to specify calculation output
+processing or on-the-fly descriptor calculation.
+In the preceding section, we had to specify calculation output
 files a number of times - instead, we can use the reduced ``.json`` files
 if we let them be created by the ``DataConverter`` class.
 
@@ -121,7 +130,7 @@ Once data is provided, the conversion itself is simple.
 
             data_converter.convert_snapshots(descriptor_save_path="./",
                                              target_save_path="./",
-                                             additional_info_save_path="./",
+                                             simulation_output_save_path="./",
                                              naming_scheme="Be_snapshot*.npy",
                                              descriptor_calculation_kwargs=
                                              {"working_directory": data_path})
 
@@ -89,10 +89,14 @@ As with any ML library, MALA is a data-driven framework. So before we can
 train a model, we need to add data. The central object to manage data for any
 MALA workflow is the ``DataHandler`` class.
 
-MALA manages data "per snapshot". One snapshot is one atomic configuration,
-for which volumetric input and output data has been calculated. Data has to
-be added to the ``DataHandler`` object per snapshot, pointing to the
-where the volumetric data files are saved on disk. This is done via
+MALA manages data "per snapshot". One snapshot is an atomic configuration with
+associated volumetric data. Snapshots have to be added to the ``DataHandler``
+object. There are two ways to provide snapshot data, which are selected by
+providing the respective types of data files.
+
+1. Precomputed descriptors: The LDOS is sampled and the volumetric descriptor
+   data is precomputed into either OpenPMD or numpy files
+   (as described :doc:`here <more_data>`), and both can be loaded for training.
 
       .. code-block:: python
 
@@ -102,12 +106,33 @@ where the volumetric data files are saved on disk. This is done via
             data_handler.add_snapshot("Be_snapshot1.in.npy", data_path,
                                       "Be_snapshot1.out.npy", data_path, "va")
 
+2. On-the-fly descriptors: The LDOS is sampled into either OpenPMD or numpy
+   files, while the volumetric descriptor data is computed on-the-fly during
+   training or shuffling. Starting point for the descriptor calculation in this
+   case is the simulation output saved in a JSON file. This mode is only
+   recommended if a GPU-enabled LAMMPS version is available. If this route is
+   used, then descriptor calculation hyperparamters need to be set before
+   adding snapshots, see :doc:`data conversion manual <more_data>` for details.
+
+      .. code-block:: python
+
+            # Bispectrum parameters.
+            parameters.descriptors.descriptor_type = "Bispectrum"
+            parameters.descriptors.bispectrum_twojmax = 10
+            parameters.descriptors.bispectrum_cutoff = 4.67637
+
+            data_handler = mala.DataHandler(parameters)
+            data_handler.add_snapshot("Be_snapshot0.info.json", data_path,
+                                      "Be_snapshot0.out.npy", data_path, "tr")
+            data_handler.add_snapshot("Be_snapshot1.info.json", data_path,
+                                      "Be_snapshot1.out.npy", data_path, "va")
+
 The ``"tr"`` and ``"va"`` flag signal that the respective snapshots are added as
 training and validation data, respectively. Training data is data the model
 is directly tuned on; validation data is data used to verify the model
 performance during the run time and make sure that no overfitting occurs.
 After data has been added to the ``DataHandler``, it has to be actually loaded
-and scaled via
+(or in the case of on-the-fly usage, computed) and scaled via
 
       .. code-block:: python
 
 
@@ -1,5 +1,3 @@
-import os
-
 import mala
 
 from mala.datahandling.data_repo import data_path
@@ -10,6 +8,106 @@
 easily done in memory.
 """
 
+# There are two ways to perform shuffling. One can either shuffle snapshots
+# on the fly and create snapshot-like, temporary files to be directly used
+# for training, or save them to permanent files, and load them as needed.
+# The latter is useful when multiple networks are to be trained with the same
+# shuffled data, the former may save on hard-disk space by deleting
+# shuffled data right after usage.
+
+
+# 1. Shuffling snapshots on the fly.
+
+# Define parameters for network training after shuffling.
+parameters = mala.Parameters()
+parameters.verbosity = 1
+parameters.data.input_rescaling_type = "feature-wise-standard"
+parameters.data.output_rescaling_type = "minmax"
+parameters.network.layer_activations = ["ReLU"]
+
+# No real training, just showing how shuffling directly before training works.
+parameters.running.max_number_epochs = 5
+parameters.running.mini_batch_size = 40
+parameters.running.learning_rate = 0.00001
+parameters.running.optimizer = "Adam"
+parameters.targets.target_type = "LDOS"
+parameters.targets.ldos_gridsize = 11
+parameters.targets.ldos_gridspacing_ev = 2.5
+parameters.targets.ldos_gridoffset_ev = -5
+parameters.data.use_lazy_loading = True
+
+# This ensures reproducibility of the created data sets.
+parameters.data.shuffling_seed = 1234
+
+data_shuffler = mala.DataShuffler(parameters)
+
+# Instead of precomputed snapshots, on-the-fly calculated ones may be used
+# as well. Here, we use precomputed ones.
+data_shuffler.add_snapshot(
+    "Be_snapshot0.in.npy",
+    data_path,
+    "Be_snapshot0.out.npy",
+    data_path,
+)
+data_shuffler.add_snapshot(
+    "Be_snapshot1.in.npy",
+    data_path,
+    "Be_snapshot1.out.npy",
+    data_path,
+)
+# On-the-fly snapshots can be added as well.
+# data_shuffler.add_snapshot(
+#     "Be_snapshot2.info.json",
+#     data_path,
+#     "Be_snapshot2.out.npy",
+#     data_path,
+# )
+
+
+# Shuffle the snapshots using the "shuffle_to_temporary" option.
+data_shuffler.shuffle_snapshots(
+    complete_save_path=data_path,
+    save_name="Be_shuffled*",
+    shuffle_to_temporary=True,
+    number_of_shuffled_snapshots=2,
+)
+
+# The data shuffler provides a list with these temporary, snapshot-like
+# objects that can then be used in a network training.
+
+data_handler = mala.DataHandler(parameters)
+data_handler.add_snapshot(
+    data_shuffler.temporary_shuffled_snapshots[0].input_npy_file,
+    data_shuffler.temporary_shuffled_snapshots[0].input_npy_directory,
+    data_shuffler.temporary_shuffled_snapshots[0].output_npy_file,
+    data_shuffler.temporary_shuffled_snapshots[0].output_npy_directory,
+    "tr",
+)
+data_handler.add_snapshot(
+    data_shuffler.temporary_shuffled_snapshots[1].input_npy_file,
+    data_shuffler.temporary_shuffled_snapshots[1].input_npy_directory,
+    data_shuffler.temporary_shuffled_snapshots[1].output_npy_file,
+    data_shuffler.temporary_shuffled_snapshots[1].output_npy_directory,
+    "va",
+)
+data_handler.prepare_data()
+
+# Nowe we can train a network in the standard MALA way.
+parameters.network.layer_sizes = [
+    data_handler.input_dimension,
+    100,
+    data_handler.output_dimension,
+]
+network = mala.Network(parameters)
+trainer = mala.Trainer(parameters, network, data_handler)
+trainer.train_network()
+
+# Afterwards, the temporary files should be deleted.
+data_shuffler.delete_temporary_shuffled_snapshots()
+
+# 2. Shuffling snapshots to permanent files.
+# After shuffling, the standard approach to MALA data loading can/should
+# be used to train a network.
 
 parameters = mala.Parameters()
 
 
@@ -20,15 +20,15 @@
         target_input_path=os.path.join(
             data_path, "Be_snapshot{}.out.npy".format(snapshot)
         ),
-        additional_info_input_type=None,
-        additional_info_input_path=None,
+        simulation_output_type=None,
+        simulation_output_path=None,
         target_units=None,
     )
 
 data_converter.convert_snapshots(
     descriptor_save_path="./",
     target_save_path="./",
-    additional_info_save_path="./",
+    simulation_output_save_path="./",
     naming_scheme="converted_from_numpy_*.h5",
     descriptor_calculation_kwargs={"working_directory": "./"},
 )
@@ -43,15 +43,15 @@
         descriptor_input_path="converted_from_numpy_{}.in.h5".format(snapshot),
         target_input_type="openpmd",
         target_input_path="converted_from_numpy_{}.out.h5".format(snapshot),
-        additional_info_input_type=None,
-        additional_info_input_path=None,
+        simulation_output_type=None,
+        simulation_output_path=None,
         target_units=None,
     )
 
 data_converter.convert_snapshots(
     descriptor_save_path="./",
     target_save_path="./",
-    additional_info_save_path="./",
+    simulation_output_save_path="./",
     naming_scheme="verify_against_original_numpy_data_*.npy",
     descriptor_calculation_kwargs={"working_directory": "./"},
 )
@@ -84,15 +84,15 @@
         target_input_path=os.path.join(
             data_path, "Be_snapshot{}.out.h5".format(snapshot)
         ),
-        additional_info_input_type=None,
-        additional_info_input_path=None,
+        simulation_output_type=None,
+        simulation_output_path=None,
         target_units=None,
     )
 
 data_converter.convert_snapshots(
     descriptor_save_path="./",
     target_save_path="./",
-    additional_info_save_path="./",
+    simulation_output_save_path="./",
     naming_scheme="converted_from_openpmd_*.npy",
     descriptor_calculation_kwargs={"working_directory": "./"},
 )
@@ -48,16 +48,43 @@
 # Data has to be added to the MALA workflow. The central object for this
 # is the DataHandler class, which takes care of all data needs. After data
 # has been added, it is loaded and scaled with the prepare_data function.
-####################
+#
+# There are two ways to add descriptor (=input) data. One is to provide
+# precomputed numpy files. This makes sense when training multiple models
+# with the same data. Alternatively, MALA generated JSON files containin
+# information about a simulation can be provided, with which MALA will
+# automatically generate volumetric descriptor input. This is useful when, e.g.
+# descriptor hyperparameters are to be optimized. In order to use that
+# feature, an existing GPU-ready LAMMPS version is recommended. Creation
+# of MALA JSON files is shown in ex03_preprocess_data.
+#####################
 
 data_handler = mala.DataHandler(parameters)
-# Add a snapshot we want to use in to the list.
+
+# Add precomputed snapshots.
 data_handler.add_snapshot(
     "Be_snapshot0.in.npy", data_path, "Be_snapshot0.out.npy", data_path, "tr"
 )
 data_handler.add_snapshot(
     "Be_snapshot1.in.npy", data_path, "Be_snapshot1.out.npy", data_path, "va"
 )
+# Add snapshots with "raw" (=MALA formatted) JSON, computation of descriptors
+# will be performed "on-the-fly".
+# data_handler.add_snapshot(
+#     "Be_snapshot0.info.json",
+#     data_path,
+#     "Be_snapshot0.out.npy",
+#     data_path,
+#     "tr",
+# )
+# data_handler.add_snapshot(
+#     "Be_snapshot1.info.json",
+#     data_path,
+#     "Be_snapshot1.out.npy",
+#     data_path,
+#     "va",
+# )
+
 data_handler.prepare_data()
 
 ####################