Skip to content

Commit 8645348

Browse files
Merge pull request mala-project#630 from RandomDefaultUser/descriptors_on_the_fly
Implement on-the-fly descriptor calculation
2 parents 1588555 + ac255b2 commit 8645348

26 files changed

+1078
-155
lines changed

docs/source/advanced_usage/trainingmodel.rst

Lines changed: 12 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -170,10 +170,11 @@ data sets have to be saved - in-memory implementations are currently developed.
170170
To use the data shuffling (also shown in example
171171
``advanced/ex02_shuffle_data.py``), you can use the ``DataShuffler`` class.
172172

173-
The syntax is very easy, you create a ``DataShufller`` object,
173+
The syntax is very easy, you create a ``DataShuffler`` object,
174174
which provides the same ``add_snapshot`` functionalities as the ``DataHandler``
175-
object, and shuffle the data once you have added all snapshots in question,
176-
i.e.,
175+
object, and shuffle the data once you have added all snapshots in question.
176+
Just as with the ``DataHandler`` class, on-the-fly calculation of bispectrum
177+
descriptors is supported.
177178

178179
.. code-block:: python
179180
@@ -187,6 +188,14 @@ i.e.,
187188
data_shuffler.shuffle_snapshots(complete_save_path="../",
188189
save_name="Be_shuffled*")
189190
191+
By using the ``shuffle_to_temporary`` keyword, you can shuffle the data to
192+
temporary files, which will can deleted after the training run. This is useful
193+
if you want to shuffle the data right before training and do not plan to re-use
194+
shuffled data files for multiple training runs. As detailed in
195+
``advanced/ex02_shuffle_data.py``, access to temporary files is provided via
196+
``data_shuffler.temporary_shuffled_snapshots[...]``, which is a list containing
197+
``mala.Snapshot`` objects.
198+
190199
The seed ``parameters.data.shuffling_seed`` ensures reproducibility of data
191200
sets. The ``shuffle_snapshots`` function has a path handling ability akin to
192201
the ``DataConverter`` class. Further, via the ``number_of_shuffled_snapshots``

docs/source/basic_usage/more_data.rst

Lines changed: 15 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -49,7 +49,13 @@ MALA can be used to process raw data into ready-to-use data for ML-DFT model
4949
creation. For this, the ``DataConverter`` class can be used, as also shown
5050
in the example ``basic/ex03_preprocess_data``.
5151
The first thing when converting data is to select how the data should be
52-
processed. Up until now, MALA operates with bispectrum descriptors as
52+
processed. As outlined in :doc:`the training documentation <trainingmodel>`,
53+
there are two ways to provide descriptor data to MALA models. One can either
54+
precompute files containing descriptors with the ``DataConverter`` class or
55+
compute descriptor data on-the-fly by providing MALA generated JSON files
56+
containing simulation output information. These JSON files can also be
57+
generated by the ``DataConverter`` class.
58+
Up until now, MALA operates with bispectrum descriptors as
5359
input data (=descriptors) and LDOS as output data (=targets). Their
5460
calculation is calculated via
5561

@@ -73,6 +79,8 @@ values are included in the energy grid upon which the LDOS is sampled,
7379
``ldos_gridoffset_ev`` determines the lowest energy value sampled. These values
7480
are chosen for the ``pp.x`` simulation and have to be given here.
7581

82+
If descriptors are precomputed, then hyperparameters for their calculation
83+
have to be provided.
7684
For the bispectrum calculation, ``bispectrum_cutoff`` gives the radius of
7785
the cutoff sphere from which information on the atomic structure is incoporated
7886
into the bispectrum descriptor vector at each point in space, whereas
@@ -99,19 +107,20 @@ and fill it with data, e.g., by
99107
descriptor_input_path=outfile,
100108
target_input_type=".cube",
101109
target_input_path=ldosfile,
102-
additional_info_input_type="espresso-out",
103-
additional_info_input_path=outfile,
110+
simulation_output_type="espresso-out",
111+
simulation_output_path=outfile,
104112
target_units="1/(Ry*Bohr^3)")
105113
106114
The ``add_snapshot`` function can be called multiple times to add
107115
multiple snapshots to MALA.
108116
For regular Quantum ESPRESSO calculations, the ``descriptor_input_type``
109117
and ``target_input_type`` will always be ``"espresso-out"`` and ``".cube"``,
110118
respectively, and the ``target_units`` will always be ``"1/(Ry*Bohr^3)"``.
111-
The paths have to be modified accordingly. ``additional_info_input_*`` refers
119+
The paths have to be modified accordingly. ``simulation_output_*`` refers
112120
to the calculation output file - MALA provides an interface to condense
113121
the entire, verbose simulation output to ``.json`` files for further
114-
processing. In the preceding section, we had to specify calculation output
122+
processing or on-the-fly descriptor calculation.
123+
In the preceding section, we had to specify calculation output
115124
files a number of times - instead, we can use the reduced ``.json`` files
116125
if we let them be created by the ``DataConverter`` class.
117126

@@ -121,7 +130,7 @@ Once data is provided, the conversion itself is simple.
121130
122131
data_converter.convert_snapshots(descriptor_save_path="./",
123132
target_save_path="./",
124-
additional_info_save_path="./",
133+
simulation_output_save_path="./",
125134
naming_scheme="Be_snapshot*.npy",
126135
descriptor_calculation_kwargs=
127136
{"working_directory": data_path})

docs/source/basic_usage/trainingmodel.rst

Lines changed: 30 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -89,10 +89,14 @@ As with any ML library, MALA is a data-driven framework. So before we can
8989
train a model, we need to add data. The central object to manage data for any
9090
MALA workflow is the ``DataHandler`` class.
9191

92-
MALA manages data "per snapshot". One snapshot is one atomic configuration,
93-
for which volumetric input and output data has been calculated. Data has to
94-
be added to the ``DataHandler`` object per snapshot, pointing to the
95-
where the volumetric data files are saved on disk. This is done via
92+
MALA manages data "per snapshot". One snapshot is an atomic configuration with
93+
associated volumetric data. Snapshots have to be added to the ``DataHandler``
94+
object. There are two ways to provide snapshot data, which are selected by
95+
providing the respective types of data files.
96+
97+
1. Precomputed descriptors: The LDOS is sampled and the volumetric descriptor
98+
data is precomputed into either OpenPMD or numpy files
99+
(as described :doc:`here <more_data>`), and both can be loaded for training.
96100

97101
.. code-block:: python
98102
@@ -102,12 +106,33 @@ where the volumetric data files are saved on disk. This is done via
102106
data_handler.add_snapshot("Be_snapshot1.in.npy", data_path,
103107
"Be_snapshot1.out.npy", data_path, "va")
104108
109+
2. On-the-fly descriptors: The LDOS is sampled into either OpenPMD or numpy
110+
files, while the volumetric descriptor data is computed on-the-fly during
111+
training or shuffling. Starting point for the descriptor calculation in this
112+
case is the simulation output saved in a JSON file. This mode is only
113+
recommended if a GPU-enabled LAMMPS version is available. If this route is
114+
used, then descriptor calculation hyperparamters need to be set before
115+
adding snapshots, see :doc:`data conversion manual <more_data>` for details.
116+
117+
.. code-block:: python
118+
119+
# Bispectrum parameters.
120+
parameters.descriptors.descriptor_type = "Bispectrum"
121+
parameters.descriptors.bispectrum_twojmax = 10
122+
parameters.descriptors.bispectrum_cutoff = 4.67637
123+
124+
data_handler = mala.DataHandler(parameters)
125+
data_handler.add_snapshot("Be_snapshot0.info.json", data_path,
126+
"Be_snapshot0.out.npy", data_path, "tr")
127+
data_handler.add_snapshot("Be_snapshot1.info.json", data_path,
128+
"Be_snapshot1.out.npy", data_path, "va")
129+
105130
The ``"tr"`` and ``"va"`` flag signal that the respective snapshots are added as
106131
training and validation data, respectively. Training data is data the model
107132
is directly tuned on; validation data is data used to verify the model
108133
performance during the run time and make sure that no overfitting occurs.
109134
After data has been added to the ``DataHandler``, it has to be actually loaded
110-
and scaled via
135+
(or in the case of on-the-fly usage, computed) and scaled via
111136

112137
.. code-block:: python
113138

examples/advanced/ex02_shuffle_data.py

Lines changed: 100 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,3 @@
1-
import os
2-
31
import mala
42

53
from mala.datahandling.data_repo import data_path
@@ -10,6 +8,106 @@
108
easily done in memory.
119
"""
1210

11+
# There are two ways to perform shuffling. One can either shuffle snapshots
12+
# on the fly and create snapshot-like, temporary files to be directly used
13+
# for training, or save them to permanent files, and load them as needed.
14+
# The latter is useful when multiple networks are to be trained with the same
15+
# shuffled data, the former may save on hard-disk space by deleting
16+
# shuffled data right after usage.
17+
18+
19+
# 1. Shuffling snapshots on the fly.
20+
21+
# Define parameters for network training after shuffling.
22+
parameters = mala.Parameters()
23+
parameters.verbosity = 1
24+
parameters.data.input_rescaling_type = "feature-wise-standard"
25+
parameters.data.output_rescaling_type = "minmax"
26+
parameters.network.layer_activations = ["ReLU"]
27+
28+
# No real training, just showing how shuffling directly before training works.
29+
parameters.running.max_number_epochs = 5
30+
parameters.running.mini_batch_size = 40
31+
parameters.running.learning_rate = 0.00001
32+
parameters.running.optimizer = "Adam"
33+
parameters.targets.target_type = "LDOS"
34+
parameters.targets.ldos_gridsize = 11
35+
parameters.targets.ldos_gridspacing_ev = 2.5
36+
parameters.targets.ldos_gridoffset_ev = -5
37+
parameters.data.use_lazy_loading = True
38+
39+
# This ensures reproducibility of the created data sets.
40+
parameters.data.shuffling_seed = 1234
41+
42+
data_shuffler = mala.DataShuffler(parameters)
43+
44+
# Instead of precomputed snapshots, on-the-fly calculated ones may be used
45+
# as well. Here, we use precomputed ones.
46+
data_shuffler.add_snapshot(
47+
"Be_snapshot0.in.npy",
48+
data_path,
49+
"Be_snapshot0.out.npy",
50+
data_path,
51+
)
52+
data_shuffler.add_snapshot(
53+
"Be_snapshot1.in.npy",
54+
data_path,
55+
"Be_snapshot1.out.npy",
56+
data_path,
57+
)
58+
# On-the-fly snapshots can be added as well.
59+
# data_shuffler.add_snapshot(
60+
# "Be_snapshot2.info.json",
61+
# data_path,
62+
# "Be_snapshot2.out.npy",
63+
# data_path,
64+
# )
65+
66+
67+
# Shuffle the snapshots using the "shuffle_to_temporary" option.
68+
data_shuffler.shuffle_snapshots(
69+
complete_save_path=data_path,
70+
save_name="Be_shuffled*",
71+
shuffle_to_temporary=True,
72+
number_of_shuffled_snapshots=2,
73+
)
74+
75+
# The data shuffler provides a list with these temporary, snapshot-like
76+
# objects that can then be used in a network training.
77+
78+
data_handler = mala.DataHandler(parameters)
79+
data_handler.add_snapshot(
80+
data_shuffler.temporary_shuffled_snapshots[0].input_npy_file,
81+
data_shuffler.temporary_shuffled_snapshots[0].input_npy_directory,
82+
data_shuffler.temporary_shuffled_snapshots[0].output_npy_file,
83+
data_shuffler.temporary_shuffled_snapshots[0].output_npy_directory,
84+
"tr",
85+
)
86+
data_handler.add_snapshot(
87+
data_shuffler.temporary_shuffled_snapshots[1].input_npy_file,
88+
data_shuffler.temporary_shuffled_snapshots[1].input_npy_directory,
89+
data_shuffler.temporary_shuffled_snapshots[1].output_npy_file,
90+
data_shuffler.temporary_shuffled_snapshots[1].output_npy_directory,
91+
"va",
92+
)
93+
data_handler.prepare_data()
94+
95+
# Nowe we can train a network in the standard MALA way.
96+
parameters.network.layer_sizes = [
97+
data_handler.input_dimension,
98+
100,
99+
data_handler.output_dimension,
100+
]
101+
network = mala.Network(parameters)
102+
trainer = mala.Trainer(parameters, network, data_handler)
103+
trainer.train_network()
104+
105+
# Afterwards, the temporary files should be deleted.
106+
data_shuffler.delete_temporary_shuffled_snapshots()
107+
108+
# 2. Shuffling snapshots to permanent files.
109+
# After shuffling, the standard approach to MALA data loading can/should
110+
# be used to train a network.
13111

14112
parameters = mala.Parameters()
15113

examples/advanced/ex10_convert_numpy_openpmd.py

Lines changed: 9 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -20,15 +20,15 @@
2020
target_input_path=os.path.join(
2121
data_path, "Be_snapshot{}.out.npy".format(snapshot)
2222
),
23-
additional_info_input_type=None,
24-
additional_info_input_path=None,
23+
simulation_output_type=None,
24+
simulation_output_path=None,
2525
target_units=None,
2626
)
2727

2828
data_converter.convert_snapshots(
2929
descriptor_save_path="./",
3030
target_save_path="./",
31-
additional_info_save_path="./",
31+
simulation_output_save_path="./",
3232
naming_scheme="converted_from_numpy_*.h5",
3333
descriptor_calculation_kwargs={"working_directory": "./"},
3434
)
@@ -43,15 +43,15 @@
4343
descriptor_input_path="converted_from_numpy_{}.in.h5".format(snapshot),
4444
target_input_type="openpmd",
4545
target_input_path="converted_from_numpy_{}.out.h5".format(snapshot),
46-
additional_info_input_type=None,
47-
additional_info_input_path=None,
46+
simulation_output_type=None,
47+
simulation_output_path=None,
4848
target_units=None,
4949
)
5050

5151
data_converter.convert_snapshots(
5252
descriptor_save_path="./",
5353
target_save_path="./",
54-
additional_info_save_path="./",
54+
simulation_output_save_path="./",
5555
naming_scheme="verify_against_original_numpy_data_*.npy",
5656
descriptor_calculation_kwargs={"working_directory": "./"},
5757
)
@@ -84,15 +84,15 @@
8484
target_input_path=os.path.join(
8585
data_path, "Be_snapshot{}.out.h5".format(snapshot)
8686
),
87-
additional_info_input_type=None,
88-
additional_info_input_path=None,
87+
simulation_output_type=None,
88+
simulation_output_path=None,
8989
target_units=None,
9090
)
9191

9292
data_converter.convert_snapshots(
9393
descriptor_save_path="./",
9494
target_save_path="./",
95-
additional_info_save_path="./",
95+
simulation_output_save_path="./",
9696
naming_scheme="converted_from_openpmd_*.npy",
9797
descriptor_calculation_kwargs={"working_directory": "./"},
9898
)

examples/basic/ex01_train_network.py

Lines changed: 29 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -48,16 +48,43 @@
4848
# Data has to be added to the MALA workflow. The central object for this
4949
# is the DataHandler class, which takes care of all data needs. After data
5050
# has been added, it is loaded and scaled with the prepare_data function.
51-
####################
51+
#
52+
# There are two ways to add descriptor (=input) data. One is to provide
53+
# precomputed numpy files. This makes sense when training multiple models
54+
# with the same data. Alternatively, MALA generated JSON files containin
55+
# information about a simulation can be provided, with which MALA will
56+
# automatically generate volumetric descriptor input. This is useful when, e.g.
57+
# descriptor hyperparameters are to be optimized. In order to use that
58+
# feature, an existing GPU-ready LAMMPS version is recommended. Creation
59+
# of MALA JSON files is shown in ex03_preprocess_data.
60+
#####################
5261

5362
data_handler = mala.DataHandler(parameters)
54-
# Add a snapshot we want to use in to the list.
63+
64+
# Add precomputed snapshots.
5565
data_handler.add_snapshot(
5666
"Be_snapshot0.in.npy", data_path, "Be_snapshot0.out.npy", data_path, "tr"
5767
)
5868
data_handler.add_snapshot(
5969
"Be_snapshot1.in.npy", data_path, "Be_snapshot1.out.npy", data_path, "va"
6070
)
71+
# Add snapshots with "raw" (=MALA formatted) JSON, computation of descriptors
72+
# will be performed "on-the-fly".
73+
# data_handler.add_snapshot(
74+
# "Be_snapshot0.info.json",
75+
# data_path,
76+
# "Be_snapshot0.out.npy",
77+
# data_path,
78+
# "tr",
79+
# )
80+
# data_handler.add_snapshot(
81+
# "Be_snapshot1.info.json",
82+
# data_path,
83+
# "Be_snapshot1.out.npy",
84+
# data_path,
85+
# "va",
86+
# )
87+
6188
data_handler.prepare_data()
6289

6390
####################

0 commit comments

Comments
 (0)