SIMANAGER
A simple (but flexible) manager for simulations and parameter scan with a CERNy flavour.
A code made not with love, but with hatred. Because when you stare into the abyss of the black background of a terminal window that tells you that your simulation has failed again, you can only hate. And when you use your hate just like a Sith Lord, you can create something powerful.
README.MD
simanager
Description
simanager
is a simple manager for your simulations. It is designed to automatically create and manage a directory structure for your simulations, in order to keep your simulations organized and easy to find, and guarantee reproducibility.
Currently, it (tries to) support the following execution environments:
Installation
simanager
is available on PyPI, so you can install it with:
pip install simanager
but it is in general recommended installing it directly from the GitHub repository in editable mode
git clone https://github.com/carlidel/simanager
cd simanager
pip install -e .
in order to both have the latest version, be able to customize/contribute to the code, and have the ability to use the (despicable) CLI command:
simanager self-update
which will update the package to the latest version available on GitHub by executing a git pull
in the simanager
folder. (I know that this is not the best way to do it, but it is the simplest one, and it works for now.)
Usage
simanager
expects a simulation study to be structured in a precise way. One must first create a master directory, which will contain:
a main script in bash, which will be used to launch the simulation;
a master parameter file in YAML, which will contain the parameters of the simulation and will be specialized by the manager for each simulation;
all support files needed by the simulation (it is currently expected that the main script will launch a python script, which then loads the YAML parameter file, so the current defaults and examples are for python simulations).
After the master study is constructed, one can define a set of parameters to be varied, in order to do that, the ParameterInspection
dataclass is defined in the simanager/parameter_inspection.py
file. The ParameterInspection
dataclass is a container for the parameters to be varied, and it is used to generate a list of specialized parameter files, which will be used by the manager to launch the simulations.
The best way to create a simulation study, is to compose in the desired root directory a simulation_study.yaml
file, which will contain the parameters of the study. This file will be used to construct a SimulationStudy
dataclass, which will be used by the manager to create the directory structure and launch the simulations.
Example of a simulation_study.yaml
file:
# simulation parameters
study_name: "test_local"
original_folder: "/home/camontan/cernbox/work/code/generic_study/tests/example_master_study"
main_file: "main_script.sh"
config_file: "params.yaml"
# The following parameters are used to generate the study
parameters_inspected:
- parameter_name: "numeric_parameters/max_attempts"
inspection_method: "range"
min_value: 1
max_value: 4
combination_idx: 0
combination_method: "meshgrid"
parameter_file_name: "mxatt"
- parameter_name: "numeric_parameters/timeout_seconds"
inspection_method: "linspace"
min_value: 1
max_value: 2
n_samples: 4
combination_idx: 0
combination_method: "meshgrid"
parameter_file_name: "tout"
- parameter_name: "numeric_list"
inspection_method: "custom"
values: [[1, 2, 3], [4, 5, 6]]
parameter_file_name: "nlist"
Then one can load up the folder with the SimulationStudy
dataclass:
import simanager as sim
study = sim.SimulationStudy.load_folder("./")
The SimulationStudy
dataclass will be used by the manager to create the directory structure and launch the simulations. The manager can be used as follows:
study.initialize_folders()
study.print_sim_status()
The SimulationStudy
can then be passed to three different executor functions, which will launch the simulations in the desired environment, namely:
sim.job_run_local
, which will launch the simulations on the local machine;sim.job_run_slurm
, which will launch the simulations on a SLURM cluster;sim.job_run_htcondor
, which will launch the simulations on a HTCondor cluster.
Overview of SimulationStudy
Introduction
SimulationStudy
is a dataclass that is used to represent a simulation study. It is used to create the directory structure of the study, and to launch the simulations. It is also used to keep track of the status of the simulations, and to easily inspect the results of the study.
A SimulationStudy
is created by loading a folder with a simulation_study.yaml
file, which contains the parameters of the study. The position of the simulation_study.yaml
file is expected to be the root of the study. The SimulationStudy
dataclass is then used by the manager to create the directory structure and launch the simulations.
The SimulationStudy
dataclass contains a list of ParameterInspection
dataclasses, which are used to generate the specialized parameter files for each simulation. The ParameterInspection
dataclass is a container for the parameters to be varied, and it is used to generate a list of specialized parameter files, which will be used by the manager to create the various simulation cases to specialize and launch.
Structure of simulation_study.yaml
Here is a template of a simulation_study.yaml
file:
study_name: "name_of_the_study"
original_folder: "$STUDYPATH/master_study"
main_file: "main_script.sh"
config_file: "params.yaml"
# The following parameters are used to generate the study
parameters_inspected:
- parameter_name: "parameters/seed"
inspection_method: "range"
min_value: 1
max_value: 4
combination_idx: 0
combination_method: "meshgrid"
parameter_file_name: "mxatt"
# ...other parameters...
test_case:
- parameters/num_particles: 1
environ_dict:
WEIRDPATH: "/this/is/a/weird_path"
The study_name
key is used to specify the name of the study, and will name consequently the folder containing the simulation files and the directory structure.
The original_folder
key is used to specify the path of the master study, which will be copied into the study folders and then specialized accordingly. In the original_folder
there must be:
* The main_file
, a bash script that will launch/execute the simulation or run other simulation files.
* The config_file
, a .yaml file that will contain the various parameters of the simulation. It is expected that the SimulationStudy
only scans parameters that are already present in the config_file
.
* All other support files needed by the simulation (it is currently expected that the main script will launch a Python script, which then loads the YAML parameter file, so the current defaults and examples are for Python simulations).
The parameters_inspected
list, contains a list of ParameterInspection
dataclasses, and specifies the parameters to be varied. Refer to the ParameterInspection
page for more information on how to specify the parameters to be varied.
The test_case
key is used to specify a list of parameters to be used for a test case. This way, one can easily set up a quick test case to check that the simulation is working as expected. The test case will be placed in a test
folder, which will be created in the study_name\scan
folder, next to the other simulation cases.
The environ_dict
key is used to specify a dictionary of environment variables to be used for the path expansion. Refer to the ParameterInspection
page for more information on how to use this feature.
Important guidelines to follow
The job in the
original_folder
must ultimately generate all the output files in a folder namedoutput_files
in the same folder as the main script. This is necessary in order to be able to easily transfer the output files to the desired location, and to be able to easily inspect the results of the study.In the case of remote job executions, the tracking of the state of the simulations has to be explicitly checked with the command
simanager status
, which will print the status of the simulations. This is necessary because the remote job execution does not allow for the immediate inspection of the simulation status, and it is probed (partially) by the manager by checking the presence of flag files in the folderremote_touch_files
.In the case of HTcondor executions, where EOS files might be required to be staged in, there is the possibility to leverage on an internal routine for EOS-compliant stage-in. Currently, filepath specified at the first level of depth in the parameter file are detected and moved to a folder named
eos_files
in the scratch disk of the remote machine. This is done by thesimanager run_htcondor
command, which will stage-in the EOS files before launching the simulations.
Getting started quickly with the CLI tools
1. Bootstrap a new study from a template
The simanager
CLI tool can be used to create a new study from a template. The template is a folder containing a simulation_study.yaml
file, alongside some support files. The simanager
CLI tool can be used as follows:
simanager copy-template -n <name_of_the_root_folder>
and will create a new folder with the name specified by the -n
flag, and copy the template files into it. The -n
flag is optional, and if not specified, the name of the folder will be sim_study
.
Note that then the next commands must be executed from within the folder of the simulation_study.yaml
file.
2. Initialize the folder structure
Once the simulation_study.yaml
file is ready, the folder structure can be initialized by means of the simanager
CLI tool:
simanager create
This will create the folder structure of the study, and will copy the original_folder
into the study_name\scan
folder, alongside some support folders. The SimulationStudy
is now ready to be run on the various platforms.
3. Run the study
The SimulationStudy
can then be passed to three different executor functions, which will launch the simulations in the desired environment, namely:
simanager run_local
simanager run_htcondor
simanager run_slurm
these commands will look by default for a run_config.yaml
file in the root folder of the study, which will contain the arguments of the executor functions. The run_config.yaml
file is optional, and if not present, the executor functions will use the default arguments. Refer to the docstring of the executor functions for more information on the arguments.
4. cat out and err files
From within the directory of the simulation_study.yaml
, you can run the command simanager cat-out
and simanager cat-err
to have automatically printed on terminal the content of the out and err files of the various jobs, which by default are placed into… an out
and err
folder!
5. NUKE IT ALL
From within the directory of the simulation_study.yaml
, you can run the command simanager nuke
to delete the entire study folder. This command is DANGEROUS and will delete the entire study folder, so use it with caution.
Using parameter_inspection
Overview and examples
The dataclass ParameterInspection
is used to inspect the parameters of a SimulationStudy
. It is used to interpret the list of parameters to inspect indicated in the simulation_study.yaml file
. These scan instructions are expected to be placed under the parameters_inspected
key of the yaml file, and are expected to follow the following format:
- parameter_name: # Full dictionary key of the parameter
inspection_method: # Method used to inspect the parameter
min_value: # Minimum value of the parameter
max_value: # Maximum value of the parameter
n_samples: # Number of samples to be generated
values: # Custom values to be used
combination_idx: # Index if the parameter is part of a combination of parameters
combination_method: # Method used to combine multiple parameters
force_type: # Force the type of the parameter
parameter_file_name: # Name to use in the folder name
Note that some of these keys are optional, while others are mandatory depending on the inspection method used. For a better overview of the different inspection methods, please refer to the docstring of the ParameterInspection
dataclass.
Let us assume that we have in our master study the following parameter file:
# parameters.yaml
input_file: "initial_conditions_1.txt"
output_file: "output_file.txt"
my_seed: 1234
settings:
num_particles: 256
magic_numbers: [1, 2, 3]
a_random_float: 0.5
simulation_parameters:
max_attempts: 100
timeout_seconds: 600
We can then define the parameters_inspected
key in the simulation_study.yaml
file as follows in order to achieve the following parameter scans:
Simple scan of individual parameters
parameters_inspected:
- parameter_name: "settings/a_random_float"
inspection_method: "linspace"
min_value: 0.0
max_value: 6.0
n_samples: 10
force_type: "float"
parameter_file_name: "rndf"
- parameter_name: "my_seed"
inspection_method: "custom"
values: [1, 2, 3, 4, 5]
force_type: "int"
parameter_file_name: "s"
This will generate a total of 50 parameter combinations, which will then follow the convention case_rndf_<value>_s_<value>
.
Note how the parameter_name
key is used to specify the full dictionary key of the parameter to be inspected. When a parameter is in a nested dictionary, like the a_random_float
parameter, a path-like notation is used.
Note that the force_type
key is used to force the type of the parameter, which is useful when the parameter is not a float or an integer, but a string or a list. In this case, the force_type
key is not necessary, since the parameter is already an integer, but it is used for demonstration purposes. The custom
inspection method is used to specify a list of values to be used for the parameter scan. The linspace
inspection method is used to generate a linearly spaced list of values between the min_value
and the max_value
. Refer to the docstring of the ParameterInspection
dataclass for more information and alternative inspection methods.
Combination of parameters
parameters_inspected:
- parameter_name: "settings/num_particles"
inspection_method: "custom"
values: [15, 18, 42, 256]
combination_idx: 0
combination_method: "individual"
parameter_file_name: "np"
- parameter_name: "settings/magic_numbers"
inspection_method: "custom"
values: [[1, 2, 3], [4, 5, 6], [7, 8, 9], [10, 11, 12]]
combination_idx: 0
combination_method: "individual"
parameter_file_name: "mn"
This will generate a total of 4 parameter combinations, which correspond to a zip-like combination of the two lists of values. The combination_idx
key is used to specify the index of the parameter combination, while the combination_method
key is used to specify the method used to combine the parameters. In this case, the individual
method is used to generate a zip-like combination of the parameters. Refer to the docstring of the ParameterInspection
dataclass for more information and alternative combination methods.
Specialization of a path-like parameter
parameters_inspected:
- parameter_name: "input_file"
inspection_method: "custom"
values: [
"$STUDYPATH/../input_data/initial_conditions_1.txt",
"$STUDYPATH/../input_data/initial_conditions_2.txt",
"$STUDYPATH/../input_data/initial_conditions_3.txt"
"$WEIRDPATH/initial_conditions_3.txt"
]
force_type: "path"
parameter_file_name: "in"
environ_dict:
WEIRDPATH: "/this/is/a/weird_path"
This will generate a total of 4 parameter combinations, which correspond to the 4 different paths specified in the values
key. Note that the force_type
key is used to force the type of the parameter to be a path. This is necessary to enforce the correct path expansion. Note that the environ_dict
key is used to specify a dictionary of environment variables to be used for the path expansion. Refer to the SimulationStudy
docstring for that term. In this case, the WEIRDPATH
environment variable is used to expand the path $WEIRDPATH/initial_conditions_3.txt
. The variable STUDYPATH
is automatically defined by the manager as the path where the file simulation_study.yaml
is located.
Consider using this method also for a single path parameter, since it is more robust to changes in the folder structure of the study. If the value provided is only one, the parameter_file_name
key is not necessary and, if not provided, the case name will not be uselessly expanded more.
Running on HTCondor
Introduction
Launching stuff on HTCondor is always an adventure. A painful adventure. One of these kind of adventures that you would rather avoid, but you know that you will have to face sooner or later.
One would expect coherence, stability, and a certain degree of predictability from an HPC system, but HTCondor is not like that. HTCondor is a beast. A mighty and misterious beast that you have to tame, and that will bite you multiple times before you can even start to think that you are in control. And then set you on fire.
A bit like a cat. A cat that can set you on fire. And that can also fly. So, a dragon. HTCondor is like the Winged Dragon of Ra. Therefore, in order to tame it, we need to sing a song to it. A song that will make it happy, and that will make it want to help us. A song that will make it want to fly.
(Verse)
Ra, the sun, our guiding light,
In HPC’s realm, we take flight.
Python code, to you we send,
With your blessing, may it ascend.
(Chorus)
Ra, Ra, radiant one,
Guide our code till the work is done.
In this HPC domain, your grace we seek,
Speed our simulations, make them peak.
(Verse)
As we submit and scripts take form,
Grant us favor, keep us warm.
May our jobs, like sunbeams, shine,
In your brilliance, may they align.
(Chorus)
Ra, Ra, radiant one,
Guide our code till the work is done.
In this HPC domain, your grace we seek,
Speed our simulations, make them peak.
(Verse)
In the sacred space of computation’s might,
Watch over us, day and night.
When the simulations reach their end,
To you, Ra, our thanks we send.
(Chorus)
Ra, Ra, radiant one,
Guide our code till the work is done.
In this HPC domain, your grace we seek,
Speed our simulations, make them peak.
(~Made with ChatGPT 3.5)
“Ideal” HTCondor setup (as of today)
The following is a summary of the things that I expect to have in place to have a “perfect” HTCondor setup. This is based on my experience with HTCondor, and it is not exhaustive. It is just a summary of the things and assumptions around which I based the development of the simanager
package.
When I want to run a Python based simulation on HTCondor, I expect to have the following things well configured:
The usage of a CVMFS environment, which will be used as base to run the simulation. For example, I now always construct my simulations around a LCG release, which contain, among various compilers and software amenities, a full-fledged scientific Python installation (N.B., it is the one that SWAN uses!).
Consistency with the binary release, every LCG release is built against a specific version of the OS, and it is expected to be run on that version of the OS. As of today, the entire CERN computing environment is slowly moving to AlmaLinux 9, therefore, that version should be used (and we will see how containers can help us with that).
Carefulness with picking CUDA and non-CUDA flavours, the LCG releases are built with and without CUDA support, and the two versions are not fully interchangeable, and the CUDA version can cause issues if run on a machine without CUDA support. Therefore, it is important to be careful when picking the right version.
Example of a CVMFS environment:
# Example of a good standard CVMFS environment for modern Python
source /cvmfs/sft.cern.ch/lcg/views/LCG_104a/x86_64-el9-gcc11-opt/setup.sh
# CUDA version for GPU simulations
source /cvmfs/sft.cern.ch/lcg/views/LCG_104a_cuda/x86_64-el9-gcc11-opt/setup.sh
A Python ``venv`` built from CVMFS in AFS. In order to be able to install my own packages, while also benefitting from the standard scientific distribution available in the LCG view, and avoid clogging my AFS space, I use Python virtual environments with the
--system-site-packages
option on, while having sourced the LCG view I want to be based on. This way, I can compose my personal Python environment with libraries under development. The process to compose such environment looks like this:# WHILE ON lxplus9 IN ORDER TO USE AlmaLinux 9 # source the LCG view source /cvmfs/sft.cern.ch/lcg/views/LCG_104a/x86_64-el9-gcc11-opt/setup.sh # cd into the project folder where you want to create the venv cd /afs/cern.ch/work/c/camontan/public/my_project # create the venv python3 -m venv --system-site-packages project_venv # activate the venv source project_venv/bin/activate # install the packages you need pip install xsuite # install the packages you are developing pip install -e /afs/cern.ch/work/c/camontan/public/fantasy_circular_collider
It is necessary to execute this process on a machine with AlmaLinux 9, in order to avoid mismatching binaries shenanigans. This is why I always use lxplus9 for this!!!
In 90% of cases, this environment will work also with the CUDA version of the LCG view, but it is not guaranteed… Be careful and make an offer to the mighty Ra before trying to run anything too fancy.
Render unto EOS what is EOS’s. Render unto AFS what is AFS’s. As of today, a standard HTCondor job is only able to “freely” access AFS directories (as long as AFS is not kicking the bucket with too warm sectors in the shared filesystem), while EOS can not be accessed directly, but it requires instead a stage-in/stage-out approach. A “robust” should therefore follow these good/must practices:
Ideally, transfer everything immediately on the scratch disk by means of the submit file;
Perform only read operations on AFS directories, never write directly on AFS;
Have a well-defined list of files to be read from EOS, and transfer them to the scratch disk by means of a standard method such as
eos cp
orxrdcp
;At the end of the job, transfer the output files from the scratch disk to EOS by means of a standard method such as
eos cp
,xrdcp
, or well-defined instructions in the submit file;
With all of these “simple” rules and guidelines in place, you will easily understand the crooked structure and logic of the job_run_htcondor
function, as well as the various arguments that it can take.
Right now, the structure of the function forces you to follow these guidelines… I would have loved to make it more flexible, but the second I tried, I started to have very intense nightmares of never-ending queues of jobs that would crash over and over and over and over again. And while it is true that one must imagine Sisyphus happy, I would rather not be Sisyphus in the first place.
The job_run_htcondor
function
The job_run_htcondor
function runs a given SimulationStudy
on HTCondor. It can be easily run by means of the CLI command simanager run_htcondor
, or by importing it in a Python script and calling it directly. If the CLI command is used, a run_config.yaml
can be used to store easily the arguments of the function, and avoid having to type them every time. Refer to the study_template
folder for a simple example of a run_config.yaml
file.
When launched with default specialization instructions, the function will undergo the following steps:
Specialize all main scripts in the parameter scan folders with default initial and final instructions specialized for a HTCondror flavoured job (inspect
simanager/job_run_htcondor.py
for more details);Compose the submit file for the job, the list of folders to queue, and a EOS stage in support script in a
htcondor_support
folder. The submit file will make use of theMY.WantOS
flag and ensure that the job will always run on AlmaLinux 9, either by means of bare metal installation or by means of containers (100% resource usage with this method!!);Launch the jobs;
The jobs will then be executed on HTCondor!
The desired CVMFS environment and venv will be loaded;
The automatic stage in script will detect the presence of EOS based paths in the various configuration files, and will stage in the necessary files from EOS to the scratch disk;
The main script will then be executed;
The output, expected to be either a .pkl or a .hdf5 file (Want more? Feel free to extend the final instruction default!), will be automatically transferred to the desired EOS path, with the simulation case name appended to the file name.
A symbolic link to the output file on EOS will be created in the folder of the simulation case on AFS.
For more details on the various arguments of the function, refer to the docstring of the job_run_htcondor
function in simanager/job_run_htcondor.py
.
Extra CLI feature: easily cat the out and err files
From within the directory of the simulation_study.yaml
, you can run the command simanager cat-out
and simanager cat-err
to have automatically printed on terminal the content of the out and err files of the various jobs, which by default are placed into… an out
and err
folder!
This is extremely useful to quickly check the status of your jobs that just died, and to see on the fly how the almighty Ra is blessing your simulations by setting them on fire in yet another creative way that you would have never imagined.
Running locally and on Slurm
Differently from the HTCondor function, these two other functions are much more intuitive to work with. Just follow the docstrings, and you should be fine… I hope.
Docstrings
parameter_inspection
- class simanager.parameter_inspection.ParameterInspection(parameter_name: str, inspection_method: str, min_value: float | None = None, max_value: float | None = None, n_samples: int | None = None, values: list | None = None, combination_idx: int = -1, combination_method: str | None = None, parameter_file_name: str | None = None, force_type: str | None = None)
Bases:
object
ParameterInspection class defines a parameter or a combination of parameters to be varied in a simulation study.
- Parameters:
parameter_name (str) – Name of the parameter to be varied. Must be a parameter name contained in the YAML master file. The naming convention for nested parameters is ‘parent_parameter/child_parameter’.
inspection_method (str) –
Method to be used for parameter inspection. Must be one of the following:
’range’: Inspect the parameter by means of np.arange.
’linspace’: Inspect the parameter by means of np.linspace.
’logspace’: Inspect the parameter by means of np.logspace.
’custom’: Inspect the parameter by means of a custom list of values.
min_value (float, optional) – Minimum value of the parameter to be inspected. Must be specified if inspection_method is range, linspace, or logspace.
max_value (float, optional) – Maximum value of the parameter to be inspected. Must be specified if inspection_method is range, linspace, or logspace.
n_samples (int, optional) – Number of samples to be inspected. Must be specified if inspection_method is linspace or logspace.
values (list, optional) – List of values to be inspected. Must be specified if inspection_method is custom.
combination_idx (int, optional) – Index of the parameter if one wants to combine it with other parameter scans. If combination_idx is -1, the parameter is not combined with other parameter scans.
combination_method (str, optional) –
Method to be used for parameter combination. Must be one of the following:
’meshgrid’: Combine the parameter with other parameter scans by means of np.meshgrid.
’individual’: Combine the parameter with other parameter scans as if they were combined with a zip function.
’product’: Combine the parameter with other parameter scans by means of itertools.product.
parameter_file_name (str, optional) – Name of the parameter to be used when composing the folder name of the simulation. If None, the parameter_name is used.
force_type (str, optional) –
Force the type of the parameter to be inspected. Must be one of the following:
’int’: Force the parameter to be an integer.
’float’: Force the parameter to be a float.
’bool’: Force the parameter to be a boolean.
’str’: Force the parameter to be a string.
- ’path’: The parameter is a path. The path is expanded if it has some
environment variables in it.
’none’: Don’t force the type of the parameter in any situation.
- If None, the type of the parameter follows an internal default. By default
None.
- combination_idx: int = -1
- combination_method: str = None
- force_type: str = None
- classmethod from_dict(dictionary)
- classmethod from_yaml(yaml_path: str)
- inspection_method: str
- max_value: float = None
- min_value: float = None
- n_samples: int = None
- parameter_file_name: str = None
- parameter_name: str
- values: list = None