📦 Structure Sampling
Structure sampling is used to select, generate, or filter candidate configurations for NEP training and molecular dynamics workflows.
Script Location: Scripts/sample_structures/
Interactive Mode
Sampling tasks are usually easier to run from the interactive menu:
Choose:
The menu is:
+------------------------------------------------------+
| SAMPLE STRUCTURE TOOLS |
+------------------------------------------------------+
| 201) Sample structures from extxyz |
| 202) PyNEP sampling [deprecated] |
| 203) FPS sampling by NepTrain [preferred] |
| 204) Perturb structure |
| 205) Select max force deviation structs |
+------------------------------------------------------+
| 000) Return to the main menu |
+------------------------------------------------------+
Input the function number:
Available entries:
| Menu | Method | Use Case |
|---|---|---|
| 201 | Uniform/random sampling | quick frame selection from a trajectory |
| 202 | PyNEP notice | prints a notice; use gpumdkit.sh -pynep to run PyNEP |
| 203 | NepTrain FPS | descriptor-based FPS with NepTrain |
| 204 | Perturb structure | generate perturbed structures from POSCAR/CONTCAR |
| 205 | Force-deviation selection | select structures with high model deviation |
Uniform and Random Sampling
From interactive mode, choose 201. You will see:
Input <extxyz_file> <sampling_method> <num_samples> [skip_num]
[skip_num]: number of initial frames to skip, default value is 0
Sampling_method: 'uniform' or 'random'
Example: train.xyz uniform 50
------------>>
Example input:
or:
Arguments:
| Argument | Meaning |
|---|---|
dump.xyz |
input trajectory |
uniform / random |
sampling method |
50 / 100 |
number of selected structures |
500 |
optional number of initial frames to skip |
Output:
sampled_structures.xyz
Farthest Point Sampling with NepTrain
This entry uses NepTrain descriptors for FPS. Choose 203 in interactive mode, or run the script directly:
From interactive mode, choose 203. You will see:
Inputs:
| File | Meaning |
|---|---|
dump.xyz |
candidate structures |
train.xyz |
existing training set |
nep.txt |
current NEP model used to compute descriptors |
Outputs:
selected.xyzselect.pngpca_sample.txtpca_train.txtpca_selected.txt
During execution, choose one of two selection modes:
- select until the descriptor distance is below a threshold;
- select a specified number of structures.
The script then prints a selection prompt:
Choose selection method:
1) Select structures based on minimum distance
2) Select structures based on number of structures
------------>>
This function requires the NepTrain package. If you use this function, we recommend citing the NepTrain paper printed by the script.
Deprecated PyNEP FPS
PyNEP FPS is kept for compatibility, but it is only exposed through the direct -pynep entry.
If you choose 202 from the interactive menu, it prints:
+-------------------------------------------------+
| Function 202 is no longer supported here. |
| PyNEP package is no longer actively maintained. |
| Please use 203) NepTrain sampling instead. |
| If you still need PyNEP compatibility, run: |
| gpumdkit.sh -pynep |
+-------------------------------------------------+
Serial PyNEP
Parallel PyNEP
The last argument is the number of CPU threads.
GPUMDkit Entry
Then input:
This entry requires pynep and is kept for compatibility.
Structure Perturbation
Use perturbation when you want to generate initial structures around a known configuration. Choose 204 in interactive mode, or run the script directly:
From interactive mode, choose 204. You will see:
Input <input.vasp> <pert_num> <cell_pert_fraction> <atom_pert_distance> <atom_pert_style>
The default parameters for perturb are 20 0.03 0.2 uniform
Example: POSCAR 20 0.03 0.2 uniform
------------>>
Arguments:
| Argument | Meaning |
|---|---|
POSCAR |
input VASP structure |
20 |
number of perturbed structures |
0.03 |
cell perturbation fraction |
0.2 |
atom perturbation distance in Angstrom |
uniform |
perturbation style: normal, uniform, or const |
Output:
POSCAR_01.vasp,POSCAR_02.vasp, ...
This function requires dpdata. If you use this function, we recommend citing the dpdata package.
Force-Deviation Selection
Use this together with the GPUMD active command. The active command uses a committee model approach: multiple potentials predict forces for the same structure, and GPUMD records the maximum force deviation. select_max_modev.py selects structures with large force deviations from active.out and active.xyz, then writes them to selected.xyz.
Choose 205 in interactive mode, or run the script directly:
From interactive mode, choose 205. You will see:
+----------------------------------------------------+
| Select max force deviation structs from active.xyz |
| generated by the active command in gpumd. |
+----------------------------------------------------+
Input <structs_num> <threshold> (eg. 200 0.15)
------------>>
Arguments:
| Argument | Meaning |
|---|---|
200 |
maximum number of structures to keep |
0.15 |
minimum force deviation threshold |
Required files:
active.outactive.xyz
Output:
selected.xyz
Frame Range Extraction
frame_range.py is an independent CLI tool and is not part of the interactive sampling menu 201–205. Use it when you want to keep only a fraction of a trajectory before sampling, for example after equilibration.
gpumdkit.sh -frame_range dump.xyz 0 0.8
python Scripts/sample_structures/frame_range.py dump.xyz 0 0.8
This writes frames from 0% to 80% of the trajectory. The range arguments are trajectory fractions from 0 to 1.
Example Commands
An example preprocessing-and-sampling sequence is shown below. The first two commands are analyzer tools; the final step is structure sampling.
gpumdkit.sh -min_dist_pbc dump.xyz
gpumdkit.sh -filter_box dump.xyz 13
python Scripts/sample_structures/neptrain_select_structs.py dump.xyz train.xyz nep.txt
The final step can also be run from interactive mode by choosing 2) Sample Structures -> 203.
Common Mistakes
| Problem | Recommendation |
|---|---|
| FPS selects too many similar structures | increase the distance threshold or select fewer structures |
| Perturbed structures are unphysical | reduce cell_pert and atom_pert, then check minimum distances |
active.out and active.xyz do not match |
regenerate both from the same GPUMD active run |
| PCA plot looks strange | check whether nep.txt matches the chemical species in both datasets |