diffractem.dataset module

class diffractem.dataset.Dataset[source]

Bases: object

Stacks(**kwargs)[source]

Context manager to handle the opening and closing of stacks. returns the opened data stacks, which are automatically closed once the context is left. Arguments are passed to open_stacks Example:

with ds.Stacks(readonly=True, chunking=’dataset’) as stk:

center = stk.beam_center.compute()

print(‘Have’, center.shape[0], ‘centers.’)

This is deprecated, and using it is horribly discouraged

add_stack(label, stack, overwrite=False, set_diff_stack=False, persist=True, rechunk=True)[source]

Adds a data stack to the data set.

The new data stack can be either a dask array or a numpy array. The only restriction is that its first dimension’s length (i.e. total number of shots) has to equal the rest of the dataset. The stack is not stored to disk yet, but it’s placed under the control of the dataset object.

If the new data is a numpy array, it will be turned into a dask array with appropriate properties. By default (persist=True), it will be eagerly persisted, that is, a copy will be made and the dask graph will be simplified.

Parameters:
  • label (str) – Label for the new stack

  • stack (Union[da.Array, np.ndarray, h5py.Dataset]) – New data stack

  • overwrite (bool, optional) – Overwrite, if an identically named stack exists already. Defaults to False.

  • set_diff_stack (bool, optional) – Set the new stack as the ‘diffraction data’ stack, which will recieve some special treatment (e.g. it is never loaded into memory). Defaults to False.

  • persist (bool, optional) – If the stack is a numpy array, make the dask array persited right away. There is little speaking against it except for some edge cases. Defaults to True.

  • rechunk (bool, optional) – If the stack is a dask array with a chunk along the first dimension that does not match the dataset’s overall chunking, rechunk it. This is highly recommended. Defaults to True.

aggregate(by=('sample', 'region', 'run', 'crystal_id'), how='sum', file_suffix='_agg.h5', file_prefix='', new_folder=None, query=None, exclude_stacks=None)[source]

Aggregate sub-sets of stacks (like individual diffraction movies) using different aggregation functions.

Each set of shots with identical values of the columns specified in by will be squashed into a single one, using aggregation functions applied to the stacks as described in how. These can be different for each of the stacks. Unlike for the stacks, inconsistent fields in the shot list within each group are simply killed. The function finally returns a new dataset containing the aggregated data, it leaves the existing set untouched.

The typical application is to sum sub-stacks of dose fractionation movies, or shots with different tilt angles (quasi-precession). If you’re familiar with pandas a bit, it’s sort of like a `DataSet.GroupBy(by).agg’ operation.

In most cases (well-ordered data sets), this function should just work. More pathological ones are not sufficiently tested, though some sanity checks and precautions are taken.

As an example: setting how=[‘sample’, ‘region’, ‘run’, ‘crystal_id’] (which is the default) will aggregate over all shots taken in a single run, and if you set how=’sum’, the stacks will be added.

Parameters:
  • by (Union[list, tuple], optional) – shot table columns to group by for aggregation. Defaults to (‘sample’, ‘region’, ‘run’, ‘crystal_id’).

  • how (Union[dict, str], optional) – string specifying the aggregation method for stacks. Allowed values are ‘mean’, ‘sum’, ‘first’, ‘last’. You can also specify a dict with different values for each stack, like {‘raw_counts’: ‘sum’, ‘nPeaks’: ‘first’}. Defaults to ‘sum’.

  • file_suffix (str, optional) – as in change_filenames. Defaults to ‘_agg.h5’.

  • file_prefix (str, optional) – as in change_filenames. Defaults to ‘’.

  • new_folder (Union[str, None], optional) – as in change_filenames. Defaults to None.

  • query (Union[str, None], optional) – additional query to sub-select data before aggregation (as in select or `get_selection). E.g. query=’frame >= 1 and frame < 5” would only aggregate frames 1 to 4. Defaults to None.

  • exclude_stacks (Optional[list], optional) – Exclude stacks from the aggregated dataset. Defaults to None.

Returns:

Dataset containing the aggregated data

Return type:

Dataset

change_filenames(file_suffix='.h5', file_prefix='', new_folder=None, fn_map=None, keep_raw=True)[source]

Change file names in all lists using some handy modifications.

The old file names are copied to a “file_raw” column, if not already present (can be overriden with keep_raw).

Parameters:
  • file_suffix (Optional[str], optional) – add suffix to file, INCLUDING file extension, e.g. ‘_modified.h5’. Defaults to ‘.h5’, i.e., no change is made except for the file extension being fixed to h5.

  • file_prefix (str, optional) – add prefix to actual filenames (not folder/full path!), e.g. ‘aggregated_’. Defaults to ‘’, i.e., no prefix..

  • new_folder (Union[str, None], optional) – If not None, changes the file folders to this path. Defaults to None.

  • fn_map (Union[pd.DataFrame, None], optional) – if not None, expects an explicit table (pd.DataFrame) with columns ‘file’ and ‘file_new’ that manually maps old to new filenames. All other parameters are ignored, if provided. Defaults to None.

  • keep_raw (bool, optional) – If True (default), does not change the file_raw column in the shot list, unless there is none yet (in which case the old file names are always copied to keep_raw). Defaults to True.

close_files()[source]

Closes all HDF5 files.

Note that this might have side effects: if stacks are accessible that depend on non-persisted HDF5 datasets in the files, they will not be usable anymore after issuing this command and cause trouble especially for the distributed scheduler. So don’t close the files unless you really have to.

close_stacks()

Closes all HDF5 files.

Note that this might have side effects: if stacks are accessible that depend on non-persisted HDF5 datasets in the files, they will not be usable anymore after issuing this command and cause trouble especially for the distributed scheduler. So don’t close the files unless you really have to.

compute_and_save(diff_stack_label=None, list_file=None, client=None, exclude_stacks=None, overwrite=False, persist_diff=True, persist_all=False, compression=32004, store_features=True)[source]

Compound method to fully compute a dataset and write it to disk.

It is designed for completely writing HDF5 files from scratch, not to append to or modify existing ones, in which case you have to use the more fine-grained methods for data storage. The foolowing steps are taken:

  • Initialize the HDF5 files (using init_files)

  • Store the metadata tables (shots, features, peaks, predictions)

  • Compute/store all non-diffraction-data stacks (using store_stacks). If this step takes too long, make sure that computation-heavy but small stacks are already persisted in memory.

  • Compute/store the diffraction data set (identified by diff_stack_label) using store_stack_fast.

  • Write a list file which can be used to reload the dataset or to feed into CrystFEL.

Parameters:
  • diff_stack_label (Optional[str], optional) – Label of the diffraction data stack. If None, use the one stored in diff_stack_label. Defaults to None.

  • list_file (Optional[str], optional) – Name of the list file to be written. Defaults to None.

  • client (Optional[Client], optional) – dask.distributed client for computation of the diffraction data. Defaults to None.

  • exclude_stacks (Union[str,List[str]], optional) – Labels of data stacks to exclude. Defaults to None.

  • overwrite (bool, optional) – Overwrite existing files. Defaults to False.

  • persist_diff (bool, optional) – Changes the dask array underlying diffraction data stack from the computed one to the one stored in the HDF5 file. This is different from persisting to memory (as is done otherwise), as it persists the data from disk: if you access it using e.g. .compute(), it will be loaded from disk instead of being recomputed. Defaults to True.

  • persist_all (bool, optional) – Changes dask arrays underlying all stacks from the computed one to the one stored in the HDF5 file. Defaults to False.

  • compression (Union[str, int], optional) – HDF5 compression filter to use. Common choices are ‘gzip’, ‘none’, or 32004, which is the lz4 filter often used for diffraction data. Defaults to 32004.

  • store_features (bool, optional) – store/overwrite the feature table into the files. Defaults to True.

compute_pattern_info(opts, client=None, output_file='image_info.h5')[source]

Computes the diffraction pattern information (center, peaks, virtual dark field etc.) for the diffraction data stack of the data set. Encapsulates proc2d.get_pattern_info, automatically merging its outcome into the dataset. Also writes a diffraction pattern info file, which is a fully valid diffractem HDF5 file, just without the actual diffraction patterns (hence very small); it can be used for indexing without the actual data files, e.g. on a remote cluster.

Parameters:
  • opts (PreprocOpts, str) – PreProcOpts object or filename of a preprocessing options yaml file.

  • client (dask.distributed.Client, optional) – dask.distributed Client object. Supply if you have a cluster running already. If None, creates one (with default settings) for this task specifically, which is shut down after completion. This is a bit inefficient and does not allow custom settings (such as a scratch drive), so starting a cluster explicitly and supplying it here might be a good idea. Defaults to None.

  • output_file (str, optional) – File name of pattern info HDF5 file. Defaults to ‘image_info.h5’.

copy(file_suffix='_copy.h5', file_prefix='', new_folder=None)[source]

Makes a (deep) copy of a dataset, changing the file names.

Internally, this just calls get_selection with query=’True’.

Parameters:
  • file_suffix (Optional[str], optional) – as in change_filenames. Defaults to ‘_copy.h5’.

  • file_prefix (str, optional) – as in change_filenames. Defaults to ‘’.

  • new_folder (Union[str, None], optional) – as in change_filenames. Defaults to None.

Returns:

Copy of the dataset

Return type:

Dataset

data_pattern: str

Path to data stacks in HDF5 files. % can be used as placeholder (as in CrystFEL). Default /%/data

delete_stack(label, from_files=False)[source]

Delete a data stack from the dataset

Parameters:
  • label (str) – label of the stack to delete

  • from_files (bool, optional) – Also delete stack from the data files. Note that this will

  • space (actually not free up disk) –

  • False. (works if the files are open in writable mode. Defaults to) –

property diff_data: Array

Returns diffraction data stack (as identified by the diff_stack_label property

Return type:

Array

property diff_stack_label

Label of stack which holds the diffraction data.

property features: DataFrame

List of features (that is e.g. crystals). Each feature can have one or many shots associated with it.

Return type:

DataFrame

property file_handles: dict

Handles to the HDF5 files as a dict with keys matching the file name, if files are open. Otherwise returns None (for each file).

Return type:

dict

property files: list

List of HDF5 files which the Dataset is based on. Note that these files do not have to actually exist; but they will be written if any of the writing functions is called. Change the file names and directories using change_filenames, or direct editing of the shot table (discouraged)

Return type:

list

classmethod from_files(files, open_stacks=True, chunking='hdf5', persist_meta=True, init_stacks=False, load_tables=True, diff_stack_label='raw_counts', validate_files=False, unique_features=True, **kwargs)[source]

Create a Dataset object from HDF5 file(s) stored on disk.

There is some flexibility with regards to how to define the input files. You can specify them by

  • a .lst file name, which contains a simple list of H5 files (on separate lines). If the .lst file has CrystFEL-style event indicators in it, it will be loaded, and the events present in the list will be selected, the others not.

  • a glob pattern (like: ‘data/*.h5’)

  • a python iterable of files.

  • a simple HDF5 file path

In any case, the shot list and feature list are loaded to memory. Using the arguments you can specify what should happen to the stacks.

Parameters:
  • files (Union[list, str, tuple]) – File specification as decsribed above.

  • open_stacks (bool, optional) – Open the data stacks. This means that open handles to the HDF5 (in readonly mode). are kept within the Dataset object. Defaults to True.

  • chunking (Union[int, str], optional) – See documentation of open_stacks. Defaults to ‘hdf5’, that is, look up in the HDF5 file for a recommendation value.

  • persist_meta (bool, optional) – Right away persists the data stacks, that is, loads the actual data into memory instead of just holding references to the HDF5 files. Diffraction data (identified by 3D stacks) is automatically excluded. Defaults to True.

  • init_stacks (bool, optional) – Initialize stacks, that is, briefly open the data stacks, check their lengths, and close the files again. Viable option if you need/want to set open_stacks=False for some reason. Defaults to False.

  • load_tables (bool, optional) – Also load peaks and prediction tables from the HDF5 files. Defaults to True (will likely be changed to False).

  • diff_stack_label (str, optional) – Label of the diffraction data stack. Defaults to ‘raw_counts’.

  • validate_files (bool, optional) – Validate the HDF5 files (that is, check for required groups and datasets) before attempting to open them. Defaults to False.

  • unique_features (bool, optional) – Only keeps one copy of each feature/crystal in the feature table, if region, sample name, and crystal ID match. Set to False, if you took multiple runs from the same region with different features, e.g. for non-feature-matched multi-tilt serial ED. Defaults to True.

  • **kwargs – Dataset attributes to be set right away.

Returns:

new Dataset object read from files

Return type:

Dataset

classmethod from_list(files, open_stacks=True, chunking='hdf5', persist_meta=True, init_stacks=False, load_tables=True, diff_stack_label='raw_counts', validate_files=False, unique_features=True, **kwargs)

Create a Dataset object from HDF5 file(s) stored on disk.

There is some flexibility with regards to how to define the input files. You can specify them by

  • a .lst file name, which contains a simple list of H5 files (on separate lines). If the .lst file has CrystFEL-style event indicators in it, it will be loaded, and the events present in the list will be selected, the others not.

  • a glob pattern (like: ‘data/*.h5’)

  • a python iterable of files.

  • a simple HDF5 file path

In any case, the shot list and feature list are loaded to memory. Using the arguments you can specify what should happen to the stacks.

Parameters:
  • files (Union[list, str, tuple]) – File specification as decsribed above.

  • open_stacks (bool, optional) – Open the data stacks. This means that open handles to the HDF5 (in readonly mode). are kept within the Dataset object. Defaults to True.

  • chunking (Union[int, str], optional) – See documentation of open_stacks. Defaults to ‘hdf5’, that is, look up in the HDF5 file for a recommendation value.

  • persist_meta (bool, optional) – Right away persists the data stacks, that is, loads the actual data into memory instead of just holding references to the HDF5 files. Diffraction data (identified by 3D stacks) is automatically excluded. Defaults to True.

  • init_stacks (bool, optional) – Initialize stacks, that is, briefly open the data stacks, check their lengths, and close the files again. Viable option if you need/want to set open_stacks=False for some reason. Defaults to False.

  • load_tables (bool, optional) – Also load peaks and prediction tables from the HDF5 files. Defaults to True (will likely be changed to False).

  • diff_stack_label (str, optional) – Label of the diffraction data stack. Defaults to ‘raw_counts’.

  • validate_files (bool, optional) – Validate the HDF5 files (that is, check for required groups and datasets) before attempting to open them. Defaults to False.

  • unique_features (bool, optional) – Only keeps one copy of each feature/crystal in the feature table, if region, sample name, and crystal ID match. Set to False, if you took multiple runs from the same region with different features, e.g. for non-feature-matched multi-tilt serial ED. Defaults to True.

  • **kwargs – Dataset attributes to be set right away.

Returns:

new Dataset object read from files

Return type:

Dataset

get_indexing_solution(stream, sol_file, legacy=False, det_shift=None, beam_center=None, pixel_size=1, img_size=(0, 0))[source]

Writes a .sol file containing an indexing solution from a stream file that has been generated using this dataset, or another which holds patterns from the same set of crystals. This is identified by the shot table columns [sample, region, crystal_id, run] being identical.

Typically, you will want to use this function when “broadcasting” the indexing results you’ve obtained with one aggregation of a dose-fractionation movie to another aggregation, or even the dataset containing all the single shots.

NB: If you just want to simply generate a .sol file from a .stream, keeping all file and event identifiers, you might rather want to use the sol2stream command line tool, which is faster and simpler.

Parameters:
  • stream (Union[str, StreamParser]) – Stream file holding the indexing solution

  • sol_file (str) – Output solution file

  • legacy (bool, optional) – Writes .sol file compatible with older electron-adapted CrystFEL versions, where the .sol file does not contain cell information. Defaults to False.

  • det_shift (list, optional) – List of stream file extra header field names holding a detector shift in mm to be added on top of that found by the indexer. Defaults to None.

  • beam_center (list, optional) – List of stream file extra header field names holding the beam center in pixels to be added on top of that found by the indexer. Defaults to None.

  • pixel_size (float, optional) – Required if using beam_center. Defaults to 1.

  • img_size (Union[Tuple, List], optional) – (x, y) size of images in pixels. Required if using_beam center. Defaults to (0,0).

get_map(file, subset='entry')[source]
get_meta(path='/%/instrument/detector/collection/shutter_time')[source]

Gets an instrument metadata field in NeXus format from the HDF5 files.

As those metadata are per-file and not per-shot, a series is returned which then can be joined into the dataset manually. If you want to have this done automatically, use merge_meta instead.

Parameters:

path (str, optional) – Path to metadata to be grabbed. Can include CrystFEL-stype % placeholder. Defaults to ‘/%/instrument/detector/collection/shutter_time’.

Returns:

pandas Series holding the metadata for each file.

Return type:

pd.Series

get_random_subset(N=10, seed=None)[source]

Returns a randomized subset of the dataset containing N shots.

Parameters:
  • N (int, optional) – Sample size. Defaults to 10.

  • seed (int, optional) – If not None, seeds the random number generator with this number. This allows to obtain a reproducable subset in every call. Defaults to None.

Returns:

random subset of this dataset.

Return type:

Dataset

get_selection(query=None, file_suffix='_sel.h5', file_prefix='', new_folder=None, reset_id=True)[source]

Returns a new dataset object by applying a selection.

By default, returns a new Dataset object, including all shots with selected == True in the current shot list. Optionally, a different query string can be supplied (which leaves the selection unaffected). The file names of the new data set will be changed, to avoid collisions. This can be controlled with the file_suffix and file_prefix parameters. Otherwise, the returned dataset will include everything from the existing one.

Hint:

Parameters:
  • query (Union[str, None], optional) – Optional query string, as in the select method. Defaults to None, that is, use the selected column in the shot list.

  • file_suffix (Optional[str], optional) – as in change_filenames. Defaults to ‘_sel.h5’.

  • file_prefix (str, optional) – as in change_filenames. Defaults to ‘’.

  • new_folder (Union[str, None], optional) – as in change_filenames. Defaults to None.

  • reset_id (bool, optional) – reset the shot in subset. Defaults to True.

Returns:

New dataset with all the same attributes, but containing only the desired sub-selection of shots.

Return type:

Dataset

init_files(overwrite=False, keep_features=False, exclude_list=())[source]

Initialize set of HDF5 files to store the Dataset.

Makes new files corresponding to the shot list, by creating the files with the basic structure, and copying over instrument metadata and maps (but not shot list, data arrays,…) from the raw files (as stored in file_raw).

Parameters:
  • overwrite (bool, optional) – Overwrite files if existing already. Defaults to False.

  • keep_features (bool, optional) – Copy over the (full) feature list. Usually not required, as it will be later stored using store_stacks. Defaults to False.

  • exclude_list (tuple, optional) – Custom list of HDF5 groups or datasets to exclude from copying. Please consult documentation of nexus.copy_h5 for help. Defaults to ().

init_shot_table(files, stack_label='raw_counts')[source]
init_stacks(**kwargs)[source]

Opens files briefly in readonly mode, to check stack names shapes etc., and closes them again right away.

Parameters:

**kwargs – any arguments are passed to open_stacks

instrument_pattern: str

Path to instrument metadat in HDF5 files. % can be used as placeholder (as in CrystFEL). Default /%/instrument

load_tables(shots=False, features=False, files=None, unique_features=False)[source]

Load pandas metadata tables from the HDF5 files. Set the argument for the table you want to load to True.

Parameters:
  • shots (bool, optional) – Get shot table. Defaults to False.

  • features (bool, optional) – Get feature table. Defaults to False.

  • files (bool, optional) – Only include sub selection of files - usually not a good idea. Uses all files of dataset if None. Defaults to None.

  • unique_festures (bool, optional) – only keep one copy of each feature, if crystal ID, region and sample match. Set to False if you took multiple runs on the same regions with different features. Defaults to True.

map_pattern: str

Path to map and feature data in HDF5 files. % can be used as placeholder (as in CrystFEL). Default /%/map

merge_meta(path='%/instrument/detector/collection/shutter_time')[source]

Gets an instrument metadata field in NeXus format from the HDF5 files, and merges it into the shot table of the data set.

Note, that the name of the new column in the shot table will correspond to the HDF5 dataset name, ignoring the group (as included in the full path). E.g., for the default value, it will be just ‘shutter_time’.

Parameters:

path (str, optional) – Path to metadata to be grabbed. Can include CrystFEL-style % placeholder. Defaults to ‘%/instrument/detector/collection/shutter_time’.

merge_pattern_info(ds_from, merge_cols=None, by=('sample', 'region', 'run', 'crystal_id'), persist=True)[source]

Merge shot-table and CXI peak data from another data set into this one, based on matching of the shot table columns specified in “by”. Default is (‘sample’, ‘region’, ‘run’, ‘crystal_id’), which matches the shot information based on individual crystals.

The typical application of this function is to take over diffraction pattern information such as pattern center and peak positions from an aggregated data set (where each pattern corresponds to exactly one shot) to a full data set (where each pattern often corresponds to many shots, such as frames of a diffraction movie).

In this case you’d call the method like: ds_all.merge_pattern_info(self), where self is the aggregated data set to get the information from.

Parameters:
  • ds_from (Uniton[Dataset, str]) – Diffractem Dataset to take information from, or filename of h5 or list file. Esepcially friendly for h5 files written by get_image_info.

  • merge_cols (Optional[List[str]], optional) – Shot table columns to take over from other data set. If None, all columns are taken over which are not present in the shot table currently. Defaults to None.

  • by (Union[List[str], Tuple[str]], optional) – Shot table columns to match by. Defaults to (‘sample’, ‘region’, ‘run’, ‘crystal_id’).

  • persist (bool, optional) – Persist the merged CXI peak data to memory. Defaults to True.

merge_stream(streamfile)[source]

Loads a CrystFEL stream file and merges it contents into the dataset.

Parameters:

streamfile (Union[StreamParser, str]) – stream file name, or StreamParser object.

open_stacks(labels=None, checklen=True, init=False, readonly=True, swmr=False, chunking='dataset')[source]

Opens data stacks from HDF5 (NeXus) files (found by the “data_pattern” attribute), and assigns dask array objects to them. After opening, the arrays or parts of them can be accessed through the stacks attribute, or directly using a dataset.stack syntax, and loaded using the .compute() or .persist() method of the arrays.

A critical point here is how the chunking of the dask arrays is done. Especially for the initial opening of raw data this is crucial for (as in: orders of magnitude) the performance of downstream tasks. You have several options, those are, in decreasing order of recommendation:

  • ‘dataset’ to use what is set in the current dataset zchunks property (default). This will not work for a fresh dataset, in which case you have to specify it from scratch.

  • ‘hdf5’ to use the chunksize recommended in the HDF5 file (‘recommended_zchunks’ attribute) of the data stacks group.

  • an integer number for a defined (approximate) chunk size, which ignores shots with frame number < -1. This means, that after a get_selection command or anything that filters out dummy shots, equal chunk sizes are achieved. This is the recommended way of chunking for totally from-scratch datasets which don’t yet have the recommended_zchunks attribute set. Something of the order of 10 is often a good choice if you want to work with the set as is, if you want to aggregate early on, choose something bigger (rather 100). If your dataset comprises diffraction movies, this should be an integer multiple of the number of frames within each.

  • an iterable to explicitly set the chunk sizes

  • ‘existing’ to use the chunking of an already-existing stack which is about to be overwritten. Should usually be the same as ‘dataset’, but still works if your stacks have inconsistent chunking.

  • ‘auto’ to use the dask automatic mode, with inevitably sub-optimal results.

Parameters:
  • labels (Union[None, list], optional) – lLst of stacks to open. To open all stacks, set to None. Defaults to None.

  • checklen (bool, optional) – check if stack heights (first dimension) is equal to shot list length. Defaults to True.

  • init (bool, optional) – do not load stacks, just make empty dask arrays. Defaults to False.

  • readonly (bool, optional) – open HDF5 files in read-only mode. Defaults to True.

  • swmr (bool, optional) – open HDF5 files in SWMR mode. Defaults to False.

  • chunking (Union[int, str, list, tuple], optional) – [description]. Defaults to ‘dataset’.

parallel_io: bool

Toggles if parallel I/O is attempted for datasets spanning many files. Note that this is independent from dask.distributed-based parallelization as in store_stack_fast. Default True, which is overriden if the Dataset comprises a single file only.

property peak_data: Dict[str, Array]

Stored Bragg reflection data in CXI format, if present. Otherwise raises error.

Return type:

Dict[str, Array]

property peaks: DataFrame

List of found diffraction peaks. Deprecated. Please store peaks in CXI-format stacks. Note that peak positions in this table must follow CrystFEL convention, that is, integer numbers specify the pixel edges, not centers. This is in contrast to CXI convention, where integer numbers correspond to pixel centers

Return type:

DataFrame

persist_stacks(labels=None, exclude=None, include_3d=False, scheduler='threading')[source]

Persist the stacks to memory (locally and/or on the cluster workers), that is, they are computed. but actually not changed to numpy arrays, just immediately available dask arrays without an actual task graph. It is recommended to have as many stacks persisted as possible. The diffraction data stack is automatically excluded, as are any 3D arrays (be default).

Note

There are important subtleties about which dask scheduler to use here. If you have a dask.distributed cluster running (and you often will), the underlying dask.persist() function if called without parameters will compute and persist the data on the workers of the cluster, not the local machine. For our typical applications (making access to small meta stacks faster and less error-prone), that’s the wrong choice. Hence, scheduler=’threading’ by default (you might as well use ‘single-threaded’). However, there might be cases where persisting on the workers make sense - in that case just set the scheduler argument to your client object.

Parameters:
  • labels (Union[None, str, list], optional) – Labels of stacks to persist (None: all except for the one set in diff_stack_label). Defaults to None.

  • exclude (Union[None, str, list], optional) – Stacks to exclude. Defaults to None.

  • include_3d (bool, optional) – Include 3D stacks. Defaults to False.

  • scheduler (Union[str, Client], optional) – What scheduler to use. Defaults to ‘threading’.

property predict: DataFrame

List of predictions. Deprecated. Please store predictions in StreamParser objects.

Return type:

DataFrame

rechunk_stacks(chunk_height)[source]
reset_id(keep_raw=True)[source]

Resets shot_in_subset and Event columns to continuous numbering. Useful after dataset reduction. The old Event strings are copied to a “Event_raw” column, if not already present (can be overriden with keep_raw).

Parameters:

keep_raw (bool, optional) – if True (default), does not change the Event_raw column in the shot list, unless there is none yet (in which case the old Event IDs are always copied to keep_raw)

Returns:

result_pattern: str

Path to result data (peaks, predictions) in HDF5 files. % can be used as placeholder (as in CrystFEL). Default /%/results. Note that storing results in this way is discouraged and deprecated.

select(query='True')[source]

Sets the ‘selected’ column of the shot list by a string query (eg. ‘num_peaks > 30 and frame == 1’). See pandas documentation for ‘query’ and ‘eval’. If you want to add another criterion to the existing selection you can also do sth. like ‘selected and hit == 1’.

Parameters:

query (str) – if left empty, defaults to ‘True’ -> selects all shots.

property shots: DataFrame

Shot list. Can be overwritten only if index and ID columns of the shots are identical to the existing one.

Return type:

DataFrame

shots_pattern: str

Path to shot table data in HDF5 files. % can be used as placeholder (as in CrystFEL). Default /%/shots

property stacks: dict

Dictionary of data stacks of the Dataset.

Return type:

dict

stacks_to_shots(stack_labels, shot_labels=None)[source]
store_stack_fast(label=None, client=None, sync=True, compression=32004)[source]

Store (and compute) a single stack to HDF5 file(s), using a dask.distributed cluster.

This allows for proper parallel computation (on single or many machines) and is wa(aaa)y faster than the standard store_stacks, which only works with threads. Typically, you’ll want to use this method to store a processed diffraction data stack.

Note

If the stack to be stored depends on computationally heavy (but memory-fitting) dask arrays which you want to retain outside this computation (e.g. to store them using store_stacks), make sure they are persisted before calling this function. Otherwise, they will be re-calculated from scratch.

Parameters:
  • label (Optional[str]) – Label of the stack to be computed and stored. If None, use the value stored in diff_stack_label. Defaults to None

  • client (Optional[Client], optional) – dask.distributed client connected to a cluster to perform the computation on. Defaults to None.

  • sync (bool, optional) – if True (default), computes and stores immediately, and returns a pandas dataframe containing metadata of everything stored, for validation. If False, returns a list of dask.delayed objects which encapsulate the computation/storage. Defaults to True.

  • compression (Union[int, str], optional) – HDF5 compression filter to use. Common choices are ‘gzip’, ‘none’, or 32004, which is the lz4 filter often used for diffraction data. Defaults to 32004.

Returns:

pandas DataFrame holding ID columns of the computed shots. They can be merged

with the shot list to cross-check if everything went ok. If sync=False, a list of futures to tuples (file, subset, path, idcs) for each dask array chunk is returned instead.

Return type:

pd.DataFrame

store_stacks(labels=None, exclude=None, overwrite=False, compression=32004, lazy=False, data_pattern=None, progress_bar=True, scheduler='threading', **kwargs)[source]

Stores stacks with given labels to the HDF5 data files. For stacks which are not persisted, at this point the actual calculation is done here.

Note

This way of computing and storing data is restricted to threading (which does not help much) or single-threaded computation, i.e. it’s not recommended for heavy lifting, like computing corrected/aggregated/modified diffraction patterns. In this case, better use true parallelism provided by store_stack_fast, which uses dask.distributed for scheduling.

Parameters:
  • labels (Union[None, str, list], optional) – Stacks to be written. If None, write all stacks, including the diffraction data stack. Defaults to None.

  • exclude (Union[None, str, list], optional) – Stacks to exclude. It might be wise to set the diffraction data stack here. Defaults to None.

  • overwrite (bool, optional) – Overwrite existing stacks (HDF5 datasets) in the files. Defaults to False.

  • compression (Union[str, int], optional) – HDF5 compression filter to use. Common choices are ‘gzip’, ‘none’, or 32004, which is the lz4 filter often used for diffraction data. Defaults to 32004.

  • lazy (bool, optional) – Instead of computing and storing the arrays, return a list of dask arrays and HDF5 data sets, which can be inserted into dask.array.store. Defaults to False.

  • data_pattern (Union[None,str], optional) – store stacks to this data path (% is replaced by subset) instead of standard data path if not None. Note that stacks stored this way will not be retrievable through Dataset objects. Defaults to None.

  • progress_bar (bool, optional) – show a progress bar during calculation/storing. To prevent a mess, disable if you’re running store_stacks in multiple processes simultaneously. Defaults to True.

  • scheduler (str, optional) – dask scheduler to be used. Can be ‘threading’ or ‘single-threaded’. It is not possible to use ‘multiprocessing’ due to conflicting access to HDF5 files. (If you want true parallel computation, you have to use store_stack_fast instead.) Defaults to ‘threading’.

  • **kwargs – Will be forwarded to h5py.create_dataset

Returns:

None (if lazy=False) da.Array, h5py.Dataset: dask arrays and HDF5 dataset to pass to dask.array.store (if lazy=True)

store_tables(shots=None, features=None)[source]

Stores the metadata tables (shots, features) into HDF5 files.

For each of the tables, it can be automatically determined if they have changed and should be stored (however, this only works if no inplace changes have been made. So don’t rely on it too much.). If you want this, leave the argument at None. Otherwise explicitly specify True or False (strongly recommended).

Parameters:
  • shots (Union[None, bool], optional) – Store shot table. Defaults to None.

  • features (Union[None, bool], optional) – Store feature table. Defaults to None.

transform_stack_groups(stacks, func=<function Dataset.<lambda>>, by=('sample', 'region', 'run', 'crystal_id'))[source]

For all data stacks listed in stacks, transforms sub-stacks within groups defined by by using the function in func.

The dimensions of each sub-stack must not change in the process. Note that, unlike for get_selection or aggregate, this happens in place, i.e., the stacks will be overwritten by a transformed version! If this is not what you want, first make a copy of your data set, using copy.

A typical application is to calculate a cumulative sum of patterns wittin each diffraction movie. This is what the default parameters for by and func is doing. Can do all kinds of other fun things, i.e. calculating directly the difference between frames, the difference of each w.r.t. the first, normalizing them to sth, etc.

Parameters:
  • stacks (Union[List[str], str]) – Name(s) of data stacks to be transformed

  • func (Callable[[ndarray], ndarray]) – Function applied to each sub-stack. Must act on a numpy array and return one of the same dimensions. Defaults to lambda x: np.cumsum(x, axis=0).

  • by (Union[List[str], Tuple[str]]) – Shot table columns to identify groups - similar to how it’s done in aggregate. Defaults to (‘sample’, ‘region’, ‘run’, ‘crystal_id’).

update_det_shift(opt_file='preproc.yaml', panel='p0')[source]

Updates the lab-frame detector shift in the shot table, as required by CrystFEL to account for a varying direct beam position. As the dataset object has no idea about the lab-frame geometry, you’ll need to supply it, either from a diffractem options file (.yaml), or a CrystFEL geometry file (.geom). Also detector distortions are accounted for here. The column names in the shot tables are automatically determined from the options/geometry file.

Note that this method does not automatically store the shot table afterwards - to do so, run ds.store_tables(shots=True) right afterwards.

Parameters:
  • opt_file (str, optional) – Options file name. Can be a diffractem PreProcOpts file (.yaml) or a CrystFEL geometry file (.geom) - as determined by the file extension. Defaults to ‘preproc.yaml’.

  • panel (str, optional) – Label of panel in CrystFEL geometry file to which the center coordinates of the dataset refer. Defaults to p0.

view(shot=0, Imax=30, log=False)[source]

Interactive viewing widget for use in Jupyter notbeooks.

Parameters:
  • shot (int, optional) – Shot number to show initially

  • Imax (int, optional) – Maximum intensity to be shown initially. Defaults to 30.

  • log (bool, optional) – Toggles initial logarithmic display. Defaults to False.

write_list(listfile, append=False)[source]

Writes the files in the dataset into a list file, containing each file on a line.

Parameters:

listfile (str) – list file name

write_virtual_file(filename='virtual', diff_stack_label='zero_image', virtual_size=1024)[source]

Generate a virtual HDF5 file containing the meta data of the dataset, but not the actual diffraction. Instead of the diffraction stack, a virtual dummy stack is created that does not actually contain data.

The peak positions in the virtual file are changed, such that they refer to a “virtual” geometry, corresponding to a square detector with a size given by virtual_size. On this detector, the pattern is centered.

Note that this functionality is mostly deprecated in favor of directly using the data files directly, or the image info file generated by proc2d.get_pattern_info.

Parameters:
  • filename (str) – [description]

  • diff_stack_label (str) – [description]

  • virtual_size (int, optional) – [description]. Defaults to 1024.

property zchunks: tuple

Chunks of dask arrays holding the stacks along their first (that is, stacked) axis.

Return type:

tuple