disdrodb.l0 package

Subpackages

disdrodb.l0.readers package

Submodules

disdrodb.l0.l0a_processing module

Functions to process raw text files into DISDRODB L0A Apache Parquet.

disdrodb.l0.l0a_processing.cast_column_dtypes(df: DataFrame, sensor_name: str, verbose: bool = False) → DataFrame[source]

Convert ‘object’ dataframe columns into DISDRODB L0A dtype standards.

Parameters:

df (pd.DataFrame) – Input dataframe.
sensor_name (str) – Name of the sensor.
verbose (bool) – Wheter to verbose the processing.

Returns:

Dataframe with corrected columns types.

Return type:

pd.DataFrame

disdrodb.l0.l0a_processing.coerce_corrupted_values_to_nan(df: DataFrame, sensor_name: str, verbose: bool = False) → DataFrame[source]

Coerce corrupted values in dataframe numeric columns to np.nan.

Parameters:

df (pd.DataFrame) – Input dataframe.
sensor_name (str) – Name of the sensor.
verbose (bool) – Wheter to verbose the processing.

Returns:

Dataframe with string columns without corrupted values.

Return type:

pd.DataFrame

disdrodb.l0.l0a_processing.concatenate_dataframe(list_df: list, verbose: bool = False) → DataFrame[source]

Concatenate a list of dataframes.

Parameters:

list_df (list) – List of dataframes.
verbose (bool, optional) – If True, print messages. If False, no print.

Returns:

Concatenated dataframe.

Return type:

pd.DataFrame

Raises:

ValueError – Concatenation can not be done.

disdrodb.l0.l0a_processing.drop_time_periods(df, time_periods)[source]: Drop problematic time_period.

disdrodb.l0.l0a_processing.drop_timesteps(df, timesteps)[source]: Drop problematic time steps.

disdrodb.l0.l0a_processing.preprocess_reader_kwargs(reader_kwargs: dict) → dict[source]

Preprocess arguments required to read raw text file into Pandas.

Parameters:: reader_kwargs (dict) – Initial parameter dictionary.
Returns:: Parameter dictionary that matches either Pandas or Dask.
Return type:: dict

disdrodb.l0.l0a_processing.process_raw_file(filepath, column_names, reader_kwargs, df_sanitizer_fun, sensor_name, verbose=True, issue_dict={})[source]

Read and parse a raw text files into a L0A dataframe.

Parameters:

filepath (str) – File path
column_names (list) – Columns names.
reader_kwargs (dict) – Pandas read_csv arguments.
df_sanitizer_fun (object, optional) – Sanitizer function to format the datafame.
sensor_name (str) – Name of the sensor.
verbose (bool) – Wheter to verbose the processing. The default is True
issue_dict (dict) – Issue dictionary providing information on timesteps to remove. The default is an empty dictionary {}. Valid issue_dict key are ‘timesteps’ and ‘time_periods’. Valid issue_dict values are list of datetime64 values (with second accuracy). To correctly format and check the validity of the issue_dict, use the disdrodb.l0.issue.check_issue_dict function.

Returns:

Dataframe

Return type:

pd.DataFrame

disdrodb.l0.l0a_processing.read_raw_data(filepath: str, column_names: list, reader_kwargs: dict) → DataFrame[source]

Read raw data into a dataframe.

Parameters:

filepath (str) – Raw file path.
column_names (list) – Column names.
reader_kwargs (dict) – Pandas pd.read_csv arguments.

Returns:

Pandas dataframe.

Return type:

pandas.DataFrame

disdrodb.l0.l0a_processing.read_raw_file_list(file_list: list | str, column_names: list, reader_kwargs: dict, sensor_name: str, verbose: bool, df_sanitizer_fun: object | None = None) → DataFrame[source]

Read and parse a list for raw files into a dataframe.

Parameters:

file_list (Union[list,str]) – File(s) path(s)
column_names (list) – Columns names.
reader_kwargs (dict) – Pandas read_csv arguments.
sensor_name (str) – Name of the sensor.
verbose (bool) – Wheter to verbose the processing.
df_sanitizer_fun (object, optional) – Sanitizer function to format the datafame.

Returns:

Dataframe

Return type:

pd.DataFrame

Raises:

ValueError – Input parameters can not be used or the raw file can not be processed.

disdrodb.l0.l0a_processing.remove_corrupted_rows(df)[source]

Remove corrupted rows by checking conversion of raw fields to numeric.

Note: The raw array must be stripped away from delimiter at start and end !

disdrodb.l0.l0a_processing.remove_duplicated_timesteps(df: DataFrame, verbose: bool = False)[source]

Remove duplicated timesteps.

It keep only the first timestep occurence !

Parameters:

df (pd.DataFrame) – Input dataframe.
verbose (bool) – Wheter to verbose the processing.

Returns:

Dataframe with valid unique timesteps.

Return type:

pd.DataFrame

disdrodb.l0.l0a_processing.remove_issue_timesteps(df, issue_dict, verbose=False)[source]

Drop dataframe rows with timesteps listed in the issue dictionary.

Parameters:

df (pd.DataFrame) – Input dataframe.
issue_dict (dict) – Issue dictionary

Returns:

Dataframe with problematic timesteps removed.

Return type:

pd.DataFrame

disdrodb.l0.l0a_processing.remove_rows_with_missing_time(df: DataFrame, verbose: bool = False)[source]

Remove dataframe rows where the “time” is NaT.

Parameters:

df (pd.DataFrame) – Input dataframe.
verbose (bool) – Wheter to verbose the processing.

Returns:

Dataframe with valid timesteps.

Return type:

pd.DataFrame

disdrodb.l0.l0a_processing.replace_nan_flags(df, sensor_name, verbose)[source]

Set values corresponding to nan_flags to np.nan.

Parameters:

df (pd.DataFrame) – Input dataframe.
sensor_name (str) – Name of the sensor.
verbose (bool) – Wheter to verbose the processing.

Returns:

Dataframe without nan_flags values.

Return type:

pd.DataFrame

disdrodb.l0.l0a_processing.set_nan_outside_data_range(df, sensor_name, verbose)[source]

Set values outside the data range as np.nan.

Parameters:

df (pd.DataFrame) – Input dataframe.
sensor_name (str) – Name of the sensor.
verbose (bool) – Wheter to verbose the processing.

Returns:

Dataframe without values outside the expected data range.

Return type:

pd.DataFrame

disdrodb.l0.l0a_processing.set_nan_unvalid_values(df, sensor_name, verbose)[source]

Set unvalid (class) values to np.nan.

Parameters:

df (pd.DataFrame) – Input dataframe.
sensor_name (str) – Name of the sensor.
verbose (bool) – Wheter to verbose the processing.

Returns:

Dataframe without unvalid values.

Return type:

pd.DataFrame

disdrodb.l0.l0a_processing.strip_delimiter_from_raw_arrays(df)[source]: Remove the first and last delimiter occurence from the raw array fields.

disdrodb.l0.l0a_processing.strip_string_spaces(df: DataFrame, sensor_name: str, verbose: bool = False) → DataFrame[source]

Strip leading/trailing spaces from dataframe string columns.

Parameters:

df (pd.DataFrame) – Input dataframe.
sensor_name (str) – Name of the sensor.
verbose (bool) – Wheter to verbose the processing.

Returns:

Dataframe with string columns without leading/trailing spaces.

Return type:

pd.DataFrame

disdrodb.l0.l0a_processing.write_l0a(df: DataFrame, fpath: str, force: bool = False, verbose: bool = False)[source]

Save the dataframe into an Apache Parquet file.

Parameters:

df (pd.DataFrame) – Input dataframe.
fpath (str) – Output file path.
force (bool, optional) – Whether to overwrite existing data. If True, overwrite existing data into destination directories. If False, raise an error if there are already data into destination directories. This is the default.
verbose (bool, optional) – Wheter to verbose the processing. The default is False.

Raises:

ValueError – The input dataframe can not be written as an Apache Parquet file.
NotImplementedError – The input dataframe can not be processed.

disdrodb.l0.l0b_processing module

Functions to process DISDRODB L0A files into DISDRODB L0B netCDF files.

disdrodb.l0.l0b_processing.add_dataset_crs_coords(ds)[source]: Add the CRS coordinate to the xr.Dataset

disdrodb.l0.l0b_processing.add_dataset_missing_variables(ds, missing_vars, sensor_name)[source]: Add missing Dataset variables as nan DataArrays.

disdrodb.l0.l0b_processing.convert_object_variables_to_string(ds: Dataset) → Dataset[source]

Convert variables with object dtype to string.

Parameters:: ds (xr.Dataset) – Input dataset.
Returns:: Output dataset.
Return type:: xr.Dataset

disdrodb.l0.l0b_processing.create_l0b_from_l0a(df: DataFrame, attrs: dict, verbose: bool = False) → Dataset[source]

Transform the L0A dataframe to the L0B xr.Dataset.

Parameters:

df (pd.DataFrame) – DISDRODB L0A dataframe.
attrs (dict) – Station metadata.
verbose (bool, optional) – Wheter to verbose the processing. The default is False.

Returns:

DISDRODB L0B dataset.

Return type:

xr.Dataset

Raises:

ValueError – Error if the DISDRODB L0B xarray dataset can not be created.

disdrodb.l0.l0b_processing.format_string_array(string: str, n_values: int) → array[source]

Split a string with multiple numbers separated by a delimiter into an 1D array.

e.g. : format_string_array(“2,44,22,33”, 4) will return [ 2. 44. 22. 33.]

If empty string (“”) –> Return an arrays of zeros If the list length is not n_values -> Return an arrays of np.nan

The function strip potential delimiters at start and end before splitting.

Parameters:

string (str) – Input string
n_values (int) – Expected length of the output array.

Returns:

array of float

Return type:

np.array

disdrodb.l0.l0b_processing.get_bin_coords(sensor_name: str) → dict[source]

Retrieve diameter (and velocity) bin coordinates.

Parameters:: sensor_name (str) – Name of the sensor.
Returns:: Dictionary with coordinate arrays.
Return type:: dict

disdrodb.l0.l0b_processing.infer_split_str(string: str) → str[source]

Infer the delimeter inside a string.

Parameters:: string (str) – Input string.
Returns:: Inferred delimiter.
Return type:: str

disdrodb.l0.l0b_processing.preprocess_raw_netcdf(ds, dict_names, sensor_name)[source]

This function preprocess raw netCDF to improve compatibility with DISDRODB standards.

This function checks validity of the dict_names, rename and subset the data accordingly. If some variables specified in the dict_names are missing, it adds a NaN DataArray !

Parameters:

ds (xr.Dataset) – Raw netCDF to be converted to DISDRODB standards.
dict_names (dict) – Dictionary mapping raw netCDF variables/coordinates/dimension names to DISDRODB standards.
sensor_name (str) – Sensor name.

Returns:

ds – xarray Dataset with DISDRODB-compliant variable naming conventions.

Return type:

xr.Dataset

disdrodb.l0.l0b_processing.process_raw_nc(filepath, dict_names, ds_sanitizer_fun, sensor_name, verbose, attrs)[source]

Read and convert a raw netCDF into a DISDRODB L0B netCDF.

Parameters:

filepath (str) – netCDF file path.
dict_names (dict) – Dictionary mapping raw netCDF variables/coordinates/dimension names to DISDRODB standards.
ds_sanitizer_fun (function) – Sanitizer function to do ad-hoc processing of the xr.Dataset.
attrs (dict) – Global metadata to attach as global attributes to the xr.Dataset.
sensor_name (str) – Name of the sensor.
verbose (bool) – Wheter to verbose the processing.

Returns:

L0B xr.Dataset

Return type:

xr.Dataset

disdrodb.l0.l0b_processing.rechunk_dataset(ds: Dataset, encoding_dict: dict) → Dataset[source]

Coerce the dataset arrays to have the chunk size specified in the encoding dictionary.

Parameters:

ds (xr.Dataset) – Input xarray dataset
encoding_dict (dict) – Dictionary containing the encoding to write the xarray dataset as a netCDF.

Returns:

Output xarray dataset

Return type:

xr.Dataset

disdrodb.l0.l0b_processing.rename_dataset(ds, dict_names)[source]: Rename Dataset variables, coordinates and dimensions.

disdrodb.l0.l0b_processing.replace_custom_nan_flags(ds, dict_nan_flags)[source]

Set values corresponding to nan_flags to np.nan.

Parameters:

df (xr.Dataset) – Input xarray dataset
dict_nan_flags (dict) – Dictionary with nan flags value to set as np.nan

Returns:

Dataset without nan_flags values.

Return type:

xr.Dataset

disdrodb.l0.l0b_processing.replace_nan_flags(ds, sensor_name, verbose)[source]

Set values corresponding to nan_flags to np.nan.

Parameters:

ds (xr.Dataset) – Input xarray dataset
dict_nan_flags (dict) – Dictionary with nan flags value to set as np.nan
verbose (bool) – Wheter to verbose the processing.

Returns:

Dataset without nan_flags values.

Return type:

xr.Dataset

disdrodb.l0.l0b_processing.reshape_raw_spectrum(arr: array, dims_order: list, dims_size_dict: dict, n_timesteps: int) → array[source]

Reshape the raw spectrum to a 2D+time array.

The array has dimensions [“time”] + dims_order

Parameters:

arr (np.array) – Input array.
dims_order (list) –
The order of dimension in the raw spectrum.

Examples: - OTT Parsivel spectrum [v1d1 … v1d32, v2d1, …, v2d32] –> dims_order = [“diameter_bin_center”, “velocity_bin_center”] - Thies LPM spectrum [v1d1 … v20d1, v1d2, …, v20d2] –> dims_order = [“velocity_bin_center”, “diameter_bin_center”]
dims_size_dict (dict) –
Dictionary with the number of bins for each dimension. For OTT_Parsivel:

{“diameter_bin_center”: 32,
”velocity_bin_center”: 32}

For This_LPM

{“diameter_bin_center”: 22,
”velocity_bin_center”: 20}
n_timesteps (int) – Number of timesteps.

Returns:

Output array.

Return type:

np.array

Raises:

ValueError – Impossible to reshape the raw_spectrum matrix

disdrodb.l0.l0b_processing.retrieve_l0b_arrays(df: DataFrame, sensor_name: str, verbose: bool = False) → dict[source]

Retrieves the L0B data matrix.

Parameters:

df (pd.DataFrame) – Input dataframe
sensor_name (str) – Name of the sensor

Returns:

Dictionary with data arrays.

Return type:

dict

disdrodb.l0.l0b_processing.sanitize_encodings_dict(encoding_dict: dict, ds: Dataset) → dict[source]

Ensure chunk size to be smaller than the array shape.

Parameters:

encoding_dict (dict) – Dictionary containing the encoding to write DISDRODB L0B netCDFs.
ds (xr.Dataset) – Input dataset.

Returns:

Encoding dictionary.

Return type:

dict

disdrodb.l0.l0b_processing.set_coordinate_attributes(ds)[source]

disdrodb.l0.l0b_processing.set_dataset_attrs(ds, sensor_name)[source]: Set variable and coordinates attributes.

disdrodb.l0.l0b_processing.set_encodings(ds: Dataset, sensor_name: str) → Dataset[source]

Apply the encodings to the xarray Dataset.

Parameters:

ds (xr.Dataset) – Input xarray dataset.
sensor_name (str) – Name of the sensor.

Returns:

Output xarray dataset.

Return type:

xr.Dataset

disdrodb.l0.l0b_processing.set_nan_outside_data_range(ds, sensor_name, verbose)[source]

Set values outside the data range as np.nan.

Parameters:

ds (xr.Dataset) – Input xarray dataset
sensor_name (str) – Name of the sensor.
verbose (bool) – Wheter to verbose the processing.

Returns:

Dataset without values outside the expected data range.

Return type:

xr.Dataset

disdrodb.l0.l0b_processing.set_nan_unvalid_values(ds, sensor_name, verbose)[source]

Set unvalid (class) values to np.nan.

Parameters:

ds (xr.Dataset) – Input xarray dataset
sensor_name (str) – Name of the sensor.
verbose (bool) – Wheter to verbose the processing.

Returns:

Dataset without unvalid values.

Return type:

xr.Dataset

disdrodb.l0.l0b_processing.set_variable_attributes(ds: Dataset, sensor_name: str) → Dataset[source]

Set attributes to each xr.Dataset variable.

Parameters:

ds (xr.Dataset) – Input dataset.
sensor_name (str) – Name of the sensor.

Returns:

xr.Dataset.

Return type:

ds

disdrodb.l0.l0b_processing.subset_dataset(ds, dict_names, sensor_name)[source]

disdrodb.l0.l0b_processing.write_l0b(ds: Dataset, fpath: str, force=False) → None[source]

Save the xarray dataset into a NetCDF file.

Parameters:

ds (xr.Dataset) – Input xarray dataset.
fpath (str) – Output file path.
sensor_name (str) – Name of the sensor.
force (bool, optional) – Whether to overwrite existing data. If True, overwrite existing data into destination directories. If False, raise an error if there are already data into destination directories. This is the default.

disdrodb.l0.l0b_concat module

disdrodb.l0.l0b_concat.run_disdrodb_l0b_concat(disdrodb_dir, data_sources=None, campaign_names=None, station_names=None, remove_l0b=False, verbose=False)[source]

Concatenate the L0B files of the DISDRODB archive.

This function is called by the run_disdrodb_l0b_concat script.

disdrodb.l0.l0b_concat.run_disdrodb_l0b_concat_station(disdrodb_dir, data_source, campaign_name, station_name, remove_l0b=False, verbose=False)[source]

Concatenate the L0B files of a single DISDRODB station.

This function runs the run_disdrodb_l0b_concat_station script in the terminal.

disdrodb.l0.l0_processing module

disdrodb.l0.l0_processing.click_l0_archive_options(function: object)[source]

Click command line arguments for L0 processing archiving of a station.

Parameters:: function (object) – Function.

disdrodb.l0.l0_processing.click_l0_processing_options(function: object)[source]

Click command line default parameters for L0 processing options.

Parameters:: function (object) – Function.

disdrodb.l0.l0_processing.click_l0_station_arguments(function: object)[source]

Click command line arguments for L0 processing of a station.

Parameters:: function (object) – Function.

disdrodb.l0.l0_processing.click_l0_stations_options(function: object)[source]

Click command line options for DISDRODB archive L0 processing.

Parameters:: function (object) – Function.

disdrodb.l0.l0_processing.click_l0b_concat_options(function: object)[source]

Click command line default parameters for L0B concatenation.

Parameters:: function (object) – Function.

disdrodb.l0.l0_processing.run_disdrodb_l0(disdrodb_dir, data_sources=None, campaign_names=None, station_names=None, l0a_processing: bool = True, l0b_processing: bool = True, l0b_concat: bool = False, remove_l0a: bool = False, remove_l0b: bool = False, force: bool = False, verbose: bool = False, debugging_mode: bool = False, parallel: bool = True)[source]

Run the L0 processing of DISDRODB stations.

This function enable to launch the processing of many DISDRODB stations with a single command. From the list of all available DISDRODB stations, it runs the processing of the stations matching the provided data_sources, campaign_names and station_names.

Parameters:

disdrodb_dir (str) – Base directory of DISDRODB Format: <…>/DISDRODB
data_sources (list) – Name of data source(s) to process. The name(s) must be UPPER CASE. If campaign_names and station are not specified, process all stations. The default is None
campaign_names (list) – Name of the campaign(s) to process. The name(s) must be UPPER CASE. The default is None
station_names (list) – Station names to process. The default is None
l0a_processing (bool) – Whether to launch processing to generate L0A Apache Parquet file(s) from raw data. The default is True.
l0b_processing (bool) – Whether to launch processing to generate L0B netCDF4 file(s) from L0A data. The default is True.
l0b_concat (bool) – Whether to concatenate all raw files into a single L0B netCDF file. If l0b_concat=True, all raw files will be saved into a single L0B netCDF file. If l0b_concat=False, each raw file will be converted into the corresponding L0B netCDF file. The default is False.
remove_l0a (bool) – Whether to keep the L0A files after having generated the L0B netCDF products. The default is False.
remove_l0b (bool) –

Whether to remove the L0B files after having concatenated all L0B netCDF files.
It takes places only if l0b_concat = True

The default is False.
force (bool) – If True, overwrite existing data into destination directories. If False, raise an error if there are already data into destination directories. The default is False.
verbose (bool) – Whether to print detailed processing information into terminal. The default is True.
parallel (bool) – If True, the files are processed simultanously in multiple processes. Each process will use a single thread to avoid issues with the HDF/netCDF library. By default, the number of process is defined with os.cpu_count(). If False, the files are processed sequentially in a single process. If False, multi-threading is automatically exploited to speed up I/0 tasks.
debugging_mode (bool) – If True, it reduces the amount of data to process. For L0A, it processes just the first 3 raw data files. For L0B, it processes just the first 100 rows of 3 L0A files. The default is False.

disdrodb.l0.l0_processing.run_disdrodb_l0_station(disdrodb_dir, data_source, campaign_name, station_name, l0a_processing: bool = True, l0b_processing: bool = True, l0b_concat: bool = True, remove_l0a: bool = False, remove_l0b: bool = False, force: bool = False, verbose: bool = False, debugging_mode: bool = False, parallel: bool = True)[source]

Run the L0 processing of a specific DISDRODB station from the terminal.

Parameters:

disdrodb_dir (str) – Base directory of DISDRODB Format: <…>/DISDRODB
data_source (str) – Institution name (when campaign data spans more than 1 country), or country (when all campaigns (or sensor networks) are inside a given country). Must be UPPER CASE.
campaign_name (str) – Campaign name. Must be UPPER CASE.
station_name (str) – Station name
l0a_processing (bool) – Whether to launch processing to generate L0A Apache Parquet file(s) from raw data. The default is True.
l0b_processing (bool) – Whether to launch processing to generate L0B netCDF4 file(s) from L0A data. The default is True.
l0b_concat (bool) – Whether to concatenate all raw files into a single L0B netCDF file. If l0b_concat=True, all raw files will be saved into a single L0B netCDF file. If l0b_concat=False, each raw file will be converted into the corresponding L0B netCDF file. The default is False.
remove_l0a (bool) – Whether to keep the L0A files after having generated the L0B netCDF products. The default is False.
remove_l0b (bool) –

Whether to remove the L0B files after having concatenated all L0B netCDF files.
It takes places only if l0b_concat=True

The default is False.
force (bool) – If True, overwrite existing data into destination directories. If False, raise an error if there are already data into destination directories. The default is False.
verbose (bool) – Whether to print detailed processing information into terminal. The default is True.
parallel (bool) – If True, the files are processed simultanously in multiple processes. Each process will use a single thread to avoid issues with the HDF/netCDF library. By default, the number of process is defined with os.cpu_count(). If False, the files are processed sequentially in a single process. If False, multi-threading is automatically exploited to speed up I/0 tasks.
debugging_mode (bool) – If True, it reduces the amount of data to process. For L0A, it processes just the first 3 raw data files for each station. For L0B, it processes just the first 100 rows of 3 L0A files for each station. The default is False.

disdrodb.l0.l0_processing.run_disdrodb_l0a(disdrodb_dir, data_sources=None, campaign_names=None, station_names=None, force: bool = False, verbose: bool = False, debugging_mode: bool = False, parallel: bool = True)[source]

disdrodb.l0.l0_processing.run_disdrodb_l0a_station(disdrodb_dir, data_source, campaign_name, station_name, force: bool = False, verbose: bool = False, debugging_mode: bool = False, parallel: bool = True)[source]: Run the L0B processing of a station calling run_disdrodb_l0a_station in the terminal.

disdrodb.l0.l0_processing.run_disdrodb_l0b(disdrodb_dir, data_sources=None, campaign_names=None, station_names=None, force: bool = False, verbose: bool = False, debugging_mode: bool = False, parallel: bool = True)[source]

disdrodb.l0.l0_processing.run_disdrodb_l0b_station(disdrodb_dir, data_source, campaign_name, station_name, force: bool = False, verbose: bool = False, debugging_mode: bool = False, parallel: bool = True)[source]: Run the L0B processing of a station calling run_disdrodb_l0b_station in the terminal.

disdrodb.l0.l0_processing.run_l0a(raw_dir, processed_dir, station_name, glob_patterns, column_names, reader_kwargs, df_sanitizer_fun, parallel, verbose, force, debugging_mode)[source]

Run the L0A processing for a specific DISDRODB station.

Parameters:

raw_dir (str) –
The directory path where all the raw content of a specific campaign is stored. The path must have the following structure:

<…>/DISDRODB/Raw/<data_source>/<campaign_name>’.

Inside the raw_dir directory, it is required to adopt the following structure: - /data/<station_name>/<raw_files> - /metadata/<station_name>.yaml Important points: - For each <station_name> there must be a corresponding YAML file in the metadata subfolder. - The <campaign_name> must semantically match between:
- the raw_dir and processed_dir directory paths;
- with the key ‘campaign_name’ within the metadata YAML files.
- The campaign_name are expected to be UPPER CASE.
processed_dir (str) –
The desired directory path for the processed DISDRODB L0A and L0B products. The path should have the following structure:

<…>/DISDRODB/Processed/<data_source>/<campaign_name>’

For testing purpose, this function exceptionally accept also a directory path simply ending with <campaign_name> (i.e. /tmp/<campaign_name>).
station_name (str) – Station name
glob_patterns (str) – Glob pattern to search data files in <raw_dir>/data/<station_name>
column_names (list) – Columns names of the raw text file.
reader_kwargs (dict) – Pandas read_csv arguments to open the text file.
df_sanitizer_fun (object, optional) – Sanitizer function to format the datafame into DISDRODB L0A standard.
parallel (bool) – If True, the files are processed simultanously in multiple processes. The number of simultaneous processes can be customized using the dask.distributed LocalCluster. If False, the files are processed sequentially in a single process. If False, multi-threading is automatically exploited to speed up I/0 tasks.
verbose (bool) – Whether to print detailed processing information into terminal. The default is False.
force (bool) – If True, overwrite existing data into destination directories. If False, raise an error if there are already data into destination directories. The default is False.
debugging_mode (bool) – If True, it reduces the amount of data to process. It processes just the first 100 rows of 3 raw data files. The default is False.

disdrodb.l0.l0_processing.run_l0b(processed_dir, station_name, parallel, force, verbose, debugging_mode)[source]

Run the L0B processing for a specific DISDRODB station.

Parameters:

raw_dir (str) –
The directory path where all the raw content of a specific campaign is stored. The path must have the following structure:

<…>/DISDRODB/Raw/<data_source>/<campaign_name>’.

Inside the raw_dir directory, it is required to adopt the following structure: - /data/<station_name>/<raw_files> - /metadata/<station_name>.yaml Important points: - For each <station_name> there must be a corresponding YAML file in the metadata subfolder. - The <campaign_name> must semantically match between:
- the raw_dir and processed_dir directory paths;
- with the key ‘campaign_name’ within the metadata YAML files.
- The campaign_name are expected to be UPPER CASE.
processed_dir (str) –
The desired directory path for the processed DISDRODB L0A and L0B products. The path should have the following structure:

<…>/DISDRODB/Processed/<data_source>/<campaign_name>’

For testing purpose, this function exceptionally accept also a directory path simply ending with <campaign_name> (i.e. /tmp/<campaign_name>).
station_name (str) – Station name
force (bool) – If True, overwrite existing data into destination directories. If False, raise an error if there are already data into destination directories. The default is False.
verbose (bool) – Whether to print detailed processing information into terminal. The default is True.
parallel (bool) – If True, the files are processed simultanously in multiple processes. The number of simultaneous processes can be customized using the dask.distributed LocalCluster. Ensure that the threads_per_worker (number of thread per process) is set to 1 to avoid HDF errors. Also ensure to set the HDF5_USE_FILE_LOCKING environment variable to False. If False, the files are processed sequentially in a single process. If False, multi-threading is automatically exploited to speed up I/0 tasks.
debugging_mode (bool) – If True, it reduces the amount of data to process. It processes just 3 raw data files. The default is False.

disdrodb.l0.l0_processing.run_l0b_from_nc(raw_dir, processed_dir, station_name, glob_patterns, dict_names, ds_sanitizer_fun, parallel, verbose, force, debugging_mode)[source]

Run the L0B processing for a specific DISDRODB station with raw netCDFs.

Parameters:

raw_dir (str) –
The directory path where all the raw content of a specific campaign is stored. The path must have the following structure:

<…>/DISDRODB/Raw/<data_source>/<campaign_name>’.

Inside the raw_dir directory, it is required to adopt the following structure: - /data/<station_name>/<raw_files> - /metadata/<station_name>.yaml Important points: - For each <station_name> there must be a corresponding YAML file in the metadata subfolder. - The <campaign_name> must semantically match between:
- the raw_dir and processed_dir directory paths;
- with the key ‘campaign_name’ within the metadata YAML files.
- The campaign_name are expected to be UPPER CASE.
processed_dir (str) –
The desired directory path for the processed DISDRODB L0B products. The path should have the following structure:

<…>/DISDRODB/Processed/<data_source>/<campaign_name>’

For testing purpose, this function exceptionally accept also a directory path simply ending with <campaign_name> (i.e. /tmp/<campaign_name>).
station_name (str) – Station name
glob_patterns (str) – Glob pattern to search data files in <raw_dir>/data/<station_name>. Example: glob_patterns = “*.nc”
dict_names (dict) – Dictionary mapping raw netCDF variables/coordinates/dimension names to DISDRODB standards.
ds_sanitizer_fun (object, optional) – Sanitizer function to format the raw netCDF into DISDRODB L0B standard.
force (bool) – If True, overwrite existing data into destination directories. If False, raise an error if there are already data into destination directories. The default is False.
verbose (bool) – Whether to print detailed processing information into terminal. The default is False.
parallel (bool) – If True, the files are processed simultanously in multiple processes. The number of simultaneous processes can be customized using the dask.distributed LocalCluster. If False, the files are processed sequentially in a single process. If False, multi-threading is automatically exploited to speed up I/0 tasks.
debugging_mode (bool) – If True, it reduces the amount of data to process. It processes just the first 3 raw netCDF files. The default is False.

disdrodb.l0.l0_reader module

disdrodb.l0.l0_reader.available_readers(data_sources=None, reader_path=False)[source]: Retrieve available readers information.

disdrodb.l0.l0_reader.check_available_readers()[source]: Check the readers arguments of all package.

disdrodb.l0.l0_reader.check_reader_arguments(reader)[source]: Check the reader have the expected input arguments.

disdrodb.l0.l0_reader.check_reader_exists(reader_data_source: str, reader_name: str) → str[source]

Check if the provided data source exists and reader names exists within the available readers.

Please run get_available_readers_dict() to get the list of all available reader.

Parameters:

reader_data_source (str) – The directory within which the reader_name is located in the disdrodb.l0.readers directory.
reader_name (str) – Campaign name

Returns:

If True : returns the reader name If False : Error - return None

Return type:

str

Raises:

ValueError – Error if the reader name provided for the campaign has not been found.

disdrodb.l0.l0_reader.get_available_readers_dict() → dict[source]

Returns the readers description included into the current release of DISDRODB.

Returns:: The dictionary has the following schema {“data_source”: {“reader_name”: “reader_file_path”}}
Return type:: dict

disdrodb.l0.l0_reader.get_reader(reader_data_source: str, reader_name: str) → object[source]

Returns the reader function based on input parameters.

Parameters:

reader_data_source (str) – The directory within which the reader_name is located in the disdrodb.l0.readers directory.
reader_name (str) – The reader name.

Returns:

The reader() function

Return type:

object

disdrodb.l0.l0_reader.get_reader_from_metadata_reader_key(reader_data_source_name)[source]

Retrieve the reader from the reader metadata value.

The convention for metadata reader key: <data_source/reader_name> in disdrodb.l0.readers

disdrodb.l0.l0_reader.get_station_reader(disdrodb_dir, data_source, campaign_name, station_name)[source]: Retrieve reader form station metadata information.

disdrodb.l0.l0_reader.is_documented_by(original)[source]

Wrapper function to apply generic docstring to the decorated function.

Parameters:: original (function) – Function to take the docstring from.

disdrodb.l0.l0_reader.reader_generic_docstring()[source]

Script to convert the raw data to L0A format.

Parameters:

raw_dir (str) –
The directory path where all the raw content of a specific campaign is stored. The path must have the following structure:

<…>/DISDRODB/Raw/<data_source>/<campaign_name>’.

Inside the raw_dir directory, it is required to adopt the following structure: - /data/<station_name>/<raw_files> - /metadata/<station_name>.yaml Important points: - For each <station_name> there must be a corresponding YAML file in the metadata subfolder. - The <campaign_name> must semantically match between:
- the raw_dir and processed_dir directory paths;
- with the key ‘campaign_name’ within the metadata YAML files.
- The campaign_name are expected to be UPPER CASE.
processed_dir (str) –
The desired directory path for the processed DISDRODB L0A and L0B products. The path should have the following structure:

<…>/DISDRODB/Processed/<data_source>/<campaign_name>’

For testing purpose, this function exceptionally accept also a directory path simply ending with <campaign_name> (i.e. /tmp/<campaign_name>).
station_name (str) – Station name
force (bool) – If True, overwrite existing data into destination directories. If False, raise an error if there are already data into destination directories. The default is False.
verbose (bool) – Whether to print detailed processing information into terminal. The default is True.
parallel (bool) – If True, the files are processed simultanously in multiple processes. The number of simultaneous processes can be customized using the dask.distributed LocalCluster. If False, the files are processed sequentially in a single process. If False, multi-threading is automatically exploited to speed up I/0 tasks.
debugging_mode (bool) – If True, it reduces the amount of data to process. It processes just the first 3 raw data files. The default is False.

disdrodb.l0.check_configs module

disdrodb.l0.check_configs.check_bin_consistency(sensor_name: str) → None[source]

Check bin consistency from config file.

Do not check the first and last bin !

Parameters:: sensor_name (str) – Name of the sensor.

disdrodb.l0.check_configs.check_sensor_configs(sensor_name: str) → None[source]

Check the validity of the sensor configurations.

Parameters:: sensor_name (str) – Name of the sensor.

disdrodb.l0.check_configs.check_variable_keys_consistency(sensor_name: str) → None[source]

Check attributes consistency from config file.

Parameters:: sensor_name (str) – Name of the sensor.

disdrodb.l0.check_metadata module

disdrodb.l0.check_metadata.check_archive_metadata_campaign_name(disdrodb_dir)[source]

disdrodb.l0.check_metadata.check_archive_metadata_compliance(disdrodb_dir)[source]

disdrodb.l0.check_metadata.check_archive_metadata_data_source(disdrodb_dir)[source]

disdrodb.l0.check_metadata.check_archive_metadata_geolocation(disdrodb_dir)[source]

disdrodb.l0.check_metadata.check_archive_metadata_keys(disdrodb_dir)[source]

disdrodb.l0.check_metadata.check_archive_metadata_reader(disdrodb_dir)[source]

disdrodb.l0.check_metadata.check_archive_metadata_sensor_name(disdrodb_dir)[source]

disdrodb.l0.check_metadata.check_archive_metadata_station_name(disdrodb_dir)[source]

disdrodb.l0.check_metadata.check_metadata_geolocation(metadata) → None[source]: Identify metadata with missing or wrong geolocation.

disdrodb.l0.check_metadata.get_archive_metadata_key_value(disdrodb_dir, key, return_tuple=True)[source]: Return the values of a metadata key for all the archive.

disdrodb.l0.check_metadata.identify_empty_metadata_keys(metadata_fpaths: list, keys: str | list) → None[source]

Identify empty metadata keys.

Parameters:

metadata_fpaths (str) – Input YAML file path.
keys (Union[str,list]) – Attributes to verify the presence.

disdrodb.l0.check_metadata.identify_missing_metadata_coords(metadata_fpaths: str) → None[source]

Identify missing coordinates.

Parameters:: metadata_fpaths (str) – Input YAML file path.
Raises:: TypeError – Error if latitude or longitude coordinates are not present or are wrongly formatted.

disdrodb.l0.check_metadata.read_yaml(fpath: str) → dict[source]

Read YAML file.

Parameters:: fpath (str) – Input YAML file path.
Returns:: Attributes read from the YAML file.
Return type:: dict

disdrodb.l0.check_standards module

disdrodb.l0.check_standards.check_l0a_column_names(df: DataFrame, sensor_name: str) → None[source]

Checks that the dataframe columns respects DISDRODB standards.

Parameters:

df (pd.DataFrame) – Input dataframe.
sensor_name (str) – Name of the sensor.

Raises:

ValueError – Error if some columns do not meet the DISDRODB standards or if the ‘time’ column is missing in the dataframe.

disdrodb.l0.check_standards.check_l0a_standards(df: DataFrame, sensor_name: str, verbose: bool = True) → None[source]

Checks that a file respects the DISDRODB L0A standards.

Parameters:

df (pd.DataFrame) – L0A dataframe.
sensor_name (str) – Name of the sensor.
verbose (bool, optional) – Wheter to verbose the processing. The default is True.

Raises:

ValueError – Error if some columns have inconsistent values.

disdrodb.l0.check_standards.check_l0b_standards(x: str) → None[source]

disdrodb.l0.check_standards.check_sensor_name(sensor_name: str) → None[source]

Check sensor name.

Parameters:

sensor_name (str) – Name of the sensor.

Raises:

TypeError – Error if sensor_name is not a string.
ValueError – Error if the input sensor name has not been found in the list of available sensors.

disdrodb.l0.io module

disdrodb.l0.io.check_glob_pattern(pattern: str) → None[source]

Check if the input parameters is a string and if it can be used as pattern.

Parameters:

pattern (str) – String to be checked.

Raises:

TypeError – The input parameter is not a string.
ValueError – The input parameter can not be used as pattern.

disdrodb.l0.io.check_glob_patterns(patterns: str | list) → list[source]: Check if glob patterns are valids.

disdrodb.l0.io.check_processed_dir(processed_dir)[source]

disdrodb.l0.io.check_raw_dir(raw_dir: str, verbose: bool = False) → None[source]

Check validity of raw_dir.

Steps: 1. Check that ‘raw_dir’ is a valid directory path 2. Check that ‘raw_dir’ follows the expect directory structure 3. Check that each station_name directory contains data 4. Check that for each station_name the mandatory metadata.yml is specified. 4. Check that for each station_name the mandatory issue.yml is specified.

Parameters:

raw_dir (str) – Input raw directory
verbose (bool, optional) – Wheter to verbose the processing. The default is False.

disdrodb.l0.io.create_directory_structure(processed_dir, product_level, station_name, force, verbose=False)[source]: Create directory structure for L0B and higher DISDRODB products.

disdrodb.l0.io.create_initial_directory_structure(raw_dir, processed_dir, station_name, force, verbose=False, product_level='L0A')[source]

Create directory structure for the first L0 DISDRODB product.

If the input data are raw text files –> product_level = “L0A” (run_l0a) If the input data are raw netCDF files –> product_level = “L0B” (run_l0b_nc)

disdrodb.l0.io.get_L0A_dir(processed_dir: str, station_name: str) → str[source]

Define L0A directory.

Parameters:

processed_dir (str) – Path of the processed directory
station_name (str) – Name of the station

Returns:

L0A directory path.

Return type:

str

disdrodb.l0.io.get_L0A_fname(df, processed_dir, station_name: str) → str[source]

Define L0A file name.

Parameters:

df (pd.DataFrame) – L0A DataFrame
processed_dir (str) – Path of the processed directory
station_name (str) – Name of the station

Returns:

L0A file name.

Return type:

str

disdrodb.l0.io.get_L0A_fpath(df: DataFrame, processed_dir: str, station_name: str) → str[source]

Define L0A file path.

Parameters:

df (pd.DataFrame) – L0A DataFrame.
processed_dir (str) – Path of the processed directory.
station_name (str) – Name of the station.

Returns:

L0A file path.

Return type:

str

disdrodb.l0.io.get_L0B_dir(processed_dir: str, station_name: str) → str[source]

Define L0B directory.

Parameters:

processed_dir (str) – Path of the processed directory
station_name (int) – Name of the station

Returns:

Path of the L0B directory

Return type:

str

disdrodb.l0.io.get_L0B_fname(ds, processed_dir, station_name: str) → str[source]

Define L0B file name.

Parameters:

ds (xr.Dataset) – L0B xarray Dataset
processed_dir (str) – Path of the processed directory
station_name (str) – Name of the station

Returns:

L0B file name.

Return type:

str

disdrodb.l0.io.get_L0B_fpath(ds: Dataset, processed_dir: str, station_name: str, l0b_concat=False) → str[source]

Define L0B file path.

Parameters:

ds (xr.Dataset) – L0B xarray Dataset.
processed_dir (str) – Path of the processed directory.
station_name (str) – ID of the station
l0b_concat (bool) – If False, the file is specified inside the station directory. If True, the file is specified outside the station directory.

Returns:

L0B file path.

Return type:

str

disdrodb.l0.io.get_campaign_name(path: str) → str[source]

Return the campaign name from a file or directory path.

Current assumption: no data_source, campaign_name, station_name or file contain the word DISDRODB!

Parameters:: base_dir (str) – path can be a campaign_dir (‘raw_dir’ or ‘processed_dir’), or a DISDRODB file path.
Returns:: Name of the campaign.
Return type:: str

disdrodb.l0.io.get_data_source(path: str) → str[source]

Return the data_source from a file or directory path.

Current assumption: no data_source, campaign_name, station_name or file contain the word DISDRODB!

Parameters:: base_dir (str) – path can be a campaign_dir (‘raw_dir’ or ‘processed_dir’), or a DISDRODB file path.
Returns:: Name of the campaign.
Return type:: str

disdrodb.l0.io.get_dataframe_min_max_time(df: DataFrame)[source]

Retrieves dataframe starting and ending time.

Parameters:: df (pd.DataFrame) – Input dataframe
Returns:: (starting_time, ending_time)
Return type:: tuple

disdrodb.l0.io.get_dataset_min_max_time(ds: Dataset)[source]

Retrieves dataset starting and ending time.

Parameters:: ds (xr.Dataset) – Input dataset
Returns:: (starting_time, ending_time)
Return type:: tuple

disdrodb.l0.io.get_disdrodb_dir(path: str) → str[source]

Return the disdrodb base directory from a file or directory path.

Current assumption: no data_source, campaign_name, station_name or file contain the word DISDRODB!

Parameters:: path (str) – path can be a campaign_dir (‘raw_dir’ or ‘processed_dir’), or a DISDRODB file path.
Returns:: Path of the DISDRODB directory.
Return type:: str

disdrodb.l0.io.get_disdrodb_path(path: str) → str[source]

Return the path fron the disdrodb_dir directory.

Current assumption: no data_source, campaign_name, station_name or file contain the word DISDRODB!

Parameters:: path (str) – path can be a campaign_dir (‘raw_dir’ or ‘processed_dir’), or a DISDRODB file path.
Returns:: Path inside the DISDRODB archive. Format: DISDRODB/<Raw or Processed>/<data_source>/…
Return type:: str

disdrodb.l0.io.get_l0a_file_list(processed_dir, station_name, debugging_mode)[source]

Retrieve L0A files for a give station.

Parameters:

processed_dir (str) – Directory of the campaign where to search for the L0A files. Format <..>/DISDRODB/Processed/<data_source>/<campaign_name>
station_name (str) – ID of the station
debugging_mode (bool, optional) – If True, it select maximum 3 files for debugging purposes. The default is False.

Returns:

list_fpaths – List of L0A file paths.

Return type:

list

disdrodb.l0.io.get_raw_file_list(raw_dir, station_name, glob_patterns, verbose=False, debugging_mode=False)[source]

Get the list of files from a directory based on input parameters.

Currently concatenates all files provided by the glob patterns. In future, this might be modified to enable DISDRODB processing when raw data are separated in multiple files.

Parameters:

raw_dir (str) – Directory of the campaign where to search for files. Format <..>/DISDRODB/Raw/<data_source>/<campaign_name>
station_name (str) – ID of the station
verbose (bool, optional) – Wheter to verbose the processing. The default is False.
debugging_mode (bool, optional) – If True, it select maximum 3 files for debugging purposes. The default is False.

Returns:

list_fpaths – List of files file paths.

Return type:

list

disdrodb.l0.io.read_L0A_dataframe(fpaths: str | list, verbose: bool = False, debugging_mode: bool = False) → DataFrame[source]

Read DISDRODB L0A Apache Parquet file(s).

Parameters:

fpaths (str or list) – Either a list or a single filepath .
verbose (bool) – Whether to print detailed processing information into terminal. The default is False.
debugging_mode (bool) – If True, it reduces the amount of data to process. If fpaths is a list, it reads only the first 3 files For each file it select only the first 100 rows. The default is False.

Returns:

L0A Dataframe.

Return type:

pd.DataFrame

disdrodb.l0.issue module

class disdrodb.l0.issue.NoDatesSafeLoader(stream)[source]

Bases: SafeLoader

classmethod remove_implicit_resolver(tag_to_remove)[source]

Remove implicit resolvers for a particular tag

Takes care not to modify resolvers in super classes.

We want to load datetimes as strings, not dates, because we go on to serialise as json which doesn’t have the advanced types of yaml, and leads to incompatibilities down the track.

disdrodb.l0.issue.check_issue_dict(issue_dict)[source]: Check validity of the issue dictionary

disdrodb.l0.issue.check_issue_file(fpath: str) → None[source]

Check issue YAML file validity.

Parameters:: fpath (str) – Issue YAML file path.

disdrodb.l0.issue.check_time_periods(time_periods)[source]: Check time_periods validity.

disdrodb.l0.issue.check_timesteps(timesteps)[source]

Check timesteps validity.

It expects timesteps string in YYYY-mm-dd HH:MM:SS format with second accuracy. If timesteps is None, return None.

disdrodb.l0.issue.is_numpy_array_datetime(arr)[source]

disdrodb.l0.issue.is_numpy_array_string(arr)[source]: Check if the numpy array contains strings.

disdrodb.l0.issue.load_yaml_without_date_parsing(filepath)[source]: Read a YAML file without converting automatically date string to datetime.

disdrodb.l0.issue.read_issue(raw_dir: str, station_name: str) → dict[source]

Read YAML issue file.

Parameters:

raw_dir (str) – Path of the campaign raw directory.
station_name (int) – Station name.

Returns:

Issue dictionary.

Return type:

dict

disdrodb.l0.issue.read_issue_file(fpath: str) → dict[source]

Read YAML issue file.

Parameters:: fpath (str) – Filepath of the issue YAML.
Returns:: Issue dictionary.
Return type:: dict

disdrodb.l0.issue.write_default_issue(fpath: str) → None[source]

Write an empty issue YAML file.

Parameters:: fpath (str) – Filepath of the issue YAML to write.

disdrodb.l0.issue.write_issue_dict(fpath: str, issue_dict: dict) → None[source]

Write the issue YAML file.

Parameters:

fpath (str) – Filepath of the issue YAML to write.
issue_dict (dict) – Issue dictionary

disdrodb.l0.metadata module

disdrodb.l0.metadata.add_missing_metadata_keys(metadata)[source]: Add missing keys to the metadata dictionary.

disdrodb.l0.metadata.check_metadata_compliance(disdrodb_dir, data_source, campaign_name, station_name)[source]: Check DISDRODB metadata compliance.

disdrodb.l0.metadata.create_campaign_default_metadata(disdrodb_dir, campaign_name, data_source)[source]

Create default YAML metadata files for all stations within a campaign.

Use the function with caution to avoid overwrite existing YAML files.

disdrodb.l0.metadata.get_default_metadata_dict() → dict[source]

Get DISDRODB metadata default values.

Returns:: Dictionary of attibutes standard
Return type:: dict

disdrodb.l0.metadata.get_metadata_missing_keys(metadata)[source]: Return the DISDRODB metadata keys which are missing.

disdrodb.l0.metadata.get_metadata_unvalid_keys(metadata)[source]: Return the DISDRODB metadata keys which are not valid.

disdrodb.l0.metadata.get_valid_metadata_keys() → list[source]

Get DISDRODB valid metadata list.

Returns:: List of valid metadata keys
Return type:: list

disdrodb.l0.metadata.read_metadata(campaign_dir: str, station_name: str) → dict[source]

Read YAML metadata file.

Parameters:

raw_dir (str) – Path of the raw directory
station_name (int) – Id of the station.

Returns:

Dictionnary of the metadata.

Return type:

dict

disdrodb.l0.metadata.remove_unvalid_metadata_keys(metadata)[source]: Remove unvalid keys from the metadata dictionary.

disdrodb.l0.metadata.sort_metadata_dictionary(metadata)[source]: Sort the keys of the metadata dictionary by valid_metadata_keys list order.

disdrodb.l0.metadata.write_default_metadata(fpath: str) → None[source]

Create default YAML metadata file at the specified filepath.

Parameters:: fpath (str) – File path

disdrodb.l0.metadata.write_metadata(metadata, fpath)[source]: Write dictionary to YAML file.

disdrodb.l0.standards module

disdrodb.l0.standards.available_sensor_name() → sorted[source]

Get available names of sensors.

Returns:: Sorted list of the available sensors
Return type:: sorted

disdrodb.l0.standards.get_L0A_encodings_dict(sensor_name: str) → dict[source]

Get a dictionary containing the L0A encodings

Parameters:: sensor_name (str) – Name of the sensor.
Returns:: L0A encodings
Return type:: dict

disdrodb.l0.standards.get_L0B_encodings_dict(sensor_name: str) → dict[source]

Get a dictionary containing the encoding to write L0B netCDFs.

Parameters:: sensor_name (str) – Name of the sensor.
Returns:: Encoding to write L0B netCDFs
Return type:: dict

disdrodb.l0.standards.get_configs_dir(sensor_name: str) → str[source]

Retrieve configs directory.

Parameters:: sensor_name (str) – Name of the sensor.
Returns:: Config directory.
Return type:: str
Raises:: ValueError – Error if the config directory does not exist.

disdrodb.l0.standards.get_coords_attrs_dict(ds)[source]: Return dictionary with DISDRODB coordinates attributes.

disdrodb.l0.standards.get_data_format_dict(sensor_name: str) → dict[source]

Get a dictionary containing the data format of each sensor variable.

Parameters:: sensor_name (str) – Name of the sensor.
Returns:: Data format of each sensor variable
Return type:: dict

disdrodb.l0.standards.get_data_range_dict(sensor_name: str) → dict[source]

Get the variable data range.

Parameters:: sensor_name (str) – Name of the sensor.
Returns:: Dictionary with the expected data value range for each data field. It excludes variables without specified data_range key.
Return type:: dict

disdrodb.l0.standards.get_description_dict(sensor_name: str) → dict[source]

Get a dictionary containing the description of each sensor variable.

Parameters:: sensor_name (str) – Name of the sensor.
Returns:: Description of each sensor variable.
Return type:: dict

disdrodb.l0.standards.get_diameter_bin_center(sensor_name: str) → list[source]

Get diameter bin center.

Parameters:: sensor_name (str) – Name of the sensor
Returns:: Diameter bin center
Return type:: list

disdrodb.l0.standards.get_diameter_bin_lower(sensor_name: str) → list[source]

Get diameter bin lower bound.

Parameters:: sensor_name (str) – Name of the sensor
Returns:: Diameter bin lower bound
Return type:: list

disdrodb.l0.standards.get_diameter_bin_upper(sensor_name: str) → list[source]

Get diameter bin upper bound.

Parameters:: sensor_name (str) – Name of the sensor
Returns:: Diameter bin upper bound
Return type:: list

disdrodb.l0.standards.get_diameter_bin_width(sensor_name: str) → list[source]

Get diameter bin width.

Parameters:: sensor_name (str) – Name of the sensor
Returns:: Diameter bin width
Return type:: list

disdrodb.l0.standards.get_diameter_bins_dict(sensor_name: str) → dict[source]

Get dictionary with sensor_name diameter bins information.

Parameters:: sensor_name (str) – Name of the sensor.
Returns:: sensor_name diameter bins information
Return type:: dict

disdrodb.l0.standards.get_dims_size_dict(sensor_name: str) → dict[source]

Get the number of bins for each dimension.

Parameters:: sensor_name (str) – Name of the sensor.
Returns:: Dictionary with the number of bins for each dimension.
Return type:: dict

disdrodb.l0.standards.get_field_nchar_dict(sensor_name: str) → dict[source]

Get the total number of characters from the instrument default string standards.

Important note: it accounts also for the comma and the minus sign !!!

Parameters:: sensor_name (str) – Name of the sensor.
Returns:: Dictionary with the expected number of characters for each data field.
Return type:: dict

disdrodb.l0.standards.get_field_ndigits_decimals_dict(sensor_name: dict) → dict[source]

Get number of digits on the right side of the comma from the instrument default string standards.

Example: 123,45 -> 45 –> 2 decimal digits :param sensor_name: Name of the sensor. :type sensor_name: dict

Returns:: Dictionary with the expected number of decimal digits for each data field.
Return type:: dict

disdrodb.l0.standards.get_field_ndigits_dict(sensor_name: str) → dict[source]

Get number of digits from the instrument default string standards.

Important note: it excludes the comma but it counts the minus sign !!!

Parameters:: sensor_name (str) – Name of the sensor.
Returns:: Dictionary with the expected number of digits for each data field.
Return type:: dict

disdrodb.l0.standards.get_field_ndigits_natural_dict(sensor_name: str) → dict[source]

Get number of digits on the left side of the comma from the instrument default string standards.

Example: 123,45 -> 123 –> 3 natural digits

Parameters:: sensor_name (str) – Name of the sensor.
Returns:: Dictionary with the expected number of natural digits for each data field.
Return type:: dict

disdrodb.l0.standards.get_l0a_dtype(sensor_name: str) → dict[source]

Get a dictionary containing the L0A dtype.

Parameters:: sensor_name (str) – Name of the sensor.
Returns:: L0A dtype
Return type:: dict

disdrodb.l0.standards.get_long_name_dict(sensor_name: str) → dict[source]

Get a dictionary containing the long name of each sensor variable.

Parameters:: sensor_name (str) – Name of the sensor.
Returns:: Long name of each sensor variable.
Return type:: dict

disdrodb.l0.standards.get_n_diameter_bins(sensor_name)[source]: Get the number of diameter bins.

disdrodb.l0.standards.get_n_velocity_bins(sensor_name)[source]: Get the number of velocity bins.

disdrodb.l0.standards.get_nan_flags_dict(sensor_name: str) → dict[source]

Get the variable nan_flags.

Parameters:: sensor_name (str) – Name of the sensor.
Returns:: Dictionary with the expected nan_flags list for each data field. It excludes variables without specified nan_flags key.
Return type:: dict

disdrodb.l0.standards.get_raw_array_dims_order(sensor_name: str) → dict[source]

Get the dimension order of the raw fields.

The order of dimension specified for raw_drop_number controls the reshaping of the precipitation raw spectrum.

Examples

OTT Parsivel spectrum [v1d1 … v1d32, v2d1, …, v2d32] –> dimension_order = [“velocity_bin_center”, “diameter_bin_center”] Thies LPM spectrum [v1d1 … v20d1, v1d2, …, v20d2] –> dimension_order = [“diameter_bin_center”, “velocity_bin_center”]

Parameters:: sensor_name (str) – Name of the sensor
Returns:: Dimension order dictionary
Return type:: dict

disdrodb.l0.standards.get_raw_array_nvalues(sensor_name: str) → dict[source]

Get a dictionary with the number of values expected for each raw array.

Parameters:: sensor_name (str) – Name of the sensor.
Returns:: Field definition.
Return type:: dict

disdrodb.l0.standards.get_sensor_variables(sensor_name: str) → list[source]

Get sensor variable names list.

Parameters:: sensor_name (str) – Name of the sensor.
Returns:: List of the variables values
Return type:: list

disdrodb.l0.standards.get_time_encoding() → dict[source]

Create time encoding

Returns:: Time encoding
Return type:: dict

disdrodb.l0.standards.get_units_dict(sensor_name: str) → dict[source]

Get a dictionary containing the unit of each sensor variable.

Parameters:: sensor_name (str) – Name of the sensor.
Returns:: Unit of each sensor variable
Return type:: dict

disdrodb.l0.standards.get_valid_coordinates_names(sensor_name)[source]: Get list of valid coordinates.

disdrodb.l0.standards.get_valid_dimension_names(sensor_name)[source]: Get list of valid dimension names.

disdrodb.l0.standards.get_valid_names(sensor_name)[source]

disdrodb.l0.standards.get_valid_values_dict(sensor_name: str) → dict[source]

Get the list of valid values for a variable.

Parameters:: sensor_name (str) – Name of the sensor.
Returns:: Dictionary with the expected values for specific variables. It excludes variables without specified valid_values key.
Return type:: dict

disdrodb.l0.standards.get_valid_variable_names(sensor_name)[source]: Get list of valid variables.

disdrodb.l0.standards.get_variables_dict(sensor_name: str) → dict[source]

Get a dictionary containing the variable name of the sensor field numbers.

Parameters:: sensor_name (str) – Name of the sensor.
Returns:: Variables names
Return type:: dict

disdrodb.l0.standards.get_variables_dimension(sensor_name: str)[source]: Returns a dictionary with the variable dimensions of a L0B product.

disdrodb.l0.standards.get_velocity_bin_center(sensor_name: str) → list[source]

Get velocity bin center.

Parameters:: sensor_name (str) – Name of the sensor
Returns:: Velocity bin center
Return type:: list

disdrodb.l0.standards.get_velocity_bin_lower(sensor_name: str) → list[source]

Get velocity bin lower bound.

Parameters:: sensor_name (str) – Name of the sensor
Returns:: Velocity bin lower bound.
Return type:: list

disdrodb.l0.standards.get_velocity_bin_upper(sensor_name: str) → list[source]

Get velocity bin upper bound.

Parameters:: sensor_name (str) – Name of the sensor
Returns:: Velocity bin upper bound
Return type:: list

disdrodb.l0.standards.get_velocity_bin_width(sensor_name: str) → list[source]

Get velocity bin width.

Parameters:: sensor_name (str) – Name of the sensor
Returns:: Velocity bin width
Return type:: list

disdrodb.l0.standards.get_velocity_bins_dict(sensor_name: str) → dict[source]

Get velocity with sensor_name diameter bins information.

Parameters:: sensor_name (str) – Name of the sensor.
Returns:: Sensor_name diameter bins information
Return type:: dict

disdrodb.l0.standards.read_config_yml(sensor_name: str, filename: str) → dict[source]

Read a config yaml file and return the dictionary.

Parameters:

sensor_name (str) – Name of the sensor.
filename (str) – Name of the file.

Returns:

Content of the config file.

Return type:

dict

Raises:

ValueError – Error if file does not exist.

disdrodb.l0.standards.set_disdrodb_attrs(ds, product_level: str)[source]

Add DISDRODB processing information to the netCDF global attributes.

It assumes stations metadata are already added the dataset.

Parameters:

ds (xarray dataset) – Dataset
product_level (str) – DISDRODB product_level

Returns:

Dataset

Return type:

xarray dataset

disdrodb.l0.summary module

disdrodb.l0.template_tools module

disdrodb.l0.template_tools.arr_has_constant_nchar(arr: array) → bool[source]

Check if the content of an array has a constant number of characters

Parameters:: arr (numpy.ndarray) – The array to analyse
Returns:: True if the number of character is constant
Return type:: booleen

disdrodb.l0.template_tools.check_column_names(column_names: list, sensor_name: str) → None[source]

Checks that the columnn names respects DISDRODB standards.

Parameters:

column_names (list) – List of columns names.
sensor_name (str) – Name of the sensor.

Raises:

TypeError – Error if some columns do not meet the DISDRODB standards.

disdrodb.l0.template_tools.get_decimal_ndigits(string: str) → int[source]

Get the decimal number of digit.

Parameters:: string (str) – Input string
Returns:: The number of digit.
Return type:: int

disdrodb.l0.template_tools.get_df_columns_unique_values_dict(df: DataFrame, column_indices: int | slice | list | None = None, column_names: bool = True)[source]

Create a dictionary {column: unique values}

Parameters:

df (pd.DataFrame) – Input dataframe
column_indices (Union[int,slice,list], optional) – column indices
column_names (bool, optional) – If true, print the column name, by default True

disdrodb.l0.template_tools.get_natural_ndigits(string: str) → int[source]

Get the natural number of digit.

Parameters:: string (str) – Input string
Returns:: The number of digit.
Return type:: int

disdrodb.l0.template_tools.get_nchar(string: str) → int[source]

Get the number of charactar.

Parameters:: string (str) – Input string
Returns:: Number of charactar
Return type:: int

disdrodb.l0.template_tools.get_ndigits(string: str) → int[source]

Get the number of digit.

Parameters:: string (str) – Input string
Returns:: Number of digit
Return type:: int

disdrodb.l0.template_tools.get_possible_keys(dict_options: dict, desired_value: str) → set[source]

Get the possible keys from the input values

Parameters:

dict_options (dict) – Input dictionnary
desired_value (str) – Input value

Returns:

Keys that the value matches the desired input value.

Return type:

set

disdrodb.l0.template_tools.infer_column_names(df: DataFrame, sensor_name: str, row_idx: int = 1)[source]

Try to guess the dataframe columns names based on string characteristics.

Parameters:

df (numpy.ndarray) – The array to analyse
sensor_name (str) – name of the sensor
row_idx (int, optional) – The row ID of the array, by default 1

Returns:

Dictionary with the keys being the column id and the values being the guessed column names

Return type:

dict

disdrodb.l0.template_tools.print_df_column_names(df: DataFrame) → None[source]

Print dataframe columns names

Parameters:: df (dataframe) – The dataframe
Returns:: Nothing
Return type:: None

disdrodb.l0.template_tools.print_df_columns_unique_values(df: DataFrame, column_indices: int | slice | list | None = None, column_names: bool = True) → None[source]

Print columns’ unique values

Parameters:

df (pd.DataFrame) – Input dataframe
column_indices (Union[int,slice,list], optional) – column indices
column_names (bool, optional) – If true, print the column name, by default True

disdrodb.l0.template_tools.print_df_first_n_rows(df: DataFrame, n: int = 5, column_names: bool = True) → None[source]

Print the n first n rows dataframe by column.

Parameters:

df (pd.DataFrame) – Input dataframe
n (int, optional) – Number of row, by default 5
column_names (bool , optional) – If true columns name are printed, by default True

disdrodb.l0.template_tools.print_df_random_n_rows(df: DataFrame, n: int = 5, with_column_names: bool = True) → None[source]

Print the content of the dataframe by column, randomly chosen

Parameters:

df (dataframe) – The dataframe
n (int, optional) – The number of row to print, by default 5
with_column_names (bool, optional) – If true, print the column name, by default True

Returns:

Nothing

Return type:

None

disdrodb.l0.template_tools.print_df_summary_stats(df: DataFrame, column_indices: int | slice | list | None = None, column_names: bool = True)[source]

Create a columns statistics summary.

Parameters:

df (pd.DataFrame) – Input dataframe
column_indices (Union[int,slice,list], optional) – column indices
column_names (bool, optional) – If true, print the column name, by default True

Raises:

ValueError – Error if columns types is not numeric.

disdrodb.l0.template_tools.print_df_with_any_nan_rows(df: DataFrame) → None[source]

Print empty rows

Parameters:: df (pd.DataFrame) – Input dataframe.

disdrodb.l0.template_tools.print_valid_L0_column_names(sensor_name: str) → None[source]

Print valid columns names from the standard.

Parameters:: sensor_name (str) – Name of the sensor.

disdrodb.l0.template_tools.search_possible_columns(string: str, sensor_name: str) → list[source]

Define possible column

Parameters:

string (str) – Inpur string
sensor_name (str) – Name of the sensor

Returns:

list of possible columns

Return type:

list

disdrodb.l0.template_tools.str_has_decimal_digits(string: str) → bool[source]

Check if a string has decimals

Parameters:: string – Input string
Returns:: True if sting has digits.
Return type:: bool

disdrodb.l0.template_tools.str_is_integer(string: str) → bool[source]

Check if a string is an integer

Parameters:: string (Input string) –
Returns:: True if integer.
Return type:: bool

disdrodb.l0.template_tools.str_is_not_number(string: str) → bool[source]

Check if a string is not numeric

Parameters:: string (Input string) –
Returns:: True if not float.
Return type:: bool

disdrodb.l0.template_tools.str_is_number(string: str) → bool[source]

Check if a string is numeric

Parameters:: string (Input string) –
Returns:: True if float.
Return type:: bool

Module contents

disdrodb.l0.available_readers(data_sources=None, reader_path=False)[source]: Retrieve available readers information.

disdrodb.l0.check_archive_metadata_compliance(disdrodb_dir)[source]

disdrodb.l0.check_archive_metadata_geolocation(disdrodb_dir)[source]

disdrodb.l0.run_disdrodb_l0(disdrodb_dir, data_sources=None, campaign_names=None, station_names=None, l0a_processing: bool = True, l0b_processing: bool = True, l0b_concat: bool = False, remove_l0a: bool = False, remove_l0b: bool = False, force: bool = False, verbose: bool = False, debugging_mode: bool = False, parallel: bool = True)[source]

Run the L0 processing of DISDRODB stations.

This function enable to launch the processing of many DISDRODB stations with a single command. From the list of all available DISDRODB stations, it runs the processing of the stations matching the provided data_sources, campaign_names and station_names.

Parameters:

disdrodb_dir (str) – Base directory of DISDRODB Format: <…>/DISDRODB
data_sources (list) – Name of data source(s) to process. The name(s) must be UPPER CASE. If campaign_names and station are not specified, process all stations. The default is None
campaign_names (list) – Name of the campaign(s) to process. The name(s) must be UPPER CASE. The default is None
station_names (list) – Station names to process. The default is None
l0a_processing (bool) – Whether to launch processing to generate L0A Apache Parquet file(s) from raw data. The default is True.
l0b_processing (bool) – Whether to launch processing to generate L0B netCDF4 file(s) from L0A data. The default is True.
l0b_concat (bool) – Whether to concatenate all raw files into a single L0B netCDF file. If l0b_concat=True, all raw files will be saved into a single L0B netCDF file. If l0b_concat=False, each raw file will be converted into the corresponding L0B netCDF file. The default is False.
remove_l0a (bool) – Whether to keep the L0A files after having generated the L0B netCDF products. The default is False.
remove_l0b (bool) –

Whether to remove the L0B files after having concatenated all L0B netCDF files.
It takes places only if l0b_concat = True

The default is False.
force (bool) – If True, overwrite existing data into destination directories. If False, raise an error if there are already data into destination directories. The default is False.
verbose (bool) – Whether to print detailed processing information into terminal. The default is True.
parallel (bool) – If True, the files are processed simultanously in multiple processes. Each process will use a single thread to avoid issues with the HDF/netCDF library. By default, the number of process is defined with os.cpu_count(). If False, the files are processed sequentially in a single process. If False, multi-threading is automatically exploited to speed up I/0 tasks.
debugging_mode (bool) – If True, it reduces the amount of data to process. For L0A, it processes just the first 3 raw data files. For L0B, it processes just the first 100 rows of 3 L0A files. The default is False.

disdrodb.l0.run_disdrodb_l0_station(disdrodb_dir, data_source, campaign_name, station_name, l0a_processing: bool = True, l0b_processing: bool = True, l0b_concat: bool = True, remove_l0a: bool = False, remove_l0b: bool = False, force: bool = False, verbose: bool = False, debugging_mode: bool = False, parallel: bool = True)[source]

Run the L0 processing of a specific DISDRODB station from the terminal.

Parameters:

disdrodb_dir (str) – Base directory of DISDRODB Format: <…>/DISDRODB
data_source (str) – Institution name (when campaign data spans more than 1 country), or country (when all campaigns (or sensor networks) are inside a given country). Must be UPPER CASE.
campaign_name (str) – Campaign name. Must be UPPER CASE.
station_name (str) – Station name
l0a_processing (bool) – Whether to launch processing to generate L0A Apache Parquet file(s) from raw data. The default is True.
l0b_processing (bool) – Whether to launch processing to generate L0B netCDF4 file(s) from L0A data. The default is True.
l0b_concat (bool) – Whether to concatenate all raw files into a single L0B netCDF file. If l0b_concat=True, all raw files will be saved into a single L0B netCDF file. If l0b_concat=False, each raw file will be converted into the corresponding L0B netCDF file. The default is False.
remove_l0a (bool) – Whether to keep the L0A files after having generated the L0B netCDF products. The default is False.
remove_l0b (bool) –

Whether to remove the L0B files after having concatenated all L0B netCDF files.
It takes places only if l0b_concat=True

The default is False.
force (bool) – If True, overwrite existing data into destination directories. If False, raise an error if there are already data into destination directories. The default is False.
verbose (bool) – Whether to print detailed processing information into terminal. The default is True.
parallel (bool) – If True, the files are processed simultanously in multiple processes. Each process will use a single thread to avoid issues with the HDF/netCDF library. By default, the number of process is defined with os.cpu_count(). If False, the files are processed sequentially in a single process. If False, multi-threading is automatically exploited to speed up I/0 tasks.
debugging_mode (bool) – If True, it reduces the amount of data to process. For L0A, it processes just the first 3 raw data files for each station. For L0B, it processes just the first 100 rows of 3 L0A files for each station. The default is False.

disdrodb.l0.run_l0a(raw_dir, processed_dir, station_name, glob_patterns, column_names, reader_kwargs, df_sanitizer_fun, parallel, verbose, force, debugging_mode)[source]

Run the L0A processing for a specific DISDRODB station.

Parameters:

raw_dir (str) –
The directory path where all the raw content of a specific campaign is stored. The path must have the following structure:

<…>/DISDRODB/Raw/<data_source>/<campaign_name>’.

Inside the raw_dir directory, it is required to adopt the following structure: - /data/<station_name>/<raw_files> - /metadata/<station_name>.yaml Important points: - For each <station_name> there must be a corresponding YAML file in the metadata subfolder. - The <campaign_name> must semantically match between:
- the raw_dir and processed_dir directory paths;
- with the key ‘campaign_name’ within the metadata YAML files.
- The campaign_name are expected to be UPPER CASE.
processed_dir (str) –
The desired directory path for the processed DISDRODB L0A and L0B products. The path should have the following structure:

<…>/DISDRODB/Processed/<data_source>/<campaign_name>’

For testing purpose, this function exceptionally accept also a directory path simply ending with <campaign_name> (i.e. /tmp/<campaign_name>).
station_name (str) – Station name
glob_patterns (str) – Glob pattern to search data files in <raw_dir>/data/<station_name>
column_names (list) – Columns names of the raw text file.
reader_kwargs (dict) – Pandas read_csv arguments to open the text file.
df_sanitizer_fun (object, optional) – Sanitizer function to format the datafame into DISDRODB L0A standard.
parallel (bool) – If True, the files are processed simultanously in multiple processes. The number of simultaneous processes can be customized using the dask.distributed LocalCluster. If False, the files are processed sequentially in a single process. If False, multi-threading is automatically exploited to speed up I/0 tasks.
verbose (bool) – Whether to print detailed processing information into terminal. The default is False.
force (bool) – If True, overwrite existing data into destination directories. If False, raise an error if there are already data into destination directories. The default is False.
debugging_mode (bool) – If True, it reduces the amount of data to process. It processes just the first 100 rows of 3 raw data files. The default is False.

disdrodb.l0.run_l0b_from_nc(raw_dir, processed_dir, station_name, glob_patterns, dict_names, ds_sanitizer_fun, parallel, verbose, force, debugging_mode)[source]

Run the L0B processing for a specific DISDRODB station with raw netCDFs.

Parameters:

raw_dir (str) –
The directory path where all the raw content of a specific campaign is stored. The path must have the following structure:

<…>/DISDRODB/Raw/<data_source>/<campaign_name>’.

Inside the raw_dir directory, it is required to adopt the following structure: - /data/<station_name>/<raw_files> - /metadata/<station_name>.yaml Important points: - For each <station_name> there must be a corresponding YAML file in the metadata subfolder. - The <campaign_name> must semantically match between:
- the raw_dir and processed_dir directory paths;
- with the key ‘campaign_name’ within the metadata YAML files.
- The campaign_name are expected to be UPPER CASE.
processed_dir (str) –
The desired directory path for the processed DISDRODB L0B products. The path should have the following structure:

<…>/DISDRODB/Processed/<data_source>/<campaign_name>’

For testing purpose, this function exceptionally accept also a directory path simply ending with <campaign_name> (i.e. /tmp/<campaign_name>).
station_name (str) – Station name
glob_patterns (str) – Glob pattern to search data files in <raw_dir>/data/<station_name>. Example: glob_patterns = “*.nc”
dict_names (dict) – Dictionary mapping raw netCDF variables/coordinates/dimension names to DISDRODB standards.
ds_sanitizer_fun (object, optional) – Sanitizer function to format the raw netCDF into DISDRODB L0B standard.
force (bool) – If True, overwrite existing data into destination directories. If False, raise an error if there are already data into destination directories. The default is False.
verbose (bool) – Whether to print detailed processing information into terminal. The default is False.
parallel (bool) – If True, the files are processed simultanously in multiple processes. The number of simultaneous processes can be customized using the dask.distributed LocalCluster. If False, the files are processed sequentially in a single process. If False, multi-threading is automatically exploited to speed up I/0 tasks.
debugging_mode (bool) – If True, it reduces the amount of data to process. It processes just the first 3 raw netCDF files. The default is False.