API#
This is the API documentation for the momics package
Loading module#
The loading module provides utilities for importing and handling various data formats used in marine omics workflows. Metadata are handled in separate module momics.metadata.
Parquet data#
This submodule contains functions for loading and processing data stored in Parquet format.
- momics.loader.parquets.load_parquets(folder: str) Dict[str, DataFrame]#
Loads all .parquet files in a folder and stores them in a dictionary.
The keys of the dictionary are the file names without the .parquet extension. If the filename contains more than one ‘.’, only the last part of the name is included.
Example
metagoflow_analyses.go_slim.parquet -> key = “go_slim”
- Parameters:
folder (str) – The path to the folder containing the .parquet files.
- Returns:
A dictionary containing the data frames of the .parquet files.
- Return type:
dict
- momics.loader.parquets.load_parquets_udal()#
Load parquet files into a dictionary by looping udal calls
Ro-crates#
This submodule provides tools for working with RO-Crate metadata packages.
- momics.loader.ro_crates.extract_all_datafiles(metadata: Dict) list#
Extracts all data files from the metadata. :param metadata: The metadata in JSON format. :type metadata: Dict
- Returns:
A list of dictionaries containing data file information.
- Return type:
List[Dict]
- momics.loader.ro_crates.extract_data_by_name(metadata: Dict, name: str) Dict#
Extracts data from metadata based on the name. :param metadata: The metadata in JSON format. :type metadata: Dict :param name: The name of the data to extract. :type name: str
- Returns:
The extracted data.
- Return type:
Dict
- momics.loader.ro_crates.get_rocrate_data(metadata_json: Dict, data_id: str)#
Retrieves RO-Crate data file based on metadata.
- Parameters:
metadata_json (Dict) – The metadata in JSON format.
data_id (str) – The ID of the data file.
- Returns:
The content of the data file.
- Return type:
str
- momics.loader.ro_crates.get_rocrate_metadata_gh(sample_id: str) Dict#
Retrieves RO-Crate metadata from a GitHub repository.
- Parameters:
sample_id (str) – The ID of the sample.
- Returns:
The metadata in JSON format.
- Return type:
Dict
Utils#
This submodule includes helper functions used during the data loading process.
- momics.loader.utils.bytes_to_df(data: bytes, sep: str = '\t') Dict[str, DataFrame]#
Convert a dictionary of bytes to a dictionary of DataFrames.
- Parameters:
data (Dict[str, bytes]) – A dictionary where keys are filenames and values are byte strings.
sep (str) – The separator used in the data files. Default is tab (” “).
- Returns:
A dictionary where keys are filenames and values are DataFrames.
- Return type:
Dict[str, pd.DataFrame]
Complex networks#
Implementation of networkx functions for complex networks analysis, specifically tailored for marine omics data. Examples include taxonomy co-occurrence networks and biosynthetic gene cluster analysis.
- momics.networks.build_interaction_graphs(correlation_data: dict, pos_cutoff: float = 0.5, neg_cutoff: float = -0.5, p_val_cutoff: float = 0.05) Dict#
Build interaction graphs from correlation data.
- Parameters:
correlation_data (dict) – A dictionary containing correlation data for different factors.
pos_cutoff (float) – The positive correlation cutoff.
neg_cutoff (float) – The negative correlation cutoff.
p_val_cutoff (float) – The p-value cutoff.
- Returns:
A dictionary containing network results for each factor.
- Return type:
Dict
- momics.networks.interaction_to_graph(df: DataFrame, pos_cutoff: float = 0.8, neg_cutoff: float = -0.6) Tuple[List[str], List[Tuple[str, str]], List[Tuple[str, str]]]#
Create a network from the correlation matrix. :param df: The input DataFrame containing correlation values. :type df: pd.DataFrame :param pos_cutoff: Positive correlation cutoff. :type pos_cutoff: float :param neg_cutoff: Negative correlation cutoff. :type neg_cutoff: float
- Returns:
List of node indices. edges_pos (list): List of positive edges. edges_neg (list): List of negative edges.
- Return type:
nodes (list)
- momics.networks.interaction_to_graph_with_pvals(df: DataFrame, pvals_df: DataFrame, pos_cutoff: float = 0.8, neg_cutoff: float = -0.6, p_val_cutoff: float = 0.05) tuple#
Create a network from the correlation matrix and p-values. :param df: The input DataFrame containing correlation values. :type df: pd.DataFrame :param pvals_df: The DataFrame containing p-values. :type pvals_df: pd.DataFrame :param pos_cutoff: Positive correlation cutoff. :type pos_cutoff: float :param neg_cutoff: Negative correlation cutoff. :type neg_cutoff: float
- Returns:
List of node indices. edges_pos (list): List of positive edges with p-values. edges_neg (list): List of negative edges with p-values.
- Return type:
nodes (list)
- momics.networks.pairwise_jaccard_lower_triangle(network_results: dict, edge_type: str = 'all') DataFrame#
Calculate pairwise Jaccard similarity for the lower triangle of all group comparisons. Returns a DataFrame with columns: group1, group2, jaccard_similarity.
If edge_type is ‘all’, it calculates Jaccard similarity for all edges in the graphs.
- Parameters:
network_results (dict) – Dictionary containing network results for each group. Keys are ‘graph’, ‘nodes’, and lists of specific edges from the graph.
edge_type (str) – dict key for list of edges to consider (or ‘all’).
- Returns:
DataFrame containing pairwise Jaccard similarity.
- Return type:
pd.DataFrame
Constants#
This submodule defines various constants used throughout the momics package.
Diversity module#
This module offers methods for calculating and analyzing biodiversity metrics from omics data.
- momics.diversity.alpha_diversity_parametrized(tables_dict: Dict[str, DataFrame], table_name: str, metadata: DataFrame) DataFrame#
Calculates the alpha diversity for a list of tables and merges with metadata.
- Parameters:
tables_dict (Dict[str, pd.DataFrame]) – A dictionary of DataFrames containing species abundances.
table_name (str) – The name of the table.
metadata (pd.DataFrame) – A DataFrame containing metadata.
- Returns:
A DataFrame containing the alpha diversity and metadata.
- Return type:
pd.DataFrame
- Raises:
ValueError – If the index names of the input DataFrame and metadata do not match.
- momics.diversity.alpha_input(tables_dict: Dict[str, DataFrame], table_name: str) DataFrame#
Prepares the input data for alpha diversity calculation.
- Parameters:
tables_dict (Dict[str, pd.DataFrame]) – A dictionary of DataFrames containing species abundances.
table_name (str) – The name of the table to process.
- Returns:
- A pivot table with species abundances indexed by the key column of the functional table
and index column converted to columns.
- Return type:
pd.DataFrame
- momics.diversity.beta_diversity_parametrized(df: DataFrame, taxon: str, metric: str = 'braycurtis') DataFrame#
Calculates the beta diversity for a DataFrame.
- Parameters:
df (pd.DataFrame) – A DataFrame containing species abundances.
taxon (str) – The taxon to use for the beta diversity calculation.
metric (str, optional) – The distance metric to use. Defaults to “braycurtis”.
- Returns:
A DataFrame containing the beta diversity distances.
- Return type:
pd.DataFrame
- momics.diversity.calculate_alpha_diversity(df: DataFrame, factors: DataFrame) DataFrame#
Calculates the alpha diversity (Shannon index) for a DataFrame.
- Parameters:
df (pd.DataFrame) – A DataFrame containing species abundances.
factors (pd.DataFrame) – A DataFrame containing additional factors to merge.
- Returns:
A DataFrame containing the Shannon index and additional factors.
- Return type:
pd.DataFrame
- momics.diversity.calculate_shannon_index(df: DataFrame) Series#
Applies the Shannon index calculation to each row of a DataFrame.
- Parameters:
df (pd.DataFrame) – A DataFrame containing species abundances.
- Returns:
A Series containing the Shannon index for each row.
- Return type:
pd.Series
- momics.diversity.diversity_input(df: DataFrame, kind: str = 'alpha', taxon: str = 'ncbi_tax_id') DataFrame#
Prepare input for diversity analysis.
- Parameters:
df (pd.DataFrame) – The input dataframe.
kind (str) – The type of diversity analysis. Either ‘alpha’ or ‘beta’.
taxon (str) – The column name containing the taxon IDs.
- Returns:
The input for diversity analysis.
- Return type:
pd.DataFrame
- momics.diversity.find_taxa_in_table(table: DataFrame, tax_level: str, search_term: str | int, ncbi_tax_id: bool = False, exact_match: bool = False) DataFrame#
Find taxa in the given table at the specified taxonomic level matching the search term.
- Parameters:
table (pd.DataFrame) – DataFrame containing taxonomic data.
tax_level (str) – Taxonomic level to search (‘all’ for all levels).
search_term (str|int) – Term to search for.
ncbi_tax_id (bool) – If True, search by NCBI taxonomic ID.
exact_match (bool) – If True, perform exact match; otherwise, use substring match.
- Returns:
DataFrame containing matching taxa.
- Return type:
pd.DataFrame
- momics.diversity.get_key_column(table_name: str) str#
Returns the key column name based on the table name.
- Parameters:
table_name (str) – The name of the table.
- Returns:
The key column name.
- Return type:
str
- Raises:
ValueError – If the table name is unknown.
- momics.diversity.run_permanova(data: DataFrame, metadata: DataFrame, permanova_factor: str, permanova_group: List[str], permanova_additional_factors: List[str], permutations: int = 999, verbose: bool = False) Dict[str, DataFrame]#
Run PERMANOVA on the given data and metadata. :param data: DataFrame containing the abundance data. :type data: pd.DataFrame :param metadata: DataFrame containing the metadata. :type metadata: pd.DataFrame :param permanova_factor: The factor to use for PERMANOVA. :type permanova_factor: str :param permanova_group: List of groups to include in the analysis. :type permanova_group: List[str] :param permanova_additional_factors: Additional factors to test. :type permanova_additional_factors: List[str] :param permutations: Number of permutations for PERMANOVA. Default is 999. :type permutations: int :param verbose: If True, print detailed output. :type verbose: bool
- Returns:
Dictionary containing PERMANOVA results for each factor.
- Return type:
Dict[str, pd.DataFrame]
- momics.diversity.shannon_index(row: Series) float#
Calculates the Shannon index for a given row of data.
- Parameters:
row (pd.Series) – A row of data containing species abundances.
- Returns:
The Shannon index value.
- Return type:
float
- momics.diversity.update_subset_indicator(indicator, df)#
Update the subset indicator with the number of unique index ids.
- momics.diversity.update_taxa_count_indicator(indicator, df)#
Update the taxa count indicator with the number of unique taxa.
Galaxy integration#
This module enables integration with the Galaxy platform for workflow automation and reproducibility. We keep it minimalistic, because we expect to use direct install of Galaxy once demos are deployed in the final VRE.
- class momics.galaxy.Gecco(params)#
Bases:
objectA class to interact with the GECCO tool in the Galaxy platform for comparative genomics.
This class manages user authentication, history and dataset selection, file uploads, and submission of jobs to the GECCO tool via the Galaxy API.
- Parameters:
params (dict) – Dictionary of Panel widgets and parameters required for interaction.
- handle_create_history(clicks)#
Creates a new history in Galaxy if requested by the user.
- Parameters:
clicks (int) – Number of times the create history button has been clicked.
- handle_get_datasets(clicks)#
Retrieves available datasets from Galaxy and updates the selection widget.
- Parameters:
clicks (int) – Number of times the login button has been clicked.
- handle_get_histories(clicks)#
Retrieves the user’s Galaxy histories and updates the selection widget.
- Parameters:
clicks (int) – Number of times the login button has been clicked.
- handle_login(clicks)#
Handles user login and retrieves relevant data.
- Parameters:
clicks (int) – Number of times the login button has been clicked.
- handle_submit_gecco(clicks)#
Submits a job to the GECCO tool in Galaxy with the selected parameters.
- Parameters:
clicks (int) – Number of times the submit button has been clicked.
- handle_update_current_file_name(value)#
Updates the current file name and ID based on the selected dataset.
- Parameters:
value (tuple) – Tuple containing the file name and file ID.
- handle_update_current_history_id(value)#
Updates the current history ID based on the selected history.
- Parameters:
value (dict) – Dictionary containing history information.
- handle_update_current_history_name(value)#
Updates the current history name based on the selected history.
- Parameters:
value (dict) – Dictionary containing history information.
- handle_upload_dataset(clicks)#
Uploads a dataset to Galaxy if the user chooses to upload from local source.
- Parameters:
clicks (int) – Number of times the upload button has been clicked.
- class momics.galaxy.RemGalaxy(url_var_name: str, api_key_var_name: str)#
Bases:
object- clean_histories_for_display()#
Cleans the histories for display.
- Returns:
A list of cleaned histories.
- Return type:
List
- get_datasets(name: str = None)#
Retrieves datasets with an optional name filter.
- Parameters:
name (str, optional) – The name to filter datasets. Defaults to None.
- Returns:
A list of dataset names.
- Return type:
List[str]
- get_datasets_by_key(key: str, value: str | List[str]) Tuple[List, List, List]#
Retrieves datasets by a specific key and value.
- Parameters:
key (str) – The key to filter datasets.
value (str | List[str]) – The value to filter datasets.
- Returns:
A tuple containing: - A list of datasets (tuples) that match the key and value. - A list of dataset names that match the key and value. - A list of dataset IDs that match the key and value.
- Return type:
Tuple[List[str], List[str], List[str]]
- get_histories()#
Retrieves all histories.
- Returns:
A list of histories.
- Return type:
List
- set_dataset(dataset_id: str)#
Sets the dataset ID.
- Parameters:
dataset_id (str) – The ID of the dataset.
- set_galaxy_env(url_var_name: str, api_key_var_name: str) List#
Sets the Galaxy environment variables.
- Parameters:
url_var_name (str) – The name of the environment variable containing the Galaxy URL.
api_key_var_name (str) – The name of the environment variable containing the Galaxy API key.
- Returns:
A list of environment variables.
- Return type:
List
- set_history(create: bool = True, hid: str = None, hname: str = None)#
Sets the history.
- Parameters:
create (bool, optional) – Whether to create a new history. Defaults to True.
hid (str, optional) – The ID of the history. Defaults to None.
hname (str, optional) – The name of the history. Defaults to None.
- set_tool(tool_id: str)#
Sets the tool ID.
- Parameters:
tool_id (str) – The ID of the tool.
- show_job_status(job_id: str)#
Shows the status of a job.
- Parameters:
job_id (str) – The ID of the job.
- upload_file(file_path: str)#
Uploads a file.
- Parameters:
file_path (str) – The path to the file to upload.
Metadata module#
This module provides tools for handling and analyzing metadata in omics datasets.
Methods to manipulate, concatenate, merge and enrich metadata files from EMO-BON samplings.
Some of these methods work as temporary solution to bad or incomplete data validation of the metadata tables.
Hopefully, that will not be the case for ever.
- momics.metadata.clean_metadata(metadata: DataFrame, terms: Dict[str, str]) DataFrame#
Clean the metadata DataFrame by filtering and renaming columns. :param metadata: The metadata DataFrame to clean. :type metadata: pd.DataFrame :param terms: A dictionary where keys are original column names and values are new column names. :type terms: Dict[str, str]
- Returns:
The cleaned metadata DataFrame with filtered and renamed columns.
- Return type:
pd.DataFrame
- momics.metadata.enhance_metadata(metadata: DataFrame, df_validation: DataFrame = None) DataFrame#
Enhance the metadata DataFrame by processing the ‘collection_date’ column and extracting the season. This function also optionally filters the metadata based on the ‘ref_code’ values in the df_validation DataFrame.
- Parameters:
metadata (pd.DataFrame) – The metadata DataFrame to enhance.
df_validation (pd.DataFrame, optional) – The DataFrame containing valid samples for filtering.
- Returns:
The enhanced metadata DataFrame.
- Return type:
pd.DataFrame
- momics.metadata.extract_season(metadata: DataFrame) DataFrame#
Add a ‘season’ column to the metadata DataFrame. This function determines the season based on the ‘month’ and ‘day’ columns and adds it as a new column to the DataFrame.
- Parameters:
metadata (pd.DataFrame) – The metadata DataFrame containing ‘month’ and ‘day’ columns.
- Returns:
The updated metadata DataFrame with a new ‘season’ column.
- Return type:
pd.DataFrame
- momics.metadata.extract_season_single(row)#
Determine the season based on the month and day. This function is used as a helper function for the apply method.”
- momics.metadata.fill_na_for_object_columns(df)#
Fill NA values with ‘NA’ for object columns in the dataframe.
- Parameters:
df (pd.DataFrame) – The input dataframe.
- Returns:
The dataframe with NA values filled for object columns.
- Return type:
pd.DataFrame
- momics.metadata.filter_data(df: DataFrame, filtered_metadata: DataFrame) DataFrame#
Filter the DataFrame based on the filtered metadata. This function filters the DataFrame columns based on the index values in the filtered metadata.
- Parameters:
df (pd.DataFrame) – The DataFrame to filter.
filtered_metadata (pd.DataFrame) – The filtered metadata DataFrame.
- Returns:
The filtered DataFrame.
- Return type:
pd.DataFrame
- momics.metadata.filter_metadata_table(metadata_df: DataFrame, selected_factors: Dict[str, List[str]]) DataFrame#
Filter the metadata DataFrame based on selected factors and their values.
- Parameters:
metadata_df (pd.DataFrame) – The metadata DataFrame to filter.
selected_factors (Dict[str, List[str]]) – A dictionary where keys are factor names and values are lists of selected values. If ‘All’ is in the list, that factor will not be filtered.
- Returns:
The filtered metadata DataFrame.
- Return type:
pd.DataFrame
- momics.metadata.filter_metadata_terms(df: DataFrame, terms: list) DataFrame#
Filter the metadata terms in the DataFrame. :param df: The DataFrame containing metadata. :type df: pd.DataFrame :param terms: A list of terms to keep in the DataFrame. :type terms: list
- Returns:
The DataFrame filtered to only include the specified terms.
- Return type:
pd.DataFrame
- momics.metadata.get_metadata(folder)#
- momics.metadata.get_metadata_udal()#
Load metadata from the UDAL API
- momics.metadata.merge_source_mat_id_to_data(df_dict: Dict[str, DataFrame], metadata: DataFrame) Dict[str, DataFrame]#
Merge the ‘source_mat_id’ from metadata to each DataFrame in df_dict based on ‘ref_code’. This function assumes that each DataFrame in df_dict has a ‘ref_code’ column that matches the ‘ref_code’ in metadata. :param df_dict: A dictionary where keys are DataFrame names and values are DataFrames. :type df_dict: Dict[str, pd.DataFrame] :param metadata: The metadata DataFrame containing ‘source_mat_id’ and ‘ref_code’ columns. :type metadata: pd.DataFrame
- Returns:
A dictionary where each DataFrame has been merged with ‘source_mat_id’ from metadata.
- Return type:
Dict[str, pd.DataFrame]
- momics.metadata.process_collection_date(metadata: DataFrame) DataFrame#
Process the ‘collection_date’ column in the metadata DataFrame. This function converts the ‘collection_date’ column to datetime format, extracts the year, month, and day, and adds them as new columns. It also converts the month number to the month name (abbreviated).
- Parameters:
metadata (pd.DataFrame) – The metadata DataFrame containing the ‘collection_date’ column.
- Returns:
The updated metadata DataFrame with new columns for year, month, and day.
- Return type:
pd.DataFrame
- momics.metadata.rename_metadata_terms_vre(df: DataFrame, hash: Dict[str, str]) DataFrame#
Rename metadata terms to make names more readable to the user. :param df: The DataFrame containing metadata. :type df: pd.DataFrame :param hash: A dictionary where keys are original column names and values are new column names. :type hash: Dict[str, str]
- Returns:
The DataFrame with renamed columns.
- Return type:
pd.DataFrame
Panel dashboard module#
This module provides some of the utilities for building interactive dashboards using the Panel library. These are Specific for FAIR-EASE use case and more methods and widgets can be found directly in the demo workflow notebooks here.
- momics.panel_utils.close_server(server, env)#
- momics.panel_utils.create_indicators_diversity() Tuple[Progress, Number]#
Creates indicators for RAM usage.
- Returns:
A FlexBox containing RAM usage indicators.
- Return type:
pn.FlexBox
- momics.panel_utils.create_indicators_landing_page(df: DataFrame) List[Number]#
Generates a list of indicators for the landing page based on the provided aggregated DataFrame.
- Parameters:
df (pd.DataFrame) – A DataFrame containing aggregated data with ‘COMPLETED’ and ‘queued’ columns.
- Returns:
A list of Panel indicators displaying the number of sequenced samples.
- Return type:
List[pn.indicators.Number]
- momics.panel_utils.diversity_select_widgets(cat_columns: List[str], num_columns: List[str]) Tuple[Select, Select, Select, Select, Select, Checkbox]#
Creates selection widgets for alpha and beta diversity analysis.
- Parameters:
cat_columns (List[str]) – A list of categorical column names for alpha diversity.
num_columns (List[str]) – A list of numerical column names for beta diversity.
- Returns:
A tuple containing selection widgets for alpha and beta diversity analysis.
- Return type:
Tuple[pn.widgets.Select, pn.widgets.Select, pn.widgets.Select, pn.widgets.Select, pn.widgets.Select]
- momics.panel_utils.is_port_in_use(port: int) bool#
- momics.panel_utils.serve_app(template, env, name='panel app')#
- momics.panel_utils.tax_finder_selector() Tuple[Select, Select, TextInput, Checkbox, Checkbox]#
Plotting module#
This module contains functions for visualizing omics data using various plotting libraries, such as mpl, seaborn, and hvplot.
Constants#
- PLOT_FACE_COLORstr
The face color for the plot.
TODO: - Returns should be plt.figure and not pn.pane.Matplotlib, as already implemented for beta_plot_pc() function.
- momics.plotting.alpha_plot(tables_dict: Dict[str, DataFrame], table_name: str, factor: str, metadata: DataFrame, order: str = 'factor', backend: str = 'hvplot') Matplotlib | HoloViews#
Creates an alpha diversity plot.
- Parameters:
tables_dict (Dict[str, pd.DataFrame]) – A dictionary of DataFrames containing species abundances.
table_name (str) – The name of the table to process.
factor (str) – The column name to group by.
metadata (pd.DataFrame) – A DataFrame containing metadata.
order (str) – The order of sorting the data. Can be “factor” or “value”.
backend (str) – The plotting backend to use. Can be “matplotlib” or “hvplot”.
- Returns:
A pane containing the alpha diversity plot.
- Return type:
Union[pn.pane.Matplotlib, pn.pane.HoloViews]
- momics.plotting.av_alpha_plot(tables_dict: Dict[str, DataFrame], table_name: str, factor: str, metadata: DataFrame, order: str = 'factor', backend: str = 'hvplot') Matplotlib | HoloViews#
Creates an average alpha diversity plot.
- Parameters:
tables_dict (Dict[str, pd.DataFrame]) – A dictionary of DataFrames containing species abundances.
table_name (str) – The name of the table to process.
factor (str) – The column name to group by.
metadata (pd.DataFrame) – A DataFrame containing metadata.
- Returns:
A pane containing the average alpha diversity plot.
- Return type:
Union[pn.pane.Matplotlib, pn.pane.HoloViews]
- momics.plotting.beta_plot(tables_dict: Dict[str, DataFrame], table_name: str, norm: bool, taxon: str = 'ncbi_tax_id', backend: str = 'hvplot') Matplotlib | HoloViews#
Creates a beta diversity heatmap plot.
- Parameters:
tables_dict (Dict[str, pd.DataFrame]) – A dictionary of DataFrames containing species abundances.
table_name (str) – The name of the table to process.
taxon (str, optional) – The taxon level for beta diversity calculation. Defaults to “ncbi_tax_id”.
norm (bool) – Whether to normalize the data.
backend (str) – The plotting backend to use. Can be “matplotlib” or “hvplot”.
- Returns:
A pane containing the beta diversity heatmap plot.
- Return type:
Union[pn.pane.Matplotlib, pn.pane.HoloViews]
- momics.plotting.beta_plot_pc(tables_dict: Dict[str, DataFrame], metadata: DataFrame, table_name: str, factor: str, taxon: str = 'ncbi_tax_id') Tuple[Scatter, Tuple[float, float]]#
Creates a beta diversity PCoA plot.
- Parameters:
tables_dict (Dict[str, pd.DataFrame]) – A dictionary of DataFrames containing species abundances.
metadata (pd.DataFrame) – A DataFrame containing metadata.
table_name (str) – The name of the table to process.
factor (str) – The column name to color the points by.
taxon (str, optional) – The taxon level for beta diversity calculation. Defaults to “ncbi_tax_id”.
- Returns:
A tuple containing the beta diversity PCoA plot and the explained variance for PC1 and PC2.
- Return type:
Tuple[hv.element.Scatter, Tuple[float, float]]
- momics.plotting.beta_plot_pc_granular(filtered_data: DataFrame, metadata: DataFrame, factor: str) Tuple[Scatter, Tuple[float, float]]#
Creates a beta diversity PCoA plot.
- Parameters:
tables_dict (Dict[str, pd.DataFrame]) – A dictionary of DataFrames containing species abundances.
metadata (pd.DataFrame) – A DataFrame containing metadata.
table_name (str) – The name of the table to process.
factor (str) – The column name to color the points by.
taxon (str, optional) – The taxon level for beta diversity calculation. Defaults to “ncbi_tax_id”.
- Returns:
A tuple containing the beta diversity PCoA plot and the explained variance.
- Return type:
Tuple[plt.figure, float]
- momics.plotting.change_legend_labels(ax: axis, labels: List[str]) axis#
Changes the labels of a legend on a given matplotlib axis.
ax (plt.axis): The matplotlib axis object whose legend labels need to be changed. labels (List[str]): A list of new labels to be set for the legend.
- Returns:
The matplotlib axis object with updated legend labels.
- Return type:
plt.axis
- momics.plotting.cut_xaxis_labels(ax: axis, n: int = 15) axis#
Changes the x-tick labels by cutting them short.
- Parameters:
ax – The axes to change the x-axis of.
n – cutoff for max number of characters for the xtick label.
- Returns:
The axes with the new x-tick labels.
- Return type:
plt.axis
- momics.plotting.fold_legend_labels_from_series(df: Series, max_len: int = 30) List[str]#
Folds a list of labels to a maximum length from a Series.
- Parameters:
df (pd.Series) – The series to extract unique labels.
max_len (int, optional) – The maximum length of a label. Defaults to 30.
- Returns:
The folded list of labels.
- Return type:
List[str]
- momics.plotting.get_sankey(df, cat_cols=[], value_cols='', title='Sankey Diagram')#
- momics.plotting.hvplot_alpha_diversity(alpha: DataFrame, factor: str) Bars#
Creates a horizontal bar plot for alpha diversity using hvplot.
- Parameters:
alpha (pd.DataFrame) – DataFrame containing alpha diversity data.
factor (str) – The column name to group by.
- Returns:
A horizontal bar plot of alpha diversity.
- Return type:
hv.element.Bars
- momics.plotting.hvplot_average_per_factor(alpha: DataFrame, factor: str) Bars#
Creates a horizontal bar plot for alpha diversity using hvplot.
- Parameters:
alpha (pd.DataFrame) – DataFrame containing alpha diversity data.
factor (str) – The column name to group by.
- Returns:
A horizontal bar plot of alpha diversity.
- Return type:
hv.element.Bars
- momics.plotting.hvplot_bgcs_violin(df: DataFrame, normalize: bool = False) Overlay#
Creates a violin plot for BGC probabilities by type using hvplot.
- Parameters:
df (pd.DataFrame) – A DataFrame containing BGC data with columns ‘type’, ‘average_p’, and ‘max_p’.
normalize (bool) – Whether to normalize the y-axis to the range [0, 1].
- Returns:
An overlay of swarm and violin plots.
- Return type:
hv.Overlay
- momics.plotting.hvplot_heatmap(df: DataFrame, taxon: str, norm: bool = False) HeatMap#
Creates a heatmap plot for beta diversity using hvplot.
- Parameters:
df (pd.DataFrame) – DataFrame containing beta diversity distances.
taxon (str) – The taxon level for beta diversity calculation.
norm (bool) – Whether to normalize the data.
- Returns:
A heatmap plot of beta diversity.
- Return type:
hv.element.HeatMap
- momics.plotting.hvplot_plot_pcoa_black(pcoa_df: DataFrame, color_by: str = None, explained_variance: Tuple[float, float] = None, **kwargs) Scatter#
Plots a PCoA plot with optional coloring using hvplot.
- Parameters:
pcoa_df (pd.DataFrame) – A DataFrame containing PCoA results.
color_by (str, optional) – The column name to color the points by. Defaults to None.
- Returns:
The PCoA plot.
- Return type:
hv.element.Scatter
- momics.plotting.mpl_alpha_diversity(alpha_df: DataFrame, factor: str = None) Figure#
Plots the Shannon index grouped by a factor.
- Parameters:
alpha_df (pd.DataFrame) – A DataFrame containing alpha diversity results.
factor (str, optional) – The column name to group by. Defaults to None.
- Returns:
The Shannon index plot.
- Return type:
plt.Figure
- momics.plotting.mpl_average_per_factor(df: DataFrame, factor: str = None) Figure#
Plots the average Shannon index grouped by a factor.
- Parameters:
df (pd.DataFrame) – A DataFrame containing alpha diversity results.
factor (str, optional) – The column name to group by. Defaults to None.
- Returns:
The average Shannon index plot.
- Return type:
plt.Figure
- momics.plotting.mpl_bgcs_violin(df: DataFrame, normalize: bool = False) Figure#
- momics.plotting.mpl_plot_heatmap(df: DataFrame, taxon: str, norm=False) Figure#
Creates a heatmap plot for beta diversity.
- Parameters:
df (pd.DataFrame) – A DataFrame containing beta diversity distances.
taxon (str) – The taxon level for beta diversity calculation.
norm (bool) – Whether to normalize the data.
- Returns:
The heatmap plot.
- Return type:
plt.Figure
- momics.plotting.plot_domain_abundance(filtered_domains: Series, abundance_min: int) Bars#
Plot the histogram of the number of pfam domains from the feature table using hvplot.
- Parameters:
filtered_domains (pd.Series) – A Series containing domain names as the index and their abundances as values.
abundance_min (int) – The minimum abundance threshold for domains to be included in the plot.
- Returns:
A horizontal bar plot of domain abundances.
- Return type:
hv.element.Bars
- momics.plotting.plot_network(network_results, association_data, alpha=0.5)#
- momics.plotting.plot_pcoa_black(pcoa_df: DataFrame, color_by: str = None) Figure#
Plots a PCoA plot with optional coloring.
- Parameters:
pcoa_df (pd.DataFrame) – A DataFrame containing PCoA results.
color_by (str, optional) – The column name to color the points by. Defaults to None.
- Returns:
The PCoA plot.
- Return type:
plt.Figure
- momics.plotting.plot_tsne(X_embedded: ndarray, kmeans) Scatter#
Plot the t-SNE embedding of the clusters using hvplot.
- Parameters:
X_embedded (np.ndarray) – The t-SNE embedded coordinates.
kmeans – The fitted KMeans object containing cluster labels.
- Returns:
The t-SNE plot of domain clusters.
- Return type:
hv.element.Scatter
Statistical module#
This module provides functions for performing statistical analyses on omics data.
- momics.stats.plot_association_histogram(assoc_data: Dict, bins: int = 50) None#
Plot a histogram of the correlation values for each factor.
- Parameters:
correlation_data (dict) – A dictionary containing correlation data for each factor.
bins (int) – The number of bins to use for the histogram.
- Returns:
None
- momics.stats.plot_fdr(correlations: Dict, pval_cutoff: float) None#
Plot the FDR-corrected p-values against the raw p-values for each factor.
- Parameters:
correlations (dict) – A dictionary containing correlation data for each factor.
pval_cutoff (float) – The p-value cutoff for significance.
- Returns:
None
- momics.stats.spearman_from_taxonomy(split_taxonomy: Dict) Dict#
Compute Spearman correlation and p-values for the full taxonomy split by a factor. Refer momics.taxonomy.split_taxonomic_data for more information.
- Parameters:
split_taxonomy (dict) – A dictionary containing dataframes for each factor.
- Returns:
A dictionary containing Spearman correlation and p-values for each factor.
- Return type:
dict
Taxonomy module#
This module provides tools for handling and analyzing taxonomic information in omics datasets.
- momics.taxonomy.aggregate_by_taxonomic_level(df: DataFrame, level: str) DataFrame#
Aggregates the DataFrame by a specific taxonomic level and sums abundances across samples.
- Parameters:
df (pd.DataFrame) – The input DataFrame containing taxonomic information.
level (str) – The taxonomic level to aggregate by (e.g., ‘phylum’, ‘class’, etc.).
- Returns:
A DataFrame aggregated by the specified taxonomic level.
- Return type:
pd.DataFrame
- momics.taxonomy.clean_tax_row(row: str) str#
Cleans the taxonomic rows for both EMO0-BON and MGnify formats of taxonomic concats.
- Parameters:
row (str) – The input taxonomic row for a taxonomy DF as a string.
- Returns:
The cleaned taxonomic row.
- Return type:
str
- momics.taxonomy.compute_bray_curtis(df: DataFrame, skip_cols: int = 0, direction: str = 'samples') DataFrame#
Compute Bray-Curtis dissimilarity and return as a pandas DataFrame. This function computes the Bray-Curtis dissimilarity for samples in the DataFrame.
- Parameters:
df (pd.DataFrame) – The input DataFrame containing sample counts.
skip_cols (int) – Number of columns to skip (e.g., taxonomic information).
direction (str) – Direction of the dissimilarity calculation, ‘samples’ or ‘taxa’.
- Returns:
A DataFrame containing the Bray-Curtis dissimilarity matrix.
- Return type:
pd.DataFrame
- momics.taxonomy.fdr_pvals(p_spearman_df: DataFrame, pval_cutoff: float) DataFrame#
Apply FDR correction to the p-values DataFrame using Benjamini/Hochberg (non-negative) method. This function extracts the upper triangle of the p-values DataFrame.
- Parameters:
p_spearman_df (pd.DataFrame) – DataFrame containing p-values.
pval_cutoff (float) – P-value cutoff for FDR correction.
- Returns:
DataFrame with FDR corrected p-values.
- Return type:
pd.DataFrame
- momics.taxonomy.fill_taxonomy_placeholders(df: DataFrame, taxonomy_ranks: list) DataFrame#
Fill higher missing taxonomy levels in a DataFrame with placeholders like ‘unclassified_<lower_rank_value>’.
No downwards propagation is done, only upwards filling.
Parameters: - df: pandas DataFrame containing taxonomy columns. - taxonomy_ranks: ordered list of taxonomy column names from higher to lower rank.
Returns: - df with placeholders filled.
- momics.taxonomy.logger = <Logger momics.taxonomy (INFO)>#
Some functions were originally developed by Andrzej Tkacz at CCMAR-Algarve.
- momics.taxonomy.map_taxa_up(df: DataFrame, taxon: str, tax_level: str, tax_id: int) DataFrame#
Map all lower taxa to the specified taxonomic level in the DataFrame.
- Parameters:
df (pd.DataFrame) – DataFrame containing taxonomic data.
taxon (str) – The taxon to map up.
tax_level (str) – The taxonomic level to map to.
tax_id (int) – The NCBI taxonomic ID to map to.
- Returns:
DataFrame with lower taxa mapped to the specified taxonomic level.
- Return type:
pd.DataFrame
- momics.taxonomy.normalize_abundance(df: DataFrame, method: str = 'tss_sqrt', rarefy_depth: int = None) DataFrame#
Normalize the abundance DataFrame using specified method.
- Parameters:
df (pd.DataFrame) – The input DataFrame containing taxonomic information.
method (str) – Normalization method. Options: ‘tss’, ‘tss_sqrt’, ‘rarefy’. Defaults to ‘tss_sqrt’.
rarefy_depth (int, optional) – Depth for rarefaction. If None, uses min sample sum. Defaults to None.
- Returns:
A DataFrame with normalized abundance values.
- Return type:
pd.DataFrame
- Raises:
IndexError – If the DataFrame does not have a multiindex with ‘taxonomic_concat’ and ‘ncbi_tax_id’.
TypeError – If the DataFrame does not contain numeric values for normalization.
- momics.taxonomy.pivot_taxonomic_data(df: DataFrame) DataFrame#
Prepares the taxonomic data (LSU and SSU tables) for analysis. Apart from pivoting.
Normalization of the pivot is optional. Methods include:
None: no normalization.
tss_sqrt: Total Sum Scaling and Square Root Transformation.
rarefy: rarefaction to a specified depth, if None, min of sample sums is used.
TODO: refactor scaling to a new method and offer different options.
- Parameters:
df (pd.DataFrame) – The input DataFrame containing taxonomic information.
normalize (str, optional) – Normalization method. Options: None, ‘tss_sqrt’, ‘rarefy’. Defaults to None.
rarefy_depth (int, optional) – Depth for rarefaction. If None, uses min sample sum. Defaults to None.
- Returns:
A pivot table with taxonomic data.
- Return type:
pd.DataFrame
- momics.taxonomy.prevalence_cutoff(df: DataFrame, percent: float = 10, skip_columns: int = 2) DataFrame#
Apply a prevalence cutoff to the DataFrame, removing features that do not appear in at least a certain percentage of samples. This is useful for filtering out low-prevalence features that may not be biologically relevant.
- Parameters:
df (pd.DataFrame) – The input DataFrame containing feature abundances.
percent (float) – The prevalence threshold as a percentage.
skip_columns (int) – The number of columns to skip (e.g., taxonomic information).
- Returns:
A filtered DataFrame with low-prevalence features removed.
- Return type:
pd.DataFrame
- momics.taxonomy.prevalence_cutoff_taxonomy(df: DataFrame, percent: float = 10) DataFrame#
Apply a prevalence cutoff to the taxonomy DataFrame, which is not pivoted, removing features taxa with low abundance in each of the samples separately.
- Parameters:
df (pd.DataFrame) – The input DataFrame containing feature abundances.
percent (float) – The prevalence threshold as a percentage.
- Returns:
A filtered DataFrame with low-prevalence features removed.
- Return type:
pd.DataFrame
- momics.taxonomy.rarefy_table(df: DataFrame, depth: int = None, axis: int = 1) DataFrame#
Rarefy an abundance table to a given depth. If depth is None, uses the minimum sample sum across all samples. This function is a wrapper around the skbio.stats.subsample_counts function.
- Parameters:
df – pd.DataFrame (rows: features, columns: samples)
depth – int or None, rarefaction depth. If None, uses min sample sum.
axis – int, 1 for samples in columns, 0 for samples in rows.
- Returns:
A rarefied DataFrame. Samples are ALWAYS in columns.
- Return type:
pd.DataFrame
- momics.taxonomy.remove_high_taxa(df: DataFrame, taxonomy_ranks: list, tax_level: str = 'phylum', strict: bool = True) DataFrame#
Remove high level taxa from the dataframe.
- Parameters:
df (pd.DataFrame) – DataFrame containing taxonomic data.
taxonomy_ranks (list) – List of taxonomic ranks in order (e.g., [‘phylum’, ‘class’, ‘order’, …]).
tax_level (str) – The taxonomic level to filter by (e.g., ‘phylum’, ‘class’, ‘order’, etc.).
strict (bool) – If True, the lower taxa are all mapped to the tax_level. For instance, tax_level=’phylum’ will map all the more granular assignments (class, order, etc) to the phylum level.
- Returns:
DataFrame with rows where the specified taxonomic level is not None.
- Return type:
pd.DataFrame
- momics.taxonomy.separate_taxonomy(df: DataFrame, eukaryota_keywords: List[str] = None) Dict[str, DataFrame]#
Separate the taxonomic data into different categories based on the index names. :param df: The input DataFrame containing taxonomic information (LSU/SSU tables). :type df: pd.DataFrame :param eukaryota_keywords: List of keywords to filter Eukaryota data. :type eukaryota_keywords: List[str]
- Returns:
A dictionary containing separate DataFrames for Prokaryotes and Eukaryota.
- Return type:
Dict[str, pd.DataFrame]
- momics.taxonomy.separate_taxonomy_eukaryota(df: DataFrame, eukaryota_keywords: List[str])#
Separate Eukaryota data into different files based on specific keywords. :param df: The input DataFrame containing taxonomic information (LSU/SSU tables). :type df: pd.DataFrame :param eukaryota_keywords: List of keywords to filter Eukaryota data. :type eukaryota_keywords: List[str]
- Example keywords:
- eukaryota_keywords = [‘Discoba’, ‘Stramenopiles’, ‘Rhizaria’, ‘Alveolata’,
‘Amorphea’, ‘Archaeoplastida’, ‘Excavata’]
- momics.taxonomy.split_metadata(metadata: DataFrame, factor: str) Dict[str, list]#
Splits the metadata ref codes to dictionary of key being the factor value and value is a list of the ref codes.
- Parameters:
metadata (pd.DataFrame) – The input DataFrame containing metadata.
factor (str) – The column name to split the metadata by.
- Returns:
- A dictionary with keys as unique values of the factor and
values as lists of ref codes.
- Return type:
Dict[str, list]
- momics.taxonomy.split_taxonomic_data(taxonomy: DataFrame, groups: Dict[str, list]) Dict[str, DataFrame]#
Splits the taxonomic data into dictionary of DataFrames for each group. The split is based on the ref_code column, which needs to be present in the dataframes.
- Parameters:
df (pd.DataFrame) – The input DataFrame containing taxonomic information.
groups (Dict[str, list]) – A dictionary where keys are unique values of a factor which correspond to the groups to split by.
- Returns:
- A dictionary with keys as unique values of the factor and
values as DataFrames containing taxonomic data for each group.
- Return type:
Dict[str, pd.DataFrame]
- momics.taxonomy.split_taxonomic_data_pivoted(taxonomy: DataFrame, groups: Dict[str, list]) Dict[str, DataFrame]#
Splits the taxonomic data into dictionary of DataFrames for each group. The split is based on the column names which need to match between the taxonomy DataFrame and the groups lists. The DataFrame should have a ‘ncbi_tax_id’ and ‘taxonomic_concat’ which will serve as index of the resulting DataFrames.
- Parameters:
taxonomy (pd.DataFrame) – The input DataFrame containing taxonomic information.
groups (Dict[str, list]) – A dictionary where keys are unique values of a factor which correspond to the groups to split by.
- Returns:
- A dictionary with keys as unique values of the factor and
values as DataFrames with separate columns for each taxonomic rank.
- Return type:
Dict[str, pd.DataFrame]
- momics.taxonomy.split_taxonomy(index_name: str) List[str]#
Splits the taxonomic string into its components and removes prefixes.
- Parameters:
index_name (str) – The taxonomic string to split.
- Returns:
A list of taxonomic levels.
- Return type:
List[str]
- momics.taxonomy.taxon_in_table(df: DataFrame, taxonomy_ranks: list, taxon: str, tax_level: str) int#
Check if a taxon exists in the DataFrame at the specified taxonomic level.
- Parameters:
df (pd.DataFrame) – DataFrame containing taxonomic data.
taxon (str) – The taxon to check for.
tax_level (str) – The taxonomic level to check against.
- Returns:
The index of the taxon in the DataFrame, or -1 if not found.
- Return type:
int
Utilities of all sorts#
This module contains miscellaneous utility functions used throughout the momics package.
- momics.utils.check_index_names(df1: DataFrame, df2: DataFrame) bool#
Check if two DataFrames have the same index name.
- Parameters:
df1 (pd.DataFrame) – The first DataFrame.
df2 (pd.DataFrame) – The second DataFrame.
- Returns:
True if both DataFrames have the same index name, False otherwise.
- Return type:
bool
- momics.utils.get_notebook_environment()#
Determine if the notebook is running in VS Code or JupyterLab.
- Returns:
The environment in which the notebook is running (‘vscode’, ‘jupyter:binder’, ‘jupyter:local’ or ‘unknown’).
- Return type:
str
- momics.utils.init_setup()#
Initializes the setup environment.
This function checks if the current environment is IPython (such as Google Colab). If it is, it runs the setup for IPython environments. Otherwise, it runs the setup for local environments.
- momics.utils.install_colab_packages()#
- momics.utils.is_ipython()#
- momics.utils.load_and_clean(valid_samples: DataFrame = None) Tuple[DataFrame, DataFrame]#
- momics.utils.memory_load()#
Get the memory usage of the current process.
- Returns:
- A tuple containing:
used_gb (float): The amount of memory currently used by the process in gigabytes.
total_gb (float): The total amount of memory available in gigabytes.
- Return type:
tuple
- momics.utils.memory_usage()#
Get the memory usage of the current process.
- Returns:
- A list of tuples containing the names of the objects in the current environment
and their corresponding sizes in bytes.
- Return type:
list
- momics.utils.reconfig_logger(format='%(levelname)s | %(name)s | %(message)s', level=20)#
(Re-)configure logging
- momics.utils.setup_ipython()#
Setup the IPython environment.
This function installs the momics package and other dependencies for the IPython environment.
- momics.utils.taxonomy_common_preprocess01(df, high_taxon, mapping, prevalence_cutoff_value, taxonomy_ranks, pivot=False)#