Parametrized co-occurrence networks

Complex networks provide metrics and to identify kestone species bridges/bottleneck species within the community. EMO-BON data are unified in methods but highly diverse in respect to the metadata (sampling station, temperature, season etc.), which can be interpreted as control/treatment. To refrase a particular question in respect to seasonality, is there structural difference in taxonomy buildup in the communities between spring, summer, autumn and winter.

We will construct association matrices between taxa over the series of sampling events grouped by the factor (season in this case).

Recipe¶

Load data
Remove high taxa (non-identified sequences)
Pivot table from sampling events to taxa.
Remove low abundance taxa
Rarefy, or normalize
Remove replicates
Split to groups on chosen factor
Calculate associations (Bray-curits dissimilarity, Spearman’s correlation, etc.)
False discovery rate correction
Build and analyse network per group

1. Loading data and metadata¶

import os
from typing import Dict
import numpy as np
import pandas as pd

from skbio.stats import subsample_counts
from skbio.diversity import beta_diversity
from scipy.stats import spearmanr
from statsmodels.stats.multitest import multipletests
import networkx as nx

from momics.taxonomy import (
    pivot_taxonomic_data,
    separate_taxonomy)
from skbio.diversity import beta_diversity
from momics.taxonomy import (
    remove_high_taxa,
    pivot_taxonomic_data,
    prevalence_cutoff,
    rarefy_table,
    split_metadata,
    split_taxonomic_data,
    split_taxonomic_data_pivoted,
    compute_bray_curtis,
    fdr_pvals,
)
from momics.networks import interaction_to_graph, interaction_to_graph_with_pvals

from mgo.udal import UDAL

# All low level functions are imported from the momics package
from momics.loader import load_parquets_udal
from momics.metadata import get_metadata_udal, enhance_metadata

import seaborn as sns
import matplotlib.pyplot as plt

All the methods which are not part of marine-omics package are defined below. Documnetation and context for the marine-omics (momics) methods can be found on readthedocs.io.

def get_data():
    return load_parquets_udal()


# Load and merge metadata
def get_full_metadata():
    return get_metadata_udal()


def get_valid_samples():
    url = "https://raw.githubusercontent.com/emo-bon/momics-demos/refs/heads/main/data/shipment_b1b2_181.csv"
    df_valid = pd.read_csv(
        url, header=0, index_col=0,
    )
    return df_valid

# Load metadata
full_metadata = get_full_metadata()

# filter the metadata only for valid 181 samples
valid_samples = get_valid_samples()
full_metadata = enhance_metadata(full_metadata, valid_samples)

mgf_parquet_dfs = get_data()
ssu = mgf_parquet_dfs['ssu']
ssu.head()

assert full_metadata.shape[0] == 181

print(full_metadata.shape)

(181, 142)

Batch 1+2 contain 181 valid samples, which should be the first value printed above. Now we convert object columns to categorical variables

for col in full_metadata.columns:
    # print(full_metadata[col].isinstance(pd.CategoricalDtype()))
    # check if object dtype
    if full_metadata[col].dtype == 'object':
        # convert to categorical
        full_metadata[col] = full_metadata[col].astype('category')

# example how to crosscheck the abve operation
isinstance(full_metadata['season'].dtype, pd.CategoricalDtype)

True

2. Remove high taxa¶

Because many sequences are not well classified, information that the sequence is from superKingdom Bacteria is not helpful and does not add information value to our analysis. Here we remove all taxa which are only identified on phylum level and higher.

print("Original DataFrame shape:", ssu.shape)
ssu = remove_high_taxa(ssu, tax_level='phylum')
print("Filtered DataFrame shape:", ssu.shape)

Original DataFrame shape: (111320, 11)
Filtered DataFrame shape: (110286, 11)

3. Pivot tables¶

Pivoting converts taxonomy table which rows contain each taxon identified in every sample to table with rows of unique taxa with columns representing samples and values of abundances per sample.

ssu_pivot = pivot_taxonomic_data(ssu, normalize=None, rarefy_depth=None)
ssu_pivot.head()

The pivot_taxonomic_data can normalize or rarefy the table too, but split these steps, because of additional removal of low abundance taxa which needs to happen before normalizing/rarefying.

4. Remove low prevalence taxa¶

The cutoff selected here is 10 %, which means that all taxa which do not appear at least in 10 % of all the samples in the taxonomy table will be removed.

ssu_filt = prevalence_cutoff(ssu_pivot, percent=10, skip_columns=2)
ssu_filt.head()

5. Rarefy or normalize¶

Many normalization techniques are readily available to mornalize taxonomy data. For the purposes of co-occurrence networks, it is reccomended to rarefy [refs]. Please correct if wrong.

ssu_rarefied = ssu_filt.copy()
ssu_rarefied.iloc[:, 2:] = rarefy_table(ssu_filt.iloc[:, 2:], depth=None, axis=1)
ssu_rarefied.head()

6. Remove replicates¶

Anything which has count > 1 is replicate of some sort. Note that this information comes from the enhanced metadata information, because replicate_info is not in the original metadata. It is a combination of observatory, sample type, date, and filter information.

It should be considered to filter replicates before steps 4. and 5. Identify replicates:

full_metadata['replicate_info'].value_counts()

replicate_info
VB_water_column_2021-08-23_NA           4
VB_water_column_2021-06-21_NA           4
VB_water_column_2021-12-17_NA           4
VB_water_column_2021-10-18_NA           4
AAOT_water_column_2021-10-15_3-200      2
                                       ..
OSD74_water_column_2021-08-31_3-200     1
HCMR-1_water_column_2021-06-28_3-200    1
OOB_soft_sediment_2021-06-08_NA         1
ESC68N_water_column_2021-11-04_0.2-3    1
ESC68N_water_column_2021-11-04_3-200    1
Name: count, Length: 90, dtype: int64

Remove the replicates:

filtered_metadata = full_metadata.drop_duplicates(subset='replicate_info', keep='first')
filtered_metadata.shape

(90, 142)

7. Split to groups on chosen `factor`¶

Split metadata to groups based on unique values in your factor.

FACTOR = 'season'
groups = split_metadata(
    filtered_metadata,
    FACTOR
)

# remove groups which have less than 2 members (bad for your statistics :)
for groups_key in list(groups.keys()):
    print(f"{groups_key}: {len(groups[groups_key])} samples")
    if len(groups[groups_key]) < 3:
        del groups[groups_key]
        print(f"Warning: {groups_key} has less than 3 samples, therefore removed.")

Spring: 2 samples
Warning: Spring has less than 3 samples, therefore removed.
Summer: 35 samples
Autumn: 44 samples
Winter: 9 samples

Split pivoted taxonomy data according to groups of ref_codes defined above:

# pivot and filter the ssu
split_taxonomy = split_taxonomic_data_pivoted(
    ssu_rarefied,
    groups
)

for key, value in split_taxonomy.items():
    print(f"{key}: {value.shape[0]} rows, {value.shape[1]} columns")
# split_taxonomy.keys(),

print(ssu_rarefied.shape)
split_taxonomy['Autumn'].iloc[:, 2:].head()

8. Calculate associations (Bray-curtis and Spearman’s)¶

Bray-curtis dissimilarity:

bray_taxa = {}
for factor, df in split_taxonomy.items():
    bray_taxa[factor] = compute_bray_curtis(df, direction='taxa')

Spearman correlation:

spearman_taxa = {}
# Compute Spearman correlation
for factor, df in split_taxonomy.items():
    corr, p_spearman = spearmanr(df.iloc[:, 2:].T)
    assert corr.shape == p_spearman.shape, "Spearman correlation and p-values must have the same shape."
    corr_df = pd.DataFrame(
        corr,
        index=df['ncbi_tax_id'],
        columns=df['ncbi_tax_id']
    )
    p_spearman_df = pd.DataFrame(
        p_spearman,
        index=df['ncbi_tax_id'],
        columns=df['ncbi_tax_id']
    )
    d = {
        'correlation': corr_df,
        'p_vals': p_spearman_df
    }
    spearman_taxa[factor] = d
    assert spearman_taxa[factor]['correlation'].shape == spearman_taxa[factor]['p_vals'].shape, "Spearman correlation and p-values must have the same shape."


spearman_taxa['Summer']['correlation'].head()

9. False discovery rate (FDR) correction¶

Calculate FDR¶

We have more than 1000 taxa in the tables, which means 1000*1000 / 2 association values. Too many to statistically trust that any low p-value is not just statistical coincidence. Therefore False Discovery Rate correction is absolutely essential here.

for factor, d in spearman_taxa.items():
    pvals_fdr = fdr_pvals(d['p_vals'])
    spearman_taxa[factor]['p_vals_fdr'] = pvals_fdr

Let’s crosscheck one particular set of DFs for one factor value, which should all have the same shape:

season = 'Summer'
spearman_taxa[season]['correlation'].shape, spearman_taxa[season]['p_vals'].shape, spearman_taxa[season]['p_vals_fdr'].shape

((1312, 1312), (1312, 1312), (1312, 1312))

Display histograms FDR corrected p-values¶

For Bray-curtis dissimilarity we plot only the histogram. Does it make sense to do FDR? How?

# histogram of the correlation values for setting graph cutoffs
plt.figure(figsize=(10, 5))
for factor, df in bray_taxa.items():
    plt.hist(df.values.flatten(), bins=50, alpha=0.5, label=factor, log=True)

plt.title("Histogram of Correlation Values")
plt.xlabel("Correlation")
plt.ylabel("Frequency")
plt.legend()
plt.show()

We performed FDR on Spearman’s correlations, so both correlation values and FDR curves are displayed.

# histogram of the correlation values for setting graph cutoffs
plt.figure(figsize=(10, 5))
for factor, dict_df in spearman_taxa.items():
    plt.hist(dict_df['correlation'].values.flatten(), bins=50, alpha=0.5, label=factor, log=True)
plt.title("Histogram of Correlation Values")
plt.xlabel("Correlation")
plt.ylabel("Frequency")
plt.legend()
plt.show()

plt.figure(figsize=(10, 5))
for factor, dict_df in spearman_taxa.items():
    plt.scatter(
        dict_df['p_vals'].values.flatten()[::10],  # downsample for better visibility and speed
        dict_df['p_vals_fdr'].values.flatten()[::10],
        alpha=0.5,
        label=factor,
    )
plt.axvline(0.05, color='gray', linestyle='--', label='Raw p=0.05')
plt.axhline(0.05, color='black', linestyle='--', label='FDR p=0.05')
plt.xlabel('Raw p-value')
plt.ylabel('FDR-corrected p-value')
plt.title(f'Comparison of Raw and FDR-corrected p-values for {factor}')
plt.legend()
plt.show()

The FDR curve shows huge inflation of significant associations if taken from the raw data. How many?

significant_raw = dict_df['p_vals'].values.flatten() < 0.05
significant_fdr = dict_df['p_vals_fdr'].values.flatten() < 0.05

removed = np.sum(significant_raw & ~significant_fdr)
significant = np.sum(significant_fdr)

print(f"Total associations significant before FDR: {np.sum(significant_raw)}")
print(f"Total associations significant after FDR: {significant}")
print(f"Associations significant before FDR but not after: {removed}")

Total associations significant before FDR: 96562
Total associations significant after FDR: 9442
Associations significant before FDR but not after: 87120

10. Build network for each factor value group¶

Nodes and edges for positive and negative associations (Example)¶

Example of function generating list of nodes and significant positive/negative edges to construct the graph of.

for factor, dict_df in spearman_taxa.items():
    print(f"factor: {factor}")
    nodes, edges_pos, edges_neg = interaction_to_graph_with_pvals(
        dict_df['correlation'],
        dict_df['p_vals_fdr'],
        pos_cutoff=0.7,
        neg_cutoff=-0.5,
        p_val_cutoff=0.05
    )
    break

factor: Summer

Number of positive edges: 4391
Number of negative edges: 739

Network generation and metrics calculation¶

At the same time as we generate the graph, we calculate several descriptive metrics of themetwork (implement saving of the network, if you need the graphs for later). We are going to look at:

degree centrality, which measures the number of direct connections a node (taxon) has. High degree centrality suggests a taxon is highly connected and may play a central role in the community. Therefore represents a so called keystone species.
betweenness, which represents how often a node appears on the shortest paths between other nodes. Taxa with high/low betweenness may act as bridges or bottlenecks in the network, respectively.

network_results = {}
for factor, dict_df in spearman_taxa.items():
    print(f"Factor: {factor}")
    nodes, edges_pos, edges_neg = interaction_to_graph_with_pvals(dict_df['correlation'], dict_df['p_vals_fdr'], pos_cutoff=0.7, neg_cutoff=-0.5, p_val_cutoff=0.05)

    G = nx.Graph(
        mode = factor,
    )

    G.add_nodes_from(nodes)
    G.add_edges_from(edges_pos, color='green')
    G.add_edges_from(edges_neg, color='red')

    network_results[factor] = {
        "graph": G,
        "nodes": nodes,
        "edges_pos": edges_pos,
        "edges_neg": edges_neg
    }

    colors = nx.get_edge_attributes(G, 'color')

    degree_centrality = nx.degree_centrality(G)

    network_results[factor]['degree_centrality'] = sorted(degree_centrality.items(),
                                                          key=lambda x: x[1],
                                                          reverse=True)[:10]
    
    betweenness = nx.betweenness_centrality(G)

    network_results[factor]['top_betweenness'] = sorted(betweenness.items(),
                                                    key=lambda x: x[1],
                                                    reverse=True)[:10]
    network_results[factor]['bottom_betweenness'] = sorted(betweenness.items(),
                                                           key=lambda x: x[1])[:10]
    network_results[factor]['total_nodes'] = G.number_of_nodes()
    network_results[factor]['total_edges'] = G.number_of_edges()

Factor: Summer

Number of positive edges: 4391
Number of negative edges: 739

Factor: Autumn

Number of positive edges: 1753
Number of negative edges: 1063

Factor: Winter

Number of positive edges: 9265
Number of negative edges: 177

Compile dataframes from the dictionary. The dataframe structure is not ideal yet, feel free to open a PR here.

DF = pd.DataFrame(columns=[FACTOR, 'centrality', 'top_betweenness', 'bottom_betweenness', 'total_nodes', 'total_edges'])
for factor, dict_results in network_results.items():
    DF = pd.concat([DF, pd.DataFrame([{
        FACTOR: factor,
        'centrality': dict_results['degree_centrality'],
        'top_betweenness': dict_results['top_betweenness'],
        'bottom_betweenness': dict_results['bottom_betweenness'],
        'total_nodes': dict_results['total_nodes'],
        'total_edges': dict_results['total_edges']
    }])], ignore_index=True)
DF.head()


df_centrality = pd.DataFrame(columns=[FACTOR, 'node', 'centrality'])

for factor, dict_results in network_results.items():
    for node, centrality in dict_results['degree_centrality']:
        df_centrality = pd.concat([df_centrality, pd.DataFrame({
            FACTOR: [factor],
            'node': [node],
            'centrality': [centrality]
        })], ignore_index=True)

# Set hierarchical (MultiIndex) with 'factor' above 'node'
# df_centrality.set_index([FACTOR, 'node'], inplace=True)
df_centrality

The node identifiers correspond to the NCBI tax_id. Is any node shared in the high centrality table?

sum(df_centrality['node'].value_counts() > 1)

1

Whole table looks like this

DF.head()

Jaccard similarity of edge sets¶

This allows to compare networks pair-wise. Since we implemented coloring of the positive and negative associated edges, we can perfor this 3 times (for total, positive, and negative). Three seasons pair-wise give three nuber as well.

First we calculate all the necessary sets of edges:

# this is a pair-wise comparison of the networks
edges_summer_pos = set(network_results['Summer']['edges_pos'])
edges_summer_neg = set(network_results['Summer']['edges_neg'])
edges_summer_total = set(network_results['Summer']['graph'].edges())


edges_autumn_pos = set(network_results['Autumn']['edges_pos'])
edges_autumn_neg = set(network_results['Autumn']['edges_neg'])
edges_autumn_total = set(network_results['Autumn']['graph'].edges())

edges_winter_pos = set(network_results['Winter']['edges_pos'])
edges_winter_neg = set(network_results['Winter']['edges_neg'])
edges_winter_total = set(network_results['Winter']['graph'].edges())

And now Jaccard distances

print('Analyse edges identified with POSITIVE correlation:')
print('###########################################')
intersection = edges_summer_pos & edges_autumn_pos
union = edges_summer_pos | edges_autumn_pos
jaccard_similarity = len(intersection) / len(union)
print(f"Jaccard similarity between Summer and Autumn: {jaccard_similarity:.4f}")

intersection = edges_summer_pos & edges_winter_pos
union = edges_summer_pos | edges_winter_pos
jaccard_similarity = len(intersection) / len(union)
print(f"Jaccard similarity between Summer and Winter: {jaccard_similarity:.4f}")

intersection = edges_autumn_pos & edges_winter_pos
union = edges_autumn_pos | edges_winter_pos
jaccard_similarity = len(intersection) / len(union)
print(f"Jaccard similarity between Autumn and Winter: {jaccard_similarity:.4f}\n")


print('Analyse all edges:')
print('###########################################')
intersection = edges_summer_total & edges_autumn_total
union = edges_summer_total | edges_autumn_total
jaccard_similarity = len(intersection) / len(union)
print(f"Jaccard similarity between Summer and Autumn: {jaccard_similarity:.4f}")

intersection = edges_summer_total & edges_winter_total
union = edges_summer_total | edges_winter_total
jaccard_similarity = len(intersection) / len(union)
print(f"Jaccard similarity between Summer and Winter: {jaccard_similarity:.4f}")

intersection = edges_autumn_total & edges_winter_total
union = edges_autumn_total | edges_winter_total
jaccard_similarity = len(intersection) / len(union)
print(f"Jaccard similarity between Autumn and Winter: {jaccard_similarity:.4f}\n")


print('Analyse edges identified with NEGATIVE correlation:')
print('###########################################')
intersection = edges_summer_neg & edges_autumn_neg
union = edges_summer_neg | edges_autumn_neg
jaccard_similarity = len(intersection) / len(union)
print(f"Jaccard similarity between Summer and Autumn: {jaccard_similarity:.4f}")

intersection = edges_summer_neg & edges_winter_neg
union = edges_summer_neg | edges_winter_neg
jaccard_similarity = len(intersection) / len(union)
print(f"Jaccard similarity between Summer and Winter: {jaccard_similarity:.4f}")

intersection = edges_autumn_neg & edges_winter_neg
union = edges_autumn_neg | edges_winter_neg
jaccard_similarity = len(intersection) / len(union)
print(f"Jaccard similarity between Autumn and Winter: {jaccard_similarity:.4f}\n")

Analyse edges identified with POSITIVE correlation:
###########################################
Jaccard similarity between Summer and Autumn: 0.0641
Jaccard similarity between Summer and Winter: 0.0171
Jaccard similarity between Autumn and Winter: 0.0131

Analyse all edges:
###########################################
Jaccard similarity between Summer and Autumn: 0.0660
Jaccard similarity between Summer and Winter: 0.0172
Jaccard similarity between Autumn and Winter: 0.0136

Analyse edges identified with NEGATIVE correlation:
###########################################
Jaccard similarity between Summer and Autumn: 0.0726
Jaccard similarity between Summer and Winter: 0.0189
Jaccard similarity between Autumn and Winter: 0.0189

Graph example visualization¶

fig, axes = plt.subplots(1, 3, figsize=(18, 6))

for ax, season in zip(axes, ['Summer', 'Autumn', 'Winter']):
    G = network_results[season]['graph']
    colors = nx.get_edge_attributes(G, 'color')
    pos = nx.spring_layout(G, k=0.2, iterations=50, seed=42)
    nx.draw_networkx_nodes(G, pos, ax=ax, alpha=0.2, node_color='grey', node_size=15)
    nx.draw_networkx_edges(G, pos, ax=ax, alpha=0.2, edge_color=list(colors.values()))
    ax.set_title(season)
    ax.axis('off')

plt.tight_layout()
plt.show()

If you do not do the FDR and remove high taxa, the networks will be hard to distinguis visually, mostly because of all the false positive association hits. Quite surprisingly, out of 1000*1000 tables, we get a clear visual separation of the community interaction patterns.

Summary¶

None of this answers any of the possible question you want to ask, but with the growing literature applying complex network methods to metagenomics data in combination of unique unified sampling and sequencing approach by EMO-BON, this dataset is ideal to push the boundaries of monitoring using the above described tool set.

Factors can be any categorical variable from the metadata table. However alternative approach to color and evaluate the associations is to color the graph but some knowledge about the taxa, such r- or K- community species classification etc.

Dashboards

Integrate MGnify API for data access

Extra Analysis

GECCO analysis over time (EXAMPLE)