Skip to article frontmatterSkip to article content

Parametrized co-occurrence networks

CCMAR-Algarve

Complex networks provide metrics and to identify kestone species bridges/bottleneck species within the community. EMO-BON data are unified in methods but highly diverse in respect to the metadata (sampling station, temperature, season etc.), which can be interpreted as control/treatment. To refrase a particular question in respect to seasonality, is there structural difference in taxonomy buildup in the communities between spring, summer, autumn and winter.

We will construct association matrices between taxa over the series of sampling events grouped by the factor (season in this case).

Recipe

  1. Load data
  2. Remove high taxa (non-identified sequences)
  3. Pivot table from sampling events to taxa.
  4. Remove low abundance taxa
  5. Rarefy, or normalize
  6. Remove replicates
  7. Split to groups on chosen factor
  8. Calculate associations (Bray-curits dissimilarity, Spearman’s correlation, etc.)
  9. False discovery rate correction
  10. Build and analyse network per group

1. Loading data and metadata

All the methods which are not part of marine-omics package are defined below. Documnetation and context for the marine-omics (momics) methods can be found on readthedocs.io.

# Load metadata
full_metadata = get_full_metadata()

# filter the metadata only for valid 181 samples
valid_samples = get_valid_samples()
full_metadata = enhance_metadata(full_metadata, valid_samples)

mgf_parquet_dfs = get_data()
ssu = mgf_parquet_dfs['ssu']
ssu.head()
Loading...
assert full_metadata.shape[0] == 181

print(full_metadata.shape)
(181, 142)

Batch 1+2 contain 181 valid samples, which should be the first value printed above. Now we convert object columns to categorical variables

for col in full_metadata.columns:
    # print(full_metadata[col].isinstance(pd.CategoricalDtype()))
    # check if object dtype
    if full_metadata[col].dtype == 'object':
        # convert to categorical
        full_metadata[col] = full_metadata[col].astype('category')

# example how to crosscheck the abve operation
isinstance(full_metadata['season'].dtype, pd.CategoricalDtype)
True

2. Remove high taxa

Because many sequences are not well classified, information that the sequence is from superKingdom Bacteria is not helpful and does not add information value to our analysis. Here we remove all taxa which are only identified on phylum level and higher.

print("Original DataFrame shape:", ssu.shape)
ssu = remove_high_taxa(ssu, tax_level='phylum')
print("Filtered DataFrame shape:", ssu.shape)
Original DataFrame shape: (111320, 11)
Filtered DataFrame shape: (110286, 11)

3. Pivot tables

Pivoting converts taxonomy table which rows contain each taxon identified in every sample to table with rows of unique taxa with columns representing samples and values of abundances per sample.

ssu_pivot = pivot_taxonomic_data(ssu, normalize=None, rarefy_depth=None)
ssu_pivot.head()
Loading...

The pivot_taxonomic_data can normalize or rarefy the table too, but split these steps, because of additional removal of low abundance taxa which needs to happen before normalizing/rarefying.

4. Remove low prevalence taxa

The cutoff selected here is 10 %, which means that all taxa which do not appear at least in 10 % of all the samples in the taxonomy table will be removed.

ssu_filt = prevalence_cutoff(ssu_pivot, percent=10, skip_columns=2)
ssu_filt.head()
Loading...

5. Rarefy or normalize

Many normalization techniques are readily available to mornalize taxonomy data. For the purposes of co-occurrence networks, it is reccomended to rarefy [refs]. Please correct if wrong.

ssu_rarefied = ssu_filt.copy()
ssu_rarefied.iloc[:, 2:] = rarefy_table(ssu_filt.iloc[:, 2:], depth=None, axis=1)
ssu_rarefied.head()
Loading...

6. Remove replicates

Anything which has count > 1 is replicate of some sort. Note that this information comes from the enhanced metadata information, because replicate_info is not in the original metadata. It is a combination of observatory, sample type, date, and filter information.

It should be considered to filter replicates before steps 4. and 5. Identify replicates:

full_metadata['replicate_info'].value_counts()
replicate_info VB_water_column_2021-08-23_NA 4 VB_water_column_2021-06-21_NA 4 VB_water_column_2021-12-17_NA 4 VB_water_column_2021-10-18_NA 4 AAOT_water_column_2021-10-15_3-200 2 .. OSD74_water_column_2021-08-31_3-200 1 HCMR-1_water_column_2021-06-28_3-200 1 OOB_soft_sediment_2021-06-08_NA 1 ESC68N_water_column_2021-11-04_0.2-3 1 ESC68N_water_column_2021-11-04_3-200 1 Name: count, Length: 90, dtype: int64

Remove the replicates:

filtered_metadata = full_metadata.drop_duplicates(subset='replicate_info', keep='first')
filtered_metadata.shape
(90, 142)

7. Split to groups on chosen factor

Split metadata to groups based on unique values in your factor.

FACTOR = 'season'
groups = split_metadata(
    filtered_metadata,
    FACTOR
)

# remove groups which have less than 2 members (bad for your statistics :)
for groups_key in list(groups.keys()):
    print(f"{groups_key}: {len(groups[groups_key])} samples")
    if len(groups[groups_key]) < 3:
        del groups[groups_key]
        print(f"Warning: {groups_key} has less than 3 samples, therefore removed.")
Spring: 2 samples
Warning: Spring has less than 3 samples, therefore removed.
Summer: 35 samples
Autumn: 44 samples
Winter: 9 samples

Split pivoted taxonomy data according to groups of ref_codes defined above:

# pivot and filter the ssu
split_taxonomy = split_taxonomic_data_pivoted(
    ssu_rarefied,
    groups
)

for key, value in split_taxonomy.items():
    print(f"{key}: {value.shape[0]} rows, {value.shape[1]} columns")
# split_taxonomy.keys(),

print(ssu_rarefied.shape)
split_taxonomy['Autumn'].iloc[:, 2:].head()
Loading...

8. Calculate associations (Bray-curtis and Spearman’s)

Bray-curtis dissimilarity:

bray_taxa = {}
for factor, df in split_taxonomy.items():
    bray_taxa[factor] = compute_bray_curtis(df, direction='taxa')

Spearman correlation:

spearman_taxa = {}
# Compute Spearman correlation
for factor, df in split_taxonomy.items():
    corr, p_spearman = spearmanr(df.iloc[:, 2:].T)
    assert corr.shape == p_spearman.shape, "Spearman correlation and p-values must have the same shape."
    corr_df = pd.DataFrame(
        corr,
        index=df['ncbi_tax_id'],
        columns=df['ncbi_tax_id']
    )
    p_spearman_df = pd.DataFrame(
        p_spearman,
        index=df['ncbi_tax_id'],
        columns=df['ncbi_tax_id']
    )
    d = {
        'correlation': corr_df,
        'p_vals': p_spearman_df
    }
    spearman_taxa[factor] = d
    assert spearman_taxa[factor]['correlation'].shape == spearman_taxa[factor]['p_vals'].shape, "Spearman correlation and p-values must have the same shape."


spearman_taxa['Summer']['correlation'].head()
Loading...

9. False discovery rate (FDR) correction

Calculate FDR

We have more than 1000 taxa in the tables, which means 1000*1000 / 2 association values. Too many to statistically trust that any low p-value is not just statistical coincidence. Therefore False Discovery Rate correction is absolutely essential here.

for factor, d in spearman_taxa.items():
    pvals_fdr = fdr_pvals(d['p_vals'])
    spearman_taxa[factor]['p_vals_fdr'] = pvals_fdr

Let’s crosscheck one particular set of DFs for one factor value, which should all have the same shape:

season = 'Summer'
spearman_taxa[season]['correlation'].shape, spearman_taxa[season]['p_vals'].shape, spearman_taxa[season]['p_vals_fdr'].shape
((1312, 1312), (1312, 1312), (1312, 1312))

Display histograms FDR corrected p-values

For Bray-curtis dissimilarity we plot only the histogram. Does it make sense to do FDR? How?

# histogram of the correlation values for setting graph cutoffs
plt.figure(figsize=(10, 5))
for factor, df in bray_taxa.items():
    plt.hist(df.values.flatten(), bins=50, alpha=0.5, label=factor, log=True)

plt.title("Histogram of Correlation Values")
plt.xlabel("Correlation")
plt.ylabel("Frequency")
plt.legend()
plt.show()
<Figure size 1000x500 with 1 Axes>

We performed FDR on Spearman’s correlations, so both correlation values and FDR curves are displayed.

# histogram of the correlation values for setting graph cutoffs
plt.figure(figsize=(10, 5))
for factor, dict_df in spearman_taxa.items():
    plt.hist(dict_df['correlation'].values.flatten(), bins=50, alpha=0.5, label=factor, log=True)
plt.title("Histogram of Correlation Values")
plt.xlabel("Correlation")
plt.ylabel("Frequency")
plt.legend()
plt.show()

plt.figure(figsize=(10, 5))
for factor, dict_df in spearman_taxa.items():
    plt.scatter(
        dict_df['p_vals'].values.flatten()[::10],  # downsample for better visibility and speed
        dict_df['p_vals_fdr'].values.flatten()[::10],
        alpha=0.5,
        label=factor,
    )
plt.axvline(0.05, color='gray', linestyle='--', label='Raw p=0.05')
plt.axhline(0.05, color='black', linestyle='--', label='FDR p=0.05')
plt.xlabel('Raw p-value')
plt.ylabel('FDR-corrected p-value')
plt.title(f'Comparison of Raw and FDR-corrected p-values for {factor}')
plt.legend()
plt.show()
<Figure size 1000x500 with 1 Axes><Figure size 1000x500 with 1 Axes>

The FDR curve shows huge inflation of significant associations if taken from the raw data. How many?

significant_raw = dict_df['p_vals'].values.flatten() < 0.05
significant_fdr = dict_df['p_vals_fdr'].values.flatten() < 0.05

removed = np.sum(significant_raw & ~significant_fdr)
significant = np.sum(significant_fdr)

print(f"Total associations significant before FDR: {np.sum(significant_raw)}")
print(f"Total associations significant after FDR: {significant}")
print(f"Associations significant before FDR but not after: {removed}")
Total associations significant before FDR: 96562
Total associations significant after FDR: 9442
Associations significant before FDR but not after: 87120

10. Build network for each factor value group

Nodes and edges for positive and negative associations (Example)

Example of function generating list of nodes and significant positive/negative edges to construct the graph of.

for factor, dict_df in spearman_taxa.items():
    print(f"factor: {factor}")
    nodes, edges_pos, edges_neg = interaction_to_graph_with_pvals(
        dict_df['correlation'],
        dict_df['p_vals_fdr'],
        pos_cutoff=0.7,
        neg_cutoff=-0.5,
        p_val_cutoff=0.05
    )
    break
factor: Summer
Number of positive edges: 4391
Number of negative edges: 739

Network generation and metrics calculation

At the same time as we generate the graph, we calculate several descriptive metrics of themetwork (implement saving of the network, if you need the graphs for later). We are going to look at:

network_results = {}
for factor, dict_df in spearman_taxa.items():
    print(f"Factor: {factor}")
    nodes, edges_pos, edges_neg = interaction_to_graph_with_pvals(dict_df['correlation'], dict_df['p_vals_fdr'], pos_cutoff=0.7, neg_cutoff=-0.5, p_val_cutoff=0.05)

    G = nx.Graph(
        mode = factor,
    )

    G.add_nodes_from(nodes)
    G.add_edges_from(edges_pos, color='green')
    G.add_edges_from(edges_neg, color='red')

    network_results[factor] = {
        "graph": G,
        "nodes": nodes,
        "edges_pos": edges_pos,
        "edges_neg": edges_neg
    }

    colors = nx.get_edge_attributes(G, 'color')

    degree_centrality = nx.degree_centrality(G)

    network_results[factor]['degree_centrality'] = sorted(degree_centrality.items(),
                                                          key=lambda x: x[1],
                                                          reverse=True)[:10]
    
    betweenness = nx.betweenness_centrality(G)

    network_results[factor]['top_betweenness'] = sorted(betweenness.items(),
                                                    key=lambda x: x[1],
                                                    reverse=True)[:10]
    network_results[factor]['bottom_betweenness'] = sorted(betweenness.items(),
                                                           key=lambda x: x[1])[:10]
    network_results[factor]['total_nodes'] = G.number_of_nodes()
    network_results[factor]['total_edges'] = G.number_of_edges()
Factor: Summer
Number of positive edges: 4391
Number of negative edges: 739
Factor: Autumn
Number of positive edges: 1753
Number of negative edges: 1063
Factor: Winter
Number of positive edges: 9265
Number of negative edges: 177

Compile dataframes from the dictionary. The dataframe structure is not ideal yet, feel free to open a PR here.

DF = pd.DataFrame(columns=[FACTOR, 'centrality', 'top_betweenness', 'bottom_betweenness', 'total_nodes', 'total_edges'])
for factor, dict_results in network_results.items():
    DF = pd.concat([DF, pd.DataFrame([{
        FACTOR: factor,
        'centrality': dict_results['degree_centrality'],
        'top_betweenness': dict_results['top_betweenness'],
        'bottom_betweenness': dict_results['bottom_betweenness'],
        'total_nodes': dict_results['total_nodes'],
        'total_edges': dict_results['total_edges']
    }])], ignore_index=True)
DF.head()


df_centrality = pd.DataFrame(columns=[FACTOR, 'node', 'centrality'])

for factor, dict_results in network_results.items():
    for node, centrality in dict_results['degree_centrality']:
        df_centrality = pd.concat([df_centrality, pd.DataFrame({
            FACTOR: [factor],
            'node': [node],
            'centrality': [centrality]
        })], ignore_index=True)

# Set hierarchical (MultiIndex) with 'factor' above 'node'
# df_centrality.set_index([FACTOR, 'node'], inplace=True)
df_centrality
Loading...

The node identifiers correspond to the NCBI tax_id. Is any node shared in the high centrality table?

sum(df_centrality['node'].value_counts() > 1)
1

Whole table looks like this

DF.head()
Loading...

Jaccard similarity of edge sets

This allows to compare networks pair-wise. Since we implemented coloring of the positive and negative associated edges, we can perfor this 3 times (for total, positive, and negative). Three seasons pair-wise give three nuber as well.

First we calculate all the necessary sets of edges:

# this is a pair-wise comparison of the networks
edges_summer_pos = set(network_results['Summer']['edges_pos'])
edges_summer_neg = set(network_results['Summer']['edges_neg'])
edges_summer_total = set(network_results['Summer']['graph'].edges())


edges_autumn_pos = set(network_results['Autumn']['edges_pos'])
edges_autumn_neg = set(network_results['Autumn']['edges_neg'])
edges_autumn_total = set(network_results['Autumn']['graph'].edges())

edges_winter_pos = set(network_results['Winter']['edges_pos'])
edges_winter_neg = set(network_results['Winter']['edges_neg'])
edges_winter_total = set(network_results['Winter']['graph'].edges())

And now Jaccard distances

print('Analyse edges identified with POSITIVE correlation:')
print('###########################################')
intersection = edges_summer_pos & edges_autumn_pos
union = edges_summer_pos | edges_autumn_pos
jaccard_similarity = len(intersection) / len(union)
print(f"Jaccard similarity between Summer and Autumn: {jaccard_similarity:.4f}")

intersection = edges_summer_pos & edges_winter_pos
union = edges_summer_pos | edges_winter_pos
jaccard_similarity = len(intersection) / len(union)
print(f"Jaccard similarity between Summer and Winter: {jaccard_similarity:.4f}")

intersection = edges_autumn_pos & edges_winter_pos
union = edges_autumn_pos | edges_winter_pos
jaccard_similarity = len(intersection) / len(union)
print(f"Jaccard similarity between Autumn and Winter: {jaccard_similarity:.4f}\n")


print('Analyse all edges:')
print('###########################################')
intersection = edges_summer_total & edges_autumn_total
union = edges_summer_total | edges_autumn_total
jaccard_similarity = len(intersection) / len(union)
print(f"Jaccard similarity between Summer and Autumn: {jaccard_similarity:.4f}")

intersection = edges_summer_total & edges_winter_total
union = edges_summer_total | edges_winter_total
jaccard_similarity = len(intersection) / len(union)
print(f"Jaccard similarity between Summer and Winter: {jaccard_similarity:.4f}")

intersection = edges_autumn_total & edges_winter_total
union = edges_autumn_total | edges_winter_total
jaccard_similarity = len(intersection) / len(union)
print(f"Jaccard similarity between Autumn and Winter: {jaccard_similarity:.4f}\n")


print('Analyse edges identified with NEGATIVE correlation:')
print('###########################################')
intersection = edges_summer_neg & edges_autumn_neg
union = edges_summer_neg | edges_autumn_neg
jaccard_similarity = len(intersection) / len(union)
print(f"Jaccard similarity between Summer and Autumn: {jaccard_similarity:.4f}")

intersection = edges_summer_neg & edges_winter_neg
union = edges_summer_neg | edges_winter_neg
jaccard_similarity = len(intersection) / len(union)
print(f"Jaccard similarity between Summer and Winter: {jaccard_similarity:.4f}")

intersection = edges_autumn_neg & edges_winter_neg
union = edges_autumn_neg | edges_winter_neg
jaccard_similarity = len(intersection) / len(union)
print(f"Jaccard similarity between Autumn and Winter: {jaccard_similarity:.4f}\n")
Analyse edges identified with POSITIVE correlation:
###########################################
Jaccard similarity between Summer and Autumn: 0.0641
Jaccard similarity between Summer and Winter: 0.0171
Jaccard similarity between Autumn and Winter: 0.0131

Analyse all edges:
###########################################
Jaccard similarity between Summer and Autumn: 0.0660
Jaccard similarity between Summer and Winter: 0.0172
Jaccard similarity between Autumn and Winter: 0.0136

Analyse edges identified with NEGATIVE correlation:
###########################################
Jaccard similarity between Summer and Autumn: 0.0726
Jaccard similarity between Summer and Winter: 0.0189
Jaccard similarity between Autumn and Winter: 0.0189

Graph example visualization

fig, axes = plt.subplots(1, 3, figsize=(18, 6))

for ax, season in zip(axes, ['Summer', 'Autumn', 'Winter']):
    G = network_results[season]['graph']
    colors = nx.get_edge_attributes(G, 'color')
    pos = nx.spring_layout(G, k=0.2, iterations=50, seed=42)
    nx.draw_networkx_nodes(G, pos, ax=ax, alpha=0.2, node_color='grey', node_size=15)
    nx.draw_networkx_edges(G, pos, ax=ax, alpha=0.2, edge_color=list(colors.values()))
    ax.set_title(season)
    ax.axis('off')

plt.tight_layout()
plt.show()
<Figure size 1800x600 with 3 Axes>

If you do not do the FDR and remove high taxa, the networks will be hard to distinguis visually, mostly because of all the false positive association hits. Quite surprisingly, out of 1000*1000 tables, we get a clear visual separation of the community interaction patterns.

Summary

None of this answers any of the possible question you want to ask, but with the growing literature applying complex network methods to metagenomics data in combination of unique unified sampling and sequencing approach by EMO-BON, this dataset is ideal to push the boundaries of monitoring using the above described tool set.

Factors can be any categorical variable from the metadata table. However alternative approach to color and evaluate the associations is to color the graph but some knowledge about the taxa, such r- or K- community species classification etc.