Skip to article frontmatterSkip to article content

Parametrized co-occurrence networks

CCMAR-Algarve

Complex networks provide metrics and to identify kestone species bridges/bottleneck species within the community. EMO-BON data are unified in methods but highly diverse in respect to the metadata (sampling station, temperature, season etc.), which can be interpreted as control/treatment. To refrase a particular question in respect to seasonality, is there structural difference in taxonomy buildup in the communities between spring, summer, autumn and winter.

We will construct association matrices between taxa over the series of sampling events grouped by the factor (season in this case).

RecipeΒΆ

  1. Load data
  2. Remove high taxa (non-identified sequences)
  3. Pivot table from sampling events to taxa.
  4. Remove low abundance taxa
  5. Rarefy, or normalize
  6. Remove replicates
  7. Split to groups on chosen factor
  8. Calculate associations (Bray-curits dissimilarity, Spearman’s correlation, etc.)
  9. False discovery rate correction
  10. Build and analyse network per group

1. Loading data and metadataΒΆ

All the methods which are not part of marine-omics package are defined below. Documnetation and context for the marine-omics (momics) methods can be found on readthedocs.io.

valid_samples = get_valid_samples()

# load, filter and enhance metadata
full_metadata, mgf_parquet_dfs = load_and_clean(valid_samples=valid_samples)

ssu = mgf_parquet_dfs['ssu']
ssu.head()
Loading...
assert full_metadata.shape[0] == 181

print(full_metadata.shape)
(181, 98)

Batch 1+2 contain 181 valid samples, which should be the first value printed above. Now we convert object columns to categorical variables

# Identify object columns
categorical_columns = sorted(full_metadata.select_dtypes(include=['object', "boolean"]).columns)

# Convert them all at once to category
full_metadata = full_metadata.astype({col: 'category' for col in categorical_columns})

if not isinstance(full_metadata['season'].dtype, pd.CategoricalDtype):
        raise ValueError(f"Column 'season' is not categorical (object dtype).")

2. Remove high taxaΒΆ

First infer missing higher taxa if some lower taxon was identified using unclassified_ placeholder. Example of filled row of taxonomy can look like this:

kingdom phylum class order family genus species Archaea unclassified_Thaumarchaeota Thaumarchaeota unclassified_Cenarchaeales Cenarchaeales Cenarchaeaceae None None

taxonomy_cols = ['superkingdom', 'kingdom', 'phylum', 'class', 'order', 'family', 'genus', 'species']
ssu_filled = fill_taxonomy_placeholders(ssu, taxonomy_cols)
ssu_filled.head(10)
Loading...

Because many sequences are not well classified, information that the sequence is from superKingdom Bacteria is not helpful and does not add information value to our analysis. Here we remove all taxa which are only identified on phylum level and higher.

print("Original DataFrame shape:", ssu.shape)
ssu_filled = remove_high_taxa(ssu_filled, taxonomy_ranks=taxonomy_cols, tax_level='phylum')
print("Filtered DataFrame shape:", ssu.shape)
Original DataFrame shape: (111320, 9)
  0%|          | 0/218 [00:00<?, ?it/s]
  0%|          | 1/218 [00:00<02:46,  1.31it/s]
  1%|          | 2/218 [00:02<04:08,  1.15s/it]
  1%|▏         | 3/218 [00:02<03:17,  1.09it/s]
  2%|▏         | 4/218 [00:04<03:38,  1.02s/it]
  3%|β–Ž         | 6/218 [00:05<03:06,  1.14it/s]
  3%|β–Ž         | 7/218 [00:07<03:40,  1.05s/it]
  4%|β–Ž         | 8/218 [00:07<03:23,  1.03it/s]
  4%|▍         | 9/218 [00:09<03:50,  1.10s/it]
  5%|▍         | 10/218 [00:09<03:24,  1.02it/s]
  5%|β–Œ         | 11/218 [00:10<03:09,  1.09it/s]
  6%|β–Œ         | 12/218 [00:11<03:17,  1.04it/s]
  6%|β–Œ         | 13/218 [00:12<02:53,  1.18it/s]
  6%|β–‹         | 14/218 [00:12<02:25,  1.40it/s]
  7%|β–‹         | 15/218 [00:13<02:23,  1.42it/s]
  7%|β–‹         | 16/218 [00:14<02:36,  1.29it/s]
  8%|β–Š         | 17/218 [00:15<02:56,  1.14it/s]
  8%|β–Š         | 18/218 [00:16<02:49,  1.18it/s]
  9%|β–Š         | 19/218 [00:16<02:13,  1.49it/s]
  9%|β–‰         | 20/218 [00:17<02:47,  1.18it/s]
 10%|β–‰         | 21/218 [00:19<03:12,  1.02it/s]
 10%|β–ˆ         | 22/218 [00:20<03:28,  1.06s/it]
 11%|β–ˆ         | 23/218 [00:21<03:11,  1.02it/s]
 11%|β–ˆ         | 24/218 [00:22<03:08,  1.03it/s]
 11%|β–ˆβ–        | 25/218 [00:23<03:21,  1.05s/it]
 12%|β–ˆβ–        | 26/218 [00:24<03:16,  1.02s/it]
 12%|β–ˆβ–        | 27/218 [00:25<03:14,  1.02s/it]
 13%|β–ˆβ–Ž        | 28/218 [00:26<03:10,  1.00s/it]
 13%|β–ˆβ–Ž        | 29/218 [00:27<03:16,  1.04s/it]
 14%|β–ˆβ–        | 30/218 [00:28<03:13,  1.03s/it]
 14%|β–ˆβ–        | 31/218 [00:29<03:07,  1.00s/it]
 15%|β–ˆβ–        | 32/218 [00:30<02:57,  1.05it/s]
 15%|β–ˆβ–Œ        | 33/218 [00:31<03:06,  1.01s/it]
 16%|β–ˆβ–Œ        | 34/218 [00:32<03:12,  1.05s/it]
 16%|β–ˆβ–Œ        | 35/218 [00:33<02:51,  1.07it/s]
 17%|β–ˆβ–‹        | 37/218 [00:33<02:03,  1.47it/s]
 18%|β–ˆβ–Š        | 40/218 [00:33<01:04,  2.75it/s]
 19%|β–ˆβ–‰        | 41/218 [00:34<00:58,  3.04it/s]
 20%|β–ˆβ–ˆ        | 44/218 [00:34<00:48,  3.57it/s]
 21%|β–ˆβ–ˆ        | 45/218 [00:35<00:57,  2.99it/s]
 21%|β–ˆβ–ˆ        | 46/218 [00:35<01:05,  2.61it/s]
 22%|β–ˆβ–ˆβ–       | 47/218 [00:36<01:06,  2.56it/s]
 22%|β–ˆβ–ˆβ–       | 48/218 [00:36<01:10,  2.42it/s]
 22%|β–ˆβ–ˆβ–       | 49/218 [00:37<01:01,  2.77it/s]
 23%|β–ˆβ–ˆβ–Ž       | 50/218 [00:37<00:57,  2.91it/s]
 23%|β–ˆβ–ˆβ–Ž       | 51/218 [00:37<01:01,  2.71it/s]
 24%|β–ˆβ–ˆβ–       | 52/218 [00:38<01:01,  2.72it/s]
 24%|β–ˆβ–ˆβ–       | 53/218 [00:38<00:59,  2.77it/s]
 25%|β–ˆβ–ˆβ–Œ       | 55/218 [00:39<00:51,  3.17it/s]
 26%|β–ˆβ–ˆβ–Œ       | 56/218 [00:39<00:56,  2.86it/s]
 26%|β–ˆβ–ˆβ–Œ       | 57/218 [00:39<00:57,  2.79it/s]
 27%|β–ˆβ–ˆβ–‹       | 59/218 [00:40<00:46,  3.40it/s]
 28%|β–ˆβ–ˆβ–Š       | 60/218 [00:40<00:43,  3.62it/s]
 28%|β–ˆβ–ˆβ–Š       | 61/218 [00:40<00:37,  4.22it/s]
 28%|β–ˆβ–ˆβ–Š       | 62/218 [00:40<00:37,  4.15it/s]
 29%|β–ˆβ–ˆβ–‰       | 63/218 [00:40<00:32,  4.83it/s]
 29%|β–ˆβ–ˆβ–‰       | 64/218 [00:41<00:41,  3.72it/s]
 30%|β–ˆβ–ˆβ–‰       | 65/218 [00:41<00:38,  4.01it/s]
 31%|β–ˆβ–ˆβ–ˆ       | 67/218 [00:41<00:31,  4.86it/s]
 31%|β–ˆβ–ˆβ–ˆ       | 68/218 [00:42<00:27,  5.48it/s]
 32%|β–ˆβ–ˆβ–ˆβ–      | 69/218 [00:42<00:27,  5.34it/s]
 32%|β–ˆβ–ˆβ–ˆβ–      | 70/218 [00:42<00:27,  5.43it/s]
 33%|β–ˆβ–ˆβ–ˆβ–Ž      | 72/218 [00:42<00:20,  6.96it/s]
 33%|β–ˆβ–ˆβ–ˆβ–Ž      | 73/218 [00:42<00:26,  5.38it/s]
 34%|β–ˆβ–ˆβ–ˆβ–      | 74/218 [00:43<00:25,  5.65it/s]
 34%|β–ˆβ–ˆβ–ˆβ–      | 75/218 [00:43<00:31,  4.54it/s]
 35%|β–ˆβ–ˆβ–ˆβ–Œ      | 77/218 [00:43<00:27,  5.11it/s]
 36%|β–ˆβ–ˆβ–ˆβ–Œ      | 79/218 [00:43<00:23,  5.81it/s]
 38%|β–ˆβ–ˆβ–ˆβ–Š      | 82/218 [00:44<00:17,  7.61it/s]
 39%|β–ˆβ–ˆβ–ˆβ–‰      | 86/218 [00:44<00:11, 11.76it/s]
 41%|β–ˆβ–ˆβ–ˆβ–ˆ      | 89/218 [00:44<00:11, 11.21it/s]
 42%|β–ˆβ–ˆβ–ˆβ–ˆβ–     | 91/218 [00:45<00:17,  7.33it/s]
 44%|β–ˆβ–ˆβ–ˆβ–ˆβ–Ž     | 95/218 [00:45<00:13,  9.23it/s]
 46%|β–ˆβ–ˆβ–ˆβ–ˆβ–Œ     | 100/218 [00:45<00:08, 13.46it/s]
 47%|β–ˆβ–ˆβ–ˆβ–ˆβ–‹     | 102/218 [00:45<00:08, 13.77it/s]
 48%|β–ˆβ–ˆβ–ˆβ–ˆβ–Š     | 104/218 [00:45<00:08, 14.19it/s]
 49%|β–ˆβ–ˆβ–ˆβ–ˆβ–Š     | 106/218 [00:46<00:09, 11.44it/s]
 50%|β–ˆβ–ˆβ–ˆβ–ˆβ–‰     | 108/218 [00:46<00:10, 10.22it/s]
 50%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆ     | 110/218 [00:46<00:11,  9.59it/s]
 51%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–    | 112/218 [00:47<00:14,  7.09it/s]
 53%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Ž    | 115/218 [00:47<00:12,  8.50it/s]
 54%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–    | 118/218 [00:47<00:09, 10.03it/s]
 57%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‹    | 125/218 [00:47<00:05, 16.99it/s]
 59%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Š    | 128/218 [00:48<00:09,  9.66it/s]
 60%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‰    | 130/218 [00:48<00:10,  8.69it/s]
 61%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ    | 132/218 [00:48<00:08,  9.67it/s]
 64%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–   | 140/218 [00:49<00:04, 18.40it/s]
 66%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Œ   | 144/218 [00:49<00:03, 19.34it/s]
 67%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‹   | 147/218 [00:49<00:03, 20.78it/s]
 69%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‰   | 150/218 [00:49<00:04, 14.29it/s]
 70%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ   | 153/218 [00:49<00:04, 16.18it/s]
 72%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–  | 156/218 [00:50<00:03, 16.03it/s]
 73%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Ž  | 159/218 [00:50<00:04, 12.09it/s]
 74%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–  | 161/218 [00:50<00:04, 12.27it/s]
 76%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Œ  | 165/218 [00:50<00:04, 12.96it/s]
 78%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Š  | 170/218 [00:50<00:02, 17.78it/s]
 80%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ  | 175/218 [00:51<00:01, 22.49it/s]
 82%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ– | 178/218 [00:51<00:01, 21.60it/s]
 83%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Ž | 181/218 [00:51<00:01, 19.48it/s]
 84%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ– | 184/218 [00:51<00:01, 19.27it/s]
 87%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‹ | 189/218 [00:51<00:01, 24.02it/s]
 89%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‰ | 195/218 [00:51<00:00, 30.39it/s]
 91%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–| 199/218 [00:52<00:00, 27.96it/s]
 94%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Ž| 204/218 [00:52<00:00, 32.41it/s]
100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‰| 217/218 [00:52<00:00, 55.20it/s]
100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 218/218 [00:52<00:00,  4.17it/s]

INFO | momics.taxonomy | Number of bad taxa at phylum: 9
INFO | momics.taxonomy | Unmapped taxa at phylum: ['Candidatus_Altiarchaeota', 'Hemichordata', 'Aquificae', 'Coprothermobacterota', 'Candidatus_Verstraetearchaeota', 'Orthonectida', 'Chrysiogenetes', 'Parabasalia', 'Candidatus_Rokubacteria']
Filtered DataFrame shape: (111320, 9)

3. Pivot tablesΒΆ

Pivoting converts taxonomy table which rows contain each taxon identified in every sample to table with rows of unique taxa with columns representing samples and values of abundances per sample.

ssu_pivot = pivot_taxonomic_data(ssu_filled)
ssu_pivot.head()
Loading...

4. Remove low prevalence taxaΒΆ

The cutoff selected here is 10 %, which means that all taxa which do not appear at least in 10 % of all the samples in the taxonomy table will be removed. This doees not refer to low abundance taxa within single samples.

ssu_filt = prevalence_cutoff(ssu_pivot, percent=10, skip_columns=0)
ssu_filt.head()
Loading...

5. Rarefy or normalizeΒΆ

Many normalization techniques are readily available to mornalize taxonomy data. For the purposes of co-occurrence networks, it is reccomended to rarefy [refs]. Please correct if wrong.

ssu_rarefied = rarefy_table(ssu_filt, depth=None, axis=1)
ssu_rarefied.head()
Loading...

6. Remove replicatesΒΆ

Anything which has replicate info count > 1 is replicate of some sort. Note that this information comes from the enhanced metadata information, because replicate info is not in the original metadata. It is a combination of observatory, sample type, date, and filter information.

It should be considered to filter replicates before steps 4. and 5. Identify replicates:

full_metadata['replicate info'].value_counts()
replicate info VB_water_column_2021-08-23_NA 4 VB_water_column_2021-06-21_NA 4 VB_water_column_2021-12-17_NA 4 VB_water_column_2021-10-18_NA 4 AAOT_water_column_2021-10-15_3-200 2 .. OSD74_water_column_2021-08-31_3-200 1 HCMR-1_water_column_2021-06-28_3-200 1 OOB_soft_sediment_2021-06-08_NA 1 ESC68N_water_column_2021-11-04_0.2-3 1 ESC68N_water_column_2021-11-04_3-200 1 Name: count, Length: 90, dtype: int64

Remove the replicates:

filtered_metadata = full_metadata.drop_duplicates(subset='replicate info', keep='first')
filtered_metadata.shape
(90, 98)

7. Split to groups on chosen factorΒΆ

Split metadata to groups based on unique values in your factor.

FACTOR = 'season'
groups = split_metadata(
    filtered_metadata,
    FACTOR
)

# remove groups which have less than 2 members (bad for your statistics :)
for groups_key in list(groups.keys()):
    print(f"{groups_key}: {len(groups[groups_key])} samples")
    if len(groups[groups_key]) < 3:
        del groups[groups_key]
        print(f"Warning: {groups_key} has less than 3 samples, therefore removed.")
Spring: 2 samples
Warning: Spring has less than 3 samples, therefore removed.
Summer: 35 samples
Autumn: 44 samples
Winter: 9 samples

Split pivoted taxonomy data according to groups of ref_codes defined above:

# pivot and filter the ssu
split_taxonomy = split_taxonomic_data_pivoted(
    ssu_rarefied,
    groups
)

for key, value in split_taxonomy.items():
    print(f"{key}: {value.shape[0]} rows, {value.shape[1]} columns")
# split_taxonomy.keys(),

print(ssu_rarefied.shape)
split_taxonomy['Autumn'].iloc[:, 2:].head()
Loading...

8. Calculate associations (Bray-curtis and Spearman’s)ΒΆ

Bray-curtis dissimilarity:

bray_taxa = {}
for factor, df in split_taxonomy.items():
    bray_taxa[factor] = compute_bray_curtis(df, skip_cols=0, direction='taxa')

Spearman correlation:

spearman_taxa = spearman_from_taxonomy(split_taxonomy)

spearman_taxa['Summer']['correlation'].head()
Loading...

9. False discovery rate (FDR) correctionΒΆ

Calculate FDRΒΆ

We have more than 1000 taxa in the tables, which means 1000*1000 / 2 association values. Too many to statistically trust that any low p-value is not just statistical coincidence. Therefore False Discovery Rate correction is absolutely essential here.

PVAL = 0.05  # Define a p-value threshold for FDR correction
for factor, d in spearman_taxa.items():
    pvals_fdr = fdr_pvals(d['p_vals'], pval_cutoff=PVAL)
    spearman_taxa[factor]['p_vals_fdr'] = pvals_fdr

Let’s crosscheck one particular set of DFs for one factor value, which should all have the same shape:

factor = 'Summer'
spearman_taxa[factor]['correlation'].shape, spearman_taxa[factor]['p_vals'].shape, spearman_taxa[factor]['p_vals_fdr'].shape
((256, 256), (256, 256), (256, 256))

Display histograms FDR corrected p-valuesΒΆ

For Bray-curtis dissimilarity we plot only the histogram. Does it make sense to do FDR? How?

plot_association_histogram(bray_taxa)
values in the triangle: 32640
values in the triangle: 35245
values in the triangle: 23436
<Figure size 1000x500 with 1 Axes>

We performed FDR on Spearman’s correlations, so both correlation values and FDR curves are displayed.

plot_association_histogram(spearman_taxa)
plot_fdr(spearman_taxa, PVAL)
values in the triangle: 32640
values in the triangle: 35245
values in the triangle: 23436
<Figure size 1000x500 with 1 Axes>
### FACTOR: Summer ###
Significant associations before 11526 and after 2074
Associations significant before FDR but not after: 9452
### FACTOR: Autumn ###
Significant associations before 14862 and after 2872
Associations significant before FDR but not after: 11990
### FACTOR: Winter ###
Significant associations before 5395 and after 508
Associations significant before FDR but not after: 4887
<Figure size 1000x500 with 1 Axes>

The FDR curve shows huge inflation of significant associations if taken from the raw data. The above print show the stats on how associations got discarded.

10. Build network for each factor value groupΒΆ

Nodes and edges for positive and negative associations (Example)ΒΆ

Code example for generating a list of nodes and significant positive/negative edges to construct the graph from.

for factor, dict_df in spearman_taxa.items():
    print(f"factor: {factor}")
    nodes, edges_pos, edges_neg = interaction_to_graph_with_pvals(
        dict_df['correlation'],
        dict_df['p_vals_fdr'],
        pos_cutoff=0.7,
        neg_cutoff=-0.5,
        p_val_cutoff=0.05
    )
    break
factor: Summer
Number of positive edges: 312
Number of negative edges: 90

Network generation and metrics calculationΒΆ

At the same time as we generate the graph, we calculate several descriptive metrics of themetwork (implement saving of the network, if you need the graphs for later). We are going to look at:

network_results = build_interaction_graphs(spearman_taxa)
Factor: Summer
Number of positive edges: 1770
Number of negative edges: 90
Factor: Autumn
Number of positive edges: 1382
Number of negative edges: 144
Factor: Winter
Number of positive edges: 503
Number of negative edges: 5

Jaccard similarity of edge setsΒΆ

Jaccard similarity evaluates pairwise similarity of constructed networks.

For positive edges:

df_jaccard = pairwise_jaccard_lower_triangle(network_results, edge_type='edges_pos')
df_jaccard
Loading...

For negative edges:

df_jaccard = pairwise_jaccard_lower_triangle(network_results, edge_type='edges_neg')
df_jaccard
Loading...

For all edges altogether:

df_jaccard = pairwise_jaccard_lower_triangle(network_results, edge_type='all')
df_jaccard
Loading...

Compile dataframes from the dictionary. The dataframe structure is not ideal yet, feel free to open a PR here.

Here are two functions to generate quite silly dataframes of results:

def network_results_df(network_results, factor_name):
    out = pd.DataFrame(columns=[factor_name, 'centrality', 'top_betweenness', 'bottom_betweenness', 'total_nodes', 'total_edges'])
    for factor, dict_results in network_results.items():
        out = pd.concat([out, pd.DataFrame([{
            factor_name: factor,
            'centrality': dict_results['degree_centrality'],
            'top_betweenness': dict_results['top_betweenness'],
            'bottom_betweenness': dict_results['bottom_betweenness'],
            'total_nodes': dict_results['total_nodes'],
            'total_edges': dict_results['total_edges']
        }])], ignore_index=True)

    return out


def create_centrality_dataframe(network_results, factor_name):
    df_centrality = pd.DataFrame(columns=[factor_name, 'node', 'centrality'])

    for factor, dict_results in network_results.items():
        for node, centrality in dict_results['degree_centrality']:
            df_centrality = pd.concat([df_centrality, pd.DataFrame({
            factor_name: [factor],
            'node': [node],
            'centrality': [centrality]
        })], ignore_index=True)

    return df_centrality

Generate the dataframes

df_results = network_results_df(network_results, FACTOR)
df_centrality = create_centrality_dataframe(network_results, FACTOR)
/tmp/ipykernel_2583/3260657370.py:21: FutureWarning: The behavior of DataFrame concatenation with empty or all-NA entries is deprecated. In a future version, this will no longer exclude empty or all-NA columns when determining the result dtypes. To retain the old behavior, exclude the relevant entries before the concat operation.
  df_centrality = pd.concat([df_centrality, pd.DataFrame({

The node identifiers correspond to the NCBI tax_id. Is any node shared in the high centrality table?

sum(df_centrality['node'].value_counts() > 1)
6

Whole table looks like this

df_results.head()
Loading...

Graph example visualizationΒΆ

plot_network(network_results, spearman_taxa)
<Figure size 1800x600 with 3 Axes>

If you do not do the FDR and remove high taxa, the networks will be hard to distinguis visually, mostly because of all the false positive association hits. Quite surprisingly, out of 1000*1000 tables, we get a clear visual separation of the community interaction patterns.

Streamline the whole NBΒΆ

It is now simple to define high-level method which takes dictionary of all the parametres as input and runs all of the above functions.

def network_analysis(params):
    # prepare metadata
    if params['drop_metadata_duplicates']:
        full_metadata = params['metadata'].drop_duplicates(subset='replicate info', keep='first')
    else:
        full_metadata = params['metadata']

    ssu_filled = fill_taxonomy_placeholders(params["data"], params["taxonomy_cols"])

    print("Original DataFrame shape:", ssu_filled.shape)
    ssu_filled = remove_high_taxa(ssu_filled,
                                  taxonomy_ranks=params["taxonomy_cols"],
                                  tax_level=params["remove_high_taxa_level"])
    print("Filtered DataFrame shape:", ssu_filled.shape)

    ssu_pivot = pivot_taxonomic_data(ssu_filled)

    del ssu_filled
    ssu_filt = prevalence_cutoff(ssu_pivot, percent=params['prevalence_cutoff'], skip_columns=0)
    del ssu_pivot
    ssu_rarefied = rarefy_table(ssu_filt, depth=params['rarefy_depth'], axis=1)
    del ssu_filt

    groups = split_metadata(full_metadata, params['factor'])

    for groups_key in list(groups.keys()):
        print(f"{groups_key}: {len(groups[groups_key])} samples")
        if len(groups[groups_key]) < 3:
            del groups[groups_key]
            print(f"Warning: {groups_key} has less than 3 samples, therefore removed.")

    split_taxonomy = split_taxonomic_data_pivoted(ssu_rarefied, groups)

    spearman_taxa = spearman_from_taxonomy(split_taxonomy)

    for factor, d in spearman_taxa.items():
        pvals_fdr = fdr_pvals(d['p_vals'], pval_cutoff=params['fdr_cutoff'])
        spearman_taxa[factor]['p_vals_fdr'] = pvals_fdr

    plot_fdr(spearman_taxa, params['fdr_cutoff'])
    plot_association_histogram(spearman_taxa)

    network_results = build_interaction_graphs(
        spearman_taxa,
        pos_cutoff=params["network_pos_cutoff"],
        neg_cutoff=params["network_neg_cutoff"],
        p_val_cutoff=params["fdr_cutoff"],
    )

    print('Jaccard Similarity (Positive Edges):')
    display(pairwise_jaccard_lower_triangle(network_results, edge_type='edges_pos'))
    print('Jaccard Similarity (Negative Edges):')
    display(pairwise_jaccard_lower_triangle(network_results, edge_type='edges_neg'))
    print('Jaccard Similarity (All Edges):')
    display(pairwise_jaccard_lower_triangle(network_results, edge_type='all'))

    res_df = network_results_df(network_results, params['factor'])
    display(res_df)

    df_centrality = create_centrality_dataframe(network_results, params['factor'])
    display(df_centrality)

    plot_network(network_results, spearman_taxa)

Define your inputs and run the whole pipeline

params = {
    "data": ssu,
    "metadata": full_metadata,
    "taxonomy_cols": ['superkingdom', 'kingdom', 'phylum', 'class', 'order', 'family', 'genus', 'species'],
    "remove_high_taxa_level": 'phylum',
    'prevalence_cutoff': 10,
    'rarefy_depth': None,
    'drop_metadata_duplicates': True,
    'factor': 'environment (material)',
    'fdr_cutoff': 0.05,
    "network_pos_cutoff": 0.5,
    "network_neg_cutoff": -0.3,
}

# run everything
# network_analysis(params)

SummaryΒΆ

None of this answers any of the possible question you want to ask, but with the growing literature applying complex network methods to metagenomics data in combination of unique unified sampling and sequencing approach by EMO-BON, this dataset is ideal to push the boundaries of monitoring using the above described tool set.

Factors can be any categorical variable from the metadata table. However alternative approach to color and evaluate the associations is to color the graph but some knowledge about the taxa, such r- or K- community species classification etc.