Skip to article frontmatterSkip to article content

GECCO analysis over time (EXAMPLE)

CCMAR-Algarve

This is far from perfect example, but the point being that that treating the EMO-BON samplings as complex networks, ie. graphs provides advantages on how to compare the results over time and space.

Every sequencing event is represented as a graph. Graphs can be constructed on both taxonomy 1 and functional tables. Depending on the question at hand, different graphs represent the sequencing result rK communities.

Here I create graphs from the secondary analysis (GECCO) runs to identify Biosynthetic Gene Clusters (BGCs).

Setup

I ran GECCO locally on the assembled contigs from the metaGOflow for one single station (Ria Formosa). 16 total sequencing analyses in batch 1 and batch 2 are distributed into replicates, soft sediment and water column samplings and 3 collection dates, which results in overall limited time series of maximum 3 time-points, but sufices for demonstration purposes.

gecco -v run --genome XXX.fa

GECCO version 0.9.10, default parameters.

Loading data and metadata

Batch 1+2 contain 181 valid samples, which should be the first value. The second value is number of metadata columns available for each samplling. Note that not all are obligatory, therefore the table contains many missing values.

assert df_metadata.shape[0] == 181

print(df_metadata.shape)
(181, 122)

Here I hardcode list of names for which GECCO was run, plus a color hash for the BGC coloring. Note that BGCs are colored, domains will be in grey, which are inherently different nodes. Each cluster is composed from 1 to many nodes, which can be shared between the BGCs and therefore I use them as nodes to connect the BGCs with edges.

data_list = [
    "HCFCYDSX5.UDI129", 
    "HVWGWDSX5.UDI102",
    "HCFCYDSX5.UDI130",
    "HVWGWDSX5.UDI103",
    "HCFCYDSX5.UDI131",
    "HVWGWDSX5.UDI115",
    "HCFCYDSX5.UDI132",
    "HVWGWDSX5.UDI127",
    "HMNJKDSX3.UDI234",
    "HVWGWDSX5.UDI150",
    "HMNJKDSX3.UDI236",
    "HVWGWDSX5.UDI162",
    "HMNJKDSX3.UDI246",
    "HVWGWDSX5.UDI174",
    "HMNJKDSX3.UDI250",
    "HVWGWDSX5.UDI186",
]

color_hash = {
    'NRP': 'red',
    'PKS': 'blue',
    'RiPPs': 'darkgreen',
    'Terpene': 'brown',
    'Unknown': 'black',
    'Polyketide': 'purple',
    'NRP;Polyketide': 'orange',
    'RiPP': 'darkblue',
    'Saccharide': 'darkred',
    'domain': 'lightgrey',
}

tables = bgc_tables_dict(data_list)

Graph methods

Below are methods I use to construct the network using networkx package. They include adding two types of nodes, color BGCs and grey domain nodes, create the graph edges (edges). generate_graph() puts all this into a single graph which has its own identifier constructed from replicate, observation ID, collection date, environmental material and filter used.

Then we have functions to plot the graph (plot_graph()) and domain count histogram (plot_domain_edges_counts()), respectively.

Example BGC graph

Example of a resulting graph can look like this one

# over all
name = data_list[0]

df = tables[name]
nodes, nodes_domains, edges = graph_nodes_edges(df, color_hash)

# contitutes date nad replicate number to make it unique
graph_id = create_graph_id(df_metadata, name)
G = generate_graph(nodes, nodes_domains, edges, graph_id)
plot_graph(G, color_hash)
Number of BGC nodes: 65
Number of nodes from domains: 137
Number of edges: 481
<Figure size 640x480 with 1 Axes>

Of course one cannot see anything useful, the same way as looking at the .csv tables underlaying this graph. However many descriptor metrics exist for such graphs, which describe connectivite, relatedness etc.

Graph analysis

Domain counter

Not related yet to the graph, we can plot the distribution of domains identified over all the BGCs

# count domain appearence
cnt_full, cnt_multiple = count_multiple_edges(edges)

# plot the histogram
plot_domain_edges_counts(cnt_multiple, f"Sample info: {G.graph['date']}")
<Figure size 640x480 with 1 Axes>

Graph metrics

Check out exhaustive list of networkx algorithms in the docs. Few random examples here

percentage of nodes which are bridges A bridge in a graph is an edge whose removal causes the number of connected components of the graph to increase. Equivalently, a bridge is an edge that does not belong to any cycle. Bridges are also known as cut-edges, isthmuses, or cut arcs.

print(f"{round(len(list(nx.bridges(G, root=None)))/G.number_of_nodes() * 100, 2)} %")
40.1 %

Maximum independent set An independent set is a set of vertices in a graph where no two vertices in the set are adjacent. The maximum independent set is the independent set of largest possible size for a given graph.

from networkx.algorithms.approximation import maximum_independent_set

ind_set = maximum_independent_set(G)
print(f"Size of Maximum independent set: {len(ind_set)}")
Size of Maximum independent set: 96

Average degree connectivity The average degree connectivity is the average nearest neighbor degree of nodes with degree k.

nx.average_degree_connectivity(G)
{5: 10.854545454545455, 3: 10.80952380952381, 21: 7.571428571428571, 6: 12.983333333333333, 11: 12.763636363636364, 7: 10.648351648351648, 16: 6.21875, 9: 11.177777777777777, 2: 9.637931034482758, 4: 18.416666666666668, 18: 11.777777777777779, 12: 8.116666666666667, 8: 11.075, 10: 12.85, 13: 11.948717948717949, 14: 13.285714285714286, 15: 8.222222222222221, 23: 9.91304347826087, 1: 9.779220779220779, 30: 9.466666666666667, 27: 8.555555555555555, 25: 10.52}

Average node connectivity The average connectivity bar{kappa} of a graph G is the average of local node connectivity over all pairs of nodes of G.

nx.average_node_connectivity(G)
1.1365942564405693

Time evolution

The above metrics are related to individual sequencing results, therefore they can be compared based on metadata (in the permanova style) or simple plotted over time or any other conditions.

Let’s try to display nx.average_node_connectivity(G) over time for our series of samples

First, construct all the graphs, plot graph and the histogram, calculte the connectivity and add those values to the DF (out of laziness).

## all graphs
df_metadata['connectivity'] = 0.0
df_metadata['bgc_nodes'] = 0
df_metadata['total_edges'] = 0
df_metadata['domain_nodes'] = 0
for name in data_list:
    df = tables[name]
    collection_date = create_graph_id(df_metadata, name)

    nodes, nodes_domains, edges = graph_nodes_edges(tables[name], color_hash)
    G = generate_graph(nodes, nodes_domains, edges, collection_date)
    cnt_full, cnt_multiple = count_multiple_edges(edges)

    plt.subplots(2, 1, figsize=(10, 8))
    plt.subplot(2, 1, 1)
    plot_domain_edges_counts(cnt_multiple, f"Saomple info: {G.graph['date']}")
    plt.subplot(2, 1, 2)
    plot_graph(G, color_hash)
    plt.tight_layout()
    plt.show()

    # calculate the average degree connectivity of graph.
    avg_conn = nx.average_node_connectivity(G)
    print(f"Average node connectivity for {name}: {avg_conn}")
    df_metadata.loc[df_metadata['reads_name_short'] == name, 'connectivity'] = avg_conn
    df_metadata.loc[df_metadata['reads_name_short'] == name, 'bgc_nodes'] = len(nodes)
    df_metadata.loc[df_metadata['reads_name_short'] == name, 'total_edges'] = len(edges)
    df_metadata.loc[df_metadata['reads_name_short'] == name, 'domain_nodes'] = len(nodes_domains)
Number of BGC nodes: 65
Number of nodes from domains: 137
Number of edges: 481
<Figure size 1000x800 with 2 Axes>
Average node connectivity for HCFCYDSX5.UDI129: 1.1365942564405693
Number of BGC nodes: 41
Number of nodes from domains: 110
Number of edges: 267
<Figure size 1000x800 with 2 Axes>
Average node connectivity for HVWGWDSX5.UDI102: 0.9924944812362031
Number of BGC nodes: 72
Number of nodes from domains: 183
Number of edges: 634
<Figure size 1000x800 with 2 Axes>
Average node connectivity for HCFCYDSX5.UDI130: 1.756739231125521
Number of BGC nodes: 58
Number of nodes from domains: 142
Number of edges: 355
<Figure size 1000x800 with 2 Axes>
Average node connectivity for HVWGWDSX5.UDI103: 1.2028643216080401
Number of BGC nodes: 96
Number of nodes from domains: 264
Number of edges: 753
<Figure size 1000x800 with 2 Axes>
Average node connectivity for HCFCYDSX5.UDI131: 1.392200557103064
Number of BGC nodes: 90
Number of nodes from domains: 186
Number of edges: 592
<Figure size 1000x800 with 2 Axes>
Average node connectivity for HVWGWDSX5.UDI115: 1.785612648221344
Number of BGC nodes: 116
Number of nodes from domains: 275
Number of edges: 835
<Figure size 1000x800 with 2 Axes>
Average node connectivity for HCFCYDSX5.UDI132: 1.5790019017640502
Number of BGC nodes: 103
Number of nodes from domains: 200
Number of edges: 675
<Figure size 1000x800 with 2 Axes>
Average node connectivity for HVWGWDSX5.UDI127: 1.7417874237754902
Number of BGC nodes: 83
Number of nodes from domains: 206
Number of edges: 520
<Figure size 1000x800 with 2 Axes>
Average node connectivity for HMNJKDSX3.UDI234: 1.6891820453671664
Number of BGC nodes: 35
Number of nodes from domains: 91
Number of edges: 240
<Figure size 1000x800 with 2 Axes>
Average node connectivity for HVWGWDSX5.UDI150: 1.391873015873016
Number of BGC nodes: 8
Number of nodes from domains: 44
Number of edges: 52
<Figure size 1000x800 with 2 Axes>
Average node connectivity for HMNJKDSX3.UDI236: 0.36048265460030166
Number of BGC nodes: 92
Number of nodes from domains: 222
Number of edges: 626
<Figure size 1000x800 with 2 Axes>
Average node connectivity for HVWGWDSX5.UDI162: 1.7973179218982114
Number of BGC nodes: 75
Number of nodes from domains: 194
Number of edges: 411
<Figure size 1000x800 with 2 Axes>
Average node connectivity for HMNJKDSX3.UDI246: 1.4036786328580149
Number of BGC nodes: 74
Number of nodes from domains: 189
Number of edges: 438
<Figure size 1000x800 with 2 Axes>
Average node connectivity for HVWGWDSX5.UDI174: 1.195251502046266
Number of BGC nodes: 38
Number of nodes from domains: 139
Number of edges: 339
<Figure size 1000x800 with 2 Axes>
Average node connectivity for HMNJKDSX3.UDI250: 1.2358757062146892
Number of BGC nodes: 37
Number of nodes from domains: 107
Number of edges: 248
<Figure size 1000x800 with 2 Axes>
Average node connectivity for HVWGWDSX5.UDI186: 1.0968337218337219

Now take different subsets of the 16 Ria Formosa samples and plot the evolution of the connectivity over 3 sampling dates

df_to_plot = df_metadata[df_metadata['connectivity'] > 0.0]

# filter tables
water = df_to_plot[df_to_plot['env_package'] == 'water_column']
sediment = df_to_plot[df_to_plot['env_package'] == 'soft_sediment']

water_eukaryotes = water[water['size_frac'] == '3-200']
water_prokaryotes = water[water['size_frac'] == '0.2-3']
water_nan = water[water['size_frac'].isnull()]

sediment_eukaryotes = sediment[sediment['size_frac'] == '3-200']
sediment_prokaryotes = sediment[sediment['size_frac'] == '0.2-3']
sediment_nan = sediment[sediment['size_frac'].isnull()]

plt.figure(figsize=(10, 5))
plt.plot(water_eukaryotes['collection_date'], water_eukaryotes['connectivity'], 'o', ms=8, linewidth=0.3, label='water Eukaryotes')
plt.plot(water_prokaryotes['collection_date'], water_prokaryotes['connectivity'], 'o', ms=8,linewidth=0.3, label='water Prokaryotes')
plt.plot(sediment_eukaryotes['collection_date'], sediment_eukaryotes['connectivity'], 'o', ms=8,linewidth=0.3, label='sediment Eukaryotes')
plt.plot(sediment_prokaryotes['collection_date'], sediment_prokaryotes['connectivity'], 'o', ms=8,linewidth=0.3, label='sediment Prokaryotes')
plt.plot(sediment_nan['collection_date'], sediment_nan['connectivity'], 'o', ms=8,linewidth=0.3, label='sediment Nan filter')
plt.xlabel('Collection date')
plt.ylabel('Average node connectivity')
plt.title('Average node connectivity over time')
plt.xticks(rotation=30)
plt.show()
<Figure size 1000x500 with 1 Axes>

Additionally, we plot total amount of BGCs, domains and edges for each type of sample over time.

plt.figure(figsize=(10, 5))
plt.plot(water_eukaryotes['collection_date'], water_eukaryotes['bgc_nodes'], 'o', ms=8, label='water Eukaryotes')
plt.plot(water_prokaryotes['collection_date'], water_prokaryotes['bgc_nodes'], 'o', ms=8, label='water Prokaryotes')
plt.plot(sediment_eukaryotes['collection_date'], sediment_eukaryotes['bgc_nodes'], 'o', ms=8, label='sediment Eukaryotes')
plt.plot(sediment_prokaryotes['collection_date'], sediment_prokaryotes['bgc_nodes'], 'o', ms=8, label='sediment Prokaryotes')
plt.plot(sediment_nan['collection_date'], sediment_nan['bgc_nodes'], 'o', ms=8, label='sediment Nan filter')
plt.xlabel('Collection date')
plt.ylabel('Number of BGCs')
plt.title('Number of BGCs over time')
plt.xticks(rotation=30)
plt.tight_layout()
plt.legend()
plt.show()

plt.figure(figsize=(10, 5))
plt.plot(water_eukaryotes['collection_date'], water_eukaryotes['total_edges'], 'o', ms=8, label='water Eukaryotes')
plt.plot(water_prokaryotes['collection_date'], water_prokaryotes['total_edges'], 'o', ms=8, label='water Prokaryotes')
plt.plot(sediment_eukaryotes['collection_date'], sediment_eukaryotes['total_edges'], 'o', ms=8, label='sediment Eukaryotes')
plt.plot(sediment_prokaryotes['collection_date'], sediment_prokaryotes['total_edges'], 'o', ms=8, label='sediment Prokaryotes')
plt.plot(sediment_nan['collection_date'], sediment_nan['total_edges'], 'o', ms=8, label='sediment Nan filter')
plt.xlabel('Collection date')
plt.ylabel('Number of edges')
plt.title('Number of edges over time')
plt.xticks(rotation=30)
plt.tight_layout()
plt.legend()
plt.show()

plt.figure(figsize=(10, 5))
plt.plot(water_eukaryotes['collection_date'], water_eukaryotes['domain_nodes'], 'o', ms=8, label='water Eukaryotes')
plt.plot(water_prokaryotes['collection_date'], water_prokaryotes['domain_nodes'], 'o', ms=8, label='water Prokaryotes')
plt.plot(sediment_eukaryotes['collection_date'], sediment_eukaryotes['domain_nodes'], 'o', ms=8, label='sediment Eukaryotes')
plt.plot(sediment_prokaryotes['collection_date'], sediment_prokaryotes['domain_nodes'], 'o', ms=8, label='sediment Prokaryotes')
plt.plot(sediment_nan['collection_date'], sediment_nan['domain_nodes'], 'o', ms=8, label='sediment Nan filter')
plt.xlabel('Collection date')
plt.ylabel('Number of domain nodes')
plt.title('Number of domain nodes over time')
plt.xticks(rotation=30)
plt.tight_layout()
plt.legend()
plt.show()
<Figure size 1000x500 with 1 Axes><Figure size 1000x500 with 1 Axes><Figure size 1000x500 with 1 Axes>

Summary

None of this answers any of the possible question you want to ask, but with the growing literature applying complex network methods to metagenomics data in combination of unique unified sampling and sequencing approach by EMO_BON, this dataset is ideal to push the boundaries of monitoring using such tool set.

References
  1. Liu, Z., Ma, A., Mathé, E., Merling, M., Ma, Q., & Liu, B. (2020). Network analyses in microbiome based on high-throughput multi-omics data. Briefings in Bioinformatics, 22(2), 1639–1655. 10.1093/bib/bbaa005