Tumor microenvironment interplay amid microbial community, host gene expression and pathological features elucidates cancer heterogeneity and prognosis risk

microenvironment interplay amid microbial community, host gene expression and pathological features elucidates


PUBLIC SUMMARY
■ There is a link between intratumoral microbiota, host, and histopathology images in cancer patients.
■ Patterns of association between intratumoral microbiota, host, and histopathology images exhibit tumor heterogeneity.
■ Associations between intratumoral microbiota, host, and histopathology images underlie the prognostic risk of cancers.

INTRODUCTION
5][6][7] Studies have observed correlations between microbes and host gene expression in the TME of several types of cancers. 8,91][12] Additionally, several studies have applied computational image analysis of tissue samples to classify stroma, immune, and tumor epithelial cells. 13,14However, at the pancancer level, the association among the microbial community, host gene expression, and pathological images remains unknown.
In this work, we took advantage of the comprehensiveness of The Cancer Genome Atlas (TCGA) database, which covers large-scale tumor samples of numerous cancer types, and multi-omics data, including numerous genes, to perform a comprehensive pan-cancer level study of the microbe-expressionimage interplay in human tumors.We classified each cancer based on the clustering of color and morphological features of the hematoxylin and eosin (H&E)-stained slides extracted in a fully automated process.Next, we compared the differences and biological associations between subtypes in the tumor microbiome, TME, and histological features.Finally, the results of these analyses were compared across 13 cancer types to reveal the tumor heterogeneity from the perspective of the interrelationship between the microbial community, host gene expression, and pathological images.

Tumor microbiome profiles of 13 cancer types
Our cohort consisted of 6,095 tumor samples covering 13 types of cancer (Figure 1A).A matching H&E-stained slide sample, tissue microbiome sample, and host transcriptome sample were available for each patient.Pseudomonas was the genus with the highest relative abundance in 10 tumor tissues, while Mycobacterium was the dominant bacterium of CRC, STAD, and UCEC (Figure 1B).Notably, Xanthomonas was overabundant in liver tumor tissue.Additionally, we found that STAD and ESCA tumors had the highest microbial richness and the lowest Shannon index (Figure S1a).Tumors with high microbial richness included UCEC and BRCA.Consistent with previous findings, 15 we observed significant differences in the tissue microbial communities of the 13 cancer types (Figure S1b; R2 = 21%, p = 0.001).

Two patterns of tissue microbiome-host interactions of 13 cancer types
Our results revealed two patterns of tissue microbiome-host interactions in the 13 cancer types (Figure 1C).Procrustes analysis showed a significant correlation between microbiome and host expression in 11 of the 13 cancer types (p < 0.05).For ESCA and PAAD, we found no significant microbe-host interplay pattern (p > 0.05).The Mantel test was used to verify the robustness of the results.Consistent with the results of the Procrustes analysis, ESCA and PAAD showed weak correlation between tissue microbe and host (Figure 1D).Consequently, from the perspective of the microbe-host interaction, the 13 cancer types could be divided into two categories, representing strong and weak microbe-host association patterns.
Next, two non-western cohorts were collected to validate the tissue microbiome-host interaction patterns.First, the AC-ICAM cohort, consisting of 246 Qatari patients with colon adenocarcinoma (COAD), was used to validate the microbe-host interaction model of COAD.Procrustes analysis showed a significant correlation between tissue microbes and host gene expression (Figure S2a).Another independent cohort of 32 Chinese patients with liver cancer also showed the same results with TCGA (Figure S2b).Consequently, two tissue microbiome-host interaction patterns presented by TCGA with independent cohorts demonstrated generalization ability.

Stratification of cancer types based on the interaction among the microbial community, host gene expression, and pathological image in the TME
Next, we aimed to explore whether tissue microbe-host interactions manifest themselves in histology slides.Here, a fully automatic process, involving section standardization, tissue contour recognition, and feature extraction, was employed to extract color and texture features from the histology slides (Figure 2).Additionally, we calculated the stromal and immune score, which quantify the TME by the ESTIMATE algorithm based on the mRNA expression data. 16First, for each cancer, we performed partitioning around medoids (PAM) clustering for pathological image samples.The optimal clustering number k was obtained based on the silhouette width.Then, we compared the differences in tumor microbes and the TME between the k groups.
Using this methodology, these 13 cancers were further categorized into the following three subtypes: MEI1, MEI2, and MEI3.MEI1, including PAAD and ESCA, was characterized by no significant microbe-host interaction; MEI2, including BRCA, LUAD, LIHC, KIRC, PRAD, UCEC, BLCA, and LGG, was characterized by the fact that the tumor microbes significantly affected the TME, but the changes in the TME were not reflected in the histology slides; and MEI3, including STAD, CRC, and LUSC, was characterized by the fact that the tumor microbes significantly influenced the TME, which was further reflected in the histology slides.

Association patterns in stomach adenocarcinoma (STAD)
STAD was the first cancer found to belong to the MEI3 pattern.PAM clustering divided patients with STAD into two clusters (STAD-T1 and STAD-T2) (Figure 3A-C).STAD-T1 and STAD-T2 showed significant differences in microbial community in tumor (Figure 3D, p = 0.001).The microbial richness of STAD-T2 was significantly higher than that of STAD-T1 (Figure 3E, p < 0.001).Moreover, the abundance of Bacteroidetes in STAD-T2 was significantly higher than that in STAD-T1 (Figure 3F, p < 0.05), while that of Firmicutes was the opposite (p < 0.001).In addition, we found that the Firmicutes/Bacteroidetes (F/B) ratio of STAD-T1 was significantly lower than that of STAD-T2 (p < 0.01).Next, we identified the differentially enriched bacteria between STAD-T1 and STAD-T2 (Figure 3G).The relative abun-  Host mRNA Bacteria (Genus)

REPORT
The Innovation Life 1(2): 100028, September 18, 2023 3 dance of Bordetella and Staphylococcus was increased in STAD-T1, while the genus Streptococcus was more abundant in STAD-T2.At the TME level, no significant difference was found between STAD-T1 and STAD-T2 in terms of the immune score (Figure S3, p = 0.41), but STAD-T1 was significantly higher than STAD-T2 in terms of the stromal score (Figure 3H, p < 0.01).The abundance of endothelial cells, a type of stromal cell, was significantly higher in STAD-T1 than that in STAD-T2 (p < 0.05).Considering the interactions between host cells and tumor microbes, 17,18 Spearman correlation analysis was performed for stromal score and differential bacteria (Figure 3J).The genera Bordetella, Staphylococcus, and Acidovorax were significantly positively correlated with stromal score (p < 0.01), while these three genera were enriched in STAD-T1.Additionally, Pseudothermotoga and Denitrovibrio enriched in STAD-T2 had a significantly negative correlation with stromal score (p < 0.001).Thus, the finding that the stromal score of STAD-T1 was significantly higher than that of STAD-T2 confirmed the role of tumor microbes on non-cancerous cells in the TME.
At the TME level, the stromal score of LUSC-T1 was significantly higher than that of LUSC-T2 (Figure 4H, p < 0.05).The relative abundances of Bordetella, Pseudomonas, and Acidovorax, all of which were highly enriched in LUSC-T2, were significantly positively correlated with stromal score (Figure 4I, p < 0.01).Additionally, bacteria with a significantly negative correlation with stromal score were overabundant in LUSC-T2, such as the genera Sanguibacteroides and Waddlia (p < 0.001).Consequently, for LUSC, variation in the TME affected by tumor microbes was further reflected in the pathological images and was related to patient prognosis.

Association patterns in colorectal cancer (CRC)
As with STAD and LUSC, PAM clustering divided patients with CRC (the final cancer belonging to MEI3) into two clusters (Figure 5A-C).The tumor microbial community was significantly different between the two clusters (Figure 5D, p < 0.001).Specifically, the microbial richness of CRC-T1 was significantly higher than that of CRC-T2 (Figure 5E, p < 0.001), while the Shannon index was higher in CRC-T2 (Figure 5F, p < 0.0001).A total of 117 differential bacteria at the genus level were identified (Figure 5G, Supplementary Table S1), among which, 43 were enriched in CRC-T1 and 74 were overabundant in CRC-T2.Additionally, the stromal score of CRC-T2 was significantly higher than that of CRC-T1 (Figure 5H, p < 0.05).Specifically, the abundances of endothelial cells (p < 0.01) and fibroblasts (p < 0.05) were significantly higher in CRC-T2 than in CRC-T1 (Figure 5I).Spearman correlation analysis showed that the bacteria that were significantly positively associated with stromal score were all enriched in CRC-T2 (Figure 5G, p < 0.01).Finally, there was no statistical significance (Figure 5J, p = 0.42), but the  survival of patients in CRC-T2 was worse than that of patients in CRC-T1.

Independent validation of MEI3 association pattern with matching images and host expression samples for COAD
To independently verify the MEI3 association pattern, we obtained 104 patients with COAD with matching images and host expression samples.Similarly, PAM clustering based on image features divided the patients into two groups (Figure S4a-c).Then, we compared the stromal score of these two groups of patients and found a significant difference (Figure S4d, p < 0.05).Taken together, our findings were not inherent in TCGA dataset, but

Prognostic potentials of MEI patterns
To determine whether MEI patterns were associated with differences in prognosis, we next examined the overall survival of patients with each of these 13 cancer types.First, we plotted survival curves, which showed significant differences in patient survival (Figure 6A, p < 0.0001).Specifically, compared to other cancers, patients with PAAD and ESCA had the worst survival, with a median survival of 607 and 801 days, respectively (Figure 6B, Supplementary Table S2).We then plotted survival curves for the three MEI patterns, which confirmed the significant differences in survival for the three patterns of cancer (Figure 6C, p < 0.0001).MEI1 had the worst prognosis, followed by LDA SCORE (log 10)  MEI3, while MEI2 had the best prognosis, with a median survival of 666, 1,849, and 3,409 days, respectively (Figure 6D).These associations with prognosis indicated that the tumor MEI classification had clinical relevance.
We next explored the non-tumor components of the MEIs in more detail.We estimated the stromal fraction to be 1 minus tumor purity.To estimate how much of the stromal fraction was due to immune cell infiltration, we plotted the immune score versus the stromal fraction (Figure 6E).In all of the MEIs, the stromal fraction was defined largely by immune cell infiltration.Additionally, MEI2 had the highest median tumor purity (Figure 6F), while MEI1 had the highest stromal score (Figure 6G).

DISCUSSION
Our investigation into the associations between tumor microbes, host gene expression, and pathological images of patients with pan-cancer in multiple independent cohorts revealed a microbe-expression-image-based classification system for pan-cancer that allowed the categorization of tumors into one of three distinct MEI patterns, each with marked differences in microbeexpression-image interplay and outcome associations.Moreover, these three patterns could offer actionable knowledge for clinical decision-making and pave the way for precision medicine.
Our results emphasized the importance of understanding the associations among the microbial community, host gene expression, and pathological image in the TME for a deep understanding of inter-tumor heterogeneity, and highlighted the potential of quantifying tumor MEI as a prognostic marker for multiple human cancers.Hoadley et al. found that only one-third of tumors exhibited homogeneity, while the remaining two-thirds exhibited varying degrees of heterogeneity. 19It is worth noting that while both cancers in MEI1 belonged to the digestive system, a mixture of the digestive, respiratory, and reproductive system was present in MEI2, while the digestive and respiratory systems were also included in MEI3.Our results suggested that tumors originating from different systems could have many commonalities at the multimodal association level, whereas those originating from the same system could have distinct association maps.We found associations between MEI patterns and survival in 13 cancers, and the tumor MEI could be further applied to the classification of other solid tumors to verify its prognostic association.Despite in-depth research and new treatment strategies, most cancers remain incurable, largely due to the high heterogeneity of the tumor. 20espite recent advances in the development of sophisticated techniques to assess tumor heterogeneity, the clinical application of tumor heterogeneity remains limited.Our study improved the understanding of the mechanisms underlying tumor heterogeneity, although the underlying prognostic mechanisms require further investigation.Future development of effective treatment strategies for heterogeneous tumors will be required, although this will not be possible without improved preclinical models that truly reproduce the heterogeneity of human disease, both between individuals and between tumors.Our work elucidated tumor heterogeneity at multiple levels from the microbial community, host gene expression, and pathological images, all of which underlie differences in cancer risk in humans.Multimodal data fusion is probably the only way to advance precision oncology. 21The limited view of disease heterogeneity from single modality data may not offer sufficient information to stratify patients and capture all of the events that occur in response to treatments. 22Our work provided new insights into inter-and intratumor heterogeneity from a multimodal perspective of tumor microbes, host genes, and pathological images.Phenotypes derived from medical images can be used as substitutes for molecular phenotypes. 23,24Furthermore, by analyzing the biological links between the three modalities data, we believe that image phenotypes can define tumor tissue microbial phenotypes and TME phenotypes for a broad spectrum of cancers.
In conclusion, our findings provided a framework for tumor subtyping across multiple types of cancers based on the association among the microbial community, host gene expression, and pathological image in the TME.Our findings indicated that MEI classification demarcated human cancers with specific tumor microbe-expression-image interplay patterns and prognoses.Our data provided biologically and clinically relevant information on the association of the microbial community, host gene expression, and pathological images in human tumors, which could assist with providing more precise diagnosis and prognosis assessment in the clinic.

MATERIALS AND METHODS Study population and data collection
A total of 6,095 patients with cancer from TCGA database, covering 13 cancer types, were included in this study.The microbial abundance data of the tumor tissues for various cancer types were downloaded from Rob Knight' s article. 25Rob Knight's team developed a Kraken TCGA microbial-detection pipeline that used a ultrafast Kraken algorithm to map sequence readings not aligning to the human reference genome to known bacterial, viral, and archaea microbial genomes. 26The cancer microbiome data used in this study and the metadata were available at ftp://ftp.microbio.me/pub/cancer_micro-biome_analysis/.The gene expression quantification by RNA-Seq, as well as the histopathological slides for each cancer patient were downloaded from https://portal.gdc.cancer.gov/.Data were analyzed with the R (version 3.4.0)and R Bioconductor packages.

Validation dataset and analysis
Tissue samples from 32 Chinese patients with liver cancer were collected to independently verify tumor microbiome and host interactions.The raw RNA-Seq sequencing data (PRJNA576155) were downloaded from the Sequence Read Archive (SRA).For taxonomic assignment, Kraken2 was used to obtain the microbial abundance data with the standard Kraken2 database.To verify the results of LIHC in TCGA, species matching the results of Poore et al was used for subsequent analysis. 25For gene expression quantification, HISAT2 was used for RNA-Seq alignment, 27 with hg19 as the reference genome.SAM-tools was used for data conversion and sorting of the alignment results. 28Finally, the gene expression was quantified by StringTie, 29 with hg19 as the reference transcriptome, and the FPKM values were extracted as the expression levels.In addition, the AC-ICAM cohort, 30 comprising 246 patients with COAD, was used to validate the tissue microbiome-host interaction patterns in COAD.Clinical metadata and gene expression data were obtained from cBioportal.The tissue microbiome abundances were downloaded from FigShare.
To independently verify the difference in the TME between the classifications based on image feature clustering, the pathological images and matched gene expression data of 104 patients with COAD were downloaded from https://www.cancerimagingarchive.net/.

Feature extraction of pathological images
For each whole slide image (WSI), our pipeline started with an automatic partition of the area of tissues.WSI was read into memory at a downsampled resolution (e.g., 32×), converted from RGB to HSV color space.A binary mask of the tissue area (foreground) was calculated based on the threshold of the saturated channel of the image after median blurring to smooth the edges, followed by additional morphological closure to fill in small gaps and holes.Subsequently, the detected approximate contour of the foreground object was filtered and stored for downstream processing based on the region threshold, while making the segmentation mask of each slide available for optional visual inspection (see the article by Lu et al. for more details. 31).Next, the color features on four components R, G, B, and grayscale for each WSI were calculated, including the mean, variance, and skewness, to describe the color characteristics.Additionally, Hu's invariant moments on the four components R, G, B, and grayscale of each image were extracted by cv2.HuMoments provided by OpenCV. 32Hu introduced a set of seven orthogonal image moments that are insensitive to translation, scaling, and rotation (see the article by Hu for more details on the characteristics of Hu moments. 32).

Clustering analysis of pathological image features
All of the pathological images were clustered based on the pathological image features using partitioning around medoids (PAM) clustering with the "cluster" package in R. We calculated the silhouette width when the clustering number was 1 to 10.The silhouette width ranges from −1 to +1, where higher values indicate that a point matches well with its own cluster and poorly with its neighboring clusters.If most of the points have high silhouette width (high average silhouette width), the current clustering method is appropriate, whereas if many points have low or negative values (low average silhouette width), the current clustering method is considered unsuitable.The optimal PAM clustering number based on pathological image features was the clustering number when the highest average silhouette width calculated with package "fpc" in R was obtained.

Procrustes analysis
The correlation between tumor microbes and host gene expression was verified by Procrustes analysis.Jaccard dissimilarities among samples were calculated using the R function "vegdist" in the "vegan" package based on the onco-microbial community at the genus level and host gene expression, respectively.The R function "procrustes" in the "vegan" package was used to conduct the Procrustes analysis.The correlation, significance, and Procrustes sum of squares were calculated by the R function "protest" in the "vegan" package, with 999 permutations.Procrustes correlations between oncomicrobial community and host were visualized using the "ggplot2" package.

Mantel test
The Mantel test was used to examine the correlation between the tumor microbial community and host gene expression with the "vegan" package in R. Jaccard dissimilarities were calculated using the R function "vegdist" based on the tumor microbial community and host gene expression, respectively.Then, the Spearman correlation coefficient and p-value were calculated by the R function "Mantel", with 999 permutations.

Microbial diversity analysis
The richness and Shannon indices were used to characterize the α-diversity of the tumor microbial community.The Shannon index was calculated using the R function "diversity" in the "vegan" package.Wilcoxon rank-sum test was used to test the differences in α-diversity between groups, while βdiversity was represented by measurements of the Bray-Curtis distance.Permutational multivariate analysis of variance (PERMANOVA) was used to examine differences in beta-diversity with the "vegan" package in R.

TME characterized by stromal score and immune score
The stromal score and immune score of all the samples in TCGA were available from https://bioinformatics.mdanderson.org/estimate/.ESTIMATE is a method that uses gene expression characteristics to infer the ratio of stromal cells and immune cells in a tumor sample. 16In addition, the abundance of eight immune cell types, namely, CD3+ T cells, CD8+ T cells, cytotoxic lymphocytes, NK cells, B lymphocytes, cells originating from monocytes (monocytic lineage), myeloid dendritic cells, neutrophils, and two stromal cells (endothelial cells and fibroblasts), were calculated using the R function "MCPcounter.estimate" in the "MCPcounter" package. 33f

REPORT
The Innovation Life 1(2): 100028, September 18, 2023 9 Linear discriminant analysis Effect Size (LEfSe) analysis, correlation analysis, and survival analysis LEfSe analysis was performed to identify the significantly different bacteria in terms of abundance between groups.The threshold value of the linear discriminant analysis (LDA) score was 2 when identifying the differential bacteria of the two groups obtained by clustering the image features of a single cancer.Correlation analysis was performed for the differential microbes and stromal scores using the R function "corr.test" in the "psych" package, with p < 0.05 as the significance threshold.Kaplan-Meier (KM) analysis was performed to plot the survival curves with the "survminer" and "survival" packages in R, and the significance was determined by log-rank test.

Figure 1 .
Figure 1.Tumor microbial profiles of the 13 cancer types and their host interactions (a) Matched tumor microbiota and host transcriptome sample sizes for the 13 cancer types.(b) Pie chart showing the top five bacteria at the genus level in relative abundance for each cancer.(c) Procrustes analysis of the association between the microbial community and that of the host gene expression in each cancer.The correlation coefficient and p-values were calculated by the "protest" function, where n represents the sample size.(d) Histogram showing the correlation coefficients of the tumor microbiota and host gene expression calculated by the Mantel test for 13 the cancer types.The red column indicates a p-value < 0.01, and the gray column indicates a p-value > 0.01.

Figure 2 .
Figure 2. Schematic overview of the workflow employed to generate the microbe-expression-image triplet (MEI-triplet) interplay pattern classification platform Microbial abundance data were obtained by Kraken's annotation of TCGA RNA-seq sequencing data.The tumor microenvironment characterized by the stromal score and immune score, was calculated using the ESTIMATE algorithm based on TCGA mRNA expression data.The tissue regions of pathological images were identified and the features were extracted.Then, PAM clustering was conducted based on the extracted features.The 13 cancers were subdivided into three MEI patterns by comparing the characteristics of the TME and the microbial characteristics among the clusters obtained by PAM clustering (please see the Results for further details).

Figure 3 .
Figure 3. Differences and associations of STAD subtypes at the level of tumor microbes and tumor microenvironment (TME) based on pathological image features (a) Two subtypes of STAD obtained by PAM clustering based on image features.(b) Silhouette width with clustering numbers of 1 to 10. (c) The silhouette width of each sample in the two STAD subtypes and the average silhouette width of the two subtypes.(d) PCoA showing the difference in the tumor microbiota between STAD-T1 and STAD-T2.The p-value was generated by a PERMANOVA test.(e) Boxplot showing the difference in tumor microbial richness between the two subtypes.***p-value < 0.001.(f) Boxplots showing the differences in the relative abundance of Firmicutes, Bacteroidetes, and the Firmicutes/Bacteroidetes ratio between the two subtypes.(g) Significantly different bacteria identified by LEfSe between the two clusters; an LDA score of 2 was the threshold.(h) Boxplot showing the differences in stromal score between the two subtypes.(i) Boxplot showing the differences in the abundance of endothelial cells between the two subtypes.(j) Spearman correlation between significantly different bacteria in terms of abundance and stromal score.**p-value < 0.01.The number represents the correlation coefficient, and the red and blue lines indicate positive and negative correlations, respectively.

Figure 4 .
Figure 4. Differences in, and associations of LUSC subtypes at the tumor microbe and TME levels based on pathological image features (a) Two subtypes of LUSC obtained by PAM clustering based on image features.(b) Silhouette width with clustering numbers of 1 to 10. (c) The silhouette width of each sample in the two LUSC subtypes and the average silhouette width of the two subtypes.(d) PCoA showing the difference in tumor microbiota between LUSC-T1 and LUSC-T2.The p-value was generated by a PERMANOVA test.(e) Boxplot showing the difference in tumor microbial richness between LUSC-T1 and LUSC-T2.****p-value < 0.0001.(f) Significantly different bacteria identified by LEfSe between the two clusters; an LDA score of 2 was the threshold.(g) Overall survival of patients in LUSC-T1 and LUSC-T2.The p-value was generated by a log-rank test.(h) Boxplot showing the differences in stromal score between the two subtypes.(i) Spearman correlation between significantly different bacteria in terms of abundance and stromal score; the red and blue lines indicate positive and negative correlations, respectively.

JFigure 5 .
Figure 5. Differences in, and associations of CRC subtypes at the level of the tumor microbes and TME based on pathological image features (a) Two subtypes of CRC obtained by PAM clustering based on image features.(b) Silhouette width with clustering numbers of 1 to 10. (c) The silhouette width of each sample in the two CRC subtypes and the average silhouette width of the two subtypes.(d) PCoA plot showing the difference in tumor microbiota between CRC-T1 and CRC-T2.The p-value was generated by a PERMANOVA test.Boxplot showing the differences in (e) tumor microbial richness and (f) Shannon index between CRC-T1 and CRC-T2.(g) Heatmap of genera with significant differences between the two clusters identified by LEfSe; the red lines indicate a significant positive correlation between these bacteria and stromal scores.*p-value < 0.05, **p value < 0.01, ***p-value < 0.001.(h) Boxplot showing the differences in stromal score between CRC-T1 and CRC-T2.(i) Boxplot showing the differences in the abundances of endothelial cells and fibroblasts between CRC-T1 and CRC-T2.(j) Overall survival of patients in CRC-T1 and CRC-T2.The p-value was generated by a log-rank test.

6 GFigure 6 .
Figure 6.Prognostic potential of the MEI-triplet interplay pattern (a) Overall survival of patients with 13 cancer types.(b) Bar plot showing the median survival of the 13 cancer types.(c) Overall survival of patients with the three MEI-triplet interplay patterns.The p-value was generated by a log-rank test.(d) Bar plot showing the median survival of the three MEI-triplet interplay patterns.(e) Linear regression analysis showing the correlation between immune score and stromal fraction among the three MEI patterns.Stromal fraction = 1-tumor purity.Boxplot showing the differences in the (f) tumor purity and (g) stromal score among the three MEI patterns.Wilcoxon test, ****p value < 0.0001, ns: No significant difference.