File Type Support and Input Data Preparation Guide

SGS is designed to support a wide range of data formats, including genome-mapped data formats and single-cell data formats:

SGS data format

For single-cell data, we primarily support two data formats: AnnData and MuData. MuData is specifically designed for complex multi-modal data storage. Detailed format descriptions are provided below.

FormatsFile description
anndataAnnData is a Python library for storing single-cell data, offering a flexible data structure to store and manipulate single-cell transcriptomic data, including .X, .obs, .obsm, .uns ect.
anndata.zarrAnnData.zarr is built upon the Zarr storage format, known for its efficient compression and scalable storage capabilities, making it ideal for handling large datasets.
mudataMuData is a format for annotated multimodal datasets. MuData is native to Python but provides cross-language functionality via HDF5-based .h5mu files.
mudata.zarrMuData.zarr is based on the Zarr storage format, which is a compressed and scalable data format for efficient storage and retrieval of large datasets.

Single cell data format conversion

SGS has developed two solutions to meet the needs for single-cell data format conversion:

Command-line Tools

SGS has created an R package and scripts to assist users in converting single-cell data formats via the command line. To ensure compatibility with widely used single-cell analysis tools such as Seurat, Signac, ArchR, and Giotto, we have developed the R package SgsAnndata. This package seamlessly converts and visualizes analysis results from these tools within SGS, allowing users to efficiently process and analyze their data. For detailed instructions and data conversion guidelines, please refer to the GitHub page of SgsAnndata. Additionally, we provide standalone format conversion scripts for users to download and use.

For example, you can loading the script with the following command:

source("./GiottoToAnndata_2024.R")
giottoToAnnData(object = giotto,
                outpath = "/test_adata",
                markerDF = NULL)

Be Attention!!

In particular, for efficient marker table storage, we recommend that users store marker Tables in the slot with the following scripts:

#!/usr/bin/env python3
anndata.uns["markers"] = marker_df

Marker table format requirements are as follows: The marker table must contain a header line and three required columns (case-insensitive):

  • Feature: the column of marker gene, peak, DMR regions.
  • Celltype: celltype annotation (aliases clusters, seurat clusters).
  • P-value: P value of the marker. Optional columns include avg_logFC, pct.1, pct.2, p_val_adj etc. The first three columns require a fixed order, and the other columns are in any order.

GEF format conversion

SGS also provides Python functions to convert GEF format to h5ad format, making it more convenient for uploading to SGS.

  • Converting GEF output to h5ad format files: Users can set the Bin_size parameter to specify the size of each bin, which will affect the spatial resolution of the data. Please ensure that you have installed the stereopy package:
import stereo as st
import warnings
warnings.filterwarnings('ignore')

# read the GEF file
data_path = './SS200000135TL_D1.tissue.gef'
data = st.io.read_gef(file_path=data_path, bin_size=50)
data.tl.raw_checkpoint()

# remember to set flavor as scanpy
adata = st.io.stereo_to_anndata(data,flavor='scanpy',output='scanpy_out.h5ad')
Last Updated:
Contributors: xiatingting, sunjiahe0502, leeo