# Tutorial 1: A Detailed Quick Start Guide (DLPFC dataset) 

This tutorial demonstrates the core functionalities of SEPAR (Spatial gene Expression PAttern Recognition) using the DLPFC (Dorsolateral Prefrontal Cortex) dataset. SEPAR is designed to identify spatial patterns in spatial transcriptomics data through graph-regularized matrix factorization.  

## Key Features  
- Spatial pattern identification  
- Pattern-specific gene detection   
- Spatially variable gene (SVG) identification  
- Gene expression refinement
- Spatial domain clustering

## Data Availability  

The DLPFC dataset used in this tutorial can be accessed through multiple sources:  

- **Dataset**: The processed DLPFC data is hosted in the [spatialLIBD repository](https://github.com/LieberInstitute/spatialLIBD)  
- **Manual Annotations**: Expert-curated layer annotations are available in [supplementary data folder from STAGATE[1]](https://drive.google.com/drive/folders/10lhz5VY7YfvHrtV40MwaqLmWz56U9eBP?usp=sharing)  


```python
import pandas as pd
import numpy as np
import scanpy as sc
import anndata as ad
import matplotlib.pyplot as plt
from sklearn.metrics import adjusted_rand_score
from sklearn.metrics.cluster import normalized_mutual_info_score as nmi
from SEPAR_model import SEPAR
```

## Loading and Preparing Data

We'll work with the DLPFC dataset. Here's how to load and prepare the data:


```python
# Define parameters for the specific slice
slice_idx = 151507
algorithm_params = {
    "151507": {"nslt": 3000, "alpha": 0.7, "l1": 0.01, "lam": 0.3}
}
params = algorithm_params.get(str(slice_idx), {})

# Load annotation data
anno_df = pd.read_csv('dataset/DLPFC/barcode_level_layer_map.tsv', sep='\t', header=None)

# Read Visium data
adata = sc.read_visium(path="dataset/DLPFC//%d" % slice_idx,
                      count_file="%d_filtered_feature_bc_matrix.h5" % slice_idx)
adata.var_names_make_unique()

# Process annotations
anno_df1 = anno_df.iloc[anno_df[1].values.astype(str) == str(slice_idx)]
anno_df1.columns = ["barcode", "slice_id", "layer"]
anno_df1.index = anno_df1['barcode']
adata.obs = adata.obs.join(anno_df1, how="left")
adata = adata[adata.obs['layer'].notna()]

# Initialize SEPAR
n_cluster = len(adata.obs['layer'].unique())
separ = SEPAR(adata.copy(), n_cluster=n_cluster)
separ.adata.obs['Region'] = separ.adata.obs['layer']
```

## Data Preprocessing

SEPAR includes several preprocessing steps:


```python
# Preprocess the data
separ.preprocess(min_cells=40, n_top_genes=6000)

# Compute spatial graph
separ.compute_graph()

# Select features using Moran's I
separ.select_morani(nslt=params['nslt'])

# Compute weights
separ.compute_weight(n_cluster=n_cluster)
```

    After filtering:  (4221, 12597)


![png](Tutorial1_files/Tutorial1_5_1.png)


    Counting moran's i ...
    
    Finish selecting


#### Parameters Explanation:

- `min_cells`: Minimum number of cells required for a gene to be considered
- `n_top_genes`: Number of highly variable genes to select
- `nslt`: Number of spatially variable genes to select using Moran's I statistic
- `n_cluster`: Number of expected spatial domains/clusters

## 1. Running SEPAR Algorithm


```python
# Run SEPAR algorithm
separ.separ_algorithm(
    r=30,              # Number of spatial patterns
    alpha=0.7,         # Graph regularization weight
    beta=0.01,         # Sparsity penalty weight (previously l1)
    gamma=0.3          # Pattern orthogonality weight (previously lam)
)
```

    Processing iterations: 100%|██████████| 100/100 [00:24<00:00,  4.03it/s]


#### Parameters Explanation:
- `r`: Integer
  - Number of spatial patterns to identify

- `alpha`: Float
  - Weight for graph regularization term
  - Controls spatial smoothness of patterns
  - Typical range: 0.1-1.0

- `beta`: Float (previously l1)
  - Weight for sparsity penalty
  - Controls pattern distinctness and localization
  - Typical range: 0.01-0.1

- `gamma`: Float (previously lam)
  - Weight for pattern orthogonality term
  - Controls redundancy between patterns
  - Typical range: 0.1-1.0

#### Results:
- `separ.Wpn`: Spatial patterns matrix (W in the paper)
- `separ.Hpn`: Gene loading matrix (H in the paper)
- `separ.err_list`: Reconstruction errors during optimization

## 2. Identifying Pattern-Specific Genes


```python
# Identify pattern-specific genes  
pattern_genes = separ.identify_pattern_specific_genes(  
    n_patterns=30,    # Number of patterns to consider  
    threshold=0.3     # Threshold for gene-pattern association  
)  
```

#### Parameters Explanation:  
- `n_patterns`: Integer (default=30)  
  - Number of top patterns to analyze  
  - Should be less than or equal to total patterns from decomposition  
- `threshold`: Float (default=0.3)  
  - Threshold for determining pattern-specific genes  
  - Higher values yield more stringent gene selection  

#### Results Storage:  
- `self.pattern_specific_mask`: List of boolean arrays  
  - Length equals to `n_patterns`  
  - Each array indicates which genes are specific to that pattern  
- `self.genes_per_pattern`: Number of specific genes per pattern  
- `self.Wpnn`, `self.Hpnn`: Normalized pattern and gene loading matrices 

#### Return:  
- `pattern_genes`: List of lists  
  - Length equals to `n_patterns`  
  - Each sublist contains genes specific to that pattern  
  - Genes are sorted by specificity score   

### Pattern Visualization and Pattern-specific Genes

First, let's visualize the identified spatial patterns and analyze their characteristics:


```python
# Visualize all spatial patterns
sim_slt = separ.sim_res(separ.Wpn, separ.Hpn, separ.Xt.T)
sim_argsort = np.argsort(-sim_slt)

num_patterns = 30
plt.figure(dpi=200, figsize=(24, 7))
for i in range(30):
    ii = sim_argsort[i]
    plt.subplot(3, np.int32(num_patterns/3), i + 1)
    plt.scatter(separ.loc[:, 0], -separ.loc[:, 1], 
                c=separ.Wpn[:, ii].reshape(-1, 1), 
                s=1.2, cmap='Reds')
    plt.axis('off')
    plt.title(f'Pattern {i + 1}: {separ.genes_per_pattern[ii]} genes', 
              fontsize=12)
plt.tight_layout()
plt.show()
```


![png](Tutorial1_files/Tutorial1_12_0.png)


```python
# Create and display sorted pattern-specific genes table  
print("\nTop Pattern-Specific Genes (Sorted by Pattern Significance):")  
print("-" * 100)  
print(f"{'Pattern':11} | {'Significance':11} | {'#Genes':8} | {'Top Genes'}")  
print("-" * 100)  

for rank, pattern_idx in enumerate(sim_argsort[:30]):  
    genes = pattern_genes[pattern_idx]  
    if len(genes) == 0:  
        gene_str = "None"  
    else:  
        gene_str = ", ".join(genes[:10])  
        if len(genes) > 10:  
            gene_str += "..."  
    
    print(f"Pattern {rank+1:<3} | {sim_slt[pattern_idx]:.4f}      | {len(genes):<8} | {gene_str}")  
print("-" * 100)  
```

    
    Top Pattern-Specific Genes (Sorted by Pattern Significance):
    ----------------------------------------------------------------------------------------------------
    Pattern     | Significance | #Genes   | Top Genes
    ----------------------------------------------------------------------------------------------------
    Pattern 1   | 0.9418      | 942      | ATP1A3, HINT1, UCHL1, CISD1, VGF, CHCHD10, TUBB2A, NDUFA4, COX4I1, SNAP25...
    Pattern 2   | 0.8872      | 24       | GFAP, GALNT15, CST3, APLNR, MT-ND2, MT-ND1, MT3, MT-ATP6, MT-ND4, MT-CYB...
    Pattern 3   | 0.7692      | 0        | None
    Pattern 4   | 0.7480      | 6        | LTB, CCL19, TACSTD2, PIGB, RLBP1, IRF7
    Pattern 5   | 0.7298      | 110      | LDB3, GJB1, ERMN, GLDN, NKX6-2, PLEKHH1, ANLN, AQP1, CARNS1, FAM222A...
    Pattern 6   | 0.6737      | 215      | LINC00507, NEUROD1, NEK2, CALB1, EFCAB1, FFAR4, LRRTM4, SHISA8, PENK, KCNA4...
    Pattern 7   | 0.6705      | 128      | TRABD2A, LRMP, FEZF2, TSHZ2, PCP4, GAL, SMYD2, MKX, HS3ST2, CLSTN2...
    Pattern 8   | 0.6345      | 7        | NR4A2, KRT17, NTNG2, THEMIS, SEMA3E, CTGF, TMEM233
    Pattern 9   | 0.6290      | 41       | RELN, MT1H, RGR, F3, MSX1, MT1F, GSTM5, HEPN1, IQCK, FABP7...
    Pattern 10  | 0.6217      | 4        | ZMYND10, AC004233.3, LAIR1, BIK
    Pattern 11  | 0.6013      | 2        | SSTR3, TSHZ2
    Pattern 12  | 0.5910      | 30       | SHD, TPBG, NEFH, NEFM, VAMP1, CPB1, KCNS1, NEUROD6, SYT2, GRM7...
    Pattern 13  | 0.5670      | 4        | ADAMTS1, NRIP2, TNFRSF12A, TMEM233
    Pattern 14  | 0.5600      | 10       | MYH11, AC073896.2, AC105942.1, ANXA3, ACTA2, APLNR, MYL9, ARRDC4, TUT7, AL031056.1
    Pattern 15  | 0.5556      | 3        | CEP76, MN1, C1QL2
    Pattern 16  | 0.5473      | 1        | PTER
    Pattern 17  | 0.5029      | 1        | DPY19L3
    Pattern 18  | 0.4962      | 2        | TMC7, NAGA
    Pattern 19  | 0.4912      | 48       | CEACAM6, CYP2A6, SCGB2A1, TFF1, CLEC3A, CYP4Z1, SCGB1D2, TPRG1, AZGP1, AGR2...
    Pattern 20  | 0.4788      | 25       | CIDEC, G0S2, PLIN1, FABP4, THRSP, ADH1B, PLIN4, ADIPOQ, SAA1, SAA2...
    Pattern 21  | 0.4665      | 0        | None
    Pattern 22  | 0.4591      | 6        | S100G, ALB, MGP, AC007952.4, GPC3, LTBP2
    Pattern 23  | 0.4514      | 32       | IER3, CPB1, BAMBI, CGA, PSMB9, CLDN7, CTPS2, H2AFJ, RAB11FIP1, HK2...
    Pattern 24  | 0.4493      | 4        | NOSTRIN, LYPD6B, ERFE, ACP6
    Pattern 25  | 0.4292      | 1        | PEBP4
    Pattern 26  | 0.4059      | 4        | PRR15L, ALB, PRR15, TRPV3
    Pattern 27  | 0.3753      | 14       | JCHAIN, IGHA1, IGLC2, IGHM, IGLC3, IGHA2, IGKC, KRT15, IGHG3, IGHG1...
    Pattern 28  | 0.3605      | 0        | None
    Pattern 29  | 0.3521      | 3        | HBB, HBA2, HBA1
    Pattern 30  | 0.3486      | 3        | ZNF446, TRARG1, OTUB2
    ----------------------------------------------------------------------------------------------------


## 3. Spatially Variable Genes (SVGs) Analysis


```python
# Get SVG results
svg_results = separ.recognize_svgs(err_tol=0.7)
svg_results

```


    {'svgs': Index(['SCGB2A2', 'SCGB1D2', 'GFAP', 'MBP', 'MT-CO2', 'MT-CO1', 'MT-ND1',
            'SAA1', 'MT-ATP6', 'MT-ND2',
            ...
            'SLC25A46', 'MAP7D1', 'LIMCH1', 'TCF25', 'DCTN2', 'CAMSAP2', 'ARL2',
            'SPAG9', 'CFL2', 'TCAF1'],
           dtype='object', length=1831),
     'gene_ranking': array([   4,    5,   19, ..., 2149,  935, 2008]),
     'error_rates': array([0.13682987, 0.20417668, 0.22127199, ..., 0.88352493, 0.6206503 ,
            0.90467464])}


#### Parameters Explanation:  
- `err_tol`: Float, default=0.7  
  - Error tolerance threshold for SVG identification  
  - Controls which genes are classified as spatially variable based on reconstruction error  

#### Results Storage:  
- `svg_results['svgs']`: Names of identified spatially variable genes  
- `svg_results['gene_ranking']`: Indices of all genes sorted by spatial variability (ascending error rates)  
- `svg_results['error_rates']`: Reconstruction error rates for each gene

Let's visualize some top and bottom SVGs:


```python
gene_ranking = svg_results['gene_ranking']
# Create a single figure with subplots for top SVGs and bottom non-SVGs  
plt.figure(figsize=(8.5, 5), dpi=100)  
plt.rcParams['font.size'] = 8  
# Plot top 3 SVGs  
for i in range(3):  
    plt.subplot(2, 3, i+1)  
    gene_idx = gene_ranking[i]  
    gene_name = adata.var_names[gene_idx]  
    gene_exp = separ.adata[:, gene_idx].X.toarray().flatten()  
    
    scatter = plt.scatter(separ.adata.obsm['spatial'][:, 0],   
                        -separ.adata.obsm['spatial'][:, 1],  
                        c=gene_exp, s=5, cmap='Reds')  
    plt.title(f'Top SVG {i+1}: {gene_name}')  
    plt.axis('off')  
    plt.colorbar(scatter)  

# Plot bottom 3 non-SVGs  
for i in range(3):  
    plt.subplot(2, 3, i+4)  
    gene_idx = gene_ranking[-(i+1)]  
    gene_name = adata.var_names[gene_idx]  
    gene_exp = separ.adata[:, gene_idx].X.toarray().flatten()  
    
    scatter = plt.scatter(separ.adata.obsm['spatial'][:, 0],   
                        -separ.adata.obsm['spatial'][:, 1],  
                        c=gene_exp, s=5, cmap='Reds')  
    plt.title(f'Bottom non-SVG {i+1}: {gene_name}')  
    plt.axis('off')  
    plt.colorbar(scatter)  

plt.tight_layout()  
plt.show()
```


![png](Tutorial1_files/Tutorial1_17_0.png)


## 4. Gene Expression Refinement


```python
adata_refined = separ.get_refined_expression()
```

#### Returns:  
- `adata_refined`: AnnData object  
  - Contains denoised gene expression matrix 
  - Preserves original gene names and spot indices from input AnnData  
  - `.X`: Stores the refined expression matrix  

### Evaluating Refinement Performance  
 
First, we use Moran's I statistic to quantitatively evaluate the improvement in spatial patterns. Moran's I measures spatial autocorrelation, with higher values indicating stronger spatial patterns. The violin plot below compares the distribution of Moran's I values between raw and refined expression data across all genes: 


```python
# Calculate Moran's I for both raw and refined data  
morani = sc.metrics.morans_i(separ.adata)  
morani_refine = sc.metrics.morans_i(adata_refined)  

# Create DataFrame for violin plot  
data_violin = pd.DataFrame({  
    'Moran\'s I': np.concatenate([morani, morani_refine]),  
    'Condition': ['Raw'] * len(morani) + ['Refined'] * len(morani_refine)  
})  

import seaborn as sns
# Create violin plot  
plt.rcParams['font.size'] = 16  
plt.figure(figsize=(3, 5.5))  
sns.violinplot(x='Condition', y='Moran\'s I', data=data_violin, palette='Set2')  
plt.ylabel('Moran\'s I')  
plt.xlabel('')  
plt.show()  
```


![png](Tutorial1_files/Tutorial1_21_0.png)


### Single Gene Visualization

To demonstrate the refinement effect at individual gene level, we visualize the spatial expression pattern of a single gene. The comparison shows how SEPAR preserves and enhances genuine spatial patterns while reducing technical noise:


```python
def plot_gene_refinement(gene_name, separ, adata_refined):  
    """  
    Visualize raw and refined expression patterns for a specific gene.  
    
    Parameters  
    ----------  
    gene_name : str  
        Name of the gene to visualize  
    separ : SEPAR object  
        SEPAR object containing raw data  
    adata_refined : AnnData  
        Refined expression data from get_refined_expression()  
    """  
    # Get gene index and expression values  
    gene_idx = separ.adata.var_names.get_loc(gene_name)  
    raw_exp = separ.adata[:, gene_idx].X.toarray().flatten()  
    refined_exp = adata_refined.X[:, gene_idx]  

    # Create comparison plot  
    plt.rcParams['font.size'] = 16  
    plt.figure(dpi=100, figsize=(7/1.2, 4/1.2))  
    
    # Set main title  
    plt.suptitle(gene_name, fontsize=18)  
    
    # Plot raw expression  
    plt.subplot(1, 2, 1)  
    plt.scatter(separ.loc[:, 0], -separ.loc[:, 1],  
                c=raw_exp, s=5, cmap='Reds', rasterized=True)  
    plt.axis('off')  
    plt.title('Raw', fontsize=16)  
    
    # Plot refined expression  
    plt.subplot(1, 2, 2)  
    plt.scatter(separ.loc[:, 0], -separ.loc[:, 1],  
                c=refined_exp, s=5, cmap='Reds', rasterized=True)  
    plt.axis('off')  
    plt.title('Refined', fontsize=16)  
    
    plt.tight_layout(pad=0)  
    plt.show()  

plot_gene_refinement("NEUROD2", separ, adata_refined)
```


![png](Tutorial1_files/Tutorial1_23_0.png)


```python
plot_gene_refinement("MBP", separ, adata_refined)  
```


![png](Tutorial1_files/Tutorial1_24_0.png)


```python
plot_gene_refinement("GFAP", separ, adata_refined)
```


![png](Tutorial1_files/Tutorial1_25_0.png)


## 5. Performing Clustering


```python
# Perform clustering
cluster_res = separ.clustering(n_cluster=n_cluster, N1=30-n_cluster-7, N2=5)
```

### Parameters Explanation:
- `n_cluster`: Integer
  - Number of spatial domains to identify

- `N1`: Integer, optional
  - Number of patterns to retain after l2-norm filtering
  - Default: r - 5 - n_cluster (where r is the total number of patterns)

- `N2`: Integer, default=3
  - Number of patterns to retain after Pattern Specificity Score (PSS) filtering
  - Default value is 3 as specified in the paper

### Results Storage:
- `self.labelres`: Array of cluster labels for each spot
- `self.adata.obs['clustering']`: Cluster assignments stored in AnnData object

### Clustering Results Visualization

Compare the manual annotation with SEPAR clustering results:


```python
# Calculate clustering metrics
Y_list = separ.adata.obs['Region']
Ari = adjusted_rand_score(Y_list, separ.labelres)
Nmi = nmi(Y_list, separ.labelres)

# Create visualization
fig, axs = plt.subplots(1, 2, figsize=(12, 6), dpi = 50)
plt.rcParams['font.size'] = 12  
# Manual annotation plot
groups = separ.adata.obs['Region'].astype('str')
unique_regions = groups.unique()
region_to_num = {region: num for num, region in enumerate(unique_regions)}
groups = separ.adata.obs['Region'].map(region_to_num)

scatter = axs[0].scatter(separ.loc[:, 0], -separ.loc[:, 1], 
                        c=groups, s=10, cmap='tab10')
axs[0].set_title("Manual annotation", fontsize=15)
axs[0].axis('off')
legend = axs[0].legend(*scatter.legend_elements(), 
                      title="Layers",
                      bbox_to_anchor=(0.95, 0.5),
                      loc='center left')

# SEPAR clustering plot
scatter = axs[1].scatter(separ.loc[:, 0], -separ.loc[:, 1], 
                        c=separ.labelres, s=10, cmap='Set1')
axs[1].set_title(f"SEPAR clustering (ARI = {Ari:.3f}, NMI = {Nmi:.3f})", 
                 fontsize=15)
axs[1].axis('off')

plt.tight_layout()
plt.show()
```


![png](Tutorial1_files/Tutorial1_30_0.png)

## References  

[1] Dong, Kangning, and Shihua Zhang. “Deciphering spatial domains from spatially resolved transcriptomics with an adaptive graph attention auto-encoder.” Nature Communications 13.1 (2022): 1-12.