Skip to content

Commit 0f1f71d

Browse files
committed
add guidelines
1 parent e78fda6 commit 0f1f71d

File tree

1 file changed

+86
-0
lines changed

1 file changed

+86
-0
lines changed

docs/guidelines.md

Lines changed: 86 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,86 @@
1+
# Data Storage, Pipelines, and Analysis Environment Guidelines
2+
3+
**Audience:** Internal team (Shannan, Lorena, Alex, Noor, Elizabeth, Ruitong, Zhu, James, Upen, Will, Emma, others)
4+
**Purpose:** Establish consistent practices for data storage, pipeline execution, and downstream analysis across FAS, O2, MIX, AWS, and local environments.
5+
6+
---
7+
8+
## 1. Data Storage
9+
10+
### 1.1 Storage Locations
11+
12+
* **FAS**
13+
14+
* For FAS/HSPH-affiliated projects.
15+
* Intended use: *final* and *downstream objects*.
16+
* Scratch storage: temporary data only; May be auto-synced to O2 for non-FAS projects.
17+
18+
* **O2**
19+
20+
* For HMS-affiliated and non FAS/HSPH projects.
21+
22+
* **Globus**
23+
24+
* Preferred for FAS–O2 syncing (benchmark: 1 TB in \~15 minutes).
25+
26+
### 1.2 Data Lifecycle Policy
27+
28+
* **Raw Data**:
29+
30+
* Work only in scratch
31+
* Used only for pipelines
32+
* Returned to clients after processing
33+
* Explore “cold storage” with retrieval fees
34+
35+
* **Pipeline Outputs Data**:
36+
37+
* Long-term retention on FAS.
38+
* Accessible to clients/collaborators.
39+
40+
* **Downstream Objects**:
41+
42+
* Retained on FAS for reproducibility and reanalysis.
43+
44+
### 1.3 Data Management and Data Flow Practices
45+
46+
* Maintain **strict folder structure** for consistency. Follow project names in **Trello Cards**
47+
* Ensures all PI folder directories have `group r+w` permissions.
48+
* Every analyst works in **their workspace**:
49+
- It could be FAS scratch or FAS user space
50+
- **First step is to clone repositiory** (repository would be ready for analysts to clone)
51+
- Data (Primary-pipeline outputs, Secondary-files and objects) **always** in project folder .
52+
- Easy level: use full paths to project folder
53+
- Advance level: use symlinks if it is easy for you. Add that step into the readme if you do use symlinks.
54+
55+
This ensure multiple people working in the same project, avoid GitHub issues, ensure good data management practices.
56+
57+
## 2. Computational Environments
58+
59+
### 2.1 Pipelines (Primary Analysis)
60+
61+
* **FAS**: Default environment for most projects.
62+
* Monitor with Seqera (Alex to assist).
63+
* **Seqera + AWS**: For small workloads (< 20 samples) or projects requiring cloud scalability.
64+
65+
### 2.2 Downstream Analysis (Secondary / Exploratory)
66+
67+
* **Local**: Preferred by many researchers for lightweight analysis.
68+
* Downstream objects need to be put back in project directory at FAS or O2
69+
* **FAS**: Standard for large datasets, reproducibility, and shared work.
70+
* **O2**: Used for HMS projects and where performance is sufficient.
71+
72+
## 3. Open Questions / Action Items
73+
74+
1. **Storage Cost Policy**
75+
76+
* Should clients pay for raw data retention?
77+
* Should we implement cold storage + retrieval fees?
78+
79+
2. **Permissions and Automation**
80+
81+
* Confirm with O2 team about cron jobs for PI folders (`group r+w`).
82+
* Define rules for excluding specific directories in sync.
83+
84+
3. **Team Coordination**
85+
86+
* Shannan & Lorena: schedule meeting with FASRC to discuss quotas, automation, and permissions.

0 commit comments

Comments
 (0)