Skip to content

Commit 6c26e7d

Browse files
authored
Content rev Nov20-2 (#18)
1 parent 1748e8f commit 6c26e7d

File tree

18 files changed

+588
-522
lines changed

18 files changed

+588
-522
lines changed

docs/component-development/creating-components.mdx

Lines changed: 109 additions & 97 deletions
Original file line numberDiff line numberDiff line change
@@ -13,8 +13,11 @@ Learn more about the Oasis CLI tool in the [Oasis CLI Manual](/docs/component-de
1313

1414
## Lightweight Python Components
1515

16-
As you learn in the [Creating Components](/docs/component-development/creating-components-generic) guide, you can create a component by writing a code and containerizing it. But this approach may be time-consuming.
17-
For Python function, you can use the Oasis CLI tool to generate a component specification from your function. Instead of rebuilding containers for every code change, the Python code goes **in the command line**, outside the container:
16+
Instead of [the conventional approach](/docs/component-development/creating-components-generic) where you write code, containerize it, publish to a registry, and manually create YAML configuration files, Oasis CLI tool automates this entire process.
17+
You simply write your Python code and run a single command - the system automatically generates the YAML specification running code as a command, eliminating the need to manage Docker images or registries.
18+
19+
This approach called "Lightweight Python Components" and it dramatically reduces the time from code to component, allowing you to iterate faster and focus on the code logic rather than infrastructure.
20+
1821

1922
```yaml
2023
implementation:
@@ -29,97 +32,6 @@ implementation:
2932
# Generated wrapper code handles I/O
3033
```
3134
32-
### The `InputPath` and `OutputPath` annotations
33-
34-
The `text_path: InputPath()` annotation tells the system that the input data for the text input should be placed into some file and the path of that file should be given to the function as a value for the text_path function parameter.
35-
36-
The `filtered_text_path: OutputPath()` annotation tells the system that it should generate and give the function a path (via the filtered_text_path parameter) where the function should write the output data. After the function finishes the execution, the system will take the output data written by the function, put it into storage and make it available for passing to other components.
37-
38-
### Why do we need the `InputPath` parameter annotation?
39-
40-
Not all data can be passed/received as a simple string. Examples: binary data, large data, directories. In all these cases, the code should read data from a file or directory pointed to by a path. This is why we have a text_path: InputPath() parameter and not text: str parameter (although the latter could still work for short texts). Another reason why the InputPath annotation is needed is that the component function code is executed inside a hermetic container. The text file needs to somehow be placed inside the container. Only the system can do that. The text_path: InputPath() annotation tells the system that the input data for the text input should be placed into some file and the path of that file should be given to the function as a value for the text_path function parameter.
41-
42-
Similarly the filtered_text_path: OutputPath() parameter annotation is needed so that the system knows that it needs to get the output data out of the container when the function finishes its execution.
43-
44-
### Default parameter values
45-
46-
The `create_component_from_func` function supports functions with default parameter values. This results in the generated component inputs becoming optional.
47-
48-
Path parameters annotated with `InputPath()` can have a default value of `None` which makes those file inputs optional.
49-
50-
The default parameter values can use any Python built-in type. (Only the built-in types can be used because the function needs to remain self-contained).
51-
52-
```python
53-
def some_func(
54-
some_int: int = 3,
55-
some_path: InputPath() = None,
56-
):
57-
from pathlib import Path
58-
if some_path:
59-
Path(some_path).read_text()
60-
...
61-
```
62-
63-
### Input and Output Types
64-
65-
While low-level TangleML does not enforce any types, the Oasis CLI generator (`components.create_component_from_func`) provides support for six basic Python types:
66-
67-
| Python Type | TangleML Type | Serialization |
68-
| ----------- | ------------- | ---------------------------- |
69-
| `str` | `String` | Direct passing |
70-
| `int` | `Integer` | String to int conversion |
71-
| `float` | `Float` | String to float conversion |
72-
| `bool` | `Boolean` | String to boolean conversion |
73-
| `list` | `JsonArray` | JSON serialization |
74-
| `dict` | `JsonObject` | JSON serialization |
75-
76-
:::tip
77-
**Beyond the Basics**: You can use any type annotation (like `XGBoostModel`), but unsupported types will be passed as strings. The generator only adds serialization/deserialization for the six basic types.
78-
:::
79-
80-
The function parameters (the parameter names and type annotations) are mapped to component inputs and outputs in a certain way. This example demonstrates all aspects of the mapping
81-
82-
```python
83-
def my_func(
84-
# Directly supported types:
85-
# Mapped to input with name "some_string" and type "String"
86-
some_string: str,
87-
# Mapped to input with name "some_string" and type "Integer"
88-
some_integer: int,
89-
# Mapped to input with name "some_float" and type "Float"
90-
some_float: float,
91-
# Mapped to input with name "some_boolean" and type "Boolean"
92-
some_boolean: bool,
93-
# Mapped to input with name "some_list" and type "JsonArray"
94-
some_list: list,
95-
# Mapped to input with name "some_dict" and type "JsonObject"
96-
some_dict: dict,
97-
98-
# Mapped to input with name "any_thing" and no type (compatible with any type. Will receive a string value at runtime!)
99-
any_thing,
100-
101-
# Other types
102-
# Mapped to input with name "some_uri" and type "Uri" (Will receive a string value at runtime!)
103-
some_uri: "Uri",
104-
# Mapped to input with name "some_uri" and type "BigInt" (Will receive a string value at runtime!)
105-
some_uri: BigInt,
106-
107-
# Paths:
108-
# Mapped to input with name "input1" (the "_path" suffix is removed)
109-
input1_path: InputPath(""),
110-
# Mapped to output with name "output1" and type "CSV" (the "_path" suffix is removed)
111-
output1_path: OutputPath("CSV"),
112-
) -> typing.NamedTuple("Outputs", [
113-
# Mapped to output with name "output_string" and type "String"
114-
("output_string", str),
115-
# Mapped to output with name "output_uri" and type "Uri" (function needs to return a string)
116-
("output_uri", "Uri"),
117-
]):
118-
...
119-
return ("Some string", "some-uri://...")
120-
```
121-
122-
12335
## Tutorial: Creating a Lightweight Python Component
12436
12537
This guide walks you through creating a TangleML component that performs regex-based text replacement. The component reads an input text file, replaces all substrings matching a given regex pattern, and writes the result to an output file.
@@ -489,9 +401,10 @@ To use the component, drop it into your pipeline and configure the inputs. Click
489401

490402

491403

492-
## Appendix
404+
### Appendix
493405

494-
### Sample Data
406+
<details>
407+
<summary>Sample Data</summary>
495408

496409
```txt
497410
Employee Records - Confidential
@@ -534,8 +447,10 @@ Notes:
534447
- Server maintenance scheduled for 192.168.1.200
535448
- Update payment info for CC: 5432 1098 7654 3210
536449
```
450+
</details>
537451

538-
### Common Regex Patterns
452+
<details>
453+
<summary>Common Regex Patterns</summary>
539454

540455
| Use Case | Pattern | Example Replacement |
541456
| --------------------------- | ------------------------------------------------------ | ------------------- |
@@ -550,7 +465,10 @@ Notes:
550465
| **HTML Tags** | `<[^>]+>` | ` ` |
551466
| **Code Comments (Python)** | `#.*$` | ` ` |
552467

553-
### Regex Flags Reference
468+
</details>
469+
470+
<details>
471+
<summary>Regex Flags Reference</summary>
554472

555473
| Flag | Value | Description | Use Case |
556474
| --------------- | ----- | --------------------------------- | ------------------------------- |
@@ -564,3 +482,97 @@ To combine flags, add their values:
564482
```python
565483
flags = 2 + 8 # IGNORECASE + MULTILINE = 10
566484
```
485+
486+
</details>
487+
488+
## Afterthoughts
489+
490+
### The `InputPath` and `OutputPath` annotations
491+
492+
The `text_path: InputPath()` annotation tells the system that the input data for the text input should be placed into some file and the path of that file should be given to the function as a value for the text_path function parameter.
493+
494+
The `filtered_text_path: OutputPath()` annotation tells the system that it should generate and give the function a path (via the filtered_text_path parameter) where the function should write the output data. After the function finishes the execution, the system will take the output data written by the function, put it into storage and make it available for passing to other components.
495+
496+
### Why do we need the `InputPath` parameter annotation?
497+
498+
Not all data can be passed/received as a simple string. Examples: binary data, large data, directories. In all these cases, the code should read data from a file or directory pointed to by a path. This is why we have a text_path: InputPath() parameter and not text: str parameter (although the latter could still work for short texts). Another reason why the InputPath annotation is needed is that the component function code is executed inside a hermetic container. The text file needs to somehow be placed inside the container. Only the system can do that. The text_path: InputPath() annotation tells the system that the input data for the text input should be placed into some file and the path of that file should be given to the function as a value for the text_path function parameter.
499+
500+
Similarly the filtered_text_path: OutputPath() parameter annotation is needed so that the system knows that it needs to get the output data out of the container when the function finishes its execution.
501+
502+
### Default parameter values
503+
504+
The `create_component_from_func` function supports functions with default parameter values. This results in the generated component inputs becoming optional.
505+
506+
Path parameters annotated with `InputPath()` can have a default value of `None` which makes those file inputs optional.
507+
508+
The default parameter values can use any Python built-in type. (Only the built-in types can be used because the function needs to remain self-contained).
509+
510+
```python
511+
def some_func(
512+
some_int: int = 3,
513+
some_path: InputPath() = None,
514+
):
515+
from pathlib import Path
516+
if some_path:
517+
Path(some_path).read_text()
518+
...
519+
```
520+
521+
### Input and Output Types
522+
523+
While low-level TangleML does not enforce any types, the Oasis CLI generator (`components.create_component_from_func`) provides support for six basic Python types:
524+
525+
| Python Type | TangleML Type | Serialization |
526+
| ----------- | ------------- | ---------------------------- |
527+
| `str` | `String` | Direct passing |
528+
| `int` | `Integer` | String to int conversion |
529+
| `float` | `Float` | String to float conversion |
530+
| `bool` | `Boolean` | String to boolean conversion |
531+
| `list` | `JsonArray` | JSON serialization |
532+
| `dict` | `JsonObject` | JSON serialization |
533+
534+
:::tip
535+
**Beyond the Basics**: You can use any type annotation (like `XGBoostModel`), but unsupported types will be passed as strings. The generator only adds serialization/deserialization for the six basic types.
536+
:::
537+
538+
The function parameters (the parameter names and type annotations) are mapped to component inputs and outputs in a certain way. This example demonstrates all aspects of the mapping
539+
540+
```python
541+
def my_func(
542+
# Directly supported types:
543+
# Mapped to input with name "some_string" and type "String"
544+
some_string: str,
545+
# Mapped to input with name "some_string" and type "Integer"
546+
some_integer: int,
547+
# Mapped to input with name "some_float" and type "Float"
548+
some_float: float,
549+
# Mapped to input with name "some_boolean" and type "Boolean"
550+
some_boolean: bool,
551+
# Mapped to input with name "some_list" and type "JsonArray"
552+
some_list: list,
553+
# Mapped to input with name "some_dict" and type "JsonObject"
554+
some_dict: dict,
555+
556+
# Mapped to input with name "any_thing" and no type (compatible with any type. Will receive a string value at runtime!)
557+
any_thing,
558+
559+
# Other types
560+
# Mapped to input with name "some_uri" and type "Uri" (Will receive a string value at runtime!)
561+
some_uri: "Uri",
562+
# Mapped to input with name "some_uri" and type "BigInt" (Will receive a string value at runtime!)
563+
some_uri: BigInt,
564+
565+
# Paths:
566+
# Mapped to input with name "input1" (the "_path" suffix is removed)
567+
input1_path: InputPath(""),
568+
# Mapped to output with name "output1" and type "CSV" (the "_path" suffix is removed)
569+
output1_path: OutputPath("CSV"),
570+
) -> typing.NamedTuple("Outputs", [
571+
# Mapped to output with name "output_string" and type "String"
572+
("output_string", str),
573+
# Mapped to output with name "output_uri" and type "Uri" (function needs to return a string)
574+
("output_uri", "Uri"),
575+
]):
576+
...
577+
return ("Some string", "some-uri://...")
578+
```

docs/core-concepts/artifacts.mdx

Lines changed: 77 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,77 @@
1+
---
2+
title: Understanding Artifacts in TangleML
3+
sidebar_label: Artifacts
4+
description: Learn how TangleML's artifact system works
5+
---
6+
7+
import {ImageAnnotation} from "@site/src/components/ImageAnnotation";
8+
9+
Artifacts are the data produced by components (read: any output), stored in TangleML's artifact storage system:
10+
11+
- **Blobs**: Nameless files (just data)
12+
- **Directories**: Nameless containers with named files inside
13+
14+
<ImageAnnotation src={require('./assets/Artifacts.png').default} alt="Artifacts" >
15+
16+
Artifacts can be accessed in the Pipeline Run page, in [the Artifacts tab](/docs/user-guide/studio-app-ui-overview#1-artifacts-tab).
17+
18+
</ImageAnnotation>
19+
20+
:::tip
21+
Small values may be stored in the TangleML database without putting any TTL on them.
22+
:::
23+
24+
### Blob vs Directory Artifacts
25+
26+
#### Blob Artifacts
27+
28+
Blobs are nameless data files. Components always write to and read from a file named `data`:
29+
30+
```python
31+
# Component writes blob
32+
with open("/tmp/outputs/model/data", "wb") as f:
33+
pickle.dump(model, f)
34+
35+
# Downstream component reads blob
36+
with open("/tmp/inputs/model/data", "rb") as f:
37+
model = pickle.load(f)
38+
```
39+
40+
This naming convention ensures compatibility - no component expects specific filenames.
41+
42+
#### Directory Artifacts
43+
44+
Directories are nameless containers, but files inside retain their names:
45+
46+
```python
47+
# Component writes directory
48+
output_dir = "/tmp/outputs/dataset/data/"
49+
os.makedirs(output_dir, exist_ok=True)
50+
pd.DataFrame(...).to_parquet(f"{output_dir}/train.parquet")
51+
pd.DataFrame(...).to_parquet(f"{output_dir}/test.parquet")
52+
53+
# Downstream component reads directory
54+
input_dir = "/tmp/inputs/dataset/data/"
55+
train = pd.read_parquet(f"{input_dir}/train.parquet")
56+
test = pd.read_parquet(f"{input_dir}/test.parquet")
57+
```
58+
59+
### Artifact Attributes
60+
61+
Every artifact has:
62+
63+
- **Size**: Total bytes (for directories, cumulative size)
64+
- **Hash**: MD5 (Google Cloud) or SHA-256 (local) for content-based caching
65+
- **Is Directory**: Boolean flag
66+
- **URL**: Storage location (hidden from components, managed by system)
67+
68+
### Storage and Retention
69+
70+
| Artifact Type | Storage Duration | What's Retained After TTL |
71+
| ----------------------------- | ----------------- | -------------------------- |
72+
| **Large artifacts** | 30 days (Shopify) | Metadata only (size, hash) |
73+
| **Small values** | Permanent | Full value in database |
74+
75+
:::warning Data Retention
76+
At Shopify, artifacts containing merchant or PII data are automatically deleted after 30 days due to compliance requirements. After deletion, you'll see metadata but get 404 errors when accessing the actual data.
77+
:::

docs/core-concepts/caching.mdx

Lines changed: 2 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -106,18 +106,8 @@ Cache staleness uses RFC3339 duration format:
106106
- `PT1H` - 1 hour
107107
- `P0D` - Disable caching
108108

109-
## Data Retention and Purging
110-
111-
### Artifact Lifecycle
112-
113-
| Stage | Duration | What Remains |
114-
| ----------- | --------- | ---------------------------------------- |
115-
| **Active** | 0-30 days | Full artifact data and metadata |
116-
| **Purged** | >30 days | Metadata only (size, hash, small values) |
117-
| **Deleted** | Never | Execution records persist indefinitely |
118-
119-
:::warning
120-
Artifacts are automatically purged after 30 days due to data retention policies. URLs to purged artifacts will return 404 errors, but metadata remains visible.
109+
:::info
110+
Large artifacts are purged after 30 days due to [data retention policies](/docs/core-concepts/artifacts#data-retention-and-purging).
121111
:::
122112

123113
### What Survives Purging

0 commit comments

Comments
 (0)