Skip to content

Commit 6361b5e

Browse files
committed
moved section on limiting factors of data transfer speed
1 parent e7f3059 commit 6361b5e

File tree

2 files changed

+70
-68
lines changed

2 files changed

+70
-68
lines changed

docs/hpc/13_tutorial_intro_hpc/07_transferring_files_remote.mdx

Lines changed: 64 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -242,10 +242,73 @@ To connect to the cluster, we’ll just need to enter our credentials at the top
242242

243243
Hit “Quickconnect” to connect. You should see your remote files appear on the right hand side of the screen. You can drag-and-drop files between the left (local) and right (remote) sides of the screen to transfer files.
244244

245+
:::tip
245246
Finally, if you need to move large files (typically larger than a gigabyte) from one remote computer to another remote computer, SSH in to the computer hosting the files and use `scp` or `rsync` to transfer over to the other. This will be more efficient than using FileZilla (or related applications) that would copy from the source to your local machine, then to the destination machine.
247+
:::
248+
249+
## Limiting Factors of Data Transfer Speed
250+
Data transfer speed may be limited by many different factors so the best data transfer mechanism to use depends on the type of data being transferred and where the data is going. The components between your data’s source and destination have varying levels of performance, and in particular, may have different capabilities with respect to *bandwidth* and *latency*.
251+
252+
**Bandwidth** is generally the raw amount of data per unit time a device is capable of transmitting or receiving. It’s a common and generally well-understood metric.
253+
254+
**Latency** is a bit more subtle. For data transfers, it may be thought of as the amount of time it takes to get data out of storage and into a transmittable form. Latency issues are the reason it’s advisable to execute data transfers by moving a small number of large files, rather than the converse.
255+
256+
Some of the key components and their associated issues are:
257+
258+
- **Disk speed**: File systems on HPC systems are often highly parallel, consisting of a very large number of high performance disk drives. This allows them to support a very high data bandwidth. Unless the remote system has a similar parallel file system you may find your transfer speed limited by disk performance at that end.
259+
- **Meta-data performance**: Meta-data operations such as opening and closing files or listing the owner or size of a file are much less parallel than read/write operations. If your data consists of a very large number of small files you may find your transfer speed is limited by meta-data operations. Meta-data operations performed by other users of the system can also interact strongly with those you perform so reducing the number of such operations you use (by combining multiple files into a single file) may reduce variability in your transfer rates and increase transfer speeds.
260+
- **Network speed**: Data transfer performance can be limited by network speed. More importantly it is limited by the slowest section of the network between source and destination. If you are transferring to your laptop/workstation, this is likely to be its connection (either via LAN or WiFi).
261+
- **Firewall speed**: Most modern networks are protected by some form of firewall that filters out malicious traffic. This filtering has some overhead and can result in a reduction in data transfer performance. The needs of a general purpose network that hosts email/web-servers and desktop machines are quite different from a research network that needs to support high volume data transfers. If you are trying to transfer data to or from a host on a general purpose network you may find the firewall for that network will limit the transfer rate you can achieve.
262+
263+
As mentioned above, if you have related data that consists of a large number of small files it is strongly recommended to pack the files into a larger *archive* file for long term storage and transfer. A single large file makes more efficient use of the file system and is easier to move, copy and transfer because significantly fewer metadata operations are required. Archive files can be created using tools like `tar` and `zip`. We have already met `tar` when we talked about data transfer earlier.
264+
265+
<div style={{textAlign: 'center'}}>
266+
![Schematic of Network Bandwidth](./static/schematic_network_bandwidth.svg)
267+
268+
*Schematic diagram of bandwidth and latency for disk and network I/O. Each of the components on the figure is connected by a blue line of width proportional to the interface bandwidth.<br />The small mazes at the link points illustrate the latency of the link, with more tortuous mazes indicating higher latency.*
269+
</div>
270+
271+
<details>
272+
<summary>
273+
:::info[Consider the Best Way to Transfer Data]
274+
If you are transferring large amounts of data you will need to think about what may affect your transfer performance. It is always useful to run some tests that you can use to extrapolate how long it will take to transfer your data.
275+
276+
Say you have a “data” folder containing 10,000 or so files, a healthy mix of small and large ASCII and binary data. Which of the following would be the best way to transfer them to Greene?
277+
```bash
278+
[user@laptop ~]$ scp -r data NYUNetID@greene.hpc.nyu.edu:~/
279+
```
280+
```bash
281+
[user@laptop ~]$ rsync -ra data NYUNetID@greene.hpc.nyu.edu:~/
282+
```
283+
```bash
284+
[user@laptop ~]$ rsync -raz data NYUNetID@greene.hpc.nyu.edu:~/
285+
```
286+
```bash
287+
[user@laptop ~]$ tar -cvf data.tar data
288+
```
289+
```bash
290+
[user@laptop ~]$ rsync -raz data.tar NYUNetID@greene.hpc.nyu.edu:~/
291+
```
292+
```bash
293+
[user@laptop ~]$ tar -cvzf data.tar.gz data
294+
```
295+
```bash
296+
[user@laptop ~]$ rsync -ra data.tar.gz NYUNetID@greene.hpc.nyu.edu:~/
297+
```
298+
**[Click for Solution]**
299+
:::
300+
</summary>
301+
:::tip[Solution]
302+
1. `scp` will recursively copy the directory. This works, but without compression.
303+
1. `rsync -ra` works like `scp -r`, but preserves file information like creation times. This is marginally better.
304+
1. `rsync -raz` adds compression, which will save some bandwidth. If you have a strong CPU at both ends of the line, and you’re on a slow network, this is a good choice.
305+
1. This command first uses `tar` to merge everything into a single file, then `rsync -z` to transfer it with compression. If you have a large number of files, metadata overhead can hamper your transfer, so this is a good idea.
306+
1. This command uses `tar -z` to compress the archive, then `rsync` to transfer it. This may perform similarly to the command directly above, but in most cases (for large datasets), it’s the best combination of high throughput and low latency (making the most of your time and network connection).
307+
:::
308+
</details>
246309

247310
:::tip[Key Points]
248311
- `wget` and `curl -O` download a file from the internet.
249312
- `scp` and `rsync` transfer files to and from your computer.
250313
- You can use an SFTP client like FileZilla to transfer files through a GUI.
251-
:::
314+
:::

docs/hpc/13_tutorial_intro_hpc/10_using_resources_responsibly.mdx

Lines changed: 6 additions & 67 deletions
Original file line numberDiff line numberDiff line change
@@ -20,7 +20,7 @@ The widespread usage of scheduling systems where users submit jobs on HPC resour
2020
## Be Kind to the Login Nodes
2121
The login node is often busy managing all of the logged in users, creating and editing files and compiling software. If the machine runs out of memory or processing capacity, it will become very slow and unusable for everyone. While the machine is meant to be used, be sure to do so responsibly – in ways that will not adversely impact other users’ experience.
2222

23-
Login nodes are always the right place to launch jobs, but data transfers should be done on the Greene Data Transfer Nodes (gDTNs). Please see more about gDTNs at [Data Transfers](../03_storage/03_data_transfers.md). Similarly, computationally intensive tasks should all be done on compute nodes. This refers to not just computational analysis/research tasks, but also to computationally intensive software installations and similar tasks.
23+
Login nodes are always the right place to launch jobs, but data transfers should be done on the Greene Data Transfer Nodes (gDTNs). Please see more about gDTNs at [Data Transfers](../03_storage/03_data_transfers.md). Similarly, computationally intensive tasks should all be done on compute nodes. This refers to not just computational analysis/research tasks, but also to processor intensive software installations and similar tasks.
2424

2525
:::warning[Login Nodes Are a Shared Resource]
2626
Remember, the login node is shared with all other users and your actions could cause issues for other people. Think carefully about the potential implications of issuing commands that may use large amounts of resource.
@@ -60,9 +60,7 @@ Remember that you are generally charged for usage on shared systems. A simple mi
6060

6161

6262
:::tip[Test Job Submission Scripts That Use Large Amounts of Resources]
63-
We suggest that you test a subset of you data or analysis on an interactive node prior to running full batch jobs. This way you can request a smaller set of resources and time which should decrease your time waiting in the queue, and you'll be able to quickly iterate code refactoring in interactive mode. When you've got everything working well on smaller problems you can submit batch jobs for larger ones.
64-
65-
Even after doing the above it's often wise to start with a small batch job first to make sure that you don't have any errors in your batch script.
63+
We suggest that you test a subset of you data or analysis on an interactive node prior to running full batch jobs. This way you can request a smaller set of resources and time which should decrease your wait time in the queue, and you'll be able to quickly iterate code refactoring in interactive mode. When you've got everything working well on smaller problems you can submit batch jobs for larger ones. Even after doing the above it's often wise to start with a small batch job first to make sure that you don't have any errors in your batch script.
6664
:::
6765

6866
## Have a Backup Plan
@@ -75,76 +73,17 @@ It is also important to remember that your access to the shared HPC system will
7573
In all these cases, please contact [hpc@nyu.edu](mailto:hpc@nyu.edu) if you have questions about data transfer and storage for the volumes of data you will be using.
7674

7775
:::warning[Your Data Is Your Responsibility]
78-
Make sure you understand what the backup policy is on the file systems on the system you are using and what implications this has for your work if you lose your data on the system. Plan your backups of critical data and how you will transfer data off the system throughout the project.
76+
Make sure you understand what the backup policy is on the system you are using and what implications this has for your work if you lose your data on the system. Plan your own personal backups of critical data and how you will transfer data off the system throughout the project.
7977
:::
8078

8179
## Transferring Data
82-
The most important point about transferring data responsibly on Green is to be sure to use Greene Date Transfer Nodes (gDTNs) or other options like [Globus](../03_storage/04_globus.md). Please see [Data Transfers](../03_storage/03_data_transfers.md) for details. By doing this you'll help to keep the login nodes responsive for all users.
83-
84-
Being efficient in *how* you transfer data on the gDTNs is also important. It will not only reduce the load on the gDTNs, but also save your time. Data transfer speed may be limited by many different factors so the best data transfer mechanism to use depends on the type of data being transferred and where the data is going.
85-
86-
The components between your data’s source and destination have varying levels of performance, and in particular, may have different capabilities with respect to bandwidth and latency.
87-
88-
**Bandwidth** is generally the raw amount of data per unit time a device is capable of transmitting or receiving. It’s a common and generally well-understood metric.
89-
90-
**Latency** is a bit more subtle. For data transfers, it may be thought of as the amount of time it takes to get data out of storage and into a transmittable form. Latency issues are the reason it’s advisable to execute data transfers by moving a small number of large files, rather than the converse.
91-
92-
Some of the key components and their associated issues are:
93-
94-
- **Disk speed**: File systems on HPC systems are often highly parallel, consisting of a very large number of high performance disk drives. This allows them to support a very high data bandwidth. Unless the remote system has a similar parallel file system you may find your transfer speed limited by disk performance at that end.
95-
- **Meta-data performance**: Meta-data operations such as opening and closing files or listing the owner or size of a file are much less parallel than read/write operations. If your data consists of a very large number of small files you may find your transfer speed is limited by meta-data operations. Meta-data operations performed by other users of the system can also interact strongly with those you perform so reducing the number of such operations you use (by combining multiple files into a single file) may reduce variability in your transfer rates and increase transfer speeds.
96-
- **Network speed**: Data transfer performance can be limited by network speed. More importantly it is limited by the slowest section of the network between source and destination. If you are transferring to your laptop/workstation, this is likely to be its connection (either via LAN or WiFi).
97-
- **Firewall speed**: Most modern networks are protected by some form of firewall that filters out malicious traffic. This filtering has some overhead and can result in a reduction in data transfer performance. The needs of a general purpose network that hosts email/web-servers and desktop machines are quite different from a research network that needs to support high volume data transfers. If you are trying to transfer data to or from a host on a general purpose network you may find the firewall for that network will limit the transfer rate you can achieve.
98-
99-
As mentioned above, if you have related data that consists of a large number of small files it is strongly recommended to pack the files into a larger *archive* file for long term storage and transfer. A single large file makes more efficient use of the file system and is easier to move, copy and transfer because significantly fewer metadata operations are required. Archive files can be created using tools like `tar` and `zip`. We have already met `tar` when we talked about data transfer earlier.
100-
101-
<div style={{textAlign: 'center'}}>
102-
![Schematic of Network Bandwidth](./static/schematic_network_bandwidth.svg)
80+
The most important point about transferring data responsibly on Green is to be sure to use Greene Data Transfer Nodes (gDTNs) or other options like [Globus](../03_storage/04_globus.md). Please see [Data Transfers](../03_storage/03_data_transfers.md) for details. By doing this you'll help to keep the login nodes responsive for all users.
10381

104-
*Schematic diagram of bandwidth and latency for disk and network I/O. Each of the components on the figure is connected by a blue line of width proportional to the interface bandwidth.<br />The small mazes at the link points illustrate the latency of the link, with more tortuous mazes indicating higher latency.*
105-
</div>
106-
107-
<details>
108-
<summary>
109-
:::info[Consider the Best Way to Transfer Data]
110-
If you are transferring large amounts of data you will need to think about what may affect your transfer performance. It is always useful to run some tests that you can use to extrapolate how long it will take to transfer your data.
111-
112-
Say you have a “data” folder containing 10,000 or so files, a healthy mix of small and large ASCII and binary data. Which of the following would be the best way to transfer them to Greene?
113-
```bash
114-
[user@laptop ~]$ scp -r data NYUNetID@greene.hpc.nyu.edu:~/
115-
```
116-
```bash
117-
[user@laptop ~]$ rsync -ra data NYUNetID@greene.hpc.nyu.edu:~/
118-
```
119-
```bash
120-
[user@laptop ~]$ rsync -raz data NYUNetID@greene.hpc.nyu.edu:~/
121-
```
122-
```bash
123-
[user@laptop ~]$ tar -cvf data.tar data
124-
```
125-
```bash
126-
[user@laptop ~]$ rsync -raz data.tar NYUNetID@greene.hpc.nyu.edu:~/
127-
```
128-
```bash
129-
[user@laptop ~]$ tar -cvzf data.tar.gz data
130-
```
131-
```bash
132-
[user@laptop ~]$ rsync -ra data.tar.gz NYUNetID@greene.hpc.nyu.edu:~/
133-
```
134-
**[Click for Solution]**
135-
:::
136-
</summary>
137-
:::tip[Solution]
138-
1. `scp` will recursively copy the directory. This works, but without compression.
139-
1. `rsync -ra` works like `scp -r`, but preserves file information like creation times. This is marginally better.
140-
1. `rsync -raz` adds compression, which will save some bandwidth. If you have a strong CPU at both ends of the line, and you’re on a slow network, this is a good choice.
141-
1. This command first uses `tar` to merge everything into a single file, then `rsync -z` to transfer it with compression. If you have a large number of files, metadata overhead can hamper your transfer, so this is a good idea.
142-
1. This command uses `tar -z` to compress the archive, then `rsync` to transfer it. This may perform similarly to the command directly above, but in most cases (for large datasets), it’s the best combination of high throughput and low latency (making the most of your time and network connection).
143-
:::
144-
</details>
82+
Being efficient in *how* you transfer data on the gDTNs is also important. It will not only reduce the load on the gDTNs, but also save your time. Be sure to archive and compress you files if possible with `tar` and `gzip`. This will remove the overhead of trying to transfer many files and shrink the size of transfer. Please see [Transferring Files with Remote Computers](./07_transferring_files_remote.mdx) for details.
14583

14684
:::tip[Key Points]
14785
- Be careful how you use the login node.
14886
- Your data on the system is your responsibility.
87+
- Always use Greene Data Transfer Nodes (gDTNs) for large data transfers.
14988
- Plan and test large data transfers.
15089
- It is often best to convert many files to a single archive file before transferring.

0 commit comments

Comments
 (0)