You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/hpc/13_tutorial_intro_hpc/07_transferring_files_remote.mdx
+64-1Lines changed: 64 additions & 1 deletion
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -242,10 +242,73 @@ To connect to the cluster, we’ll just need to enter our credentials at the top
242
242
243
243
Hit “Quickconnect” to connect. You should see your remote files appear on the right hand side of the screen. You can drag-and-drop files between the left (local) and right (remote) sides of the screen to transfer files.
244
244
245
+
:::tip
245
246
Finally, if you need to move large files (typically larger than a gigabyte) from one remote computer to another remote computer, SSH in to the computer hosting the files and use `scp` or `rsync` to transfer over to the other. This will be more efficient than using FileZilla (or related applications) that would copy from the source to your local machine, then to the destination machine.
247
+
:::
248
+
249
+
## Limiting Factors of Data Transfer Speed
250
+
Data transfer speed may be limited by many different factors so the best data transfer mechanism to use depends on the type of data being transferred and where the data is going. The components between your data’s source and destination have varying levels of performance, and in particular, may have different capabilities with respect to *bandwidth* and *latency*.
251
+
252
+
**Bandwidth** is generally the raw amount of data per unit time a device is capable of transmitting or receiving. It’s a common and generally well-understood metric.
253
+
254
+
**Latency** is a bit more subtle. For data transfers, it may be thought of as the amount of time it takes to get data out of storage and into a transmittable form. Latency issues are the reason it’s advisable to execute data transfers by moving a small number of large files, rather than the converse.
255
+
256
+
Some of the key components and their associated issues are:
257
+
258
+
-**Disk speed**: File systems on HPC systems are often highly parallel, consisting of a very large number of high performance disk drives. This allows them to support a very high data bandwidth. Unless the remote system has a similar parallel file system you may find your transfer speed limited by disk performance at that end.
259
+
-**Meta-data performance**: Meta-data operations such as opening and closing files or listing the owner or size of a file are much less parallel than read/write operations. If your data consists of a very large number of small files you may find your transfer speed is limited by meta-data operations. Meta-data operations performed by other users of the system can also interact strongly with those you perform so reducing the number of such operations you use (by combining multiple files into a single file) may reduce variability in your transfer rates and increase transfer speeds.
260
+
-**Network speed**: Data transfer performance can be limited by network speed. More importantly it is limited by the slowest section of the network between source and destination. If you are transferring to your laptop/workstation, this is likely to be its connection (either via LAN or WiFi).
261
+
-**Firewall speed**: Most modern networks are protected by some form of firewall that filters out malicious traffic. This filtering has some overhead and can result in a reduction in data transfer performance. The needs of a general purpose network that hosts email/web-servers and desktop machines are quite different from a research network that needs to support high volume data transfers. If you are trying to transfer data to or from a host on a general purpose network you may find the firewall for that network will limit the transfer rate you can achieve.
262
+
263
+
As mentioned above, if you have related data that consists of a large number of small files it is strongly recommended to pack the files into a larger *archive* file for long term storage and transfer. A single large file makes more efficient use of the file system and is easier to move, copy and transfer because significantly fewer metadata operations are required. Archive files can be created using tools like `tar` and `zip`. We have already met `tar` when we talked about data transfer earlier.
264
+
265
+
<divstyle={{textAlign: 'center'}}>
266
+

267
+
268
+
*Schematic diagram of bandwidth and latency for disk and network I/O. Each of the components on the figure is connected by a blue line of width proportional to the interface bandwidth.<br />The small mazes at the link points illustrate the latency of the link, with more tortuous mazes indicating higher latency.*
269
+
</div>
270
+
271
+
<details>
272
+
<summary>
273
+
:::info[Consider the Best Way to Transfer Data]
274
+
If you are transferring large amounts of data you will need to think about what may affect your transfer performance. It is always useful to run some tests that you can use to extrapolate how long it will take to transfer your data.
275
+
276
+
Say you have a “data” folder containing 10,000 or so files, a healthy mix of small and large ASCII and binary data. Which of the following would be the best way to transfer them to Greene?
277
+
```bash
278
+
[user@laptop ~]$ scp -r data NYUNetID@greene.hpc.nyu.edu:~/
279
+
```
280
+
```bash
281
+
[user@laptop ~]$ rsync -ra data NYUNetID@greene.hpc.nyu.edu:~/
282
+
```
283
+
```bash
284
+
[user@laptop ~]$ rsync -raz data NYUNetID@greene.hpc.nyu.edu:~/
1.`scp` will recursively copy the directory. This works, but without compression.
303
+
1.`rsync -ra` works like `scp -r`, but preserves file information like creation times. This is marginally better.
304
+
1.`rsync -raz` adds compression, which will save some bandwidth. If you have a strong CPU at both ends of the line, and you’re on a slow network, this is a good choice.
305
+
1. This command first uses `tar` to merge everything into a single file, then `rsync -z` to transfer it with compression. If you have a large number of files, metadata overhead can hamper your transfer, so this is a good idea.
306
+
1. This command uses `tar -z` to compress the archive, then `rsync` to transfer it. This may perform similarly to the command directly above, but in most cases (for large datasets), it’s the best combination of high throughput and low latency (making the most of your time and network connection).
307
+
:::
308
+
</details>
246
309
247
310
:::tip[Key Points]
248
311
-`wget` and `curl -O` download a file from the internet.
249
312
-`scp` and `rsync` transfer files to and from your computer.
250
313
- You can use an SFTP client like FileZilla to transfer files through a GUI.
Copy file name to clipboardExpand all lines: docs/hpc/13_tutorial_intro_hpc/10_using_resources_responsibly.mdx
+6-67Lines changed: 6 additions & 67 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -20,7 +20,7 @@ The widespread usage of scheduling systems where users submit jobs on HPC resour
20
20
## Be Kind to the Login Nodes
21
21
The login node is often busy managing all of the logged in users, creating and editing files and compiling software. If the machine runs out of memory or processing capacity, it will become very slow and unusable for everyone. While the machine is meant to be used, be sure to do so responsibly – in ways that will not adversely impact other users’ experience.
22
22
23
-
Login nodes are always the right place to launch jobs, but data transfers should be done on the Greene Data Transfer Nodes (gDTNs). Please see more about gDTNs at [Data Transfers](../03_storage/03_data_transfers.md). Similarly, computationally intensive tasks should all be done on compute nodes. This refers to not just computational analysis/research tasks, but also to computationally intensive software installations and similar tasks.
23
+
Login nodes are always the right place to launch jobs, but data transfers should be done on the Greene Data Transfer Nodes (gDTNs). Please see more about gDTNs at [Data Transfers](../03_storage/03_data_transfers.md). Similarly, computationally intensive tasks should all be done on compute nodes. This refers to not just computational analysis/research tasks, but also to processor intensive software installations and similar tasks.
24
24
25
25
:::warning[Login Nodes Are a Shared Resource]
26
26
Remember, the login node is shared with all other users and your actions could cause issues for other people. Think carefully about the potential implications of issuing commands that may use large amounts of resource.
@@ -60,9 +60,7 @@ Remember that you are generally charged for usage on shared systems. A simple mi
60
60
61
61
62
62
:::tip[Test Job Submission Scripts That Use Large Amounts of Resources]
63
-
We suggest that you test a subset of you data or analysis on an interactive node prior to running full batch jobs. This way you can request a smaller set of resources and time which should decrease your time waiting in the queue, and you'll be able to quickly iterate code refactoring in interactive mode. When you've got everything working well on smaller problems you can submit batch jobs for larger ones.
64
-
65
-
Even after doing the above it's often wise to start with a small batch job first to make sure that you don't have any errors in your batch script.
63
+
We suggest that you test a subset of you data or analysis on an interactive node prior to running full batch jobs. This way you can request a smaller set of resources and time which should decrease your wait time in the queue, and you'll be able to quickly iterate code refactoring in interactive mode. When you've got everything working well on smaller problems you can submit batch jobs for larger ones. Even after doing the above it's often wise to start with a small batch job first to make sure that you don't have any errors in your batch script.
66
64
:::
67
65
68
66
## Have a Backup Plan
@@ -75,76 +73,17 @@ It is also important to remember that your access to the shared HPC system will
75
73
In all these cases, please contact [hpc@nyu.edu](mailto:hpc@nyu.edu) if you have questions about data transfer and storage for the volumes of data you will be using.
76
74
77
75
:::warning[Your Data Is Your Responsibility]
78
-
Make sure you understand what the backup policy is on the file systems on the system you are using and what implications this has for your work if you lose your data on the system. Plan your backups of critical data and how you will transfer data off the system throughout the project.
76
+
Make sure you understand what the backup policy is on the system you are using and what implications this has for your work if you lose your data on the system. Plan your own personal backups of critical data and how you will transfer data off the system throughout the project.
79
77
:::
80
78
81
79
## Transferring Data
82
-
The most important point about transferring data responsibly on Green is to be sure to use Greene Date Transfer Nodes (gDTNs) or other options like [Globus](../03_storage/04_globus.md). Please see [Data Transfers](../03_storage/03_data_transfers.md) for details. By doing this you'll help to keep the login nodes responsive for all users.
83
-
84
-
Being efficient in *how* you transfer data on the gDTNs is also important. It will not only reduce the load on the gDTNs, but also save your time. Data transfer speed may be limited by many different factors so the best data transfer mechanism to use depends on the type of data being transferred and where the data is going.
85
-
86
-
The components between your data’s source and destination have varying levels of performance, and in particular, may have different capabilities with respect to bandwidth and latency.
87
-
88
-
**Bandwidth** is generally the raw amount of data per unit time a device is capable of transmitting or receiving. It’s a common and generally well-understood metric.
89
-
90
-
**Latency** is a bit more subtle. For data transfers, it may be thought of as the amount of time it takes to get data out of storage and into a transmittable form. Latency issues are the reason it’s advisable to execute data transfers by moving a small number of large files, rather than the converse.
91
-
92
-
Some of the key components and their associated issues are:
93
-
94
-
-**Disk speed**: File systems on HPC systems are often highly parallel, consisting of a very large number of high performance disk drives. This allows them to support a very high data bandwidth. Unless the remote system has a similar parallel file system you may find your transfer speed limited by disk performance at that end.
95
-
-**Meta-data performance**: Meta-data operations such as opening and closing files or listing the owner or size of a file are much less parallel than read/write operations. If your data consists of a very large number of small files you may find your transfer speed is limited by meta-data operations. Meta-data operations performed by other users of the system can also interact strongly with those you perform so reducing the number of such operations you use (by combining multiple files into a single file) may reduce variability in your transfer rates and increase transfer speeds.
96
-
-**Network speed**: Data transfer performance can be limited by network speed. More importantly it is limited by the slowest section of the network between source and destination. If you are transferring to your laptop/workstation, this is likely to be its connection (either via LAN or WiFi).
97
-
-**Firewall speed**: Most modern networks are protected by some form of firewall that filters out malicious traffic. This filtering has some overhead and can result in a reduction in data transfer performance. The needs of a general purpose network that hosts email/web-servers and desktop machines are quite different from a research network that needs to support high volume data transfers. If you are trying to transfer data to or from a host on a general purpose network you may find the firewall for that network will limit the transfer rate you can achieve.
98
-
99
-
As mentioned above, if you have related data that consists of a large number of small files it is strongly recommended to pack the files into a larger *archive* file for long term storage and transfer. A single large file makes more efficient use of the file system and is easier to move, copy and transfer because significantly fewer metadata operations are required. Archive files can be created using tools like `tar` and `zip`. We have already met `tar` when we talked about data transfer earlier.
100
-
101
-
<divstyle={{textAlign: 'center'}}>
102
-

80
+
The most important point about transferring data responsibly on Green is to be sure to use Greene Data Transfer Nodes (gDTNs) or other options like [Globus](../03_storage/04_globus.md). Please see [Data Transfers](../03_storage/03_data_transfers.md) for details. By doing this you'll help to keep the login nodes responsive for all users.
103
81
104
-
*Schematic diagram of bandwidth and latency for disk and network I/O. Each of the components on the figure is connected by a blue line of width proportional to the interface bandwidth.<br />The small mazes at the link points illustrate the latency of the link, with more tortuous mazes indicating higher latency.*
105
-
</div>
106
-
107
-
<details>
108
-
<summary>
109
-
:::info[Consider the Best Way to Transfer Data]
110
-
If you are transferring large amounts of data you will need to think about what may affect your transfer performance. It is always useful to run some tests that you can use to extrapolate how long it will take to transfer your data.
111
-
112
-
Say you have a “data” folder containing 10,000 or so files, a healthy mix of small and large ASCII and binary data. Which of the following would be the best way to transfer them to Greene?
113
-
```bash
114
-
[user@laptop ~]$ scp -r data NYUNetID@greene.hpc.nyu.edu:~/
115
-
```
116
-
```bash
117
-
[user@laptop ~]$ rsync -ra data NYUNetID@greene.hpc.nyu.edu:~/
118
-
```
119
-
```bash
120
-
[user@laptop ~]$ rsync -raz data NYUNetID@greene.hpc.nyu.edu:~/
1.`scp` will recursively copy the directory. This works, but without compression.
139
-
1.`rsync -ra` works like `scp -r`, but preserves file information like creation times. This is marginally better.
140
-
1.`rsync -raz` adds compression, which will save some bandwidth. If you have a strong CPU at both ends of the line, and you’re on a slow network, this is a good choice.
141
-
1. This command first uses `tar` to merge everything into a single file, then `rsync -z` to transfer it with compression. If you have a large number of files, metadata overhead can hamper your transfer, so this is a good idea.
142
-
1. This command uses `tar -z` to compress the archive, then `rsync` to transfer it. This may perform similarly to the command directly above, but in most cases (for large datasets), it’s the best combination of high throughput and low latency (making the most of your time and network connection).
143
-
:::
144
-
</details>
82
+
Being efficient in *how* you transfer data on the gDTNs is also important. It will not only reduce the load on the gDTNs, but also save your time. Be sure to archive and compress you files if possible with `tar` and `gzip`. This will remove the overhead of trying to transfer many files and shrink the size of transfer. Please see [Transferring Files with Remote Computers](./07_transferring_files_remote.mdx) for details.
145
83
146
84
:::tip[Key Points]
147
85
- Be careful how you use the login node.
148
86
- Your data on the system is your responsibility.
87
+
- Always use Greene Data Transfer Nodes (gDTNs) for large data transfers.
149
88
- Plan and test large data transfers.
150
89
- It is often best to convert many files to a single archive file before transferring.
0 commit comments