access.direct.demultiplexer: add ability for applets to override default USB transfer settings. #742

Sazzach · 2025-02-15T23:11:07Z

The first commit in this PR adds a way for an applet to override the default USB transfer size / the number of transfers queued on a DirectDemultiplexerInterface at a time.

The motivation for this change is that while the default values _packets_per_xfer and _xfers_per_queue work well for most applets, they aren't ideal for applets where backpressure results in the loss of data, such as the analyzer applet. In this case, I found that requesting larger transfers sized to hold ~5-10 ms of data allowed for a higher throughput without in_fifo overflows.

The second commit in this PR contains an applet called fixed-throughput. This applet outputs data from the device at a fixed rate, and checks for in_fifo overflows. I used it to test the effect of different combinations of transfer size and transfers queued.

The code block below shows the results of some measurements I took. With the default settings, and a rate of 100 Mbps, the fixed-throughput applet overflowed after ~6 seconds. By changing the settings, the fixed-throughput applet was able to run for ~40 minutes at a rate of 300 Mbps without overflows before I terminated it.

> glasgow run fixed-throughput -h
usage: glasgow run fixed-throughput [-h] [--rpkts READ-PACKETS] [--rxfers READ-TRANSFERS]
                                    rate

Evaluate fixed read throughput performance and check for in FIFO overflows

positional arguments:
  rate                     data rate in Mbps

options:
  -h, --help               show this help message and exit

run arguments:
  --rpkts READ-PACKETS     How many packets per read transfer
  --rxfers READ-TRANSFERS  How many read transfers to have active at a time

> glasgow run fixed-throughput 50
I: g.device.hardware: device already has bitstream ID a341a413bc07ad859c6a3d118bf4a2a4
I: g.cli: running handler for applet 'fixed-throughput'
^Coverflow=False
Elapsed: 68.7474992275238
Mbps: 49.50030727281442
Expected Mbps: 49.5

> glasgow run fixed-throughput 100
I: g.device.hardware: device already has bitstream ID a341a413bc07ad859c6a3d118bf4a2a4
I: g.cli: running handler for applet 'fixed-throughput'
overflow=True
Elapsed: 5.651003122329712
Mbps: 100.1768988169691
Expected Mbps: 100.5

> glasgow run fixed-throughput --rpkts 512 --rxfers 8 100
I: g.device.hardware: device already has bitstream ID a341a413bc07ad859c6a3d118bf4a2a4
I: g.cli: running handler for applet 'fixed-throughput'
^Coverflow=False
Elapsed: 64.04957938194275
Mbps: 100.48714683385604
Expected Mbps: 100.5

> glasgow run fixed-throughput --rpkts 512 --rxfers 8 300
I: g.device.hardware: device already has bitstream ID a341a413bc07ad859c6a3d118bf4a2a4
I: g.cli: running handler for applet 'fixed-throughput'
^Coverflow=False
Elapsed: 2534.1230137348175
Mbps: 300.01638171759214
Expected Mbps: 300.0

…ult USB transfer settings. Add four keyword arguments to DirectDemultiplexerInterface's constructor: read_packets_per_xfer, read_xfers_per_queue, write_packets_per_xfer, write_xfers_per_queue An applet can set these values through device.demultiplexer.claim_interface's **kwargs.

…B settings on applets that require fixed throughput without overflows.

whitequark

Thank you for the investigation. It's been a while since I looked at this code in depth, and I'm unsure if this is really the best way to solve it, especially given the analyzer's many other deficiencies.

At this moment other commitments do not leave me enough time to discuss a better approach in depth. Are you in the Matrix channel? If so we should talk about it at some point when I do have more time and work out a way forward.

whitequark · 2025-02-16T00:04:00Z

In particular, it's not clear to me that we can't just change the defaults so that any applet gets a buffer holding 5-10 ms of data. This does go against the DMA memory quotas in the Linux kernel, but:

I think the quotas may have been changed at some time between when I first wrote that and today, and
16 MB is a silly amount of memory and we could also insist (by way of a warning) on changing the quota to be higher if it is too low.

whitequark · 2025-02-16T00:04:35Z

(Also, I think the fixed-throughput applet should perhaps be a part of the benchmark applet instead? It does perform a benchmark, and I don't like adding more top-level names unless there's good justification.)

Sazzach · 2025-02-16T00:40:52Z

Thanks for the fast response!

Thank you for the investigation. It's been a while since I looked at this code in depth, and I'm unsure if this is really the best way to solve it, especially given the analyzer's many other deficiencies.

No problem! It was time well spent, I learned a number of new things about USB in the process.

At this moment other commitments do not leave me enough time to discuss a better approach in depth. Are you in the Matrix channel? If so we should talk about it at some point when I do have more time and work out a way forward.

No worries about not discussing more now. I'm not on matrix, but I am on the 1BitSquared Discord as "SamIAm", so I see Matrix messages via the bridge.

In particular, it's not clear to me that we can't just change the defaults so that any applet gets a buffer holding 5-10 ms of data. This does go against the DMA memory quotas in the Linux kernel, but:

A potential default change sounds good to me. I think one tradeoff could be latency for applets that don't use auto flush / explicitly flush.

1. I think the quotas may have been changed at some time between when I first wrote that and today, and

2. 16 MB is a silly amount of memory and we could also insist (by way of a warning) on changing the quota to be higher if it is too low.

I'm sad about the 16 MB limit. Its the default on my system (Ubuntu 24.04, Kernel 6.8). Its easy enough to increase by running sudo modprobe usbcore usbfs_memory_mb=, but it seems ridiculous for it to be so low especially with USB cameras etc.

In particular, it's not clear to me that we can't just change the defaults so that any applet gets a buffer holding 5-10 ms of data. This does go against the DMA memory quotas in the Linux kernel, but:

Moving the fixed-throughput applet into the benchmark applet makes sense and I'm happy to make that change.

whitequark · 2025-02-16T00:47:45Z

Thank you for the responsiveness! I really value this kind of contribution.

I think one tradeoff could be latency for applets that don't use auto flush / explicitly flush.

My view is that any properly designed applet should be using either auto-flush or explicitly flush. Doing neither and just relying on data to 'fall off the FIFO' does happen to work but it is expected to introduce undesirable performance characteristics.

I'm sad about the 16 MB limit. Its the default on my system (Ubuntu 24.04, Kernel 6.8).

Ah, that's upsetting. I really don't want to get involved with LKML (especially not after the latest events) so I guess the default gets to stay what it is. Do you want to explore adding a warning asking for this limit to be raised? I think once you exceed a limit you get a certain error from libusb and we could show the warning once we hit that. I recall this error to be annoyingly ambiguous but we could do what we can, now that we collectively know that it is worth bumping against the limit.

Sazzach · 2025-02-16T01:06:04Z

Ah, that's upsetting. I really don't want to get involved with LKML (especially not after the latest events) so I guess the default gets to stay what it is. Do you want to explore adding a warning asking for this limit to be raised? I think once you exceed a limit you get a certain error from libusb and we could show the warning once we hit that. I recall this error to be annoyingly ambiguous but we could do what we can, now that we collectively know that it is worth bumping against the limit.

Sure I'll look into detecting the error/making a warning on exceeding the limit. I don't remember what the error was offhand, but I definitely ran into it a few times when I made a mistake calculating the request size.

whitequark · 2025-02-16T01:25:00Z

Thanks, perfect! If you slice your changes into individual PRs I will likely be able to review and merge them faster (the new benchmark for example could be merged right away as far as I can tell from a very quick glance).

whitequark · 2025-04-26T21:34:23Z

@Sazzach Have you had the chance to take another look at this PR?

Sazzach · 2025-04-27T01:07:59Z

I have, but I also have been distracted/sidetracked and let this slide. I'm sorry.

With respect to adding a fixed source mode to the benchmark applet, the existence of a fixed source mode, implies potential fixed sink/loopback modes. Instead of adding three additional modes, I thought it'd make sense instead to add an optional --rate RATE argument, where specifying --rate 10 would set source/sink/loopback to a fixed rate of 10 MiB/s and check for over/underflow. I also have --rate in a group with --rate-search. --rate-search does a binary search on the rate and reports the highest successful rate.

I'm still working on this. My status is --rate / --rate-search for source mode works, but not with sink/loopback.

There are also some minor arguments I'd like to add. These are --duration DURATION and --until-error which would be in a group with --count.

To summarize, arguments before:

usage: glasgow run benchmark [-h] [--port SPEC] [-c COUNT] [MODE ...]

Evaluate performance of the host communication link.

Benchmark modes:
  * source: device emits an endless stream of data via one FIFO, host validates (simulates a
    logic analyzer subtarget)
  * sink: host emits an endless stream of data via one FIFO, device validates (simulates an
    I2S protocol subtarget)
  * loopback: host emits an endless stream of data via one FIFOs, device mirrors it all back,
    host validates (simulates an SPI protocol subtarget)
  * latency: host sends one packet, device sends it back, time until the packet is received
    back on the host is measured (simulates cases where a transaction with the DUT relies on
    feedback from the host; also useful for comparing different usb stacks or usb data paths
    like hubs or network bridges)

options:
  -h, --help               show this help message and exit

build arguments:
  --port SPEC              bind the applet to port SPEC (default: AB)

run arguments:
  -c COUNT, --count COUNT  transfer COUNT bytes (default: 8388608)
  MODE                     run benchmark mode MODE (default: source sink loopback latency)

Proposed arguments after:

usage: glasgow run benchmark [-h] [--port SPEC] [-c COUNT | -d DURATION | --until-error]
                             [--rate RATE | --rate-search]
                             [MODE ...]

Evaluate performance of the host communication link.

Benchmark modes:
  * source: device emits an endless stream of data via one FIFO, host validates (simulates a
    logic analyzer subtarget)
  * sink: host emits an endless stream of data via one FIFO, device validates (simulates an
    I2S protocol subtarget)
  * loopback: host emits an endless stream of data via one FIFOs, device mirrors it all back,
    host validates (simulates an SPI protocol subtarget)
  * latency: host sends one packet, device sends it back, time until the packet is received
    back on the host is measured (simulates cases where a transaction with the DUT relies on
    feedback from the host; also useful for comparing different usb stacks or usb data paths
    like hubs or network bridges)

options:
  -h, --help                show this help message and exit

build arguments:
  --port SPEC               bind the applet to port SPEC (default: AB)

run arguments:
  -c COUNT, --count COUNT   transfer COUNT bytes (default: 8388608)
  -d DURATION, --duration DURATION
                            transfer for DURATION
  --until-error             run mode(s) until an error occurs (or CTRL-C)
  --rate RATE               set device to source/sink data at constant RATE MiB/s and report
                            FIFO over/underflows
  --rate-search             search for the fastest constant rate that data can be
                            sourced/sunk without a FIFO over/underflow
  MODE                      run benchmark mode MODE (default: source sink loopback latency)

Let me know If this sounds good to you/if you would like me to do things differently. I'll prioritize finishing up my additions this week (hopefully today/tomorrow, but I don't want to over promise)

I haven't looked into the libusb error for insufficient usbfs_memory_mb yet. I should be able to once I finish up my benchmark PR.

I have thoughts about how to determine potential new defaults for _packets_per_xfer and _xfers_per_queue. I'd like to talk more once I finish my benchmark changes because it'd be nice to have measurements to inform any decisions.

whitequark · 2025-04-27T17:44:13Z

Right, that sounds fine to me. The details seem a little complicated but it seems like the problem we have on our hands is a little complicated too, so that's fine. Happy to do a review once you're done.

whitequark · 2025-05-29T10:27:52Z

Have you had a chance to revisit this? In the meantime I completely redesigned the demultiplexer (it's now part of HardwareAssembly) and raised buffer limits. I'm not sure if the premise of this issue is still important given these higher limits.

Sazzach · 2025-05-31T04:01:58Z

I've had a chance, I'm close to finishing up my benchmark applet changes. I'm sorry for being slow on this. My current goal is to finish up tomorrow.

As an aside, I'm excited by the changes to the interface. There's stuff I've been interested in doing that will be more straightforward now. I'm also looking forward to using your new BSDL parser with the JTAG applet.

Thanks!

whitequark · 2025-05-31T04:09:11Z

As an aside, I'm excited by the changes to the interface. There's stuff I've been interested in doing that will be more straightforward now. I'm also looking forward to using your new BSDL parser with the JTAG applet.

I'm very happy to hear that! I would also love to hear about your experiences using JTAG & BSDL parser.

whitequark · 2025-07-06T05:51:49Z

@Sazzach Have you been able to give this PR another look?

Sazzach · 2025-07-06T06:55:41Z

I have! I'm close to being done with adding a fixed rate modifier to the benchmark applet + some other changes.

A summary of the changes are:

Add a BenchmarkInterface class / move the code for the various modes into it
Add a fixed rate modifier that can apply to source/sink/loopback
Add a duration option

What I (roughly) have left to do is:

Polish various pieces / fix a few bugs
Rebase into a few sensible commits

One question I have is how does the add_inout_pipe argument out_buffer_size work? I tried setting it to limit the buffer size for sink when running a long benchmark, but setting it to ~100Mb resulted in much more ram being used before I killed the process. (I'm not sure if this is a bug or me misunderstanding how to use it)

I just pushed up my WIP changes here. If you have a moment could you take a look and let me know if I've gone in a direction you don't like anywhere.

I'm sorry for taking so long on this.

whitequark · 2025-07-06T10:22:41Z

I'll take a look shortly, but to answer this:

One question I have is how does the add_inout_pipe argument out_buffer_size work? I tried setting it to limit the buffer size for sink when running a long benchmark, but setting it to ~100Mb resulted in much more ram being used before I killed the process. (I'm not sure if this is a bug or me misunderstanding how to use it)

This is quite possibly just a bug. I'm not sure it's been tested at all.

whitequark · 2025-07-06T14:48:55Z

@Sazzach I'll give it another look once I'm more awake, but my main comment so far is that the code is quite hard to follow. There are a lot of different modes that combine with each other, this results in a lot of conditionals sometimes updating similarly named variables, duplicated code (rate for example is configured three times exactly the same way), and in the end it's difficult to understand how the whole thing works. Adding three more modes might well be easier to review and maintain.

I think we need to decide on which figures of merit we measure and structure the code towards that, without so much of a combinational explosion. Right now this is:

IN/OUT continuous throughput (measuring mostly host USB stack throughput)
IN-to-OUT continuous throughput (measuring mostly arbiter fairness)
IN-to-OUT latency of 1-packet transfers (measuring mostly host USB stack latency)

What are the additional figures of merit we're measuring? I think it's something like p99 latency, but expressed and measured in an indirect way. Can we do this better?

whitequark · 2025-07-06T14:58:08Z

Some examples of what I think is difficult to review and maintain:

Right now, the existing stimuli consist of a 3-stream chain: source (pipe or LFSR), optional rate limiter, sink (pipe or LFSR). This could be modelled on the gateware side by instantiating five components with streams at their boundaries and adding a few conditionals switching what gets connected to what. Instead, right now we have no meaningful stream use at all, even lines 119-123 which could've been a single connect() call.
On the host side, there is something similar going on. Rather than configuring this pipeline as three interconnected components, with the rate limiter configuration code copied three times and the time monitoring code copied two times.

There are also relatively minor things like overuse of tuples (there's even a nested tuple in one place; asyncio.TaskGroup is probably better than asyncio.gather here if only for that reason), and a very confusing calculation in _mibps_to_rate that hardcodes the reference clock rate (it's different on different revisions).

Sazzach · 2025-07-13T04:59:22Z

Thanks for the fast review! I'm sorry for not responding sooner.

my main comment so far is that the code is quite hard to follow. There are a lot of different modes that combine with each other
...
Adding three more modes might well be easier to review and maintain.

That's a good point. My original goal was to have a simple modifier that applied to the existing modes. In retrospect, this didn't work out well.

I think we need to decide on which figures of merit we measure and structure the code towards that, without so much of a combinational explosion.
...
What are the additional figures of merit we're measuring? I think it's something like p99 latency, but expressed and measured in an indirect way. Can we do this better?

Ignoring adding additional figures for now, I think that adding an interface to the benchmark applet would enable easier testing of how multiple applets running at the same time affect the overall throughput/latency.

Some example uses like this could be:
Test the fairness of the arbiter with multiple applets
Do more indepth analysis of latency, both in general and with other applets running

I think that this would be fairly straight forward and wouldn't require any gateware changes, but would require thought as to what the various modes return (A dataclass?). (I agree that my current use of tuples here is bad)

Thinking about adding additional figures now:

I've been interested in fixed throughput because it gives a rough idea of how much data an applet can produce without any data being dropped. I think that this isn't a concern for most applets so I understand if it doesn't make sense to get it added into the main project. Also, from my experiments it seems like the main factor for fixed throughput is the number of transfers queued/their size, along with how long garbage collection takes. This is trivial to measure and is represented a latency measurements with sufficient samples to cross the gc threshold.

I could see a rate setting being used for experiments such as:
testing if an assembly of applets that need X bandwidth can be ran at the same time with out sometimes dropping data (such as multiple UARTS at some fast baud)

I'm down to:

1. make a PR for adding an interface class to the benchmark applet
1. in a later PR add in modes with limited rate (in a better way this time)

I'm also down to drop this if it's not something you want for the project.

Thanks again!

whitequark · 2025-07-13T05:27:15Z

I'm down to:

make a PR for adding an interface class to the benchmark applet

in a later PR add in modes with limited rate (in a better way this time)

Let's do this, sounds good to me!

Sazzach added 2 commits February 15, 2025 13:59

applet.internal.fixed_throughput: add applet to test the effect of US…

d1fb431

…B settings on applets that require fixed throughput without overflows.

Sazzach requested a review from whitequark as a code owner February 15, 2025 23:11

whitequark reviewed Feb 16, 2025

View reviewed changes

whitequark added the waiting-on-author Status: Waiting on issue reporter or PR author label Jul 6, 2025

whitequark added waiting-on-review Status: Waiting for review and removed waiting-on-author Status: Waiting on issue reporter or PR author labels Jul 6, 2025

whitequark added waiting-on-author Status: Waiting on issue reporter or PR author and removed waiting-on-review Status: Waiting for review labels Jul 13, 2025

access.direct.demultiplexer: add ability for applets to override default USB transfer settings. #742

Are you sure you want to change the base?

access.direct.demultiplexer: add ability for applets to override default USB transfer settings. #742

Uh oh!

Conversation

Sazzach commented Feb 15, 2025

Uh oh!

whitequark left a comment

Choose a reason for hiding this comment

Uh oh!

whitequark commented Feb 16, 2025

Uh oh!

whitequark commented Feb 16, 2025

Uh oh!

Sazzach commented Feb 16, 2025

Uh oh!

whitequark commented Feb 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Sazzach commented Feb 16, 2025

Uh oh!

whitequark commented Feb 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

whitequark commented Apr 26, 2025

Uh oh!

Sazzach commented Apr 27, 2025

Uh oh!

whitequark commented Apr 27, 2025

Uh oh!

whitequark commented May 29, 2025

Uh oh!

Sazzach commented May 31, 2025

Uh oh!

whitequark commented May 31, 2025

Uh oh!

whitequark commented Jul 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Sazzach commented Jul 6, 2025

Uh oh!

whitequark commented Jul 6, 2025

Uh oh!

whitequark commented Jul 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

whitequark commented Jul 6, 2025

Uh oh!

Sazzach commented Jul 13, 2025

Uh oh!

whitequark commented Jul 13, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

whitequark commented Feb 16, 2025 •

edited

Loading

whitequark commented Feb 16, 2025 •

edited

Loading

whitequark commented Jul 6, 2025 •

edited

Loading

whitequark commented Jul 6, 2025 •

edited

Loading