Skip to content

Conversation

@Sazzach
Copy link
Member

@Sazzach Sazzach commented Feb 15, 2025

The first commit in this PR adds a way for an applet to override the default USB transfer size / the number of transfers queued on a DirectDemultiplexerInterface at a time.

The motivation for this change is that while the default values _packets_per_xfer and _xfers_per_queue work well for most applets, they aren't ideal for applets where backpressure results in the loss of data, such as the analyzer applet. In this case, I found that requesting larger transfers sized to hold ~5-10 ms of data allowed for a higher throughput without in_fifo overflows.

The second commit in this PR contains an applet called fixed-throughput. This applet outputs data from the device at a fixed rate, and checks for in_fifo overflows. I used it to test the effect of different combinations of transfer size and transfers queued.

The code block below shows the results of some measurements I took. With the default settings, and a rate of 100 Mbps, the fixed-throughput applet overflowed after ~6 seconds. By changing the settings, the fixed-throughput applet was able to run for ~40 minutes at a rate of 300 Mbps without overflows before I terminated it.

> glasgow run fixed-throughput -h
usage: glasgow run fixed-throughput [-h] [--rpkts READ-PACKETS] [--rxfers READ-TRANSFERS]
                                    rate

Evaluate fixed read throughput performance and check for in FIFO overflows

positional arguments:
  rate                     data rate in Mbps

options:
  -h, --help               show this help message and exit

run arguments:
  --rpkts READ-PACKETS     How many packets per read transfer
  --rxfers READ-TRANSFERS  How many read transfers to have active at a time

> glasgow run fixed-throughput 50
I: g.device.hardware: device already has bitstream ID a341a413bc07ad859c6a3d118bf4a2a4
I: g.cli: running handler for applet 'fixed-throughput'
^Coverflow=False
Elapsed: 68.7474992275238
Mbps: 49.50030727281442
Expected Mbps: 49.5

> glasgow run fixed-throughput 100
I: g.device.hardware: device already has bitstream ID a341a413bc07ad859c6a3d118bf4a2a4
I: g.cli: running handler for applet 'fixed-throughput'
overflow=True
Elapsed: 5.651003122329712
Mbps: 100.1768988169691
Expected Mbps: 100.5

> glasgow run fixed-throughput --rpkts 512 --rxfers 8 100
I: g.device.hardware: device already has bitstream ID a341a413bc07ad859c6a3d118bf4a2a4
I: g.cli: running handler for applet 'fixed-throughput'
^Coverflow=False
Elapsed: 64.04957938194275
Mbps: 100.48714683385604
Expected Mbps: 100.5

> glasgow run fixed-throughput --rpkts 512 --rxfers 8 300
I: g.device.hardware: device already has bitstream ID a341a413bc07ad859c6a3d118bf4a2a4
I: g.cli: running handler for applet 'fixed-throughput'
^Coverflow=False
Elapsed: 2534.1230137348175
Mbps: 300.01638171759214
Expected Mbps: 300.0

…ult USB transfer settings.

Add four keyword arguments to DirectDemultiplexerInterface's constructor:
    read_packets_per_xfer, read_xfers_per_queue,
    write_packets_per_xfer, write_xfers_per_queue

An applet can set these values through device.demultiplexer.claim_interface's **kwargs.
…B settings on applets that require fixed throughput without overflows.
@Sazzach Sazzach requested a review from whitequark as a code owner February 15, 2025 23:11
Copy link
Member

@whitequark whitequark left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the investigation. It's been a while since I looked at this code in depth, and I'm unsure if this is really the best way to solve it, especially given the analyzer's many other deficiencies.

At this moment other commitments do not leave me enough time to discuss a better approach in depth. Are you in the Matrix channel? If so we should talk about it at some point when I do have more time and work out a way forward.

@whitequark
Copy link
Member

In particular, it's not clear to me that we can't just change the defaults so that any applet gets a buffer holding 5-10 ms of data. This does go against the DMA memory quotas in the Linux kernel, but:

  1. I think the quotas may have been changed at some time between when I first wrote that and today, and
  2. 16 MB is a silly amount of memory and we could also insist (by way of a warning) on changing the quota to be higher if it is too low.

@whitequark
Copy link
Member

(Also, I think the fixed-throughput applet should perhaps be a part of the benchmark applet instead? It does perform a benchmark, and I don't like adding more top-level names unless there's good justification.)

@Sazzach
Copy link
Member Author

Sazzach commented Feb 16, 2025

Thanks for the fast response!

Thank you for the investigation. It's been a while since I looked at this code in depth, and I'm unsure if this is really the best way to solve it, especially given the analyzer's many other deficiencies.

No problem! It was time well spent, I learned a number of new things about USB in the process.

At this moment other commitments do not leave me enough time to discuss a better approach in depth. Are you in the Matrix channel? If so we should talk about it at some point when I do have more time and work out a way forward.

No worries about not discussing more now. I'm not on matrix, but I am on the 1BitSquared Discord as "SamIAm", so I see Matrix messages via the bridge.

In particular, it's not clear to me that we can't just change the defaults so that any applet gets a buffer holding 5-10 ms of data. This does go against the DMA memory quotas in the Linux kernel, but:

A potential default change sounds good to me. I think one tradeoff could be latency for applets that don't use auto flush / explicitly flush.

1. I think the quotas may have been changed at some time between when I first wrote that and today, and

2. 16 MB is a silly amount of memory and we could also insist (by way of a warning) on changing the quota to be higher if it is too low.

I'm sad about the 16 MB limit. Its the default on my system (Ubuntu 24.04, Kernel 6.8). Its easy enough to increase by running sudo modprobe usbcore usbfs_memory_mb=, but it seems ridiculous for it to be so low especially with USB cameras etc.

In particular, it's not clear to me that we can't just change the defaults so that any applet gets a buffer holding 5-10 ms of data. This does go against the DMA memory quotas in the Linux kernel, but:

Moving the fixed-throughput applet into the benchmark applet makes sense and I'm happy to make that change.

@whitequark
Copy link
Member

whitequark commented Feb 16, 2025

Thank you for the responsiveness! I really value this kind of contribution.

I think one tradeoff could be latency for applets that don't use auto flush / explicitly flush.

My view is that any properly designed applet should be using either auto-flush or explicitly flush. Doing neither and just relying on data to 'fall off the FIFO' does happen to work but it is expected to introduce undesirable performance characteristics.

I'm sad about the 16 MB limit. Its the default on my system (Ubuntu 24.04, Kernel 6.8).

Ah, that's upsetting. I really don't want to get involved with LKML (especially not after the latest events) so I guess the default gets to stay what it is. Do you want to explore adding a warning asking for this limit to be raised? I think once you exceed a limit you get a certain error from libusb and we could show the warning once we hit that. I recall this error to be annoyingly ambiguous but we could do what we can, now that we collectively know that it is worth bumping against the limit.

@Sazzach
Copy link
Member Author

Sazzach commented Feb 16, 2025

Ah, that's upsetting. I really don't want to get involved with LKML (especially not after the latest events) so I guess the default gets to stay what it is. Do you want to explore adding a warning asking for this limit to be raised? I think once you exceed a limit you get a certain error from libusb and we could show the warning once we hit that. I recall this error to be annoyingly ambiguous but we could do what we can, now that we collectively know that it is worth bumping against the limit.

Sure I'll look into detecting the error/making a warning on exceeding the limit. I don't remember what the error was offhand, but I definitely ran into it a few times when I made a mistake calculating the request size.

@whitequark
Copy link
Member

whitequark commented Feb 16, 2025

Thanks, perfect! If you slice your changes into individual PRs I will likely be able to review and merge them faster (the new benchmark for example could be merged right away as far as I can tell from a very quick glance).

@whitequark
Copy link
Member

@Sazzach Have you had the chance to take another look at this PR?

@Sazzach
Copy link
Member Author

Sazzach commented Apr 27, 2025

I have, but I also have been distracted/sidetracked and let this slide. I'm sorry.

With respect to adding a fixed source mode to the benchmark applet, the existence of a fixed source mode, implies potential fixed sink/loopback modes. Instead of adding three additional modes, I thought it'd make sense instead to add an optional --rate RATE argument, where specifying --rate 10 would set source/sink/loopback to a fixed rate of 10 MiB/s and check for over/underflow. I also have --rate in a group with --rate-search. --rate-search does a binary search on the rate and reports the highest successful rate.

I'm still working on this. My status is --rate / --rate-search for source mode works, but not with sink/loopback.

There are also some minor arguments I'd like to add. These are --duration DURATION and --until-error which would be in a group with --count.

To summarize, arguments before:

usage: glasgow run benchmark [-h] [--port SPEC] [-c COUNT] [MODE ...]

Evaluate performance of the host communication link.

Benchmark modes:
  * source: device emits an endless stream of data via one FIFO, host validates (simulates a
    logic analyzer subtarget)
  * sink: host emits an endless stream of data via one FIFO, device validates (simulates an
    I2S protocol subtarget)
  * loopback: host emits an endless stream of data via one FIFOs, device mirrors it all back,
    host validates (simulates an SPI protocol subtarget)
  * latency: host sends one packet, device sends it back, time until the packet is received
    back on the host is measured (simulates cases where a transaction with the DUT relies on
    feedback from the host; also useful for comparing different usb stacks or usb data paths
    like hubs or network bridges)

options:
  -h, --help               show this help message and exit

build arguments:
  --port SPEC              bind the applet to port SPEC (default: AB)

run arguments:
  -c COUNT, --count COUNT  transfer COUNT bytes (default: 8388608)
  MODE                     run benchmark mode MODE (default: source sink loopback latency)

Proposed arguments after:

usage: glasgow run benchmark [-h] [--port SPEC] [-c COUNT | -d DURATION | --until-error]
                             [--rate RATE | --rate-search]
                             [MODE ...]

Evaluate performance of the host communication link.

Benchmark modes:
  * source: device emits an endless stream of data via one FIFO, host validates (simulates a
    logic analyzer subtarget)
  * sink: host emits an endless stream of data via one FIFO, device validates (simulates an
    I2S protocol subtarget)
  * loopback: host emits an endless stream of data via one FIFOs, device mirrors it all back,
    host validates (simulates an SPI protocol subtarget)
  * latency: host sends one packet, device sends it back, time until the packet is received
    back on the host is measured (simulates cases where a transaction with the DUT relies on
    feedback from the host; also useful for comparing different usb stacks or usb data paths
    like hubs or network bridges)

options:
  -h, --help                show this help message and exit

build arguments:
  --port SPEC               bind the applet to port SPEC (default: AB)

run arguments:
  -c COUNT, --count COUNT   transfer COUNT bytes (default: 8388608)
  -d DURATION, --duration DURATION
                            transfer for DURATION
  --until-error             run mode(s) until an error occurs (or CTRL-C)
  --rate RATE               set device to source/sink data at constant RATE MiB/s and report
                            FIFO over/underflows
  --rate-search             search for the fastest constant rate that data can be
                            sourced/sunk without a FIFO over/underflow
  MODE                      run benchmark mode MODE (default: source sink loopback latency)

Let me know If this sounds good to you/if you would like me to do things differently. I'll prioritize finishing up my additions this week (hopefully today/tomorrow, but I don't want to over promise)

I haven't looked into the libusb error for insufficient usbfs_memory_mb yet. I should be able to once I finish up my benchmark PR.

I have thoughts about how to determine potential new defaults for _packets_per_xfer and _xfers_per_queue. I'd like to talk more once I finish my benchmark changes because it'd be nice to have measurements to inform any decisions.

@whitequark
Copy link
Member

Right, that sounds fine to me. The details seem a little complicated but it seems like the problem we have on our hands is a little complicated too, so that's fine. Happy to do a review once you're done.

@whitequark
Copy link
Member

Have you had a chance to revisit this? In the meantime I completely redesigned the demultiplexer (it's now part of HardwareAssembly) and raised buffer limits. I'm not sure if the premise of this issue is still important given these higher limits.

@Sazzach
Copy link
Member Author

Sazzach commented May 31, 2025

I've had a chance, I'm close to finishing up my benchmark applet changes. I'm sorry for being slow on this. My current goal is to finish up tomorrow.

As an aside, I'm excited by the changes to the interface. There's stuff I've been interested in doing that will be more straightforward now. I'm also looking forward to using your new BSDL parser with the JTAG applet.

Thanks!

@whitequark
Copy link
Member

As an aside, I'm excited by the changes to the interface. There's stuff I've been interested in doing that will be more straightforward now. I'm also looking forward to using your new BSDL parser with the JTAG applet.

I'm very happy to hear that! I would also love to hear about your experiences using JTAG & BSDL parser.

@whitequark
Copy link
Member

whitequark commented Jul 6, 2025

@Sazzach Have you been able to give this PR another look?

@whitequark whitequark added the waiting-on-author Status: Waiting on issue reporter or PR author label Jul 6, 2025
@Sazzach
Copy link
Member Author

Sazzach commented Jul 6, 2025

I have! I'm close to being done with adding a fixed rate modifier to the benchmark applet + some other changes.

A summary of the changes are:

  • Add a BenchmarkInterface class / move the code for the various modes into it
  • Add a fixed rate modifier that can apply to source/sink/loopback
  • Add a duration option

What I (roughly) have left to do is:

  • Polish various pieces / fix a few bugs
  • Rebase into a few sensible commits

One question I have is how does the add_inout_pipe argument out_buffer_size work? I tried setting it to limit the buffer size for sink when running a long benchmark, but setting it to ~100Mb resulted in much more ram being used before I killed the process. (I'm not sure if this is a bug or me misunderstanding how to use it)

I just pushed up my WIP changes here. If you have a moment could you take a look and let me know if I've gone in a direction you don't like anywhere.

I'm sorry for taking so long on this.

@whitequark
Copy link
Member

I'll take a look shortly, but to answer this:

One question I have is how does the add_inout_pipe argument out_buffer_size work? I tried setting it to limit the buffer size for sink when running a long benchmark, but setting it to ~100Mb resulted in much more ram being used before I killed the process. (I'm not sure if this is a bug or me misunderstanding how to use it)

This is quite possibly just a bug. I'm not sure it's been tested at all.

@whitequark whitequark added waiting-on-review Status: Waiting for review and removed waiting-on-author Status: Waiting on issue reporter or PR author labels Jul 6, 2025
@whitequark
Copy link
Member

whitequark commented Jul 6, 2025

@Sazzach I'll give it another look once I'm more awake, but my main comment so far is that the code is quite hard to follow. There are a lot of different modes that combine with each other, this results in a lot of conditionals sometimes updating similarly named variables, duplicated code (rate for example is configured three times exactly the same way), and in the end it's difficult to understand how the whole thing works. Adding three more modes might well be easier to review and maintain.

I think we need to decide on which figures of merit we measure and structure the code towards that, without so much of a combinational explosion. Right now this is:

  • IN/OUT continuous throughput (measuring mostly host USB stack throughput)
  • IN-to-OUT continuous throughput (measuring mostly arbiter fairness)
  • IN-to-OUT latency of 1-packet transfers (measuring mostly host USB stack latency)

What are the additional figures of merit we're measuring? I think it's something like p99 latency, but expressed and measured in an indirect way. Can we do this better?

@whitequark
Copy link
Member

Some examples of what I think is difficult to review and maintain:

  1. Right now, the existing stimuli consist of a 3-stream chain: source (pipe or LFSR), optional rate limiter, sink (pipe or LFSR). This could be modelled on the gateware side by instantiating five components with streams at their boundaries and adding a few conditionals switching what gets connected to what. Instead, right now we have no meaningful stream use at all, even lines 119-123 which could've been a single connect() call.
  2. On the host side, there is something similar going on. Rather than configuring this pipeline as three interconnected components, with the rate limiter configuration code copied three times and the time monitoring code copied two times.

There are also relatively minor things like overuse of tuples (there's even a nested tuple in one place; asyncio.TaskGroup is probably better than asyncio.gather here if only for that reason), and a very confusing calculation in _mibps_to_rate that hardcodes the reference clock rate (it's different on different revisions).

@Sazzach
Copy link
Member Author

Sazzach commented Jul 13, 2025

Thanks for the fast review! I'm sorry for not responding sooner.

my main comment so far is that the code is quite hard to follow. There are a lot of different modes that combine with each other
...
Adding three more modes might well be easier to review and maintain.

That's a good point. My original goal was to have a simple modifier that applied to the existing modes. In retrospect, this didn't work out well.

I think we need to decide on which figures of merit we measure and structure the code towards that, without so much of a combinational explosion.
...
What are the additional figures of merit we're measuring? I think it's something like p99 latency, but expressed and measured in an indirect way. Can we do this better?

Ignoring adding additional figures for now, I think that adding an interface to the benchmark applet would enable easier testing of how multiple applets running at the same time affect the overall throughput/latency.

Some example uses like this could be:
Test the fairness of the arbiter with multiple applets
Do more indepth analysis of latency, both in general and with other applets running

I think that this would be fairly straight forward and wouldn't require any gateware changes, but would require thought as to what the various modes return (A dataclass?). (I agree that my current use of tuples here is bad)

Thinking about adding additional figures now:

I've been interested in fixed throughput because it gives a rough idea of how much data an applet can produce without any data being dropped. I think that this isn't a concern for most applets so I understand if it doesn't make sense to get it added into the main project. Also, from my experiments it seems like the main factor for fixed throughput is the number of transfers queued/their size, along with how long garbage collection takes. This is trivial to measure and is represented a latency measurements with sufficient samples to cross the gc threshold.

I could see a rate setting being used for experiments such as:
testing if an assembly of applets that need X bandwidth can be ran at the same time with out sometimes dropping data (such as multiple UARTS at some fast baud)

I'm down to:

    1. make a PR for adding an interface class to the benchmark applet
    1. in a later PR add in modes with limited rate (in a better way this time)

I'm also down to drop this if it's not something you want for the project.

Thanks again!

@whitequark
Copy link
Member

I'm down to:

    1. make a PR for adding an interface class to the benchmark applet
    1. in a later PR add in modes with limited rate (in a better way this time)

Let's do this, sounds good to me!

@whitequark whitequark added waiting-on-author Status: Waiting on issue reporter or PR author and removed waiting-on-review Status: Waiting for review labels Jul 13, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

waiting-on-author Status: Waiting on issue reporter or PR author

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants