technologic systems logo
search icon
Customer Area | Contact Us |cart icon
  16525 East Laser Drive
Fountain Hills, AZ 85268
Phone (480) 837-5200
Fax (480) 837-5300
info@embeddedARM.com
Results 1 to 9 of 9
  1. #1

    Userspace IRQ Latency

    In our whitepaper on meeting real-time requirements we address "User Space Drivers with Real-time Priority". As we have many userspace drivers we have done some testing to see what kind of IRQ dispatch latency you can expect from userspace. The result is not far from what you would receive from a kernel driver. In this case I'm using a TS-4200 with no modifications from our stock image aside from the code below.

    To measure the IRQ latency I'm feeding a square wave to an interrupt (PC14_IRQ2) and in code responding to the IRQ from userspace and setting an output (PC0) low. It then busy waits by reading the input value of PC14_IRQ2 and sets PC0 high as soon as the square wave goes low again. Essentially this just inverts the square wave. The code I'm using is below which can be used to replicate the test.

    Code:
    #include <stdio.h>
    #include <stdint.h>
    #include <fcntl.h>
    #include <sys/select.h>
    #include <sys/stat.h>
    #include <sys/mman.h>
    #include <unistd.h>
    
    int main(int argc, char **argv)
    {
    	int ret, irqfd = 0, buf, mem, x, i;
    	fd_set fds;
    	volatile uint16_t *syscon;
    	volatile uint32_t *pioc;
    
    	irqfd = open("/proc/irq/31/irq", O_RDONLY| O_NONBLOCK, S_IREAD);
    	mem = open("/dev/mem", O_RDWR|O_SYNC);
    	syscon = mmap(0,
    				  getpagesize(),
    				  PROT_READ|PROT_WRITE,
    				  MAP_SHARED,
    				  mem,
    				  0x30000000);
    
    	pioc = mmap(0,
    				  getpagesize(),
    				  PROT_READ|PROT_WRITE,
    				  MAP_SHARED,
    				  mem,
    				  0xfffff000);
    
    	// Turn off DIO
    	pioc[0x810/4] = 0x1;
    	pioc[0x834/4] = 0x1;
    
    	while(1) {
    		// Block until the IRQ triggers
    		FD_SET(irqfd, &fds);
    		ret = select(irqfd + 1, &fds, NULL, NULL, NULL);
    
    		if(FD_ISSET(irqfd, &fds))
    		{
    			// Enable output immediately after the interrupt
    			pioc[0x830/4] = 0x1;
    
    			// Busy wait while reading the input PC0
    			while(!(pioc[0x83c/4] & (1 << 14))) {};
    
    			// Clear PC0 output
    			pioc[0x834/4] = 0x1;
    
    			FD_CLR(irqfd, &fds);
    			read(irqfd, &buf, sizeof(buf));
    		}
    	}
    
    	return 0;
    }
    This is compiled using gcc and run in a realtime priority. You can set the priority in code, but it is also simple to use 'chrt -f -p 99 <yourpid>' which will set it above all other processes.

    On the IRQ latency test I receive about 11Ás best case, about 18Ás typical, and about 50Ás worst case. In the below image green is the PC0 output userspace is controlling, and orange is the square wave. Each vertical division in the grid is 20Ás.
    irq.jpg

    Now for a much better typical/best base we can busy wait. This however ties up the CPU to 100% while it is busy waiting. The worst case in this picture is actually about the same as the IRQ at about 50Ás. This is likely the same kernel function/driver affecting the worst case for the IRQ latency test. For this test green is still PC0, and orange is still the square wave. The vertical divisions are still 20Ás.
    busywait.jpg

    The best case with busy waiting is much better. In this image each vertical division is 500ns. The best case is about 200ns for switching the output based on polling from an input. The typical was about 250ns.
    busywait-closeup.jpg

  2. #2
    Junior Member
    Join Date
    May 2012
    Posts
    12
    That test was done on a TS-4200 that probably was not running any other applications. It would be interesting to see the same test on a TS-7800, especially while it is running the new version of ts7800ctl with -r or -S set to sample the ADC with 2000 samples per second being read through TWSI communication.

  3. #3
    Administrator joff's Avatar
    Join Date
    Apr 2012
    Location
    Fountain Hills, Arizona
    Posts
    65
    Quote Originally Posted by HLDewar View Post
    That test was done on a TS-4200 that probably was not running any other applications. It would be interesting to see the same test on a TS-7800, especially while it is running the new version of ts7800ctl with -r or -S set to sample the ADC with 2000 samples per second being read through TWSI communication.
    The TS-4200 is a 100Mhz slower processor than the TS-7800, but I would predict the results to be similar.

    IIRC, the ADC sampling in the newer versions of ts7800ctl use realtime priority itself, so the results would be very dependent on which process is given the higher priority.

    As you add extra applications running in the background at lower priority, the effectiveness of the L1 cache and branch prediction is compromised. This acts as a multiplier to whatever amount of preamble code there is to the actual IRQ business which is very OS and BSP specific. The best background application to run to exacerbate this effect would be one that clobbers the contents of the L1 cache as quickly and completely as possible once it gets a hold of the CPU. A good RTOS (which Linux is not) measures IRQ dispatch latency in number of instructions. If (e.g.) you only have 100 contiguous instructions from IRQ/FIQ delivery to IRQ handler the effect of having the L1 cache not primed is not as pronounced as compared to one that takes 10,000 instructions spread across multiple functions and conditional branches where several data structures must be consulted. Also, when you have to reload the L1 cache from SDRAM, you may have to contend with other bus masters or DMA access going on which can make those cache-line fills even slower yet.

    Another characteristic of a good RTOS is minimal sections of code executing with IRQs disabled and IRQ handlers themselves spending very little CPU to service. This is a bit of an unknown but supposedly Linux has made several improvements over the years in this regard. In such a large monolithic kernel such as Linux it is hard for a comprehensive analysis to be done -- for instance, we recently found some of our realtime processes once per second would be stalled for several milliseconds(!) at a time. Upon further investigation, we found somebody had written a bad ethernet driver that had a callback executed every 1 second when the ethernet device was up, but the link was down. This callback was intended to tweak PHY parameters and did some very silly busy-wait delays preventing any IRQ from running. This realtime "bug" had gone unnoticed for a very long time because we always had the ethernet plugged in with link active. All in all this was not confidence inspiring and makes me wonder what other random bugs may be lurking in this huge kernel codebase.

  4. #4
    Junior Member
    Join Date
    May 2012
    Posts
    12
    Quote Originally Posted by joff View Post
    IIRC, the ADC sampling in the newer versions of ts7800ctl use realtime priority itself, so the results would be very dependent on which process is given the higher priority.
    So if you want to use the ADC on the TS-7800, you will either have long userspace latencies, or else you will risk interfering with the TWSI communications, depending on which process is given the higher priority.

    Quote Originally Posted by joff View Post
    As you add extra applications running in the background at lower priority, the effectiveness of the L1 cache and branch prediction is compromised. This acts as a multiplier to whatever amount of preamble code there is to the actual IRQ business which is very OS and BSP specific. The best background application to run to exacerbate this effect would be one that clobbers the contents of the L1 cache as quickly and completely as possible once it gets a hold of the CPU. A good RTOS (which Linux is not) measures IRQ dispatch latency in number of instructions. If (e.g.) you only have 100 contiguous instructions from IRQ/FIQ delivery to IRQ handler the effect of having the L1 cache not primed is not as pronounced as compared to one that takes 10,000 instructions spread across multiple functions and conditional branches where several data structures must be consulted. Also, when you have to reload the L1 cache from SDRAM, you may have to contend with other bus masters or DMA access going on which can make those cache-line fills even slower yet.

    Another characteristic of a good RTOS is minimal sections of code executing with IRQs disabled and IRQ handlers themselves spending very little CPU to service. This is a bit of an unknown but supposedly Linux has made several improvements over the years in this regard. In such a large monolithic kernel such as Linux it is hard for a comprehensive analysis to be done -- for instance, we recently found some of our realtime processes once per second would be stalled for several milliseconds(!) at a time. Upon further investigation, we found somebody had written a bad ethernet driver that had a callback executed every 1 second when the ethernet device was up, but the link was down. This callback was intended to tweak PHY parameters and did some very silly busy-wait delays preventing any IRQ from running. This realtime "bug" had gone unnoticed for a very long time because we always had the ethernet plugged in with link active. All in all this was not confidence inspiring and makes me wonder what other random bugs may be lurking in this huge kernel codebase.
    So if you want latencies less than several milliseconds, you should not use userspace IRQs, and maybe not even use the Linux kernel. Which brings back the original question: what is a good RTOS for the TS-7800?

  5. #5
    Administrator joff's Avatar
    Join Date
    Apr 2012
    Location
    Fountain Hills, Arizona
    Posts
    65
    Quote Originally Posted by HLDewar View Post
    So if you want to use the ADC on the TS-7800, you will either have long userspace latencies, or else you will risk interfering with the TWSI communications, depending on which process is given the higher priority.
    I just looked it up. The realtime requirement for the 7800 ADC sampling is that it must be serviced before 63 milliseconds from the last servicing otherwise it drops samples. It tries to wakeup every 10ms so it can tolerate a higher priority realtime task not yielding the CPU for up to approx 52.9ms and still get its work done in time. This is not a timescale difficult for Linux. It is put into the domain of the realtime process scheduler simply so that it can be reliably given a small percentage of CPU without having to worry about competing with transient CPU hogs such as a random gzip or filesystem copy/move. If you have a process that wants the 50us userspace latency, just bump its priority above that of the ts7800ctl ADC sampling process (IIRC its 50).

    Quote Originally Posted by HLDewar View Post
    So if you want latencies less than several milliseconds, you should not use userspace IRQs, and maybe not even use the Linux kernel. Which brings back the original question: what is a good RTOS for the TS-7800?
    No.

    If you want userspace IRQ latencies less than 3 milliseconds, don't spend any more than approx 2.9ms in another thread/process executing at a realtime priority higher than the userspace IRQ handler. The same rule would apply to a RTOS.

    Use an RTOS if you need sub 200us IRQ latencies and need a guarantee beyond these simple test results. Linux really isn't that bad as-is for a lot of realtime-ish tasks despite it not really being a priority of its development. If you absolutely need an RTOS, there are lots of options such as ThreadX, FreeRTOS, RTEMS, QNX, etc but most of these will only support the CPU core (ARM9) and not have support for the peripherals in the rest of the SoC or board.

  6. #6
    Junior Member
    Join Date
    May 2012
    Posts
    12
    Quote Originally Posted by joff View Post
    I just looked it up. The realtime requirement for the 7800 ADC sampling is that it must be serviced before 63 milliseconds from the last servicing otherwise it drops samples. It tries to wakeup every 10ms so it can tolerate a higher priority realtime task not yielding the CPU for up to approx 52.9ms and still get its work done in time. This is not a timescale difficult for Linux. It is put into the domain of the realtime process scheduler simply so that it can be reliably given a small percentage of CPU without having to worry about competing with transient CPU hogs such as a random gzip or filesystem copy/move. If you have a process that wants the 50us userspace latency, just bump its priority above that of the ts7800ctl ADC sampling process (IIRC its 50).
    I have found that the ts7800ctl ADC sampling takes around 30% of CPU. I would not call that "a small percentage of CPU".

    Quote Originally Posted by joff View Post
    No.

    If you want userspace IRQ latencies less than 3 milliseconds, don't spend any more than approx 2.9ms in another thread/process executing at a realtime priority higher than the userspace IRQ handler. The same rule would apply to a RTOS.

    Use an RTOS if you need sub 200us IRQ latencies and need a guarantee beyond these simple test results. Linux really isn't that bad as-is for a lot of realtime-ish tasks despite it not really being a priority of its development. If you absolutely need an RTOS, there are lots of options such as ThreadX, FreeRTOS, RTEMS, QNX, etc but most of these will only support the CPU core (ARM9) and not have support for the peripherals in the rest of the SoC or board.
    I do not think I need an RTOS, but with the new ts7800ctl ADC code I have started to run into timing problems. So now it seems that I need to take any realtime-ish code and put it in a separate task and give it a realtime priority higher than the ADC sampling task. This was not necessary previously. And I am worried that the TWSI communication will not be reliable when there is a higher priority task preempting it, since it has not been reliable in the past. Although I have been told that the new version is reliable, I doubt that it has been tested under these conditions.

  7. #7
    Administrator joff's Avatar
    Join Date
    Apr 2012
    Location
    Fountain Hills, Arizona
    Posts
    65
    Quote Originally Posted by HLDewar View Post
    I have found that the ts7800ctl ADC sampling takes around 30% of CPU. I would not call that "a small percentage of CPU".
    The bulk of that time is spent busy-waiting for crappy Marvell CPU I2C/TWI controller implementation command completion. If you need to reclaim some of that CPU% back for your app to run, simply increase the priority of your app-- its really not that difficult. Since the process is not performing a computation, if you steal the CPU away while it is simply busy-waiting for the TWI controller the chances are good that when it gets the CPU back the busy-wait loop will kick out immediately and you've just saved the CPU from 10's of thousands of idle iterations through a busy-wait loop. You just need to be aware in your higher priority app that you can't steal the CPU indefinitely otherwise you may lose some acquired samples due to buffer overflow. The lack of immediate CPU response by the process schedular to the wake up event also acts as negative feedback to cause the process to move more data at a time. Moving more data at a time in this app has the side-effect of increasing CPU efficiency and lowering CPU utilization since the number of samples processed per second is fixed.

    If you don't care about the continuity of samples taken and can tolerate gaps in the sampling, the realtime requirements of the sampling loop go away completely and a wholly different strategy in ts7800ctl.c could be used that could tolerate any amount of CPU starvation. Keep in mind the whole of this ADC sampling loop is approximately 75 lines of C code in ts7800ctl.c so it shouldn't be difficult to tweak. Perhaps what would be more appropriate for your app is to keep the most recent sample in a shared memory segment and forego the sample pipeline output to stdout altogether. This is the kind of modification TS engineering services dept. can do if necessary.

    Quote Originally Posted by HLDewar View Post
    I do not think I need an RTOS, but with the new ts7800ctl ADC code I have started to run into timing problems. So now it seems that I need to take any realtime-ish code and put it in a separate task and give it a realtime priority higher than the ADC sampling task. This was not necessary previously. And I am worried that the TWSI communication will not be reliable when there is a higher priority task preempting it, since it has not been reliable in the past. Although I have been told that the new version is reliable, I doubt that it has been tested under these conditions.
    The previous code used for TWSI sampling in both the AVR doing the ADC sampling and the Marvell CPU code in ts7800ctl.c was fragile and broken in many ways. It didn't help that both the AVR and the Marvell had undocumented hardware bugs to workaround. The new code was tested in as many configurations as we fathomed, but obviously we can't test against your application. Also, IIRC, the previous ts7800ctl used 100% of the CPU if you let it have it.

    You don't necessarily need a separate task. You can use threads or even use a single task and change its operating priority with the sched_setschedular() syscall as needed in different sections of code.

    To bring this thread back (sort of) inline to the original topic of userspace IRQs, what we're talking about here is a realtime userspace driver for the TS-7800 ADC called ts7800ctl. This driver doesn't use IRQs kernel or userspace, but it could have. Regardless, the realtime hardware requirements of the TS-7800 ADC sampling loop could be isolated from other higher priority realtime requirements using userspace realtime scheduler priorities. If a IRQ-driven kernel driver were chosen instead for the TS-7800 ADC:


    1. the CPU requirements of servicing the ADC would be very difficult to see (since IRQ handler CPU time is difficult to account appropriately).
    2. It could not be prioritized below any process/thread doing something more important (IRQs are always on and cannot auto rate-limit or cooperate with CPU scheduling policies)
    3. The worst-case realtime latency would be worsened in a non-configurable way by an amount equal to the longest path through the kernel irq handler.
    4. A driver bug could effect the entire system's realtime response -- not just for those processes that are lower priority.

  8. #8
    Junior Member
    Join Date
    May 2012
    Posts
    12
    Quote Originally Posted by joff View Post
    Also heads up so its not a surprise: If posts on this thread start departing from the original topic, a moderator may end up deleting or moving them to a new thread. Any directed questions asked will be answered, but they may be answered via private message and the original message deleted from this thread if they don't mesh well and/or confuse the topic.
    My first reply was on a different thread, which was concerned with TS-7800 realtime capability. It is a little unfair for you to first move my comments to this thread, and then complain that they are off the topic of this thread. I have remained on the topic that I originally replied to.

    FWIW, I agree with the usage of userspace IRQs and userspace drivers such as the ADC sampling in ts7800ctl. I would hate to see that ts7800ctl.c code in a kernel module; that would be disastrous. I do not even like seeing it in a userspace process that gives itself high priority.

    I merely intended to let TS-7800 users know that if they use the ADC, they will have a process that gives itself high priority and uses 30% of CPU, which may affect realtime performance, and may force them to take one of the remedies which you suggested. I do not think many users are aware of this.

    As I wrote before, it would be interesting to see the userspace IRQ test run on a TS-7800 while it is running the new version of ts7800ctl with -r or -S set to sample the ADC. It would be interesting to see the userspace IRQ latency, even with the IRQ process at a higher priority, and it would be interesting to see if the TWSI communication remained reliable with a higher priority process preempting it.

  9. #9
    Administrator joff's Avatar
    Join Date
    Apr 2012
    Location
    Fountain Hills, Arizona
    Posts
    65

    TS-7800 userspace latency

    Another datapoint:

    The TS-7800 pictured below is a 500Mhz processor running a 2.6.21 non-RT Linux kernel. Running basically the same code and test as Mark's post above with the only difference being different IO registers and IRQ numbers. This TS-7800 IRQ dispatch test is also a little more complicated/slower than the 4200 above in that the IRQ signal generator goes into an FPGA pin which causes a PCI message signaled interrupt to the CPU and then back to the FPGA via PCI bus for the GPIO pin update.
    photo.jpg

    With nothing going on, the typical latency is 18 ÁS and worst case is around 38 ÁS. Scope trace below:

    print_02.jpg

    Now, with another realtime process competing and consuming approximately 30-40% of the CPU, but at a lower priority, the latency does go up as we'd expect due to the L1 cache ineffectiveness, TLB flushes, branch prediction misses, etc. Now, the typical is somewhere around 80-90 ÁS and the worst case is around 170ÁS. Note that it is NOT even close to 1ms, let alone several milliseconds. Scope trace below:

    print_04.jpg

    The realtime prio process consuming 30-40% of the CPU was the ts7800ctl process capturing raw ADC samples to a ramdisk. Its realtime requirement is to be woken up at least every 60 mS or so to avoid buffer overflow in the ADC stream. The latency test program consumes precisely 50% of the CPU because once it gets the interrupt, it busy-waits for the de-assertion of the high level from the signal generator in realtime priority and our signal generator is outputting a 50% duty cycle square wave to the PC104 IRQ3 input. We are sending the the TS-7800 a 160Hz square wave so we are hijacking the CPU away randomly from the ADC process for approximately 3.1mS at a time and causing absolutely NO problems for the ts7800 TWSI interface or AVR ADC sampling.

    I don't have a picture of the de-asserting edge (done by polling in realtime priority), but suffice it to say that the typical is about 1ÁS. The worst case is very interesting as it is the same as via userspace IRQ at around 38 ÁS. How can this be? The realtime process is guaranteed to have the CPU approximately 170ÁS or so after the IRQ line goes high and then just camps on polling the bit as the highest priority thread in the system. The only thing higher priority than the highest priority thread in Linux is Linux's kernel IRQ handlers or paths within the kernel with the IRQs off. The 38ÁS is evidence that this is the longest path in Linux. Strangely enough, its about every 5 seconds too so its probably something to do with timer interrupt driven periodic housekeeping.

    Since the de-asserting edge/realtime polling portion of this test is such an effective way to measure worst case kernel IRQ latencies, I ran another one while the board was being ping-flooded via it's on-chip ethernet MAC at 1Gb/s. The results of this test is below:

    print_05.jpg

    Note that the worst case is approximately 150ÁS now! Typical is still around 1us. This delay would not be helped at all by moving code into kernel space as it represents the ethernet driver's IRQ handler max runtime and while the kernel is processing one IRQ handler or code section with IRQs off it cannot process another IRQ handler or a realtime thread.

Similar Threads

  1. daqctl_process latency
    By rberanek in forum Cavium CNS2132 Series Support
    Replies: 9
    Last Post: September 6th, 2012, 03:12 PM
  2. Userspace vs Kernelspace drivers
    By joff in forum General Support
    Replies: 0
    Last Post: April 23rd, 2012, 07:09 PM
  3. Userspace vs Kernelspace drivers
    By joff in forum Embedded Linux (Temporary)
    Replies: 0
    Last Post: April 23rd, 2012, 07:09 PM

Bookmarks

Bookmarks

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •