Discussion:
[Rtai] RTAI 5.0-test2 released
Pierangelo Masarati
2016-05-04 08:07:24 UTC
Permalink
Dear RTAI enthusiasts,

5.0-test2 has been released.

<https://www.rtai.org/userfiles/downloads/RTAI/rtai-5.0-test2.tar.bz2>

It features modified latency calibrations, which should work also in
case of cross development, added a user space SMI supervisor.

Its parent VULCANO have been thoroughly checked and this test release
should aim at verifying if its make-install stuff works as expected.

Please test and report thru this mailing list, as usual.

Enjoy!

Sincerely, p.
--
Pierangelo Masarati
Associate Professor
Dipartimento di Scienze e Tecnologie Aerospaziali
Politecnico di Milano
Shahbaz Youssefi
2016-05-04 13:45:03 UTC
Permalink
The user-space SMI supervisor is awesome, thank you!

On Wed, May 4, 2016 at 4:07 AM, Pierangelo Masarati
Post by Pierangelo Masarati
Dear RTAI enthusiasts,
5.0-test2 has been released.
<https://www.rtai.org/userfiles/downloads/RTAI/rtai-5.0-test2.tar.bz2>
It features modified latency calibrations, which should work also in case of
cross development, added a user space SMI supervisor.
Its parent VULCANO have been thoroughly checked and this test release should
aim at verifying if its make-install stuff works as expected.
Please test and report thru this mailing list, as usual.
Enjoy!
Sincerely, p.
--
Pierangelo Masarati
Associate Professor
Dipartimento di Scienze e Tecnologie Aerospaziali
Politecnico di Milano
_______________________________________________
Rtai mailing list
https://mail.rtai.org/cgi-bin/mailman/listinfo/rtai
Sebastian Kuzminsky
2016-06-23 16:52:54 UTC
Permalink
Post by Pierangelo Masarati
Dear RTAI enthusiasts,
5.0-test2 has been released.
<https://www.rtai.org/userfiles/downloads/RTAI/rtai-5.0-test2.tar.bz2>
It features modified latency calibrations, which should work also in
case of cross development, added a user space SMI supervisor.
Its parent VULCANO have been thoroughly checked and this test release
should aim at verifying if its make-install stuff works as expected.
Please test and report thru this mailing list, as usual.
Thanks, Pierangelo! The new cross calibration stuff is great, really
helps my work flow.

However, i'm having trouble running tests in a virtual machine. After
several load/test/unload cycles, the machine locks up, with no helpful
kernel messages on the console or in dmesg.

I'm *not* trying to use virtual machines for actual realtime workloads,
i'm just using it for correctness tests of my realtime application. For
actual work i run the application on a bare-metal rtai machine.


I'm using a Vulcano CVS snapshot from 2016 May 12, there have been no
checkins since then.

I'm using Linux 3.18.20, and hal-linux-3.18.20-x86-6.patch.

My kernel config and my rtai configure command is here, along with a
dmesg showing boot & rtai load:
http://highlab.com/~seb/rtai/vm-problem.2016-06-23/

A debian package archive (for Debian Jessie) with the kernel debs and
rtai-modules debs is here:

deb http://highlab.com/~seb/linuxcnc jessie main

The packages to install to see this issue are
linux-image-3.18.0-1-rtai-686-pae (or -amd64) and rtai-modules-3.18.0-1.


I'd appreciate any help debugging this. I'm perfectly willing to run
any tests you suggest.
--
Sebastian Kuzminsky
Paolo Mantegazza
2016-06-24 08:19:26 UTC
Permalink
If the problem is related to a lock due to many modules removal it is difficult for me to say anything, especially in view of the use of a virtual env.
I've just a question: does the same happens out of a virtualization?

Paolo

________________________________________
From: Rtai [rtai-***@rtai.org] on behalf of Sebastian Kuzminsky [***@highlab.com]
Sent: Thursday, June 23, 2016 6:52 PM
To: ***@rtai.org
Subject: Re: [Rtai] RTAI 5.0-test2 released
Post by Pierangelo Masarati
Dear RTAI enthusiasts,
5.0-test2 has been released.
<https://www.rtai.org/userfiles/downloads/RTAI/rtai-5.0-test2.tar.bz2>
It features modified latency calibrations, which should work also in
case of cross development, added a user space SMI supervisor.
Its parent VULCANO have been thoroughly checked and this test release
should aim at verifying if its make-install stuff works as expected.
Please test and report thru this mailing list, as usual.
Thanks, Pierangelo! The new cross calibration stuff is great, really
helps my work flow.

However, i'm having trouble running tests in a virtual machine. After
several load/test/unload cycles, the machine locks up, with no helpful
kernel messages on the console or in dmesg.

I'm *not* trying to use virtual machines for actual realtime workloads,
i'm just using it for correctness tests of my realtime application. For
actual work i run the application on a bare-metal rtai machine.


I'm using a Vulcano CVS snapshot from 2016 May 12, there have been no
checkins since then.

I'm using Linux 3.18.20, and hal-linux-3.18.20-x86-6.patch.

My kernel config and my rtai configure command is here, along with a
dmesg showing boot & rtai load:
http://highlab.com/~seb/rtai/vm-problem.2016-06-23/

A debian package archive (for Debian Jessie) with the kernel debs and
rtai-modules debs is here:

deb http://highlab.com/~seb/linuxcnc jessie main

The packages to install to see this issue are
linux-image-3.18.0-1-rtai-686-pae (or -amd64) and rtai-modules-3.18.0-1.


I'd appreciate any help debugging this. I'm perfectly willing to run
any tests you suggest.


--
Sebastian Kuzminsky
Sebastian Kuzminsky
2016-06-25 15:11:48 UTC
Permalink
Post by Paolo Mantegazza
If the problem is related to a lock due to many modules removal it is
difficult for me to say anything, especially in view of the use of a
virtual env. I've just a question: does the same happens out of a
virtualization?
Paolo
I just ran an over-night test on bare metal. It ran much longer than in
my virtual machines, but this morning it too had crashed.

The console was blank and unresponsive, so i don't have any info to help
debug this, sorry :-(

I'd be happy to run any tests anyone would like.
Post by Paolo Mantegazza
Sent: Thursday, June 23, 2016 6:52 PM
Subject: Re: [Rtai] RTAI 5.0-test2 released
Post by Pierangelo Masarati
Dear RTAI enthusiasts,
5.0-test2 has been released.
<https://www.rtai.org/userfiles/downloads/RTAI/rtai-5.0-test2.tar.bz2>
It features modified latency calibrations, which should work also in
case of cross development, added a user space SMI supervisor.
Its parent VULCANO have been thoroughly checked and this test release
should aim at verifying if its make-install stuff works as expected.
Please test and report thru this mailing list, as usual.
Thanks, Pierangelo! The new cross calibration stuff is great, really
helps my work flow.
However, i'm having trouble running tests in a virtual machine. After
several load/test/unload cycles, the machine locks up, with no helpful
kernel messages on the console or in dmesg.
I'm *not* trying to use virtual machines for actual realtime workloads,
i'm just using it for correctness tests of my realtime application. For
actual work i run the application on a bare-metal rtai machine.
I'm using a Vulcano CVS snapshot from 2016 May 12, there have been no
checkins since then.
I'm using Linux 3.18.20, and hal-linux-3.18.20-x86-6.patch.
My kernel config and my rtai configure command is here, along with a
http://highlab.com/~seb/rtai/vm-problem.2016-06-23/
A debian package archive (for Debian Jessie) with the kernel debs and
deb http://highlab.com/~seb/linuxcnc jessie main
The packages to install to see this issue are
linux-image-3.18.0-1-rtai-686-pae (or -amd64) and rtai-modules-3.18.0-1.
I'd appreciate any help debugging this. I'm perfectly willing to run
any tests you suggest.
--
Sebastian Kuzminsky
_______________________________________________
Rtai mailing list
https://mail.rtai.org/cgi-bin/mailman/listinfo/rtai
--
Sebastian Kuzminsky
Paolo Mantegazza
2016-06-27 09:23:19 UTC
Permalink
Post by Sebastian Kuzminsky
Post by Paolo Mantegazza
If the problem is related to a lock due to many modules removal it is
difficult for me to say anything, especially in view of the use of a
virtual env. I've just a question: does the same happens out of a
virtualization?
Paolo
I just ran an over-night test on bare metal. It ran much longer than
in my virtual machines, but this morning it too had crashed.
The console was blank and unresponsive, so i don't have any info to
help debug this, sorry :-(
I'd be happy to run any tests anyone would like.
I did not understand if you have problem in loading-unloading modules
under a virtual environment or just in running a test for a long time?
Paolo
Post by Sebastian Kuzminsky
Post by Paolo Mantegazza
Sent: Thursday, June 23, 2016 6:52 PM
Subject: Re: [Rtai] RTAI 5.0-test2 released
Post by Pierangelo Masarati
Dear RTAI enthusiasts,
5.0-test2 has been released.
<https://www.rtai.org/userfiles/downloads/RTAI/rtai-5.0-test2.tar.bz2>
It features modified latency calibrations, which should work also in
case of cross development, added a user space SMI supervisor.
Its parent VULCANO have been thoroughly checked and this test release
should aim at verifying if its make-install stuff works as expected.
Please test and report thru this mailing list, as usual.
Thanks, Pierangelo! The new cross calibration stuff is great, really
helps my work flow.
However, i'm having trouble running tests in a virtual machine. After
several load/test/unload cycles, the machine locks up, with no helpful
kernel messages on the console or in dmesg.
I'm *not* trying to use virtual machines for actual realtime workloads,
i'm just using it for correctness tests of my realtime application. For
actual work i run the application on a bare-metal rtai machine.
I'm using a Vulcano CVS snapshot from 2016 May 12, there have been no
checkins since then.
I'm using Linux 3.18.20, and hal-linux-3.18.20-x86-6.patch.
My kernel config and my rtai configure command is here, along with a
http://highlab.com/~seb/rtai/vm-problem.2016-06-23/
A debian package archive (for Debian Jessie) with the kernel debs and
deb http://highlab.com/~seb/linuxcnc jessie main
The packages to install to see this issue are
linux-image-3.18.0-1-rtai-686-pae (or -amd64) and rtai-modules-3.18.0-1.
I'd appreciate any help debugging this. I'm perfectly willing to run
any tests you suggest.
--
Sebastian Kuzminsky
_______________________________________________
Rtai mailing list
https://mail.rtai.org/cgi-bin/mailman/listinfo/rtai
Sebastian Kuzminsky
2016-06-27 15:27:42 UTC
Permalink
Post by Paolo Mantegazza
Post by Sebastian Kuzminsky
Post by Paolo Mantegazza
If the problem is related to a lock due to many modules removal it is
difficult for me to say anything, especially in view of the use of a
virtual env. I've just a question: does the same happens out of a
virtualization?
Paolo
I just ran an over-night test on bare metal. It ran much longer than
in my virtual machines, but this morning it too had crashed.
The console was blank and unresponsive, so i don't have any info to
help debug this, sorry :-(
I'd be happy to run any tests anyone would like.
I did not understand if you have problem in loading-unloading modules
under a virtual environment or just in running a test for a long time?
Paolo
I spoke unclearly, apologies.

My test consists of repeatedly building LinuxCNC and running the
LinuxCNC test suite.

The LinuxCNC test suite consists of ~180 tests. Many of the tests run a
sequence where they load the realtime modules (from RTAI and from
LinuxCNC), exercise the code and check the results, then unload all the
realtime modules.

It's unknown to me where in this loop the bare-metal test failed.

I've observed failures during unload in virtual machines. Here's a
kernel log of two consecutive test cases from our test suite, where the
Post by Paolo Mantegazza
[ 8780.404223] I-pipe: head domain RTAI registered.
[ 8780.405262] RTAI[hal]: mounted. ISOL_CPUS_MASK: 1.
[ 8780.406019] SYSINFO - # CPUs: 2, TIMER NAME: 'lapic', TIMER IRQ: 2305, TIMER FREQ: 62502000, CLOCK NAME: 'tsc', CLOCK FREQ: 2000081000, CPU FREQ: 2000081000, LINUX TIMER IRQ: 2305.
[ 8780.414520] RTAI[malloc]: global heap size = 2097152 bytes, <BSD>.
[ 8780.415838] , kstacks pool size = 524288 bytes.
[ 8780.416828] RTAI[sched]: hard timer type/freq = lapic/62502000(Hz); timing: oneshot; linear timed lists.
[ 8780.418232] RTAI[sched]: Linux timer freq = 250 (Hz), TimeBase freq = 2000081000 hz.
[ 8780.419350] RTAI[sched]: timer setup = 1504 ns, resched latency = 0 ns.
[ 8780.449429] USERMODE CHECK: OK.
[ 8780.449941] USERMODE CHECK PROVIDED (ns): KernelLatency 8897, UserLatency 9087.
[ 8780.451010] FINAL CALIBRATION SUMMARY (ns): KernelLatency 8897, UserLatency 9087.
[ 8780.455566] RTAI[math]: loaded, using NEWLIB.
[ 8781.136562] RTAI[math]: unloaded.
[ 8781.148269] SCHED releases registered named ALIEN PEDV$D
[ 8781.169297] RTAI[malloc]: unloaded.
[ 8781.268150] RTAI[sched]: unloaded (forced hard/soft/hard transitions: traps 0, syscalls 0).
[ 8781.273428] I-pipe: head domain RTAI unregistered.
[ 8781.275513] RTAI[hal]: unmounted.
[ 8781.356263] I-pipe: head domain RTAI registered.
[ 8781.357292] RTAI[hal]: mounted. ISOL_CPUS_MASK: 1.
[ 8781.358043] SYSINFO - # CPUs: 2, TIMER NAME: 'lapic', TIMER IRQ: 2305, TIMER FREQ: 62502000, CLOCK NAME: 'tsc', CLOCK FREQ: 2000081000, CPU FREQ: 2000081000, LINUX TIMER IRQ: 2305.
[ 8781.366385] RTAI[malloc]: global heap size = 2097152 bytes, <BSD>.
[ 8781.367682] , kstacks pool size = 524288 bytes.
[ 8781.368639] RTAI[sched]: hard timer type/freq = lapic/62502000(Hz); timing: oneshot; linear timed lists.
[ 8781.370053] RTAI[sched]: Linux timer freq = 250 (Hz), TimeBase freq = 2000081000 hz.
[ 8781.371301] RTAI[sched]: timer setup = 1499 ns, resched latency = 0 ns.
[ 8781.401245] USERMODE CHECK: OK.
[ 8781.401761] USERMODE CHECK PROVIDED (ns): KernelLatency 8897, UserLatency 9087.
[ 8781.402845] FINAL CALIBRATION SUMMARY (ns): KernelLatency 8897, UserLatency 9087.
[ 8781.407514] RTAI[math]: loaded, using NEWLIB.
[ 8781.488021] RTAI[math]: unloaded.
*** lockup here, while waiting for "SCHED releases registered named ALIEN PEDV$D"
And of course, all these tests run fine with our earlier (3.9-era and
older) versions of RTAI.
--
Sebastian Kuzminsky
Paolo Mantegazza
2016-06-28 07:30:03 UTC
Permalink
Post by Sebastian Kuzminsky
Post by Paolo Mantegazza
Post by Sebastian Kuzminsky
Post by Paolo Mantegazza
If the problem is related to a lock due to many modules removal it is
difficult for me to say anything, especially in view of the use of a
virtual env. I've just a question: does the same happens out of a
virtualization?
Paolo
I just ran an over-night test on bare metal. It ran much longer than
in my virtual machines, but this morning it too had crashed.
The console was blank and unresponsive, so i don't have any info to
help debug this, sorry :-(
I'd be happy to run any tests anyone would like.
I did not understand if you have problem in loading-unloading modules
under a virtual environment or just in running a test for a long time?
Paolo
I spoke unclearly, apologies.
My test consists of repeatedly building LinuxCNC and running the
LinuxCNC test suite.
The LinuxCNC test suite consists of ~180 tests. Many of the tests run
a sequence where they load the realtime modules (from RTAI and from
LinuxCNC), exercise the code and check the results, then unload all
the realtime modules.
It's unknown to me where in this loop the bare-metal test failed.
I've observed failures during unload in virtual machines. Here's a
kernel log of two consecutive test cases from our test suite, where
Post by Paolo Mantegazza
[ 8780.404223] I-pipe: head domain RTAI registered.
[ 8780.405262] RTAI[hal]: mounted. ISOL_CPUS_MASK: 1.
[ 8780.406019] SYSINFO - # CPUs: 2, TIMER NAME: 'lapic',
2000081000, CPU FREQ: 2000081000, LINUX TIMER IRQ: 2305.
[ 8780.414520] RTAI[malloc]: global heap size = 2097152 bytes, <BSD>.
[ 8780.415838] , kstacks pool size = 524288 bytes.
[ 8780.416828] RTAI[sched]: hard timer type/freq =
lapic/62502000(Hz); timing: oneshot; linear timed lists.
[ 8780.418232] RTAI[sched]: Linux timer freq = 250 (Hz),
TimeBase freq = 2000081000 hz.
[ 8780.419350] RTAI[sched]: timer setup = 1504 ns, resched latency = 0 ns.
[ 8780.449429] USERMODE CHECK: OK.
[ 8780.449941] USERMODE CHECK PROVIDED (ns): KernelLatency
8897, UserLatency 9087.
[ 8780.451010] FINAL CALIBRATION SUMMARY (ns): KernelLatency
8897, UserLatency 9087.
[ 8780.455566] RTAI[math]: loaded, using NEWLIB.
[ 8781.136562] RTAI[math]: unloaded.
[ 8781.148269] SCHED releases registered named ALIEN PEDV$D
[ 8781.169297] RTAI[malloc]: unloaded.
[ 8781.268150] RTAI[sched]: unloaded (forced hard/soft/hard
transitions: traps 0, syscalls 0).
[ 8781.273428] I-pipe: head domain RTAI unregistered.
[ 8781.275513] RTAI[hal]: unmounted.
[ 8781.356263] I-pipe: head domain RTAI registered.
[ 8781.357292] RTAI[hal]: mounted. ISOL_CPUS_MASK: 1.
[ 8781.358043] SYSINFO - # CPUs: 2, TIMER NAME: 'lapic',
2000081000, CPU FREQ: 2000081000, LINUX TIMER IRQ: 2305.
[ 8781.366385] RTAI[malloc]: global heap size = 2097152 bytes, <BSD>.
[ 8781.367682] , kstacks pool size = 524288 bytes.
[ 8781.368639] RTAI[sched]: hard timer type/freq =
lapic/62502000(Hz); timing: oneshot; linear timed lists.
[ 8781.370053] RTAI[sched]: Linux timer freq = 250 (Hz),
TimeBase freq = 2000081000 hz.
[ 8781.371301] RTAI[sched]: timer setup = 1499 ns, resched latency = 0 ns.
[ 8781.401245] USERMODE CHECK: OK.
[ 8781.401761] USERMODE CHECK PROVIDED (ns): KernelLatency
8897, UserLatency 9087.
[ 8781.402845] FINAL CALIBRATION SUMMARY (ns): KernelLatency
8897, UserLatency 9087.
[ 8781.407514] RTAI[math]: loaded, using NEWLIB.
[ 8781.488021] RTAI[math]: unloaded.
*** lockup here, while waiting for "SCHED releases registered named ALIEN PEDV$D"
OK I got it. From what you say, I see no reason for a running task to
work not. So, as a first step and sticking to what you show above, I
dare asking: what kind of object is "ALIEN PEDV$D"? Can you try to have
it closed by the code that uses it?
If that is not enough we'll see what else to do.
Paolo
Post by Sebastian Kuzminsky
And of course, all these tests run fine with our earlier (3.9-era and
older) versions of RTAI.
Paolo Mantegazza
2016-06-28 08:11:37 UTC
Permalink
Post by Paolo Mantegazza
Post by Sebastian Kuzminsky
Post by Paolo Mantegazza
Post by Sebastian Kuzminsky
Post by Paolo Mantegazza
If the problem is related to a lock due to many modules removal
it is
Post by Paolo Mantegazza
difficult for me to say anything, especially in view of the use of a
virtual env. I've just a question: does the same happens out of a
virtualization?
Paolo
I just ran an over-night test on bare metal. It ran much longer than
in my virtual machines, but this morning it too had crashed.
The console was blank and unresponsive, so i don't have any info to
help debug this, sorry :-(
I'd be happy to run any tests anyone would like.
I did not understand if you have problem in loading-unloading modules
under a virtual environment or just in running a test for a long time?
Paolo
I spoke unclearly, apologies.
My test consists of repeatedly building LinuxCNC and running the
LinuxCNC test suite.
The LinuxCNC test suite consists of ~180 tests. Many of the tests
run a sequence where they load the realtime modules (from RTAI and
from LinuxCNC), exercise the code and check the results, then unload
all the realtime modules.
It's unknown to me where in this loop the bare-metal test failed.
I've observed failures during unload in virtual machines. Here's a
kernel log of two consecutive test cases from our test suite, where
Post by Paolo Mantegazza
[ 8780.404223] I-pipe: head domain RTAI registered.
[ 8780.405262] RTAI[hal]: mounted. ISOL_CPUS_MASK: 1.
[ 8780.406019] SYSINFO - # CPUs: 2, TIMER NAME: 'lapic',
TIMER IRQ: 2305, TIMER FREQ: 62502000, CLOCK NAME: 'tsc', CLOCK
FREQ: 2000081000, CPU FREQ: 2000081000, LINUX TIMER IRQ: 2305.
[ 8780.414520] RTAI[malloc]: global heap size = 2097152 bytes, <BSD>.
[ 8780.415838] , kstacks pool size = 524288 bytes.
[ 8780.416828] RTAI[sched]: hard timer type/freq =
lapic/62502000(Hz); timing: oneshot; linear timed lists.
[ 8780.418232] RTAI[sched]: Linux timer freq = 250 (Hz),
TimeBase freq = 2000081000 hz.
[ 8780.419350] RTAI[sched]: timer setup = 1504 ns, resched latency = 0 ns.
[ 8780.449429] USERMODE CHECK: OK.
[ 8780.449941] USERMODE CHECK PROVIDED (ns): KernelLatency
8897, UserLatency 9087.
[ 8780.451010] FINAL CALIBRATION SUMMARY (ns): KernelLatency
8897, UserLatency 9087.
[ 8780.455566] RTAI[math]: loaded, using NEWLIB.
[ 8781.136562] RTAI[math]: unloaded.
[ 8781.148269] SCHED releases registered named ALIEN PEDV$D
[ 8781.169297] RTAI[malloc]: unloaded.
[ 8781.268150] RTAI[sched]: unloaded (forced hard/soft/hard
transitions: traps 0, syscalls 0).
[ 8781.273428] I-pipe: head domain RTAI unregistered.
[ 8781.275513] RTAI[hal]: unmounted.
[ 8781.356263] I-pipe: head domain RTAI registered.
[ 8781.357292] RTAI[hal]: mounted. ISOL_CPUS_MASK: 1.
[ 8781.358043] SYSINFO - # CPUs: 2, TIMER NAME: 'lapic',
TIMER IRQ: 2305, TIMER FREQ: 62502000, CLOCK NAME: 'tsc', CLOCK
FREQ: 2000081000, CPU FREQ: 2000081000, LINUX TIMER IRQ: 2305.
[ 8781.366385] RTAI[malloc]: global heap size = 2097152 bytes, <BSD>.
[ 8781.367682] , kstacks pool size = 524288 bytes.
[ 8781.368639] RTAI[sched]: hard timer type/freq =
lapic/62502000(Hz); timing: oneshot; linear timed lists.
[ 8781.370053] RTAI[sched]: Linux timer freq = 250 (Hz),
TimeBase freq = 2000081000 hz.
[ 8781.371301] RTAI[sched]: timer setup = 1499 ns, resched latency = 0 ns.
[ 8781.401245] USERMODE CHECK: OK.
[ 8781.401761] USERMODE CHECK PROVIDED (ns): KernelLatency
8897, UserLatency 9087.
[ 8781.402845] FINAL CALIBRATION SUMMARY (ns): KernelLatency
8897, UserLatency 9087.
[ 8781.407514] RTAI[math]: loaded, using NEWLIB.
[ 8781.488021] RTAI[math]: unloaded.
*** lockup here, while waiting for "SCHED releases
registered named ALIEN PEDV$D"
OK I got it. From what you say, I see no reason for a running task to
work not. So, as a first step and sticking to what you show above, I
dare asking: what kind of object is "ALIEN PEDV$D"? Can you try to
have it closed by the code that uses it?
If that is not enough we'll see what else to do.
Paolo
Sorry, I forgot to ask you if tested tasks are in user or kernel space.
The use of newlib suggests the latte, but I would like to be sure.
Paolo
Post by Paolo Mantegazza
Post by Sebastian Kuzminsky
And of course, all these tests run fine with our earlier (3.9-era and
older) versions of RTAI.
_______________________________________________
Rtai mailing list
https://mail.rtai.org/cgi-bin/mailman/listinfo/rtai
Sebastian Kuzminsky
2016-06-28 13:38:48 UTC
Permalink
Post by Paolo Mantegazza
Post by Paolo Mantegazza
Post by Sebastian Kuzminsky
Post by Paolo Mantegazza
Post by Sebastian Kuzminsky
Post by Paolo Mantegazza
If the problem is related to a lock due to many modules removal
it is
Post by Paolo Mantegazza
difficult for me to say anything, especially in view of the use of a
virtual env. I've just a question: does the same happens out of a
virtualization?
Paolo
I just ran an over-night test on bare metal. It ran much longer than
in my virtual machines, but this morning it too had crashed.
The console was blank and unresponsive, so i don't have any info to
help debug this, sorry :-(
I'd be happy to run any tests anyone would like.
I did not understand if you have problem in loading-unloading modules
under a virtual environment or just in running a test for a long time?
Paolo
I spoke unclearly, apologies.
My test consists of repeatedly building LinuxCNC and running the
LinuxCNC test suite.
The LinuxCNC test suite consists of ~180 tests. Many of the tests
run a sequence where they load the realtime modules (from RTAI and
from LinuxCNC), exercise the code and check the results, then unload
all the realtime modules.
It's unknown to me where in this loop the bare-metal test failed.
I've observed failures during unload in virtual machines. Here's a
kernel log of two consecutive test cases from our test suite, where
Post by Paolo Mantegazza
[ 8780.404223] I-pipe: head domain RTAI registered.
[ 8780.405262] RTAI[hal]: mounted. ISOL_CPUS_MASK: 1.
[ 8780.406019] SYSINFO - # CPUs: 2, TIMER NAME: 'lapic',
TIMER IRQ: 2305, TIMER FREQ: 62502000, CLOCK NAME: 'tsc', CLOCK
FREQ: 2000081000, CPU FREQ: 2000081000, LINUX TIMER IRQ: 2305.
[ 8780.414520] RTAI[malloc]: global heap size = 2097152 bytes, <BSD>.
[ 8780.415838] , kstacks pool size = 524288 bytes.
[ 8780.416828] RTAI[sched]: hard timer type/freq =
lapic/62502000(Hz); timing: oneshot; linear timed lists.
[ 8780.418232] RTAI[sched]: Linux timer freq = 250 (Hz),
TimeBase freq = 2000081000 hz.
[ 8780.419350] RTAI[sched]: timer setup = 1504 ns, resched latency = 0 ns.
[ 8780.449429] USERMODE CHECK: OK.
[ 8780.449941] USERMODE CHECK PROVIDED (ns): KernelLatency
8897, UserLatency 9087.
[ 8780.451010] FINAL CALIBRATION SUMMARY (ns): KernelLatency
8897, UserLatency 9087.
[ 8780.455566] RTAI[math]: loaded, using NEWLIB.
[ 8781.136562] RTAI[math]: unloaded.
[ 8781.148269] SCHED releases registered named ALIEN PEDV$D
[ 8781.169297] RTAI[malloc]: unloaded.
[ 8781.268150] RTAI[sched]: unloaded (forced hard/soft/hard
transitions: traps 0, syscalls 0).
[ 8781.273428] I-pipe: head domain RTAI unregistered.
[ 8781.275513] RTAI[hal]: unmounted.
[ 8781.356263] I-pipe: head domain RTAI registered.
[ 8781.357292] RTAI[hal]: mounted. ISOL_CPUS_MASK: 1.
[ 8781.358043] SYSINFO - # CPUs: 2, TIMER NAME: 'lapic',
TIMER IRQ: 2305, TIMER FREQ: 62502000, CLOCK NAME: 'tsc', CLOCK
FREQ: 2000081000, CPU FREQ: 2000081000, LINUX TIMER IRQ: 2305.
[ 8781.366385] RTAI[malloc]: global heap size = 2097152 bytes, <BSD>.
[ 8781.367682] , kstacks pool size = 524288 bytes.
[ 8781.368639] RTAI[sched]: hard timer type/freq =
lapic/62502000(Hz); timing: oneshot; linear timed lists.
[ 8781.370053] RTAI[sched]: Linux timer freq = 250 (Hz),
TimeBase freq = 2000081000 hz.
[ 8781.371301] RTAI[sched]: timer setup = 1499 ns, resched latency = 0 ns.
[ 8781.401245] USERMODE CHECK: OK.
[ 8781.401761] USERMODE CHECK PROVIDED (ns): KernelLatency
8897, UserLatency 9087.
[ 8781.402845] FINAL CALIBRATION SUMMARY (ns): KernelLatency
8897, UserLatency 9087.
[ 8781.407514] RTAI[math]: loaded, using NEWLIB.
[ 8781.488021] RTAI[math]: unloaded.
*** lockup here, while waiting for "SCHED releases
registered named ALIEN PEDV$D"
OK I got it. From what you say, I see no reason for a running task to
work not. So, as a first step and sticking to what you show above, I
dare asking: what kind of object is "ALIEN PEDV$D"? Can you try to
have it closed by the code that uses it?
If that is not enough we'll see what else to do.
Paolo
Sorry, I forgot to ask you if tested tasks are in user or kernel space.
The use of newlib suggests the latte, but I would like to be sure.
Paolo
Yes, it's in kernel space.
--
Sebastian Kuzminsky
Sebastian Kuzminsky
2016-06-28 21:34:45 UTC
Permalink
Post by Paolo Mantegazza
Post by Sebastian Kuzminsky
I've observed failures during unload in virtual machines. Here's a
kernel log of two consecutive test cases from our test suite, where
Post by Paolo Mantegazza
[ 8780.404223] I-pipe: head domain RTAI registered.
[ 8780.405262] RTAI[hal]: mounted. ISOL_CPUS_MASK: 1.
[ 8780.406019] SYSINFO - # CPUs: 2, TIMER NAME: 'lapic',
2000081000, CPU FREQ: 2000081000, LINUX TIMER IRQ: 2305.
[ 8780.414520] RTAI[malloc]: global heap size = 2097152 bytes, <BSD>.
[ 8780.415838] , kstacks pool size = 524288 bytes.
[ 8780.416828] RTAI[sched]: hard timer type/freq =
lapic/62502000(Hz); timing: oneshot; linear timed lists.
[ 8780.418232] RTAI[sched]: Linux timer freq = 250 (Hz),
TimeBase freq = 2000081000 hz.
[ 8780.419350] RTAI[sched]: timer setup = 1504 ns, resched latency = 0 ns.
[ 8780.449429] USERMODE CHECK: OK.
[ 8780.449941] USERMODE CHECK PROVIDED (ns): KernelLatency
8897, UserLatency 9087.
[ 8780.451010] FINAL CALIBRATION SUMMARY (ns): KernelLatency
8897, UserLatency 9087.
[ 8780.455566] RTAI[math]: loaded, using NEWLIB.
[ 8781.136562] RTAI[math]: unloaded.
[ 8781.148269] SCHED releases registered named ALIEN PEDV$D
[ 8781.169297] RTAI[malloc]: unloaded.
[ 8781.268150] RTAI[sched]: unloaded (forced hard/soft/hard
transitions: traps 0, syscalls 0).
[ 8781.273428] I-pipe: head domain RTAI unregistered.
[ 8781.275513] RTAI[hal]: unmounted.
[ 8781.356263] I-pipe: head domain RTAI registered.
[ 8781.357292] RTAI[hal]: mounted. ISOL_CPUS_MASK: 1.
[ 8781.358043] SYSINFO - # CPUs: 2, TIMER NAME: 'lapic',
2000081000, CPU FREQ: 2000081000, LINUX TIMER IRQ: 2305.
[ 8781.366385] RTAI[malloc]: global heap size = 2097152 bytes, <BSD>.
[ 8781.367682] , kstacks pool size = 524288 bytes.
[ 8781.368639] RTAI[sched]: hard timer type/freq =
lapic/62502000(Hz); timing: oneshot; linear timed lists.
[ 8781.370053] RTAI[sched]: Linux timer freq = 250 (Hz),
TimeBase freq = 2000081000 hz.
[ 8781.371301] RTAI[sched]: timer setup = 1499 ns, resched latency = 0 ns.
[ 8781.401245] USERMODE CHECK: OK.
[ 8781.401761] USERMODE CHECK PROVIDED (ns): KernelLatency
8897, UserLatency 9087.
[ 8781.402845] FINAL CALIBRATION SUMMARY (ns): KernelLatency
8897, UserLatency 9087.
[ 8781.407514] RTAI[math]: loaded, using NEWLIB.
[ 8781.488021] RTAI[math]: unloaded.
*** lockup here, while waiting for "SCHED releases registered
named ALIEN PEDV$D"
OK I got it. From what you say, I see no reason for a running task to
work not. So, as a first step and sticking to what you show above, I
dare asking: what kind of object is "ALIEN PEDV$D"? Can you try to have
it closed by the code that uses it?
If that is not enough we'll see what else to do.
The "SCHED releases registered named ALIENT PEDV$D" message comes from
base/sched/api.c:krtai_objects_release(), called by rtai_sched.ko on
module exit.

It's not part of our code, it's printed (for example) when you ^C any of
the testsuite/kern/* programs.

While running the testsuite/kern/latency test, i see this in
Post by Paolo Mantegazza
RTAI LXRT Information.
MAX_SLOTS = 150
Linux_Owner Parent PID
Slot Name ID Type RT_Handle Pointer Tsk_PID MEM_Sz USG Cnt
-------------------------------------------------------------------------------
158 THRSRV 0xb3b3a159 TASK 0xf871b220 0x (null) 0 2096 1
192 PEDV$D 0x9ac6d9e7 SHMEM 0xf871a000 0x (null) 0 2097152 1
I changed krtai_objects_release() to not call num2nam() on ALIEN
objects, and instead print out the rt_registry_entry fields, and what's
Post by Paolo Mantegazza
Jun 28 10:18:30 jessie-rtai-i386 kernel: [ 3133.151609] I-pipe: head domain RTAI registered.
Jun 28 10:18:30 jessie-rtai-i386 kernel: [ 3133.152660] RTAI[hal]: mounted. ISOL_CPUS_MASK: 1.
Jun 28 10:18:30 jessie-rtai-i386 kernel: [ 3133.153407] SYSINFO - # CPUs: 2, TIMER NAME: 'lapic', TIMER IRQ: 2305, TIMER FREQ: 62500999, CLOCK NAME: 'tsc', CLOCK FREQ: 2000053000, CPU FREQ: 2000053000, LINUX TIMER IRQ: 2305.
Jun 28 10:18:30 jessie-rtai-i386 kernel: [ 3133.164134] RTAI[malloc]: global heap size = 2097152 bytes, <BSD>.
Jun 28 10:18:30 jessie-rtai-i386 kernel: [ 3133.165450] , kstacks pool size = 524288 bytes.
Jun 28 10:18:30 jessie-rtai-i386 kernel: [ 3133.166155] RTAI[sched]: hard timer type/freq = lapic/62500999(Hz); timing: oneshot; linear timed lists.
Jun 28 10:18:30 jessie-rtai-i386 kernel: [ 3133.167629] RTAI[sched]: Linux timer freq = 250 (Hz), TimeBase freq = 2000053000 hz.
Jun 28 10:18:30 jessie-rtai-i386 kernel: [ 3133.169726] RTAI[sched]: timer setup = 1505 ns, resched latency = 0 ns.
Jun 28 10:18:30 jessie-rtai-i386 kernel: [ 3133.200643] USERMODE CHECK: OK.
Jun 28 10:18:30 jessie-rtai-i386 kernel: [ 3133.201146] USERMODE CHECK PROVIDED (ns): KernelLatency 8896, UserLatency 9086.
Jun 28 10:18:30 jessie-rtai-i386 kernel: [ 3133.202212] FINAL CALIBRATION SUMMARY (ns): KernelLatency 8896, UserLatency 9086.
Jun 28 10:18:30 jessie-rtai-i386 kernel: [ 3133.207419] RTAI[math]: loaded, using NEWLIB.
Jun 28 10:18:30 jessie-rtai-i386 kernel: [ 3133.466685] RTAI[math]: unloaded.
Jun 28 10:18:30 jessie-rtai-i386 kernel: [ 3133.480661] entry.name=-1698244121
Jun 28 10:18:30 jessie-rtai-i386 kernel: [ 3133.481374] entry.adr=f86d6000
Jun 28 10:18:30 jessie-rtai-i386 kernel: [ 3133.481945] entry.tsk= (null)
Jun 28 10:18:30 jessie-rtai-i386 kernel: [ 3133.482512] entry.type=2097152
Jun 28 10:18:30 jessie-rtai-i386 kernel: [ 3133.483073] entry.count=1
Jun 28 10:18:30 jessie-rtai-i386 kernel: [ 3133.483581] entry.alink=176
Jun 28 10:18:30 jessie-rtai-i386 kernel: [ 3133.484733] entry.nlink=0
Jun 28 10:18:30 jessie-rtai-i386 kernel: [ 3133.505282] RTAI[malloc]: unloaded.
Jun 28 10:18:31 jessie-rtai-i386 kernel: [ 3133.604147] RTAI[sched]: unloaded (forced hard/soft/hard transitions: traps 0, syscalls 0).
Jun 28 10:18:31 jessie-rtai-i386 kernel: [ 3133.607569] I-pipe: head domain RTAI unregistered.
Jun 28 10:18:31 jessie-rtai-i386 kernel: [ 3133.609699] RTAI[hal]: unmounted.
This led me to look closer at the hashed registry code, and it looks to
me like there's a bug there (though probably not the bug causing the
lockups i'm seeing):

1. Register an entry that hashes to slot 100.

2. Register another entry that also hashes to slot 100. The insert
code will notice the collision and instead use the next free slot, let's
say 100.

3. Remove the entry from step 1.

4. Now try to look up the entry from step 2. It will hash to slot 100,
but 100 is empty, so the lookup fails.


The normal way to handle hash collisions is to put a linked list in the
bucket, and add all colliding entries to that list. This way there's
never any question where to find an entry.


I'll keep looking for the cause of my lockups...
--
Sebastian Kuzminsky
Sebastian Kuzminsky
2016-06-29 20:23:33 UTC
Permalink
Post by Sebastian Kuzminsky
This led me to look closer at the hashed registry code, and it looks to
me like there's a bug there (though probably not the bug causing the
Switching to the old registry makes the lockups i'm seeing go away, or
at least become much more rare. See the attached patch.

With that patch, I have not yet had a lockup after ~15 minutes so far of
continuous testing. Without that patch my tests lock up within the
first minute.
--
Sebastian Kuzminsky
Sebastian Kuzminsky
2016-06-30 02:36:06 UTC
Permalink
Post by Sebastian Kuzminsky
Post by Sebastian Kuzminsky
This led me to look closer at the hashed registry code, and it looks to
me like there's a bug there (though probably not the bug causing the
Switching to the old registry makes the lockups i'm seeing go away, or
at least become much more rare. See the attached patch.
With that patch, I have not yet had a lockup after ~15 minutes so far of
continuous testing. Without that patch my tests lock up within the
first minute.
Nope, never mind. Even with the old-registry patch it still locks up
after a while.
--
Sebastian Kuzminsky
Paolo Mantegazza
2016-06-30 07:36:55 UTC
Permalink
Post by Sebastian Kuzminsky
Post by Sebastian Kuzminsky
Post by Sebastian Kuzminsky
This led me to look closer at the hashed registry code, and it looks to
me like there's a bug there (though probably not the bug causing the
Switching to the old registry makes the lockups i'm seeing go away, or
at least become much more rare. See the attached patch.
With that patch, I have not yet had a lockup after ~15 minutes so far of
continuous testing. Without that patch my tests lock up within the
first minute.
Nope, never mind. Even with the old-registry patch it still locks up
after a while.
Have you got a somewhat firm idea about the locking being related either
to starting-stopping a test or to its execution?

Paolo
Sebastian Kuzminsky
2016-06-30 15:06:50 UTC
Permalink
Post by Paolo Mantegazza
Post by Sebastian Kuzminsky
Post by Sebastian Kuzminsky
Post by Sebastian Kuzminsky
This led me to look closer at the hashed registry code, and it looks to
me like there's a bug there (though probably not the bug causing the
Switching to the old registry makes the lockups i'm seeing go away, or
at least become much more rare. See the attached patch.
With that patch, I have not yet had a lockup after ~15 minutes so far of
continuous testing. Without that patch my tests lock up within the
first minute.
Nope, never mind. Even with the old-registry patch it still locks up
after a while.
Have you got a somewhat firm idea about the locking being related either
to starting-stopping a test or to its execution?
I think i've only ever seen the hang while unloading the RTAI modules.
Loading and running all seems to work reliably.

I'm running on a 2-CPU virtual machine with isolcpus=0, 32-bit x86.
--
Sebastian Kuzminsky
Paolo Mantegazza
2016-06-30 07:26:53 UTC
Permalink
Post by Sebastian Kuzminsky
Post by Sebastian Kuzminsky
This led me to look closer at the hashed registry code, and it looks to
me like there's a bug there (though probably not the bug causing the
Switching to the old registry makes the lockups i'm seeing go away, or
at least become much more rare. See the attached patch.
With that patch, I have not yet had a lockup after ~15 minutes so far
of continuous testing. Without that patch my tests lock up within the
first minute.
That should prove that the problem is related to
loading(beginning)-unloading(end) an application.
The patch seems to point to something I overlooked in the transition to
the no legacy use of patches.
I'll have a look at it ASAP.

Thanks, Paolo.
Alec Ari
2016-06-25 19:28:15 UTC
Permalink
Hi Seb,

Upstream RTAI is pretty brutal for LinuxCNC development. I know you don't want to use my tree, but it works beautifully for all the platforms I've tested it on, both 32-bit and 64. Jeff Epler really helped push along the math fixes and sorted out those mind-boggling SSE fixes, and I couldn't have done it without him. I honestly don't know why you're so concerned about it being from rtai.org if the tree is generally broken for LinuxCNC. If you really want to insist on using this tree, you're going to have to be the one who does the work to get it going, which most likely means a full re-write of RTAPI and whatever else, just to make it work using the stuff from here. Nobody else here cares about LinuxCNC except us, so you're going to have to be the expert who does the work yourself if you plan on using the mainline RTAI tree. If you ever get fed up (I know I did) you know where it's at. I've never experienced a single hang or crash from there after the majority of the work was complet
e. Good luck and hope you figure this stuff out.

Alec Ari
Alec Ari
2016-06-27 20:18:27 UTC
Permalink
And of course, all these tests run fine with our earlier (3.9-era and older) versions of RTAI.
I'm not surprised either.

Alec
Loading...