Message boards : Number crunching : extreme long wu's

extreme long wu's

Post to thread Subscribe


Previous · 1 · 2 · 3 · 4 · 5 . . . 10 · Next
AuthorMessage
Profile Dario666
Avatar

Send message
Joined: 12 May 15
Posts: 1
Credit: 3,818,369
RAC: 0
Message 1872 - Posted: 5 Jan 2017, 9:40:47 UTC

What are these errors???

I've canceled 8 task about 130-145 hours, which they were suspended, but occupy 100% CPU. After last 24h task progress was zero.
ID: 1872 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
adrianxw

Send message
Joined: 1 Oct 16
Posts: 32
Credit: 268,033
RAC: 0
Message 1886 - Posted: 14 Jan 2017, 15:40:22 UTC

I've just aborted another one. No New Tasks set. There is a serious flaw in the application here, I wonder how many crunchers are having their systems time wasted.
ID: 1886 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Jim1348

Send message
Joined: 28 Feb 15
Posts: 186
Credit: 52,822,747
RAC: 81,828
Message 1887 - Posted: 14 Jan 2017, 15:58:09 UTC - in response to Message 1886.  
Last modified: 14 Jan 2017, 16:36:21 UTC

It must be specific to some types of systems. I haven't seen it in my current logs at all, and I think only one earlier, as posted above.
http://universeathome.pl/universe/results.php?hostid=57772&offset=0&show_names=0&state=4&appid=

I am on Ubuntu 16.10, running on an i7-4790, and this is a dedicated crunching machine that I leave on 24/7. Maybe if people report more about their systems, some sort of pattern will emerge.
ID: 1887 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Krzysztof Piszczek - wspieram Polski projekt BOINC
Project administrator
Project developer
Project tester
Avatar

Send message
Joined: 4 Feb 15
Posts: 666
Credit: 89,789,965
RAC: 178
Message 1889 - Posted: 14 Jan 2017, 17:51:17 UTC - in response to Message 1886.  

I've just aborted another one. No New Tasks set. There is a serious flaw in the application here, I wonder how many crunchers are having their systems time wasted.

Error rate is 0.6106% - all errors, not only this one...
Krzysztof 'krzyszp' Piszczek

Member of Radioactive@Home project team
My Patreon profile
ID: 1889 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
morgan

Send message
Joined: 21 Feb 15
Posts: 8
Credit: 1,632,008
RAC: 7
Message 1896 - Posted: 23 Jan 2017, 9:52:41 UTC - in response to Message 1889.  
Last modified: 23 Jan 2017, 9:55:43 UTC

I have aborted 6 wu the last days!
At one point, the Checkpoint stop working, and they go on "forever" (seems like :))

The last had run for 20 hours(25.xxx%), when chechpointing stopped,, after running for aprox 1hour!
i restarted Boinc, and the wu started from late CP (20.xxx%), and came to a hold again when reaching same run 25.xxx% as last! (after a short time)
i did let it run for 3-4 hour more, No progress, so i aborted it...
ID: 1896 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Krzysztof Piszczek - wspieram Polski projekt BOINC
Project administrator
Project developer
Project tester
Avatar

Send message
Joined: 4 Feb 15
Posts: 666
Credit: 89,789,965
RAC: 178
Message 1898 - Posted: 23 Jan 2017, 14:04:50 UTC - in response to Message 1896.  

Only what I can suggest is to abort work units which computer longer then usual tasks on particular computer. It's very rare situation and I suspect is that is more due to particular configuration then to source code and/or algorithm.
Krzysztof 'krzyszp' Piszczek

Member of Radioactive@Home project team
My Patreon profile
ID: 1898 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
adrianxw

Send message
Joined: 1 Oct 16
Posts: 32
Credit: 268,033
RAC: 0
Message 1900 - Posted: 24 Jan 2017, 9:51:15 UTC - in response to Message 1898.  

That is fine if people are sitting in front of their machines watching their tasks progress. That is not a realistic scenario, people who leave a machine running BOINC may not look at it for hours/days. I leave my machines running BOINC unattended for weeks sometimes if I am away. There is an issue with the project that is not an issue with others, simple as that. You find and fix it, or the crunchers leave.
ID: 1900 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
fruehwf

Send message
Joined: 5 Jul 16
Posts: 31
Credit: 18,447,833
RAC: 0
Message 1921 - Posted: 31 Jan 2017, 20:10:36 UTC - in response to Message 1900.  

I musst aggree. It is an issue.
ID: 1921 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Jacob Klein

Send message
Joined: 21 Feb 15
Posts: 53
Credit: 1,383,221
RAC: 0
Message 1922 - Posted: 1 Feb 2017, 6:12:27 UTC
Last modified: 1 Feb 2017, 6:22:19 UTC

Do the tasks get into a "running forever with no chance to complete" state?

If so, then this seriously needs to get fixed. I have some computers also running unattended, and if they're wasting their resources, that is unacceptable.

Admins: Has anything been done to determine the cause of these problems? I've tried to give relevant info in my other thread ... but we need input and advice from you too, on how these tasks are supposed to behave, and what you can do to fix the bad behaviors! Can you please help us?

I'm going to set No New Tasks for this project, until the admins can fix this. Please admins!

==========================
For instance, the problematic one I'm going to abort right now, had this:

Application: Universe BHspin v2 0.01
Task Name: universe_bh2_160803_59_3_20000_1-999999_360000_2
URL: http://universeathome.pl/universe/result.php?resultid=19571218

"CPU time at last checkpoint" is 02:07:46 (2 hrs)
"CPU time" is 14:32:06 (14.5 hrs)
Estimated time remaining: 03:19:50 (AND INCREASING)
Fraction done: 77.415%

log.txt has several "making checkpoint" entries, but the last one was at:
02:08:41 00:08:21 making checkpoint: j: 15000; iidd: 4305399

checkpoint.dat:
15000 4305399 0 1 2

error.dat:
error: in Renv_con() unknown Ka type: 1, iidd_old: 4436666error: in Menv_con() unknown Ka type: 1, iidd_old: 4436666

error.dat2 and error.dat3:
warning: derivative from dlnRdt() not accurate, error: 1e+030, K: 8, 501481
warning: derivative from dlnRdlnM() not accurate, error: 1e+030, K: 8, 501481
warning: derivative from dlnRdt() not accurate, error: 1e+030, K: 8, 1081777
warning: derivative from dlnRdlnM() not accurate, error: 1e+030, K: 8, 1081777
warning: derivative from dlnRdt() not accurate, error: 1e+030, K: 8, 1748698
warning: derivative from dlnRdlnM() not accurate, error: 1e+030, K: 8, 1748698
warning: derivative from dlnRdt() not accurate, error: 1e+030, K: 8, 4203947
warning: derivative from dlnRdlnM() not accurate, error: 1e+030, K: 8, 4203947
ID: 1922 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Jim1348

Send message
Joined: 28 Feb 15
Posts: 186
Credit: 52,822,747
RAC: 81,828
Message 1923 - Posted: 1 Feb 2017, 7:55:44 UTC
Last modified: 1 Feb 2017, 8:25:33 UTC

I just terminated one also. I only noticed it because it was running "high priority" after almost five days.
http://universeathome.pl/universe/result.php?resultid=19652511

But it was apparently completed (though not validated yet) by one other user.
http://universeathome.pl/universe/workunit.php?wuid=8550767

I get the long ones only about once every 200 work units (estimate - the logs don't go back to the last one). Whether that is a problem for the project I can't say.
ID: 1923 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Krzysztof Piszczek - wspieram Polski projekt BOINC
Project administrator
Project developer
Project tester
Avatar

Send message
Joined: 4 Feb 15
Posts: 666
Credit: 89,789,965
RAC: 178
Message 1927 - Posted: 2 Feb 2017, 14:21:04 UTC - in response to Message 1922.  


error.dat2 and error.dat3:
warning: derivative from dlnRdt() not accurate, error: 1e+030, K: 8, 501481
warning: derivative from dlnRdlnM() not accurate, error: 1e+030, K: 8, 501481
warning: derivative from dlnRdt() not accurate, error: 1e+030, K: 8, 1081777
warning: derivative from dlnRdlnM() not accurate, error: 1e+030, K: 8, 1081777
warning: derivative from dlnRdt() not accurate, error: 1e+030, K: 8, 1748698
warning: derivative from dlnRdlnM() not accurate, error: 1e+030, K: 8, 1748698
warning: derivative from dlnRdt() not accurate, error: 1e+030, K: 8, 4203947
warning: derivative from dlnRdlnM() not accurate, error: 1e+030, K: 8, 4203947

I have an idea, let me check it...
Krzysztof 'krzyszp' Piszczek

Member of Radioactive@Home project team
My Patreon profile
ID: 1927 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Krzysztof Piszczek - wspieram Polski projekt BOINC
Project administrator
Project developer
Project tester
Avatar

Send message
Joined: 4 Feb 15
Posts: 666
Credit: 89,789,965
RAC: 178
Message 1928 - Posted: 3 Feb 2017, 0:08:41 UTC - in response to Message 1927.  

Ok, I have indentified functions where the problem exists and... is really interesting!
We will REALLY deeply check it as is... something which can give us more data.

Thank you for investigation and for deep checking (it will be impossible for us without your feedback).
Krzysztof 'krzyszp' Piszczek

Member of Radioactive@Home project team
My Patreon profile
ID: 1928 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Jacob Klein

Send message
Joined: 21 Feb 15
Posts: 53
Credit: 1,383,221
RAC: 0
Message 1929 - Posted: 3 Feb 2017, 0:21:54 UTC - in response to Message 1928.  

Sounds great, honestly!
Once you've resolved the issue, please let us know when it is safe to disable "No New Tasks", with no fear of getting a task that wastes the CPU. I'll be monitoring this thread.
ID: 1929 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile ritterm
Avatar

Send message
Joined: 6 Mar 15
Posts: 28
Credit: 16,721,329
RAC: 0
Message 1930 - Posted: 3 Feb 2017, 4:19:26 UTC - in response to Message 1928.  

Thank you for investigation and for deep checking (it will be impossible for us without your feedback).

Well done, Jacob Klein.
ID: 1930 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
hartacus

Send message
Joined: 7 Feb 17
Posts: 6
Credit: 1,407,757
RAC: 0
Message 1946 - Posted: 12 Feb 2017, 5:27:51 UTC

I think I've had this issue too, but only on my Raspberry Pi model B. Tasks seemingly hanging on ~1% completion, with time remaining stretching months ahead. My Raspberry Pi 2 seems to crunch away just fine though, as does my desktop.

Btw thanks for supporting the Raspberry Pi!
ID: 1946 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Jacob Klein

Send message
Joined: 21 Feb 15
Posts: 53
Credit: 1,383,221
RAC: 0
Message 1947 - Posted: 12 Feb 2017, 5:48:36 UTC

krzyszp:

Any response yet for:
- Is this an application issue, or is it a bad batch of tasks?
- Is it fixed?
ID: 1947 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Krzysztof Piszczek - wspieram Polski projekt BOINC
Project administrator
Project developer
Project tester
Avatar

Send message
Joined: 4 Feb 15
Posts: 666
Credit: 89,789,965
RAC: 178
Message 1953 - Posted: 15 Feb 2017, 12:40:48 UTC - in response to Message 1947.  
Last modified: 15 Feb 2017, 17:48:37 UTC

New version of all apps will be available in next week, I hope this resolve the problem.

I had executed locally few of those problematic tasks and found that same code compiled without BOINC support doesn't make problems but compiled with BOINC support got it. So, it's suggests that some boinc client versions cause the problem...
Krzysztof 'krzyszp' Piszczek

Member of Radioactive@Home project team
My Patreon profile
ID: 1953 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
adrianxw

Send message
Joined: 1 Oct 16
Posts: 32
Credit: 268,033
RAC: 0
Message 1980 - Posted: 23 Feb 2017, 9:40:27 UTC

Are the new apps released now, it has been a week?
ID: 1980 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Chooka
Avatar

Send message
Joined: 30 Jun 16
Posts: 17
Credit: 110,988,695
RAC: 1,315
Message 1981 - Posted: 23 Feb 2017, 18:58:56 UTC

Yep, I just aborted all my WU's.
Whole stack of errors. My last valid WU was 19/02//17.

ID: 1981 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
adrianxw

Send message
Joined: 1 Oct 16
Posts: 32
Credit: 268,033
RAC: 0
Message 1982 - Posted: 23 Feb 2017, 19:22:10 UTC

Ah, err, that doesn't sound exactly optimal, is the failure the same as that the thread was about or is ths something new the "fix" has introduced?
ID: 1982 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Previous · 1 · 2 · 3 · 4 · 5 . . . 10 · Next Post to thread

Message boards : Number crunching : extreme long wu's