Ubuntu 意外死机（Linux Crash/Hang）解决

最新推荐文章于 2024-06-27 14:39:24 发布

zhangrelay

最新推荐文章于 2024-06-27 14:39:24 发布

阅读量3.1w

点赞数 1

分类专栏： Ubuntu软件文章标签： Ubuntu Linux Crash Hang

Ubuntu软件专栏收录该内容

80 篇文章 16 订阅

订阅专栏

Ubuntu 意外死机（Linux Crash/Hang）解决

以Intel Bay Trail/J1900/N2940 为例，通常是由于linux kernel和硬件兼容性问题导致：

查询网址：https://bugzilla.kernel.org/

点开对应问题，就可以看到问题，和一些解决方案。

Bug 109051 - intel_idle.max_cstate=1 required on baytrail to prevent crashes

Status:	NEEDINFO

Alias:	None

Product:	Power Management
Component:	intel_idle (show other bugs)
Hardware:	All Linux

Importance:	P1 blocking
Assignee:	Len Brown

URL:
Keywords:

Depends on:
Blocks:	113151
	Show dependency tree

Reported:	2015-12-08 09:50 UTC by Daniel Vetter
Modified:	2017-05-11 12:31 UTC (History)
CC List:	209 users (show)

See Also:
Kernel Version:	3.16-4.2
Tree:	Mainline
Regression:	No

Attachments
LG MP500 w/o fan (16.36 KB, text/plain) 2015-12-17 16:11 UTC, Chris Eineke	Details
Advantech DS-370 (23.51 KB, text/plain) 2015-12-17 16:12 UTC, Chris Eineke	Details
drm/i915/vlv: [V4.3 backport] Take forcewake on media engine writes (2.01 KB, patch) 2015-12-18 13:05 UTC, Mika Kuoppala	Details \| Diff
lspci -v Hostbridge and vga adapter output (1001 bytes, text/plain) 2016-01-08 10:32 UTC, julio.borreguero@gmail.com	Details
drm/i915/vlv: Always enable internal pm interrupts (1.64 KB, patch) 2016-01-18 11:09 UTC, Mika Kuoppala	Details \| Diff
Kernel bisection between v4.2 v4.1 for sudden freezes (2.53 KB, text/plain) 2016-02-01 22:06 UTC, BzukTuk	Details
attachment-24616-0.html (1.45 KB, text/html) 2016-03-16 22:10 UTC, Vincent Frentzel	Details
attachment-28440-0.html (1.06 KB, text/html) 2016-03-17 00:27 UTC, Vincent Frentzel	Details
Arch Linux 4.1.18 LTS panic #1 (photo 1 of 3) (2.69 MB, image/jpeg) 2016-03-17 02:53 UTC, John A.	Details
Arch Linux 4.1.18 LTS panic #2 (photo 2 of 3) (2.72 MB, image/jpeg) 2016-03-17 02:55 UTC, John A.	Details
Arch Linux 4.4.3 panic (photo 3 of 3) (2.65 MB, image/jpeg) 2016-03-17 02:57 UTC, John A.	Details
attachment-21257-0.html (1.77 KB, text/html) 2016-03-22 06:47 UTC, jds	Details
drm/i915: Prevent machine death on Ivybridge context switching for kernel 4.5.0 from kernel archive (1.54 KB, patch) 2016-03-26 21:56 UTC, julio.borreguero@gmail.com	Details \| Diff
Reverted commit 8fb55197e64... for 4.5.0 (4.88 KB, patch) 2016-04-04 12:25 UTC, Martin	Details \| Diff
attachment-24742-0.html (1.27 KB, text/html) 2016-05-18 05:44 UTC, jds	Details
attachment-7936-0.html (1.63 KB, text/html) 2016-05-18 16:13 UTC, jds	Details
attachment-22682-0.html (1.81 KB, text/html) 2016-06-07 04:48 UTC, Koen Roggemans	Details
Disable all C6 states enable all C7 core states for Baytrail CPUs (1.33 KB, text/x-sh) 2016-07-14 18:09 UTC, Wolfgang M. Reimer	Details
Shows all core states (C-states) + some related info as a formatted table (1.41 KB, text/x-sh) 2016-07-14 18:12 UTC, Wolfgang M. Reimer	Details
attachment-21109-0.html (4.15 KB, text/html) 2016-09-19 21:22 UTC, Konstantin Koslowski	Details
attachment-3924-0.html (1.42 KB, text/html) 2016-10-05 16:20 UTC, Koen Roggemans	Details
attachment-14281-0.html (2.85 KB, text/html) 2016-10-13 17:37 UTC, Javier Antonio Nisa Avila	Details
Patch to disable c-states at boot (1.77 KB, patch) 2016-10-16 05:38 UTC, Jochen Hein	Details \| Diff
Patch for Bay trail for 4.8 (1.95 KB, patch) 2016-12-14 17:56 UTC, Vincent Gerris	Details \| Diff
attachment-26085-0.html (2.01 KB, text/html) 2016-12-25 22:44 UTC, Vincent Gerris	Details
Debug patch to enable BYT C6 auto-demotion (1.73 KB, patch) 2016-12-27 21:57 UTC, Len Brown	Details \| Diff
nanosleep.c (723 bytes, text/plain) 2016-12-28 11:43 UTC, Len Brown	Details
tubostat --debug -o ts.out sleep 10 (2.01 KB, text/plain) 2017-01-02 09:52 UTC, Dmitry	Details
Test script to freeze your baytrail quickly (1.01 KB, application/octet-stream) 2017-01-10 09:58 UTC, Len Brown	Details
T100CHI turbostat kernel 4.9 patched (1.98 KB, text/plain) 2017-01-10 19:16 UTC, jbMacAZ	Details
CHI_freeze_4.9.2_no_demotion_disable_patch (3.61 KB, text/plain) 2017-01-11 22:59 UTC, jbMacAZ	Details
pstate.set script (1.97 KB, text/plain) 2017-01-26 04:46 UTC, Len Brown	Details
Mika v3: drm/i915: Avoid tweaking evaluation thresholds on Baytrail v3 (3.98 KB, patch) 2017-02-28 03:13 UTC, Len Brown	Details \| Diff
latest turbostat (17.03.04) utility for baytrail (80.20 KB, application/octet-stream) 2017-03-05 23:46 UTC, Len Brown	Details
attachment-16106-0.html (1.60 KB, text/html) 2017-03-16 04:38 UTC, Alejandro Morales Lepe	Details
Show Obsolete (7) View All Add an attachment (proposed patch, testcase, etc.)

Description Daniel Vetter 2015-12-08 09:50:50 UTC

This originally started as a gpu regression report against a change to the turbo logic. After much random walking reporter consensus seems to have settled on max_cstate=1 as the one true workaround. See

https://bugs.freedesktop.org/show_bug.cgi?id=88012

For all the glorious details.

Comment 1 Vladimir Jicha 2015-12-08 10:13:29 UTC

For me setting max_cstate=1 didn't solve the bug. It improved the time to freeze from a couple of minutes to a couple of hours. But it is not a fully and universally working workaround.

Comment 2 Anael O. 2015-12-08 10:42:29 UTC

Experienced on an Intel Celeron CPU J1900 (platform GB-BXBT-1900) on Archlinux x64. I cannot upgrade to a kernel higher than 3.14 otherwise I get very frequent crashes when playing videos on browsing the web. On the contrary, kernel 3.14 is extremely stable and the machine can stay up for weeks.

Comment 3 raidyne 2015-12-08 11:09:49 UTC

same here on an Asrock Q1900-ITX (Intel Celeron J1900): random freezes in X session.

Comment 4 Wolfgang M. Reimer 2015-12-09 14:56:37 UTC

Same here on 50+ ASRock IMB-150 mini-ITX (Intel Celeron J1900) boards: Random freezes, time to freeze ranging from some ten minutes to some hours, only when using X with conky + own QT based App (no freezes when not using X!), so it seems very likely that this problem is GPU related.

I will test with kernel parameter intel_idle.max_cstate=1 to see if it is a working workaround for my case and report here later.

Comment 5 raidyne 2015-12-09 15:03:34 UTC

but intel_idle.max_cstate=1 would result in seriously increased power consumption?!

Comment 6 Michal Feix 2015-12-09 15:19:13 UTC

(In reply to raidyne from comment #5)
> but intel_idle.max_cstate=1 would result in seriously increased power
> consumption?!

Correct. And that is the reason, why this bug needs to be fixed soon :-) intel_idle.max_cstate=1 is just a quick workaround, so your baytrail machine can live longer than just a few minutes.

I do confirm same random machine freezes on Acer notebook with Celeron N2940. Random freezes are really random, but usually more frequent when using CPU or GFX heavily. Freezes occur on 4.2.X kernels I've tested so far. I've been able to fix this by using either intel_idle.max_cstate=1 or intel_pstate=disable. Using one of these kernel parameters makes my machine usable again.

Comment 7 Michal Feix 2015-12-09 16:04:31 UTC

I've succesfully tested longterm kernel 4.1.13. This one seems to work without a single freeze for the last 8 hours of uptime. I didn'd need to use any of the intel_idle.max_cstate or intel_pstate kernel parameters with this kernel.

Comment 8 raidyne 2015-12-10 12:59:00 UTC

i'm happy to provide you with any logs. Unfortunately though, my system does not seem to be particularly vebose concerning this bug: could not find any hints in dmesg, kern.log, syslog, xorg.log.

Comment 9 Steven Ellis 2015-12-10 19:49:06 UTC

I'm seeing this freeze on an Acer Notebook with a Celeron N2940 when running Fedora.

https://bugzilla.redhat.com/show_bug.cgi?id=1285895

No issues with older fedora 22 kernel
 - kernel-4.1.6-200.fc22.x86_64

Still have issues with latest fedora kernel build
 - kernel-4.2.6-301.fc23.x86_64

Is there an easy way to show the current cstate of the system?

Comment 10 Chris Rainey 2015-12-11 15:12:00 UTC

(In reply to Steven Ellis from comment #9)
> I'm seeing this freeze on an Acer Notebook with a Celeron N2940 when running
> Fedora.
> 
> https://bugzilla.redhat.com/show_bug.cgi?id=1285895
> 
> No issues with older fedora 22 kernel
>  - kernel-4.1.6-200.fc22.x86_64
> 
> Still have issues with latest fedora kernel build
>  - kernel-4.2.6-301.fc23.x86_64
> 
> Is there an easy way to show the current cstate of the system?

YES:  PowerTop(http://01.org/powertop/) and i7z(https://code.google.com/p/i7z/) can tell you this.


Example output from PowerTop:



PowerTOP 2.8      Overview   Idle stats   Frequency stats   Device stats   Tunables                                     


          Package   |             Core    |            CPU 0
                    |                     | C0 active   0.7%
                    |                     | POLL        0.1%    0.3 ms
                    | C1 (cc1)   99.3%    | C1-BYT     99.4%    4.2 ms
C2 (pc2)    0.0%    |                     |
C3 (pc3)    0.0%    |                     |
C6 (pc6)    0.0%    | C6 (cc6)    0.0%    |

                    |             Core    |            CPU 1
                    |                     | C0 active   0.5%
                    |                     | POLL        0.0%    0.0 ms
                    | C1 (cc1)   99.5%    | C1-BYT     99.4%   22.8 ms
                    |                     |
                    |                     |
                    | C6 (cc6)    0.0%    |

                    |             Core    |            CPU 2
                    |                     | C0 active   0.9%
                    |                     | POLL        0.0%    0.0 ms
                    | C1 (cc1)   98.9%    | C1-BYT     99.0%    8.5 ms
                    |                     |
                    |                     |
                    | C6 (cc6)    0.0%    |

                    |             Core    |            CPU 3
                    |                     | C0 active   1.3%
                    |                     | POLL        0.0%    0.0 ms
                    | C1 (cc1)   98.0%    | C1-BYT     98.0%   11.1 ms
                    |                     |
                    |                     |
                    | C6 (cc6)    0.0%    |

                    |             GPU     |
                    |                     |
                    | Powered On  0.0%    |
                    | RC6       100.0%    |
                    | RC6p        0.0%    |
                    | RC6pp       0.0%    |
                    |                     |

Comment 11 Juha Sievi-Korte 2015-12-12 10:34:36 UTC

I can also confirm that this workaround works for me, running 4.3.0-2 now for about two weeks with intel_idle.max_cstate=1 and no freezes. Cheers for this, I was getting desperate with the constant hangs.

Downgrading kernel to 3.16.7-29 makes this run fine without any boot parameters, but anything newer than that means frequent freezes.

Using intel_pstate=disable does not work for this hardware / kernel combination either, it still hangs. Only limiting cstate seems to cure this.

Acer B-115M Laptop with Intel(R) Pentium(R) CPU  N3540  @ 2.16GHz

If there is anything that I can do to help to trace this, let me know.

Comment 12 Chris Rainey 2015-12-12 21:27:40 UTC

Good reading for better understanding of this issue:


1. C-states and P-states are very different(https://software.intel.com/en-us/blogs/2008/03/12/c-states-and-p-states-are-very-different)


2. Power Management States: P-States, C-States, and Package C-States(https://software.intel.com/en-us/articles/power-management-states-p-states-c-states-and-package-c-states)


3. (update) C-states, C-states and even more C-states(https://software.intel.com/en-us/blogs/2008/03/27/update-c-states-c-states-and-even-more-c-states)


Hope this helps!

Comment 13 Wolfgang M. Reimer 2015-12-17 09:14:58 UTC

(In reply to Wolfgang M. Reimer from comment #4)
> Same here on 50+ ASRock IMB-150 mini-ITX (Intel Celeron J1900) boards:
> Random freezes, time to freeze ranging from some ten minutes to some hours,
> only when using X with conky + own QT based App (no freezes when not using
> X!), so it seems very likely that this problem is GPU related.
> 
> I will test with kernel parameter intel_idle.max_cstate=1 to see if it is a
> working workaround for my case and report here later.

I can confirm that kernel parameter intel_idle.max_cstate=1 is a working workaround for my case (50+ ASRock IMB-150 mini-ITX Intel Celeron J1900 boards running a 3.18.21-rt19 kernel)

Comment 14 Pascal VITOUX 2015-12-17 14:37:57 UTC

I can confirm too, parameter "intel_idle.max_cstate=2" is required on two laptops (Medion Akoya E6239T and S6217T) with these CPU : 
 - Intel Celeron CPU N2930 1.83GHz
 - Intel Celeron CPU N2940 1.83Ghz
The random freezes come back when setting max_cstate to 3.

Also, I don't need it on two other similar laptops (Medion Akoya E6239) with these CPU : 
 - Intel Celeron CPU N2830 2.16Hhz
 - Intel Celeron CPU N2840 2.16GHz

Comment 15 Chris Eineke 2015-12-17 16:11:34 UTC

I, too, can confirm this issue on systems that use an Intel Celeron N2930@1.83GHz or an Intel Celeron J1900@1.99GHz. While adding "intel_idle.max_cstate=1" to kernel command-line fixed the issue, the regression in GPU performance wasn't acceptable. Bumping it to "intel_idle.max_cstate=2" seems to make it run with adequate GPU performance while presenting no more hard lock-ups. I attached the output of lshw of both systems.

Comment 16 Chris Eineke 2015-12-17 16:11:57 UTC

Created attachment 197611 [details]
LG MP500 w/o fan

Comment 17 Chris Eineke 2015-12-17 16:12:23 UTC

Created attachment 197621 [details]
Advantech DS-370

Comment 18 Mika Kuoppala 2015-12-17 16:37:45 UTC

Created attachment 197631 [details]
drm/i915/vlv: Take forcewake on media engine writes

Long shot, but could someone give this a spin.

Comment 19 G. Bremer 2015-12-17 20:49:17 UTC

Can anyone confirm that this problem is limited to Bay Trail and does not affect Braswell such as N3150 or N3700?  Ran into this intermittent freeze-up problem after upgrading several J1900 and N2930 based boards to 3.19 kernel.  [Had used  3.13 previously.]  intel_idle.max_cstate=1 seems to solve the problem...all units up for 48hrs anyway.  I much appreciated finding that this is a known/reported problem.  We are moving to the Braswell based boards and wondering if there are any known stability problems.  Thank you.

Comment 20 Chris Rainey 2015-12-17 21:35:19 UTC

(In reply to G. Bremer from comment #19)
> Can anyone confirm that this problem is limited to Bay Trail and does not
> affect Braswell such as N3150 or N3700?  Ran into this intermittent
> freeze-up problem after upgrading several J1900 and N2930 based boards to
> 3.19 kernel.  [Had used  3.13 previously.]  intel_idle.max_cstate=1 seems to
> solve the problem...all units up for 48hrs anyway.  I much appreciated
> finding that this is a known/reported problem.  We are moving to the
> Braswell based boards and wondering if there are any known stability
> problems.  Thank you.

Confirming same issue on N3050(Braswell/Cherry Trail/Airmont).

Comment 21 fritsch 2015-12-17 21:37:13 UTC

I can fully confirm that this issue is _not_ happening on Braswell N3150 and N3700 - both chips are perfectly fine without any patching.

3.19 is not even working on braswell.

Comment 22 Wolfgang M. Reimer 2015-12-18 09:18:22 UTC

(In reply to fritsch from comment #21)
> I can fully confirm that this issue is _not_ happening on Braswell N3150 and
> N3700 - both chips are perfectly fine without any patching.

If you make such a statement like the one above then please specify for which kernel revision(s) this is true. Older kernel revisions (like e.g. 3.13.x) do not exhibit the issue for BayTrail processors. This thread is about (more or less) random freezes of BayTrail (and possibly newer) processors running NEWER kernel revisions (e.g. 3.18.x and newer) when used without kernel parameter intel_idle.max_cstate=1 (please do not confuse this with kernel patching).

> 
> 3.19 is not even working on braswell.

What does that mean? Does the 3.19 kernel freeze on Braswell at start-up immediately? What happens when the kernel boot parameter intel_idle.max_cstate=1 is specified for this 3.19 kernel? How does that correlate to the above message that "this issue is _not_ happening on Braswell N3150 and N3700"? What is the exact kernel revision of the 3.19 kernel you tried (or did you test it an ALL 3.19.* kernels)?

Comment 23 Peter Fr 2015-12-18 09:29:27 UTC

I am the original submitter of the bugreport. At the time of filing it, Braswell did only exist on paper.

To get the GPU up and running on a braswell system you need at least kernel 4.1 or later or special parameters for older kernels to force gpu acceleration. Whatever kernel you run with 3.13 / 3.19 has no mainline gpu support. It won't work at all. If this something Ubuntu patched?

My Braswell 3150 (minix / asrock) currently run with kernel 4.3 and 4.4-rc5 without issues.

Here are the kernel image if you want to verify:
http://fritsch.fruehberger.net/kernel/linux-image-4.3.0-pt-bt1+_4.3.0-pt-bt1+-10.00.Custom_amd64.deb
http://fritsch.fruehberger.net/kernel/linux-headers-4.3.0-pt-bt1+_4.3.0-pt-bt1+-10.00.Custom_amd64.deb

Comment 24 fritsch 2015-12-18 09:31:27 UTC

To avoid confusions, last post was done by me - but with wrong account - now happy testing.

Comment 25 Mika Kuoppala 2015-12-18 13:05:11 UTC

Created attachment 197671 [details]
drm/i915/vlv: [V4.3 backport] Take forcewake on media engine writes

Comment 26 Pascal VITOUX 2015-12-18 13:11:41 UTC

(In reply to Mika Kuoppala from comment #18)
> Created attachment 197631 [details]
> drm/i915/vlv: Take forcewake on media engine writes
> 
> Long shot, but could someone give this a spin.

Tested without success with kernel 4.3.2 on my two laptops (CPU N2930 and N2940).
They froze in less than 2 two hours

Comment 27 lewexeki 2015-12-20 04:31:10 UTC

Hi,

I had the same problem with "Intel(R) Pentium(R) CPU  N3520  @ 2.16GHz". With kernel 4.2.0-16.19 there were ~5-8 freezes/day. After upgrading to 4.3.3-040303-generic (ubuntu version) it was much better: 1/2 freezes/day. With cstate=1 there has not been one yet.

Comment 28 Nicolas Porcel 2015-12-21 20:10:19 UTC

(In reply to Mika Kuoppala from comment #25)
> Created attachment 197671 [details]
> drm/i915/vlv: [V4.3 backport] Take forcewake on media engine writes

Also tested on kernel 4.3.3 on Arch Linux and it didn't work. I have an Asrock Q1900M (with intel J1900). It froze after less than 1 hour of video playback, so no improvement compared to the base Arch Linux default kernel without the patch (v4.2.5).

Comment 29 Wolfgang M. Reimer 2015-12-22 14:36:07 UTC

(In reply to Peter Fr from comment #23)

> To get the GPU up and running on a braswell system you need at least kernel
> 4.1 or later or special parameters for older kernels to force gpu
> acceleration. Whatever kernel you run with 3.13 / 3.19 has no mainline gpu
> support. It won't work at all. If this something Ubuntu patched?
> 


Thanks for the info.

Yes, Ubuntu 15.04 (Vivid) made some patches to the 3.19 kernel line for Braswell systems (Ubuntu kernel 3.19.0-20 till 3.19.0.42, see http://forum.kodi.tv/showthread.php?tid=227771&pid=2026016#pid2026016 and http://www.phoronix.com/scan.php?page=news_item&px=Intel-Braswell-Fedora-Ubuntu)
It's got support for Braswell systems however I don't know how complete this support is. The Ubuntu 15.10 (Wily) kernel 4.2.0-22 should also run on Braswell systems. The Vivid and the Wily kernel are both available for Ubuntu 14.04 LTS (Trusty, the Ubuntu release I use), too.

Comment 30 fritsch 2015-12-22 14:38:07 UTC

Then, please: Reproduce with mainline kernels. We cannot let the kernel devs debug ubuntu's picked together kernel ...

Comment 31 Wolfgang M. Reimer 2015-12-22 16:52:14 UTC

(In reply to fritsch from comment #30)
> Then, please: Reproduce with mainline kernels. We cannot let the kernel devs
> debug ubuntu's picked together kernel ...

My report does _NOT_ relate to the Ubuntu kernels _NOR_ does it relate to a Braswell system. See my Comments https://bugzilla.kernel.org/show_bug.cgi?id=109051#c4 and https://bugzilla.kernel.org/show_bug.cgi?id=109051#c13 above.

Comment 32 Markus Rehbach 2015-12-22 21:03:02 UTC

No freeze on a Acer E11 (N2940) after "echo acpi_pm > /sys/bus/clocksource/devices/clocksource0/current_clocksource" but it hit me not so often. Most of the time after reboot and not after standby/resume.

Comment 33 lewexeki 2015-12-22 22:16:16 UTC

I will compile a mainline kernel and test it.

I feel there is something connection with browsing. I got freeze while I se online videos with firefox or open a new site with multimedia content. I have disabled hardware acceleration to see what will happen. I will report it.

Comment 34 lewexeki 2015-12-23 02:12:14 UTC

There is no change. Freeze again and again. The only solution is "intel_idle.max_cstate=1".

Does anybody know when will this be fixed? With the kernel parameter my CPU is noticeably warmer. It is not very good I think. I bought a notebook with Intel Atom (N) CPU, because that is energy efficient.

Comment 35 jbMacAZ 2015-12-23 07:47:58 UTC

Freeze occurs on ASUS T100-CHI running Cinnamon Desktop on Mint17.3 or Manjaro15.12 with 64bit kernels after 3.16.7 including 4.3.x and 4.4-rcx.  Until 4.2.6, capping GPU frequency greatly reduced the freeze rate for me.  After 4.2.5 GPU frequency did not affect freeze rate (GPU hang fixed?)

Freeze rate seems to depend on particulars of the distro, kernel and device it runs on.  My setup freezes within a few minutes without a max_cstate below 2.  I notice warmer system temperatures with cstate=0.  YMMV.

Comment 36 Nicolas Porcel 2015-12-23 19:30:07 UTC

(In reply to Markus Rehbach from comment #32)
> No freeze on a Acer E11 (N2940) after "echo acpi_pm >
> /sys/bus/clocksource/devices/clocksource0/current_clocksource" but it hit me
> not so often. Most of the time after reboot and not after standby/resume.

This seems to work on my Intel J1900. Can more people confirm that this works? To make the change permanent, you can add the option "clocksource=acpi_pm" to your kernel command line.

What is the drawback of using the acpi_pm clock? From what I have read (https://access.redhat.com/solutions/18627) it has a lower frequency, 3.58Mhz compared to the 2GHz of my cpu clock. We could just force the kernel switch to the acpi_pm clock when available if the CPU is a BailTray / Braswell.

Comment 37 Nicolas Porcel 2015-12-23 23:30:50 UTC

I was wrong, turns out it takes more time to freeze but it eventually does. The best option so far is the cstate option.

Comment 38 mazout360 2015-12-24 01:10:59 UTC

There's something strange with this bug...on my Q1900DC-ITX I tried every single version of the mainline kernel from 3.16 to 4.3. It still hangs on 4.0, it hangs on 4.2, but the whole 4.1 kernel version from 4.1.0 to 4.1.15 is very stable. No need for the cstate configuration or any patch publied here or on the other thread.
For some reason, it "seemed" to get fixed on 4.1-rc something and the bug came back on 4.2.0. Now, I don't know much about how the whole i915 driver works, but it seems like a lot of changes on 4.2 concerns the cherryview chips except these:

drm/i915: Use spinlocks for checking when to waitboost
drm/i915: Don't downclock whilst we have clients waiting for GPU results 
drm/i915: Agressive downclocking on Baytrail/drm/i915: Fix computation of last_adjustment for RPS autotuning 

Looks like they directly affect baytrail chips and they alter code changes introduced right before the 4.1 series. I also remember trying to revert the drm/i915: Agressive downclocking on Baytrail commit without success on 4.2.

Comment 39 jbMacAZ 2015-12-24 06:33:18 UTC

I tried the clocksource parameter without cstate.  It froze within a few minutes (4.3.3/T100-CHI).  So far my freeze is independent of GPU frequency and system clock source!  

4.1 was more stable for me than 4.2.x, 4.3.x.  But the rest of my hardware works better with newer kernels.  Otherwise I could avoid the bleeding edge kernels.

Comment 40 Dmitry 2015-12-24 09:08:39 UTC

Additional info:
I have Dell Venue 11 Pro with Atom Z3770. Observing this freezes as everybody does from 3.17. After 4.1 behaviour of freezes changed slightly, however they happen.
intel_idle.max_cstate=0 or switching to acpi_idle driver for latest kernel 4.4-rc6 don't solve this bug. So it's not idle driver fault.
intel_idle.max_cstate=2 (cstate=1 also) completely solves freezes.

The only difference between acpi_idle (freezes) and intel_idle with max_cstate=2 (don't freeze) is in this state: ACPI FFH INTEL MWAIT 0x64.
I'll try with max_cstate=3, but I think it'll freeze too.

I can reproduce freezes with html5 video in firefox. For 3.17-4.0 it happens within 10 minutes. After 4.1 it happens within 1 hour.

Comment 41 Dmitry 2015-12-24 14:36:29 UTC

Tried every cstate till 6 and cannot reproduce this bug anymore... Even without parameter huge films over wifi and html5 video from firefox works without freezes. I'll continue testing.


cat /proc/cmdline 
root=/dev/mmcblk0p6 ro init=/usr/lib/systemd/systemd rootfstype=ext4 tsc=reliable force_tsc_stable=1 clocksource=tsc clocksource_failover=tsc swap_zram zram.num_devices=4 

uname -a
Linux venue11pro 4.4.0-rc6-dirty #200 SMP PREEMPT Thu Dec 24 15:23:06 MSK 2015 i686 Intel(R) Atom(TM) CPU Z3770 @ 1.46GHz GenuineIntel GNU/Linux

mesa 11.1
xf86-video-intel 2.99.917-r2 (gentoo version)
libdrm 2.4.65

P.S. Linux "dirty" because of ath6kl patch, soc_button_array patch and gcc native optimization patches.

Comment 42 jbMacAZ 2015-12-24 22:03:31 UTC

Replacing cstate=1 with "clocksource=acpi_pm" my setup froze within a few minutes.  Replacing cstate=1 with "tsc=reliable force_tsc_stable=1 clocksource_failover=tsc" gave me significantly more run time before freezing.  I was able to run almost 2 hours (~20x) streaming a bald eagle cam. T100-CHI (intel 3775) with hardware specific patches, 4.3.3 (Manjaro)

Comment 43 Nicolas Porcel 2015-12-24 23:40:33 UTC

 (In reply to mazout360 from comment #38)
> There's something strange with this bug...on my Q1900DC-ITX I tried every
> single version of the mainline kernel from 3.16 to 4.3. It still hangs on
> 4.0, it hangs on 4.2, but the whole 4.1 kernel version from 4.1.0 to 4.1.15
> is very stable. No need for the cstate configuration or any patch publied
> here or on the other thread.

I also run kernel 4.1.5 (from ArchLinux, which doesn't include any patch) without any freeze on Q1900-ITX. Current uptime is 10 hours, with Netflix video streaming, although it is stopped from time to time. I will need more time to be sure, but it seems to work so far.

It is for now the best option without any major drawback like video driver not working or power saving disabled. I will try to bisect between 4.0 and 4.2 to see exactly which commit introduced the regression and which one introduced it.

Comment 44 Michaël 2015-12-25 05:38:10 UTC

I confirm the random freezes on Acer TravelMate 115 (same as Juha Sievi-Korte).  Intel(R) Pentium(R) CPU  N3540  @ 2.16GHz, with Arch's linux-4.1.15-1-lts.  The freezes occur mostly while watching videos, and are way sparser than the ones reported here (on specific days, I'd have 5 freezes, then it would be fine for a few weeks, and resume).

Comment 45 Ariel 2015-12-26 19:52:28 UTC

Random freezes happening on Fedora 23 - Kernel 4.2.8-300.fc23.x86_64. And before with Fedora 22.

On Asrock Q1900-ITX, BIOS P1.40 (latest available).

This has been happening for about a year now, at different freeze frequencies going from 2 minutes after boot up to a few weeks. It only happens intermittently when playing back video content with Kodi (this is an HTPC). It doesn't happen when compiling, playing music, or when the home server stays idle. 

I noticed that certain videos (but not a specific codec) are much more prone than others to trigger the bug. Disabling hardware acceleration does NOT solve the problem. It has been a very frustrating experience.

Comment 46 FL 2015-12-28 11:09:54 UTC

Same problem with ASUS ET2325IUK with J2900  @ 2.41GHz + Arch Linux 4.1.15-1 and 4.2.5-1 (videos, system upgrades,html5...)
Freezing also systematically appears when closing gnome or cinnamon session (gdm).

Comment 47 Dmitry 2015-12-29 15:20:28 UTC

No, not fixed. Freezed by just scrolling in firefox without any video. max_cstate still needed.

Comment 48 Elmar Melcher 2016-01-02 14:02:10 UTC

Same problem on Positivo ZX3040 http://lad.dsc.ufcg.edu.br/lad/pmwiki.php?n=Lad.Tablet, but occasional hard lock-ups even with intel_idle.max_cstate=1.
Are the patches in https://github.com/hadess/rtl8723bs related in any way to this problem?

Comment 49 gpdemedici 2016-01-03 14:26:19 UTC

Last not having issue: 4.1.3 
First to show issue: 4.2.0

I am on UBUNTU and have the issue. I tested the mainline kernels. From my testing UBUNTU 4.1.0-3.3 is the last kernel known to me not having the issue, successive kernel UBUNTU 4.2.0-7.7 has the issue. To my knowledge these map to 4.1.3 and 4.2.0 mainline kernels respectively. I am sharing this hoping somebody can find this information useful to make progress towards fixing the issue.

MAINLINE KERNELS

vivid linux 
3.19.0-32.37	Ubuntu-3.19.0-32.37	3.19.8-ckt7 kernel used before I upgraded to wily, does not have issue
3.19.0-33.38	Ubuntu-3.19.0-33.38	3.19.8-ckt7
3.19.0-37.42	Ubuntu-3.19.0-37.42	3.19.8-ckt9
3.19.0-39.44	Ubuntu-3.19.0-39.44	3.19.8-ckt9
3.19.0-41.46	Ubuntu-3.19.0-41.46	3.19.8-ckt10
3.19.0-42.48	Ubuntu-3.19.0-42.48	3.19.8-ckt10 (last Vivid kernel, not tested for issue)

Wily linux

3.19.0-20.20	Ubuntu-3.19.0-20.20	3.19.8
4.0.0-4.6	Ubuntu-4.0.0-4.6	4.0.7
4.0.0-4.7	Ubuntu-4.0.0-4.7	4.0.7 works fine, issue not found here
4.1.0-1.1	Ubuntu-4.1.0-1.1	4.1.0 works fine, issue not found here
4.1.0-2.2	Ubuntu-4.1.0-2.2	4.1.3
4.1.0-3.3	Ubuntu-4.1.0-3.3	4.1.3 last known to me not having issue
4.2.0-7.7	Ubuntu-4.2.0-7.7	4.2.0 has issue
4.2.0-10.11	Ubuntu-4.2.0-10.11	4.2.0
4.2.0-10.12	Ubuntu-4.2.0-10.12	4.2.0 has issue
4.2.0-11.13	Ubuntu-4.2.0-11.13	4.2.1 has issue, also at log-in reporting an error with /usr/bon/Xorg
4.2.0-12.14	Ubuntu-4.2.0-12.14	4.2.1 
4.2.0-14.16	Ubuntu-4.2.0-14.16	4.2.2 has issue
4.2.0-15.18	Ubuntu-4.2.0-15.18	4.2.3
4.2.0-16.19	Ubuntu-4.2.0-16.19	4.2.3
4.2.0-17.21	Ubuntu-4.2.0-17.21	4.2.3
4.2.0-18.22	Ubuntu-4.2.0-18.22	4.2.3 has issue
4.2.0-19.23	Ubuntu-4.2.0-19.23	4.2.6
4.2.0-21.25	Ubuntu-4.2.0-21.25	4.2.6
4.2.0-22.27	Ubuntu-4.2.0-22.27	4.2.6

upstream kernel v4.3.0 has issue
upstream kernel v4.4.3 has issue

Comment 50 julio.borreguero@gmail.com 2016-01-08 10:26:59 UTC

i have an acer aspire es1-711


i am on gentoo linux self compiled kernels.
same problem on my linux mint partition, it certainly is a kernel bug.

latest kernel to work fine is 4.1.12 (4.1.13 is reported to work as well, haven't tested it), absolutely stable.

any 4.2 or 4.4 kernels freeze the system, no traces, no reproduction scenarios.

i can't confirm the " intel_idle.max_cstate=1" workaround to be a solution.

tested it with kernel 4.4.0-rc6 and it froze after 3 days.

Comment 51 julio.borreguero@gmail.com 2016-01-08 10:32:39 UTC

Created attachment 198961 [details]
lspci -v Hostbridge and vga adapter output

Comment 52 Christian Wansart 2016-01-08 11:12:11 UTC

I have the same problem with an Acer Aspire ES1-311 on Ubuntu. I am currently running 4.1.13 with the intel_idle.max_cstate=1 workaround. I fix would be much better!

Comment 53 Mika Kuoppala 2016-01-08 16:05:45 UTC

Another long shot to try is to see if:

'intel_reg write 0xa168 0x0'

has any effect on occurrence.

Comment 54 kernelorg 2016-01-09 01:09:03 UTC

(In reply to Mika Kuoppala from comment #53)
> Another long shot to try is to see if:
> 
> 'intel_reg write 0xa168 0x0'
> 
> has any effect on occurrence.

I've had a issue with a Lenovo Yoga 2 where restarting GDM or switching to another vty would hang the system. This command fixed it and I haven't had a crash yet.

Comment 55 Bob George 2016-01-09 02:25:43 UTC

FYI. Here is another hang issue on Baytrail that is also fixed by limiting C states.

https://lkml.org/lkml/2015/3/24/271

As far as I can tell these issues have not made it in to the kernel at all.

Comment 56 Bob George 2016-01-09 02:39:23 UTC

These patches have not made it in to the kernel, I meant.

Comment 57 hartrumpf 2016-01-17 14:12:27 UTC

(In reply to Mika Kuoppala from comment #53)
> Another long shot to try is to see if:
> 
> 'intel_reg write 0xa168 0x0'
> 
> has any effect on occurrence.

The command seems to be a correct work-around for GB-BXBT-1900. Thanks a lot!
Mika, can you explain what this command does? Any problematic consequences (for power management ...)?

Comment 58 julio.borreguero@gmail.com 2016-01-17 14:33:30 UTC

i will try the 
intel_reg write 0xa168 0x0
on an acer aspire ES1-711 now and will give feedback as soon as the system freezes or in a few days otherwise.

Comment 59 julio.borreguero@gmail.com 2016-01-17 18:07:44 UTC

btw i just tried kernel 4.4.0 (latest stable git) without any parameters and without intel_reg write 0xa168 0x0
The system froze after ~1h.
Now running the same kernel with intel_reg.
will report shortly....

Comment 60 Vincent Frentzel 2016-01-17 23:14:22 UTC

Affected by this bug as well on a Jetway JBC311U93 Celeron N2930 (Bay Trail). The system was running perfectly fine for 6 months as a router until repurposed as an HTPC. 

Hard freezes always occur when playing back video (h264 with vaapi) under Kodi. I am running kernel 4.3.3.

Will happily test any patch/solution.

Comment 61 Juha Sievi-Korte 2016-01-18 07:12:59 UTC

Tried intel_reg write 0xa168 0x0 on Acer B-115M (Pentium N3540) with kernel 4.3.3, hang happened within 20 mins after reboot, so I guess no change, occurence is random.

Question: Should I be able to read 0x0 out from that same register? I mean:
cardhu:~ # intel_reg read 0xa168 
                                    (0x0000a168): 0x0000007a
cardhu:~ # intel_reg write 0xa168 0x0
cardhu:~ # intel_reg read 0xa168 
                                    (0x0000a168): 0x0000007a

Comment 62 Alberto Salvia Novella 2016-01-18 08:17:26 UTC

Also reported in (https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1511002).

Comment 63 Mika Kuoppala 2016-01-18 11:04:22 UTC

(In reply to Juha Sievi-Korte from comment #61)
> Tried intel_reg write 0xa168 0x0 on Acer B-115M (Pentium N3540) with kernel
> 4.3.3, hang happened within 20 mins after reboot, so I guess no change,
> occurence is random.
> 
> Question: Should I be able to read 0x0 out from that same register? I mean:
> cardhu:~ # intel_reg read 0xa168 
>                                     (0x0000a168): 0x0000007a
> cardhu:~ # intel_reg write 0xa168 0x0
> cardhu:~ # intel_reg read 0xa168 
>                                     (0x0000a168): 0x0000007a

Yes, we should forget this crude hack as the register gets overwritten on boot
and also on normal operation when frequencies are changed.

I will submit a patch to try.

Comment 64 Mika Kuoppala 2016-01-18 11:09:40 UTC

Created attachment 200381 [details]
drm/i915/vlv: Always enable internal pm interrupts

Comment 65 julio.borreguero@gmail.com 2016-01-18 11:55:50 UTC

(In reply to Mika Kuoppala from comment #64)
> Created attachment 200381 [details]
> drm/i915/vlv: Always enable internal pm interrupts

concerning this:

> cardhu:~ # intel_reg write 0xa168 0x0
> cardhu:~ # intel_reg read 0xa168 

i can read and write that register, but it is constantly overwritten as mika says.
From my logic that means that that "workaround" can't work, although my system didn't freeze yet.
So i am now compiling kernel 4.4.0 with mikas patch applied (manually).
i will post that result soon.

Comment 66 julio.borreguero@gmail.com 2016-01-18 23:48:29 UTC

mika, i tried your patches on 4.4.0 kernel
The system hard-froze the same :(
back to kernel 4.1.12....

Comment 67 jbMacAZ 2016-01-19 07:50:21 UTC

 I tried the "drm/i915/vlv: Always enable internal pm interrupts" and it froze within 3 minutes on my T100CHI...

BUT, this did fix a bug of the CHI not remembering the backlight setting from the last session.  With this patch, the T100CHI powered up and dimmed to the last session level before launching the desktop.  Without this patch, the brightness slider would show reduced backlight, but it not go into effect until the brightness was adjusted, manually.

This patch is a worth keeping at least for the CHI, even if it has no effect on the freeze problem, which it doesn't.

Comment 68 Pascal VITOUX 2016-01-19 17:58:24 UTC

After a bisect between 4.1 and 4.2-rc1, and running the kernel on a laptop with a N2930 CPU : 

Last commit without freeze (after running for 6 hours, I will retry for 24h or more to be sure) : 

 commit af2d94fddcf41e879908b35a8a5308fb94e989c5
 Author: Ingo Molnar <mingo@kernel.org>
 Date:   Thu Apr 23 17:34:20 2015 +0200

    x86/fpu: Use 'struct fpu' in fpu_reset_state()
    
    Migrate this function to pure 'struct fpu' usage.
    

The freezes happen (in less than one hour for each test) with the next commit  :

 commit cb8818b6acb45a4e0acc2308df216f36cc5b950c
 Author: Ingo Molnar <mingo@kernel.org>
 Date:   Thu Apr 23 17:39:04 2015 +0200

    x86/fpu: Use 'struct fpu' in switch_fpu_prepare()
    
    Migrate this function to pure 'struct fpu' usage.

Comment 69 Pascal VITOUX 2016-01-20 09:43:33 UTC

Sorry for the misinformation in my previous comment, but after retesting af2d94f I got a freeze after 15 hours.

Comment 70 jbMacAZ 2016-01-21 02:16:36 UTC

To clarify patch "drm/i915/vlv: Always enable internal pm interrupts" results.

Tested with 4.3.3 and 4.4.0 with hardware specific patches.  Backlight control is not yet available in the standard kernel for the ASUS T100 family.

Without this patch, my ASUS T100CHI always boots to full screen brightness.  With the patch, the backlight usually starts at the indicated setting.  

This patch does fix something for baytrail systems.  Thanks for the patch.

Comment 71 Jayant Sharma 2016-01-23 12:34:30 UTC

 (In reply to mazout360 from comment #38)
> There's something strange with this bug...on my Q1900DC-ITX I tried every
> single version of the mainline kernel from 3.16 to 4.3. It still hangs on
> 4.0, it hangs on 4.2, but the whole 4.1 kernel version from 4.1.0 to 4.1.15
> is very stable. No need for the cstate configuration or any patch publied
> here or on the other thread.

Recently re-installed arch on my X205TA.

Haven't come across any freezes on kernel 4.3.3-3 with the cstate param. But, linux-lts 4.1.15-1 doesn't let me stay beyond 2 minutes, with just a couple tabs open in browser and ow doing nothing.

Comment 72 Johannes 2016-01-26 16:33:32 UTC

Hi everybody. 

Since December 2015 I have been following this bug, because I had system freezes (mostly while streaming), too. I use an ACER ES1-311 (intel GPU inside;-) with an up to date 4.3.3-3-ARCH. Unfortunately the intel_idle.max_cstate=1 did not do the trick for me. 
In Arch-Wiki, I found an interesting hint, that improved my situation tremendously. Before, I regularly had system freezes after five minutes streaming. Sometimes the freezes occured after maximum one hour. With this hint, I have not had a freeze for a couple of days streaming for hours! Possibly, my system is even fixed completely with this?! I want to share this with you guys - probably this helps finding a solution or improvement for you too. 

If interested, you find information here in the arch wiki: 

        https://wiki.archlinux.org/index.php/Intel_graphics

Scroll down to the chapter: "X freeze/crash with intel driver". (Funnily, this bug is linked there at the bottom of the chapter with the intel_idle.max_cstate=1 workarund.)

Here is described how the GPU acceleration can be disabled. I also disabled the DRI option, because I do not play games on my machine. 

That did it - or improved  alot.

Probably on systems other than ARCH, there is a similar way to access and disable GPU acceleration.

Comment 73 Ayush Agrawal 2016-01-29 15:02:06 UTC

(In reply to Johannes from comment #72)
> Hi everybody. 
> 
> Since December 2015 I have been following this bug, because I had system
> freezes (mostly while streaming), too. I use an ACER ES1-311 (intel GPU
> inside;-) with an up to date 4.3.3-3-ARCH. Unfortunately the
> intel_idle.max_cstate=1 did not do the trick for me. 
> In Arch-Wiki, I found an interesting hint, that improved my situation
> tremendously. Before, I regularly had system freezes after five minutes
> streaming. Sometimes the freezes occured after maximum one hour. With this
> hint, I have not had a freeze for a couple of days streaming for hours!
> Possibly, my system is even fixed completely with this?! I want to share
> this with you guys - probably this helps finding a solution or improvement
> for you too. 
> 
> If interested, you find information here in the arch wiki: 
> 
>         https://wiki.archlinux.org/index.php/Intel_graphics
> 
> Scroll down to the chapter: "X freeze/crash with intel driver". (Funnily,
> this bug is linked there at the bottom of the chapter with the
> intel_idle.max_cstate=1 workarund.)
> 
> Here is described how the GPU acceleration can be disabled. I also disabled
> the DRI option, because I do not play games on my machine. 
> 
> That did it - or improved  alot.
> 
> Probably on systems other than ARCH, there is a similar way to access and
> disable GPU acceleration.

Thank you so much for this.

I have a Dell Inspiron 3551 which has an Intel N3540 processor. I have been facing laptop freezing issues. This just fixed it.

I have Ubuntu Gnome 15.10 on it.

The steps I followed are these:

1. To boot into recovery mode, https://wiki.ubuntu.com/RecoveryMode (make sure to run the two mount commands)
2. To generate the config file for X (while in recovery mode), http://askubuntu.com/questions/4662/where-is-the-x-org-config-file-how-do-i-configure-x-there
3. Change the following lines in /etc/X11/xorg.conf (you can use nano):
3a. #Option "NoAccel" -> Option "NoAccel" "true"
3b. #Option "DRI" -> Option "DRI" "false"
4. Reboot and its done :).

Comment 74 julio.borreguero@gmail.com 2016-01-29 16:29:28 UTC

first of all, for me the  intel_idle.max_cstate=1 solution didnt work for me either, but i said that earlier already.

to you guys disabling hardware acceleration with the info from arch-wiki.
why do you disable hardware acceleration if you can just install any 3.1 kernel (i use 3.1.12) and at the same time use hardware acceleration and dri, without any system freezes ?
it seems to me the much better solution.

Comment 75 julio.borreguero@gmail.com 2016-01-29 20:18:28 UTC

sorry of course i meant 4.1 kernel and not 3.1. i use 4.1.12

Comment 76 Travis Hall 2016-01-30 03:35:01 UTC

I have also been having this issue on my Lenovo 11e laptop with an Intel N2940 baytrail-m.  I am running Manjaro and have been having full system hangs (mouse stops moving, everthing freezes, it doesn't even seem to dump any errors out in time) and application freezes (mostly vlc).  It seems to happen on battery or plugged in when running a video.  The only kernel that seems stable without limiting max_cstate to 1 seems to be 4.1.16-1.

Kernels that have given me issues:
4.4.0-4, 4.3.4-1, 4.2.8.2-1, 3.18.25-1

As a side note, hibernate seems to not work on most kernels, works on 4.4.0-4.  Not sure if it's related.

Comment 77 BzukTuk 2016-01-30 11:21:15 UTC

Hi, Im too trying to bisect this issue.
Best way I have found to make freeze almost instantly (on my Acer Switch 10 with Intel Atom Z3735F, on Ubuntu Gnome 15.10) is to run glxgears (from package mesa-utils) on one half of the screen, and *x264*.mp4 415MB 42 minutes long video in VLC on other half of the screen. Freeze usually occurred between 2-5 minutes. On few occasions I had to wait like 15-20 minutes.

When I was running only the VLC, I had to wait many hours until the freeze occurred. Sometimes the freeze did not occurred after 8 hours, when with glxgears it was matter of minutes.

Can someone confirm that this method works for you?

I can confirm that kernel 4.1 and 4.1.15 work without problem (I did not test 4.1.16 yet). Kernel 4.2-rc1 first introduced the issue. Im currently bisecting between 4.1 and 4.2-rc1, but Im not sure if I tested merges right. When I executed "git bisect [good|bad]" the same output was written on the terminal as "git bisect [good|bad]" one step back.

Note that Im not testing pure vanilla kernels - Im applying patches from Adrian Hunter of Intel from here https://github.com/hadess/rtl8723bs/tree/master/patches and small patches for keyboard and sound.

Comment 78 BzukTuk 2016-01-31 11:22:01 UTC

git bisect bad
cf5d8a46a001c9421c7397699db55f962e0410fc is the first bad commit

However reverting this small commit in v4.2-rc1 did not solve the issue. Bisected kernel from previous step (was git bisect good) is running glxgears and VLC without problem around 11 hours now

Comment 79 Travis Hall 2016-01-31 11:40:38 UTC

(In reply to BzukTuk from comment #78)
> git bisect bad
> cf5d8a46a001c9421c7397699db55f962e0410fc is the first bad commit
> 
> However reverting this small commit in v4.2-rc1 did not solve the issue.
> Bisected kernel from previous step (was git bisect good) is running glxgears
> and VLC without problem around 11 hours now

That could very well be connected to the problem.  My suspect was 099bfbfc7fbbe22356c02f0caf709ac32e1126ea given the amount of i915 changes that were merged into 4.2-rc1.

Comment 80 julio.borreguero@gmail.com 2016-01-31 12:51:56 UTC

is it confirmed that i915 is the problem ?
although it is the most obvious, i am just asking.

it is a kernel problem:
it is not xf86-video-intel, i tried all possible bridges there (sna, xaa and uxa)
i also tried the gallium driver with ilo-dri.
i tried different accel methods, buffers, module parameters at boot time for the i915 module like framebuffer, enable_rc6 power saving options, semaphores and pretty much all the options there is for that module.
i also used different versions of xf86-video-intel, compiled them all by myself.
the freezes still occured.

to be able to do a meaningful bisect between kernel versions it is necessary to know which one is the last working kernel without the bug.
is it confirmed that all 4.1 kernel work and that 4.2-rc1 is the first faulty version ?

i will get the latest 4.1 stable kernel and test it over the next days.
if someone wants me to do further testing i am also available.
i am using gentoo linux and therefore compile everything on the machine.

Comment 81 jbMacAZ 2016-01-31 19:56:28 UTC

My ASUS T100-CHI has freeze problems with all version 4 kernels.  The history suggests that 3.16.7 was the last version freeze free (see also Freedesktop bug 88012.)  That said, freezes do occur more often on the CHI (a few minutes to an hour(s) vs. day(s)) starting with 4.2.  There definitely is an issue there (a new freeze or making the first one(s) worse)!

BTW the new DMA fix for 4.5 did not solve the CHI freeze problem when I attempted to back-port it to 4.4.  It froze within 2 minutes w/o cstate limit.  But 4.4.0 has numerous other hardware regressions relative to the CHI (stock kernel - no wifi, no touchscreen, flackey BT) so...

Comment 82 dertobi 2016-02-01 07:00:50 UTC

I think I can provide some insight into this bug, although not really a solution.

I have a Acer V3-111P featuring a N3530 processor. I got this machine in July 2014 when it was just released on the German market, because it was the first fanless laptop available.

First thing I did on it was to install Fedora and those random freezes started to appear. It drove me nuts, as my system ran no more than a couple of minutes at a time and never longer than 20 minutes. I searched for "linux random freezes" on google and found this phoronix thread where a guy had random freezes similar to mine, but nobody else in the linux kernel mailing list could reproduce it at the time. In the thread Linus Torvalds himself provided a patch he made on a hunch for the guy to test. So I applied the same patch to my 3.18 kernel and to my own surprise the crashes/freezes became a lot more infrequent. Since then my laptop runs usually at least a couple of hours and occasionally can run even a couple of weeks depending on the usage pattern I suppose. I can't find the patch in the linux kernel mailing list thread anymore. Fortunately I saved a copy locally.

Here's the phoronix thread: https://www.phoronix.com/scan.php?page=news_item&px=MTg1MDc

Unfortunately Linus's patch can't be applied to newer kernels as the particular code was changed quite a bit or even rewritten. But I think it still might give a hint how the problem could be solved or mitigated. If I understand Linus's patch correctly (and I've only a superficial understanding of it) it's a hack (Linus's own words) that corrects goofy jumps that can happen between "timekeeping" cycles.

Here's Linus's patch that I applied to the 3.18 kernel.

diff --git a/include/linux/timekeeper_internal.h b/include/linux/timekeeper_internal.h
index 95640dc..7b14fd3 100644
--- a/include/linux/timekeeper_internal.h
+++ b/include/linux/timekeeper_internal.h
@@ -32,6 +32,7 @@ struct tk_read_base {
 	cycle_t			(*read)(struct clocksource *cs);
 	cycle_t			mask;
 	cycle_t			cycle_last;
+	cycle_t			cycle_error;
 	u32			mult;
 	u32			shift;
 	u64			xtime_nsec;
diff --git a/kernel/time/timekeeping.c b/kernel/time/timekeeping.c
index ec1791f..1e2722f 100644
--- a/kernel/time/timekeeping.c
+++ b/kernel/time/timekeeping.c
@@ -140,6 +140,7 @@ static void tk_setup_internals(struct timekeeper *tk, struct clocksource *clock)
 	tk->tkr.read = clock->read;
 	tk->tkr.mask = clock->mask;
 	tk->tkr.cycle_last = tk->tkr.read(clock);
+	tk->tkr.cycle_error = 0;
 
 	/* Do the ns -> cycle conversion first, using original mult */
 	tmp = NTP_INTERVAL_LENGTH;
@@ -197,11 +198,17 @@ static inline s64 timekeeping_get_ns(struct tk_read_base *tkr)
 	s64 nsec;
 
 	/* read clocksource: */
-	cycle_now = tkr->read(tkr->clock);
+	cycle_now = tkr->read(tkr->clock) + tkr->cycle_error;
 
 	/* calculate the delta since the last update_wall_time: */
 	delta = clocksource_delta(cycle_now, tkr->cycle_last, tkr->mask);
 
+	/* Hmm? This is really not good, we're too close to overflowing */
+	if (unlikely(delta > (tkr->mask >> 3))) {
+		tkr->cycle_error = delta;
+		delta = 0;
+	}
+
 	nsec = delta * tkr->mult + tkr->xtime_nsec;
 	nsec >>= tkr->shift;
 
@@ -455,6 +462,16 @@ static void timekeeping_update(struct timekeeper *tk, unsigned int action)
 	update_fast_timekeeper(tk);
 }
 
+static void check_cycle_error(struct tk_read_base *tkr)
+{
+	cycle_t error = tkr->cycle_error;
+
+	if (unlikely(error)) {
+		tkr->cycle_error = 0;
+		pr_err("Clocksource %s had cycles off by %llu\n", tkr->clock->name, error);
+	}
+}
+
 /**
  * timekeeping_forward_now - update clock to the current time
  *
@@ -471,6 +488,7 @@ static void timekeeping_forward_now(struct timekeeper *tk)
 	cycle_now = tk->tkr.read(clock);
 	delta = clocksource_delta(cycle_now, tk->tkr.cycle_last, tk->tkr.mask);
 	tk->tkr.cycle_last = cycle_now;
+	check_cycle_error(&tk->tkr);
 
 	tk->tkr.xtime_nsec += delta * tk->tkr.mult;
 
@@ -1181,6 +1199,7 @@ static void timekeeping_resume(void)
 
 	/* Re-base the last cycle value */
 	tk->tkr.cycle_last = cycle_now;
+	tk->tkr.cycle_error = 0;
 	tk->ntp_error = 0;
 	timekeeping_suspended = 0;
 	timekeeping_update(tk, TK_MIRROR | TK_CLOCK_WAS_SET);
@@ -1528,11 +1547,15 @@ void update_wall_time(void)
 	if (unlikely(timekeeping_suspended))
 		goto out;
 
+	check_cycle_error(&real_tk->tkr);
+
 #ifdef CONFIG_ARCH_USES_GETTIMEOFFSET
 	offset = real_tk->cycle_interval;
 #else
 	offset = clocksource_delta(tk->tkr.read(tk->tkr.clock),
 				   tk->tkr.cycle_last, tk->tkr.mask);
+	if (unlikely(offset > (tk->tkr.mask >> 3)))
+		pr_err("Cutting it too close for %s in in update_wall_time (offset = %llu)\n", tk->tkr.clock->name, offset);
 #endif
 
 	/* Check if there's really nothing to do */

Comment 83 Johannes 2016-02-01 11:35:20 UTC

(In reply to julio.borreguero@gmail.com from comment #75)
> sorry of course i meant 4.1 kernel and not 3.1. i use 4.1.12

You are right, Julio. Downgrading the kernel works without disabling hardware acceleration. I managed to downgrade to 4.1.6-1 and did not have a freeze yet. Before, I was not able to downgrade the kernel - I did it wrong, because I am new to this. Anyway, a lot of people have posted freezes for many kernel versions and kernel versions,that worked fine. I can add, that my ACER Aspire ES1-311 seems to work with kernel 4.1.6-1.

Comment 84 Dmitry 2016-02-01 20:44:46 UTC

Tested with latest 4.5-rc2 kernel. Got hard lockup after one hour. Neither in browser nor in video player. I was emerging linux-firmware while was looking through linux kernel nconfig.
But this time I added console=/dev/ttyUSB0,115200 and got some useful (maybe) information.

1) Right after boot I ended up with refined-jiffies:

clocksource: timekeeping watchdog on CPU1: Marking clocksource 'tsc' as unstable because the skew is too large:
clocksource:'refined-jiffies' wd_now: fffb77c9 wd_last: fffb75d5 mask: ffffffff
clocksource:'tsc' cs_now: 2d666de2c cs_last: 29a343f3d mask: ffffffffffffffff
clocksource: Switched to clocksource refined-jiffies

And got this lockups:

NMI watchdog: Watchdog detected hard LOCKUP on cpu 0
Modules linked in: aesni_intel xts aes_i586 lrw ablk_helper cryptd pcspkr mac_hid snd_intel_sst_acpi crc32c_intel ath6kl_sdio
CPU: 0 PID: 0 Comm: swapper/0 Tainted: G        W       4.5.0-rc2-dirty #322
Hardware name: Dell Inc. Venue 11 Pro 5130/05FF9P, BIOS A15 01/20/2016
 00000000 c12a9b9f 00000000 c1118b3d c1be49d4 00000000 edc0b400 c1118a00
 c112e367 00000003 96ac4d66 fffffffc 00000000 c1ce1dc0 00000000 f77afae0
 00000001 c1ce1efc c112ec02 c1ce1f5c c10622b6 00000000 00000000 22f3ec2c
Call Trace:
 [<c12a9b9f>] ? dump_stack+0x48/0x79
 [<c1118b3d>] ? watchdog_overflow_callback+0x13d/0x150
 [<c1118a00>] ? watchdog_enable_all_cpus+0xb0/0xb0
 [<c112e367>] ? __perf_event_overflow+0xb7/0x280
 [<c112ec02>] ? perf_event_overflow+0x12/0x20
 [<c10622b6>] ? intel_pmu_handle_irq+0x1e6/0x3e0
 [<c10b864f>] ? enqueue_entity+0x2ff/0xe80
 [<c10b9214>] ? enqueue_task_fair+0x44/0xd40
 [<c10b361e>] ? select_task_rq_fair+0x44e/0x850
 [<c1097399>] ? __send_signal+0x189/0x310
 [<c10a5c97>] ? raw_notifier_call_chain+0x17/0x20
 [<c10ec7bb>] ? timekeeping_update+0x11b/0x1b0
 [<c1915c5f>] ? _raw_write_unlock_irqrestore+0xf/0x30
 [<c10eead3>] ? update_wall_time+0x303/0xb70
 [<c1915c5f>] ? _raw_write_unlock_irqrestore+0xf/0x30
 [<c112a89e>] ? perf_event_task_tick+0x4e/0x2a0
 [<c1059696>] ? perf_event_nmi_handler+0x26/0x40
 [<c1049ec4>] ? nmi_handle+0x44/0xa0
 [<c15c9252>] ? poll_idle+0x32/0x70
 [<c104a443>] ? default_do_nmi+0x53/0x230
 [<c104a6bf>] ? do_nmi+0x9f/0xd0
 [<c1916ea7>] ? nmi_stack_correct+0x2f/0x34
 [<c10e00d8>] ? rcu_sync_func+0x38/0x90
 [<c15c9252>] ? poll_idle+0x32/0x70
 [<c15c8ce4>] ? cpuidle_enter_state+0x134/0x270
 [<c10c474c>] ? cpu_startup_entry+0x1ac/0x250
 [<c15626cd>] ? usb_find_interface+0x2d/0x50
 [<c1d57a92>] ? start_kernel+0x39d/0x3a4
perf interrupt took too long (3896 > 2500), lowering kernel.perf_event_max_sample_rate to 50000
clocksource: Switched to clocksource tsc


I had to switch to tsc manually in order to use tablet at all.

2)Another bug:
------------[ cut here ]------------
WARNING: CPU: 2 PID: 3158 at drivers/base/power/common.c:150 dev_pm_domain_set+0x54/0x60()
PM domains can only be changed for unbound devices
Modules linked in: aesni_intel xts aes_i586 lrw ablk_helper cryptd pcspkr mac_hid snd_intel_sst_acpi crc32c_intel ath6kl_sdio(-)
CPU: 2 PID: 3158 Comm: rmmod Tainted: G        W       4.5.0-rc2-dirty #322
Hardware name: Dell Inc. Venue 11 Pro 5130/05FF9P, BIOS A15 01/20/2016
 00000009 c12a9b9f ecd8dec8 c108d662 c1c4adf4 ecd8dee0 00000c56 c1c17ffd
 00000096 c14b1b54 c14b1b54 d0008604 00000000 00000000 bfc7cd88 c108d6c3
 00000009 ecd8dec8 c1c4adf4 ecd8dee0 c14b1b54 c1c17ffd 00000096 c1c4adf4
Call Trace:
 [<c12a9b9f>] ? dump_stack+0x48/0x79
 [<c108d662>] ? warn_slowpath_common+0x82/0xb0
 [<c14b1b54>] ? dev_pm_domain_set+0x54/0x60
 [<c14b1b54>] ? dev_pm_domain_set+0x54/0x60
 [<c108d6c3>] ? warn_slowpath_fmt+0x33/0x40
 [<c14b1b54>] ? dev_pm_domain_set+0x54/0x60
 [<c132b612>] ? acpi_dev_pm_detach+0x2d/0x6b
 [<c14b1a86>] ? dev_pm_domain_detach+0x16/0x20
 [<c15d6523>] ? sdio_bus_remove+0x83/0xf0
 [<c14a9ef8>] ? __device_release_driver+0x78/0x120
 [<c14aa67f>] ? driver_detach+0x8f/0xa0
 [<c14a9a68>] ? bus_remove_driver+0x38/0x90
 [<c10fdd78>] ? SyS_delete_module+0x158/0x220
 [<c11b163d>] ? mntput_no_expire+0xd/0x180
 [<c10a3a74>] ? task_work_run+0x74/0x90
 [<c100100b>] ? exit_to_usermode_loop+0x8b/0xc0
 [<c1001500>] ? do_fast_syscall_32+0x80/0x130
 [<c19161f8>] ? sysenter_past_esp+0x3d/0x5d
---[ end trace 969e6d42685aab80 ]---

I believe it's connected with ath6kl. I added my sdio card id to ath6kl_sdio.c. I'll try without any custom patches.

3)Last line before complete hard lockup was:


perf interrupt took too long (5007 > 5000), lowering kernel.perf_event_max_sample_rate to 25000


Before it there were only cfg80211's regulatory domain changes, IPv6 link not ready and ath6kl's stuff.
Also I wasn't able to reboot using sysrq at all.


Full log is here: http://pastebin.ca/3363040

Comment 85 BzukTuk 2016-02-01 22:06:32 UTC

Created attachment 202701 [details]
Kernel bisection between v4.2 v4.1 for sudden freezes

Hi, small update.

My first bisect was from 4.1 to 4.2-rc1 and as first bad commit [cf5d8a46a001c9421c7397699db55f962e0410fc] was flagged. But I was not so sure that i did the bisection properly.

So today I made second bisection - git bisect start v4.2 v4.1. Bisection process went without problem/confusion/doubt (as my first attempt did). Last git bisect was good on commit cf5d8a46a001c9421c7397699db55f962e0410fc (after 90 minutes of glxgears and vlc). Git pointed that first bad commit was:

[8fb55197e64d5988ec57b54e973daeea72c3f2ff] drm/i915: Agressive downclocking on Baytrail

then from commit cf5d8a46.. I cherry-picked 8fb55197 and this kernel froze after 3 minutes.

More cherry-picking/testing tomorrow. Sorry if my previous post made confusion/unnecessary work.
Todays 'git bisect log' is in the attachment

Comment 86 ceric 2016-02-02 19:28:14 UTC

Hello everybody, I use 15.10 (x64) version and the only way for using my laptop Asus X751MJ-TY005H which is powered by an n3540 i found is passing by kernel boot parameter.
https://wiki.ubuntu.com/Kernel/KernelBootParameters
It's been since two days i use my laptop and the stock kernel 4.2.0.27-generic without a freeze. I essentially listen music and navigate on network and read my mail post.

Comment 87 Dmitry 2016-02-07 18:58:49 UTC

Latest git kernel works for me without freezes. Tested for about a week and very hard workflows (glxgears,youtube in firefox, mpv with 1080p and kernel compiling in 4 threads at the same time). There is only one flaw: I had to add my wifi(ath6kl_sdio with custom patch adding new ID) to blocklist. Modprobing it leads to freeze in minutes.

Comment 88 Travis Hall 2016-02-08 03:17:35 UTC

(In reply to Dnitry from comment #87)
> Latest git kernel works for me without freezes. Tested for about a week and
> very hard workflows (glxgears,youtube in firefox, mpv with 1080p and kernel
> compiling in 4 threads at the same time). There is only one flaw: I had to
> add my wifi(ath6kl_sdio with custom patch adding new ID) to blocklist.
> Modprobing it leads to freeze in minutes.

What commit is working fine for you?  I'm very curious because 4.5-rc2 exhibited the issue for me and it would help in bisecting.  

Also I just compiled 4.5-rc3 and I'm testing the stability on the Celeron N2940.

Comment 89 julio.borreguero@gmail.com 2016-02-08 16:12:31 UTC

$ uname -a
Linux shiva 4.5.0-rc1 #18 SMP Mon Feb 8 10:09:09 ART 2016 x86_64 Intel(R) Celeron(R) CPU N2940 @ 1.83GHz GenuineIntel GNU/Linux

i am running latest stable kernel 4.5.0-rc1 on N2940 for a few hours.
I did some stress-testing running parallel vlc and glxgears plus did loads of other stuff at the same time.
May it really be that the bug is finally fixed ?
i will give feedback as soon as my system freezes or in a few days otherwise.

Comment 90 julio.borreguero@gmail.com 2016-02-08 16:27:44 UTC

short fun, it froze :(

Comment 91 BzukTuk 2016-02-08 18:46:24 UTC

:-)
Today I tested on Acer Aspire Switch 10 linux-v4.5-rc[1-3] - freezes occured on all of them.

From my bisect last good commit seems to be [cf5d8a46a001c9421c7397699db55f962e0410fc] - glxgears and VLC was running for 18hours 20minutes without problem (then I got bored and powered it off). 

Commit [8fb55197e64d5988ec57b54e973daeea72c3f2ff] introduced the problem - vlc&glxgears froze laptop in 3 minutes.

Unfortunately my git&c skills are not good enough to revert this [8fb5519..] commit in whole releases (like in 4.2-rc1 or 4.2) because of additional changes. Biggest problem with "git revert 8fb5519.." is in file "drivers/gpu/drm/i915/intel_pm.c" - there is over 30 commits (some of them merges) changing this file between 8fb5519 and 4.2-rc1 or 4.2 kernel. Could someone look into that? 
Thanks

Comment 92 Dmitry 2016-02-09 17:14:18 UTC

4.5.0-rc3: 4 hours of films, glxgears and browsing till batteries are dead. Without any hint of freeze. For me 4.5.0-rc2 and higher are much more stable than any other and even 4.1.y branch. Recently I got several freezes on 4.1.17 kernel and then switched to latest git.
In cmdline I have only this: tsc=reliable clocksource=tsc. And as for patches I have fix for asoc channels, ath6kl enable patch and soc_button_array patch. Nothing special related to i915 or cpu(cstate). Also, as I mentioned before, I blacklisted my wifi (I use usb wifi stick and usb ethernet).
I have a idea that this freezes might be connected either with clock or power instability. For baytrail platform we do not have reliable hpet and tsc seems also unstable. As for power I observe freezes when there is some changes in gpu or cpu states. When tablet works on a task it works perfect, but when this task ends there is non zero possibility of freeze. Or when we decide to do anything after a pause on a tablet. For me it looks like there is not enough voltage during frequency changes. Like what we can see during undervoltage. It is possible, because we see hard lockups, but it is just a guess. I do not know where there is in baytrail platform ability to tune voltage through any software api. Because windows works stable regardless of any workload.
P.S. Or this freezes migth also be connected with mmc. Wifi is connected through it and bluetooth does not work for me at all. Only internal storage and external mmc card.

Comment 93 jbMacAZ 2016-02-10 02:33:44 UTC

(In reply to BzukTuk from comment #91)
> :-)
> Today I tested on Acer Aspire Switch 10 linux-v4.5-rc[1-3] - freezes occured
> on all of them.
> 
> From my bisect last good commit seems to be
> [cf5d8a46a001c9421c7397699db55f962e0410fc] - glxgears and VLC was running
> for 18hours 20minutes without problem (then I got bored and powered it off). 
> 
> Commit [8fb55197e64d5988ec57b54e973daeea72c3f2ff] introduced the problem -
> vlc&glxgears froze laptop in 3 minutes.
> 
> Unfortunately my git&c skills are not good enough to revert this [8fb5519..]
> commit in whole releases (like in 4.2-rc1 or 4.2) because of additional
> changes. Biggest problem with "git revert 8fb5519.." is in file
> "drivers/gpu/drm/i915/intel_pm.c" - there is over 30 commits (some of them
> merges) changing this file between 8fb5519 and 4.2-rc1 or 4.2 kernel. Could
> someone look into that? 
> Thanks

The legacy-turbo patch does a fine job of disabling this commit. (https://github.com/OpenBricks/openbricks/blob/master/packages/system/linux/patches/4.0/linux-999-i915-use-legacy-turbo.patch) - edit as needed

Since 4.2.6, my ASUS T100-CHI usually freezes within 5 minutes without max_cstate=1.  Because of your bisect, I tried the legacy-turbo patch on 4.5-rc3.  Before patching, my CHI ran 4.5-rc3 29 minutes before freezing (no cstate.), better than 5, but...  After patching, I haven't had a freeze in over 5 hours so far (no cstate argument).  So, there must be at least 2 freeze bugs!  LPSS, aggressive down-clocking and another one still lurking somewhere around wifi/mmc.  Not to mention the GPU hang previously fixed in 4.2.6.

Usual disclaimers, YMMV.  My kernels have a few T100 hardware specific patches.  Still a bit early to declare success, but this is promising.

Comment 94 jbMacAZ 2016-02-10 05:07:18 UTC

Eating crow already.  My second 4.5 test without cstate froze after 5 hours.  5 minutes to 29 minutes to 5 hours is a huge improvement, but it is not the whole solution.  There is still another one out there.

(ASUS T100-CHI, kernel-4.5-rc3 + Legacy-turbo patch + T100 specific patches, 
intel_idle.max_cstate=1 does not freeze)

Comment 95 julio.borreguero@gmail.com 2016-02-10 09:04:17 UTC

i tried 4.5.0-rc3 on N2940 (ACER ES1-711).
with tsc=reliable clocksource=tsc cmdline => freeze after a few hours.
i could try turbo patch.
intel_idle.max_cstate=1 never worked for me in any freeze kernel.

Comment 96 Henry Groover 2016-02-11 01:39:01 UTC

I've run several kernel versions on a Jetway JBC311U93 Celeron N2930 (Bay Trail). In all cases I've had intermittent lockups after anything from 1 hour of runtime up to 2 weeks. I mostly run with both HDMI ports connected but little or no video acceleration in use.

Kernels I've used include:
3.19 (built by Yocto poky)
4.1.13
4.2.6
4.3.3
4.4.0

Currently I'm running 4.4.1.

I've observed intermittent lockups on the abovementioned hardware on ALL of these kernels. I see no activity on USB other than the clock pulse, which I interpret (perhaps incorrectly) as no signs of life from the SoC.

I've never gotten any useful core dumps or kernel panics when the lockup occurs - the system becomes completely unresponsive.

Building mplayer2 and playing h.264 high profile videos continuously, I seem to get lockups far more consistently, usually within no more than 24 hours. Previous tests I've run to try to induce failure have all been fruitless. One of the symptoms of the lockup has included very high junction temperatures (up to 98C; the rated maximum junction temp is 110C) when it is in the hung state, and in some cases reboot (via a Fintek chipset watchdog) does not clear the hung state.  My previous efforts focused on stressing CPU load and the SSD disk device. However, exercising the GPU seems to yield higher failure rates.

Running with intel_idle.max_cstate=1, I have gotten no lockups so far. It's too early to declare a victory but this is definitely promising.

Comment 97 jbMacAZ 2016-02-12 19:45:02 UTC

I've added Dnitry's tsc arguments to my custom kernel (4.5-rc3(w/LPSS) + legacyturbo & t100 patches). Best test run yet without cstate: > 21 hours and counting.  May it keep running after this post.

They may be interrelated, but there are still more freeze bugs.  Since max_cstate=n doesn't avoid all freezes, at least one is outside of power-saving.  Not all Atom platforms are affected by each possible freeze.  The T100-CHI is sensitive to several, but cstate has been a reliable workaround.

I can try other patches or kernel arguments, if they are posted here.  4.3.5 runs quite well, but freezes readily if I omit cstate.  Now that the LPSS updates have been included in 4.5, typical freezing takes several times longer, making it less suitable for rapid testing.  4.4.x does not work well for me, hardware regressed - no wifi (w/o patching) or touchscreen.

Comment 98 BzukTuk 2016-02-13 16:09:15 UTC

(In reply to jbMacAZ from comment #93)
> (In reply to BzukTuk from comment #91)
> The legacy-turbo patch does a fine job of disabling this commit.
> (https://github.com/OpenBricks/openbricks/blob/master/packages/system/linux/
> patches/4.0/linux-999-i915-use-legacy-turbo.patch) - edit as needed

Thank you jbMacAZ, with this patch I had no freeze during 24+ hours of running glxgears, VLC, and youtube in firefox. Kernel 4.4.1 with mmc&pm-qos patches from (https://github.com/hadess/rtl8723bs/tree/master/patches), and linux-999-i915-use-legacy-turbo.patch + small change in snd drivers. No kernel parameters. During those 24 hours, tablet went few times to hibernation (low battery), and after resume, glxgears and vlc still worked. Wifi module need reload after resume from hibernations - Youtube started playing after F5 :)

Comment 99 jbMacAZ 2016-02-17 04:41:22 UTC

No freeze observed (47hrs) with tsc arguments, but my bluetooth inactivity timeouts became erratic.  On the T100-CHI, the keyboard is linked via bluetooth, so unreliable timeouts affect usability.  I won't be using the tsc arguments as an alternate workaround to max_cstate.  YMMV.

Comment 100 John A. 2016-02-17 14:37:48 UTC

I may be running into this bug as well using a Celeron N3150 (Braswell).

I've tried:
* Ubuntu Server 15.10 with generic 4.2.0-* kernels
* Arch with 4.4.1-* kernels (console only, no X)

Both setups caused similar halts and spontaneous reboots, almost always without any logs generated except to the screen. I saw watchdog errors about stalled cores and some other errors that I can't recall offhand (but may have written down at home, will check tonight).

So far, Arch with lts kernel 4.1.(17?) seems to be running better, although not without an occasional issue. I'm trying intel_idle.max_cstate=2 rightto  now and can report back. Will be curious to see if it helps, as C2 isn't explicitly stated as a c-state for the N3150 (only C0, C1, C6, and C7 states). I'll try max_cstate=1 after this trial as well.

My thanks everyone tracking and reporting on this issue. It's been super informatative and helpful as I've been trying to figure out what's happening with this box.

Comment 101 Daniel Glöckner 2016-02-19 13:41:09 UTC

I'm seeing these freezes on a Z3745.

While reading the comments I get the feeling that we are mixing up two problems.
BayTrail-T in current kernels only has one real clocksouce - the tsc. By default it will compare this clocksource to the refined-jiffies clocksource. But as refined-jiffies is unreliable (at least on non-rt kernels), the kernel often gets the impression that it can't rely on the tsc. When this happens the kernel switches to the refined-jiffies clocksource and starts to become sluggish. After a short time "sleep 1" will take forever and you are lucky if you have an open root shell where you can set the clocksource back to tsc. The official fix in Intel's Android kernel is to set the tsc as reliable.

It is definitely a bug that refined-jiffies results in this behaviour, but it is not related to the freezes we see on BayTrail.

Comment 102 jbMacAZ 2016-02-20 07:09:56 UTC

Thank you for the clarification on tsc.  I have seen that sluggishness twice where the screen refreshes once every 20-30 seconds. 4.5rcx or 4.4.x needs to run overnight to get that bad.  

So my kernel args should be tsc=reliable and intel_idle.max_cstate={1,0}.  Then nothing bad should happen (no excessive latency, no freezes)?

Comment 103 Daniel Glöckner 2016-02-24 13:42:32 UTC

(In reply to jbMacAZ from comment #102)
> So my kernel args should be tsc=reliable and intel_idle.max_cstate={1,0}. 
> Then nothing bad should happen (no excessive latency, no freezes)?

You should also apply the patches mentioned in comment 55.

Comment 104 Vladimir Jicha 2016-02-24 14:08:14 UTC

Does Intel really completely ignore this issue? It has been introduce in 3.16 and still not fixed in 4.5 kernel. Yes, there is a workaround. But no real solution.

I doubt it will ever get fixed. Only a few people are trying to identify the issue in their free time. It would be awesome if they could find a permanent fix. But shouldn't have Intel done this already a long time ago?

My computer freezes time to time (about twice per week) even with 3.13 kernel. So staying with the old kernel isn't the ideal solution neither.

Comment 105 Joe Burmeister 2016-02-24 14:28:26 UTC

To be clear, the issue isn't in 3.16. I've apt-pinned to 3.16.0-4 and never had the freeze issue again. 
3.16.7 is meant to be the last freeze free version noted.
Which 3.16 do you mean?

But yes, it's been very quiet from Intel on this thread, but as I understand it,  Adrian Hunter is from Intel and has done some patches on this: https://lkml.org/lkml/2015/3/24/271  (as mention in comment 55). Though these don't see to have been merged for nearly a year now.

Comment 106 jbMacAZ 2016-02-24 17:48:16 UTC

(In reply to Daniel Glöckner from comment #103)
> (In reply to jbMacAZ from comment #102)
> > So my kernel args should be tsc=reliable and intel_idle.max_cstate={1,0}. 
> > Then nothing bad should happen (no excessive latency, no freezes)?
> 
> You should also apply the patches mentioned in comment 55.

I have them in 4.3.5 and that is my best running recent kernel[EOL: way too soon].  I thought that the LPSS enhancements in 4.5 meant they were no longer needed there.

Appreciate the guidance.

Comment 107 Michal Feix 2016-02-24 20:21:27 UTC

(In reply to Joe Burmeister from comment #105)
> But yes, it's been very quiet from Intel on this thread, but as I understand
> it,  Adrian Hunter is from Intel and has done some patches on this:
> https://lkml.org/lkml/2015/3/24/271  (as mention in comment 55). Though
> these don't see to have been merged for nearly a year now.

I agree. In general, this seems to be a stability issue relevant with any Baytrail based machine. That is why I believe there has to be thousands of users fighting with this bug on different linux distros, probably unaware of this bug report. Would it help if somebody competent raised the importance of this bug here in Bugzilla? I don't feel that importance "P1 Normal" is correct, if this bug leads to certain freezes in tens of minutes on Baytrail machines. Also, status "NEW" is also missleading, as this bug is obviously CONFIRMED.

Comment 108 Casey 2016-02-24 21:34:41 UTC

Just made an account here to confirm this Baytrail issue. Older kernels work fine, but are not optimal. On a new install of the latest kernel, simply moving the mouse or watching a terminal download from apt-get can cause graphical corruption, reset or freeze.

Windows has zero issues with stability relating to power states or graphics, so it is not my hardware. I am using a Lenovo 11e with a Intel N2940 cpu.

Comment 109 Alejandro Morales Lepe 2016-02-24 22:51:11 UTC

(In reply to Michal Feix from comment #107)
> (In reply to Joe Burmeister from comment #105)
> > But yes, it's been very quiet from Intel on this thread, but as I
> understand
> > it,  Adrian Hunter is from Intel and has done some patches on this:
> > https://lkml.org/lkml/2015/3/24/271  (as mention in comment 55). Though
> > these don't see to have been merged for nearly a year now.
> 
> I agree. In general, this seems to be a stability issue relevant with any
> Baytrail based machine. That is why I believe there has to be thousands of
> users fighting with this bug on different linux distros, probably unaware of
> this bug report. Would it help if somebody competent raised the importance
> of this bug here in Bugzilla? I don't feel that importance "P1 Normal" is
> correct, if this bug leads to certain freezes in tens of minutes on Baytrail
> machines. Also, status "NEW" is also missleading, as this bug is obviously
> CONFIRMED.

I think I am one of those who are strugling with this bug, any distro other than Debian 8 (kernel 3.16) locks up after some use, which may vary from a few minutes to a several hours, but it always crashes. A fix would be very important, machines like the Dell Inspiron 3000 Series Ubuntu Edition are bay trail based, they are very affordable so many users could be running those (just like myself).

Comment 110 Juliaonly 2016-02-24 23:18:40 UTC

I installed Ubuntu 14.04.4 in a separate partition for experimentation. I am running Kernel 4.2.0-30. The only modification I made was the Cstate setting mentioned in this post and it locked up in fifteen minutes. I'll try something else tonight and post the results.

Comment 111 Sebastian Damsgaard 2016-02-26 07:37:45 UTC

I can also confirm this bug. My HTPC is a shuttle XS35V4 with a J1900. It is unusable on anything higher than kernel 3.16. Exactly as Alejandro Morales Lepe explained it.

Comment 112 László Kara 2016-02-26 20:56:10 UTC

I can also confirm this bug (Acer ES-11, n2940), looking for the solution as many. Sorry I can not add any useful to the hunt.

Comment 113 radarixxx 2016-02-27 08:22:18 UTC

I can also confirm this bug. ASRock Q1900TM-ITX xubuntu 3.19.0-51-generic x86_64

Comment 114 Hal 2016-02-28 14:52:06 UTC

Greetings:

I have just joined the forum to provide you feedback on my situation which seems to confirm your findings regarding 'intel_idle.max_cstate=2'.

Indeed, I have two mini-PC type low power consumption very recent boxes. The first one is an Intel NUC Model 5CPYH with a Dual Core Celeron N3050. The second is a Zotac Zbox CI320 Nano with a Quad Core Celeron N2930.

Since the beginning I have been running Linux Mint 17.2 and now 17.3 on both boxes as host and guest OS as I run VirtualBox 5.0.14 to virtualize a tiny family web server only accessible from my LAN. I never installed or tried any other OS (notably no Windows) or any other flavors of Linux on these machines.

Several observations noteworthy:

1-Intel NUC couldn't display anything via VGA or HDMI when first installed with stock Linux Mint 17.2 (kernel 3.16 if I recall). I could remotely SSH and replace its kernel to 4.3.0 (picked randomly, and it was the most recent at that time), and everything started to work. Very well! Actually without any crash or anything for days.

2-I installed VirtualBox 5.0 and virtualized a basic server built on Linux Mint 17.2 desktop with wordpress, which has been in use for months on an AMD processor based computer (but needed to be replaced as it was a 200 Watt consuming old hardware). Everything went smoothly, but the virtual machine froze overnight. This has kept happening over and over for several weeks; the virtual machine would freeze within less than a day. Rebooting it would become an ordinary daily thing. But, the host would never freeze or crash on me!

3-Zotac CI320 on the other hand started to freeze the minute I installed Linux Mint 17.2. After each reboot it would work for a few minutes and freeze before my eyes while trying to select a WiFi access point, or changing screen resolution, or browsing with Firefox, or simply moving a window around. I upgraded its kernel to 4.3.0 and many different versions, but at best the frequency of failures changed, the problem never went away for good. Things seemed to get a bit better after upgrading to Linux Mint 17.3 with kernel 3.19 stock version, to the point that I wanted to test VirtualBox on it. I installed VirtualBox 5.0.14 and started to play around.

4-My first guest OS was FreeBSD 10.2 on Zotac's VirtualBox. Amazingly, this combo brought a new found stability to my hardware. So, Linux Mint 17.3 with kernel 3.19, VirtualBox 5.0.14 and FreeBSD 10.2 stock would work trouble free without any failure, for days.

5-Then I decided to move my little server to the Zotac platform as it looked stable as described in #4. Troubles started to show up again! But far worse than on the Intel NUC. It would actually crash the entire machine, host, all guest OS etc. whereas on Intel NUC it would only crash the guest OS.

6-I kept digging for info and eventually came across this posting and thought this might be the root cause of my problems. I have been running my Zotac box with intel_idle.max_cstate=2 for the last couple of days (both on the host and guest OS) and have even been bold to the point of doing some computer intensive things. Everything is holding up for now. Hopefully it will be ok for good.

I just wanted to share my experience with the hope that if someone with similar, more or better experience want to comment or suggest, it would be helpful for me but also for others. I am still on the edge because of these 2 almost brand new computers.

Also, I wanted to ask for advice regarding using 'intel_idle.max_cstate=2' on both the host and guest OS as I am doing right now. Does it make sense? or should I only run it on the host OS?

Maybe one more question, although this might not be the right place to ask; is the FreeBSD 10.2 kernel known to work better with these processors with regards to this random freezing problem?

Thanks for your attention and sorry for the length of the post.
Hal

Comment 115 jbMacAZ 2016-02-29 23:51:46 UTC

FWIW, I had a freeze on 4.3.6 with tsc=reliable and intel_idle.max_cstate=1.  I hadn't had any freezes since 4.2.5 when cstate limit was set.  A new freeze bug perhaps?

Comment 116 Alejandro Morales Lepe 2016-03-02 19:25:02 UTC

I have been distrohopping for a time now, and I can confirm, anything newer than 3.16 freezes. I installed Ubuntu 14.04.2 it runs nicely but if I install 14.04.3 the system freezes. intel_idle.max_cstate=1 sometimes seems to work and sometimes don't but I havent found any pattern or something. If there is something I can do to help solve this issue tell me or otherwise I am going to be stuck on Ubuntu 14.04.2 or Debian 8 forever.

Comment 117 podschie 2016-03-03 07:23:37 UTC

Hey everyone, same here! With the new kernels I have several freezes per day. Just writing and doing office stuff causes that bug only sometimes. But watching a DVD (with an external drive) or surfing the internet (especially flash I think) and the system freezes a lot. I use Lubuntu 15.10 with 4.2.0-30-generic.
My PC is an Acer ES-1 311 laptop with Intel N3540 CPU. Would be nice to solve the problem. Can't we just go back to the old working kernel from Ubuntu 14.04 and delete the malicious code in the new one? I don't understand, why a kernel with such a heavy bug, that affects a lot of users, was released.

Comment 118 Michal Feix 2016-03-04 11:11:55 UTC

> Also, I wanted to ask for advice regarding using 'intel_idle.max_cstate=2'
> on both the host and guest OS as I am doing right now. Does it make sense?
> or should I only run it on the host OS?

IMHO, it only makes sense on the host.

Comment 119 Dimitris Roussis 2016-03-05 16:22:26 UTC

I am sorry but this situation is a real comedy!!

Almost all Bayltray devices have problem.This means a huge number of modern Pcs,tablets and laptops.

This situation is more than 4 months and the developers dont care to fix it but to include new futures to the kernel!! 

I am stacked more than 4 months to kernel 3.16 because of this serious bug..and i know more than 20 people in the same situation.All of them with different devices.

I love linux,i appreciate kernel developers but for sure here we need a project manager to estimate if a bug is a high priority or not..

Comment 120 Molnár Roland 2016-03-07 21:02:05 UTC

Hello everyone.

I'm have been facing the same issue. Recently i bought an Asrock N3150DC-ITX Board, and the onboard N3150 with Linux kernels from 3.19 to 4.4 was buggy, freezing, and sometimes X11 random crashed on it.

Few days ago i installed the drm-intel-next kernel from the Ubuntu mainline repository: http://kernel.ubuntu.com/~kernel-ppa/mainline/drm-intel-next/2016-03-01-wily/

After installed the new kernel, the system seems to be stable without the cstate hack.

Sidenote: after installing the Intel open source graphics driver from 01.org, the Hardware based decoding is working perfectly on Ubuntu 15.10 (Mate Edition) (The installer also updates the vaapi packages for the latest version that supports Cherrytrail). Tested with 4K contents and 1920p downscaling. No issues, no lag after 3-4 days uptime, running mostly with Kodi.

Comment 121 jds 2016-03-08 23:01:05 UTC

I just tried this on a n2940 system, a Thinkpad 11e.  The screen flashed a lot, so I went back to 4.2.8 again, with cstate hack.


(In reply to Molnár Roland from comment #120)
> Hello everyone.
> 
> I'm have been facing the same issue. Recently i bought an Asrock N3150DC-ITX
> Board, and the onboard N3150 with Linux kernels from 3.19 to 4.4 was buggy,
> freezing, and sometimes X11 random crashed on it.
> 
> Few days ago i installed the drm-intel-next kernel from the Ubuntu mainline
> repository:
> http://kernel.ubuntu.com/~kernel-ppa/mainline/drm-intel-next/2016-03-01-wily/
> 
> After installed the new kernel, the system seems to be stable without the
> cstate hack.
> 
> Sidenote: after installing the Intel open source graphics driver from
> 01.org, the Hardware based decoding is working perfectly on Ubuntu 15.10
> (Mate Edition) (The installer also updates the vaapi packages for the latest
> version that supports Cherrytrail). Tested with 4K contents and 1920p
> downscaling. No issues, no lag after 3-4 days uptime, running mostly with
> Kodi.

Comment 122 Travis Hall 2016-03-09 00:04:05 UTC

(In reply to jds from comment #121)
> I just tried this on a n2940 system, a Thinkpad 11e.  The screen flashed a
> lot, so I went back to 4.2.8 again, with cstate hack.
> 
> 
> (In reply to Molnár Roland from comment #120)
> > Hello everyone.
> > 
> > I'm have been facing the same issue. Recently i bought an Asrock
> N3150DC-ITX
> > Board, and the onboard N3150 with Linux kernels from 3.19 to 4.4 was buggy,
> > freezing, and sometimes X11 random crashed on it.
> > 
> > Few days ago i installed the drm-intel-next kernel from the Ubuntu mainline
> > repository:
> >
> http://kernel.ubuntu.com/~kernel-ppa/mainline/drm-intel-next/2016-03-01-wily/
> > 
> > After installed the new kernel, the system seems to be stable without the
> > cstate hack.
> > 
> > Sidenote: after installing the Intel open source graphics driver from
> > 01.org, the Hardware based decoding is working perfectly on Ubuntu 15.10
> > (Mate Edition) (The installer also updates the vaapi packages for the
> latest
> > version that supports Cherrytrail). Tested with 4K contents and 1920p
> > downscaling. No issues, no lag after 3-4 days uptime, running mostly with
> > Kodi.

Interesting, I tried the drm-intel-next kernel linked by Molnár Roland on my Thinkpad 11e for a little while last night, and I didn't see any of those issues.  I tried it on a quick fresh ubuntu install (I usually use Manjaro) playing a twitch stream on Mpv, I'll have to find a way to unzip that deb and install the kernel myself, or get it building myself to test it longer.

I don't believe any of the drm-intel-next stuff is going to get merged into the upcoming 4.5, as it's already on rc7, so maybe it will come with 4.6

Comment 123 jds 2016-03-09 02:09:55 UTC

(In reply to Travis Hall from comment #122)
> (In reply to jds from comment #121)
> > I just tried this on a n2940 system, a Thinkpad 11e.  The screen flashed a
> > lot, so I went back to 4.2.8 again, with cstate hack.
> > 
> > 
> > (In reply to Molnár Roland from comment #120)
> > > Hello everyone.
> > > 
> > > I'm have been facing the same issue. Recently i bought an Asrock
> N3150DC-ITX
> > > Board, and the onboard N3150 with Linux kernels from 3.19 to 4.4 was
> buggy,
> > > freezing, and sometimes X11 random crashed on it.
> > > 
> > > Few days ago i installed the drm-intel-next kernel from the Ubuntu
> mainline
> > > repository:
> > >
> http://kernel.ubuntu.com/~kernel-ppa/mainline/drm-intel-next/2016-03-01-wily/
> > > 
> > > After installed the new kernel, the system seems to be stable without the
> > > cstate hack.
> > > 
> > > Sidenote: after installing the Intel open source graphics driver from
> > > 01.org, the Hardware based decoding is working perfectly on Ubuntu 15.10
> > > (Mate Edition) (The installer also updates the vaapi packages for the
> latest
> > > version that supports Cherrytrail). Tested with 4K contents and 1920p
> > > downscaling. No issues, no lag after 3-4 days uptime, running mostly with
> > > Kodi.
> 
> Interesting, I tried the drm-intel-next kernel linked by Molnár Roland on my
> Thinkpad 11e for a little while last night, and I didn't see any of those
> issues.  I tried it on a quick fresh ubuntu install (I usually use Manjaro)
> playing a twitch stream on Mpv, I'll have to find a way to unzip that deb
> and install the kernel myself, or get it building myself to test it longer.
> 
> I don't believe any of the drm-intel-next stuff is going to get merged into
> the upcoming 4.5, as it's already on rc7, so maybe it will come with 4.6

Interesting.  Well, I installed this kernel over a Mint 17 setup (Ubuntu 14.04), so maybe there's some interaction between the new kernel and X?

Comment 124 Travis Hall 2016-03-09 22:09:15 UTC

(In reply to jds from comment #123)
> Interesting.  Well, I installed this kernel over a Mint 17 setup (Ubuntu
> 14.04), so maybe there's some interaction between the new kernel and X?

False alarm, my Ubuntu MATE install hung while running the drm-intel-next kernel  from the Ubuntu repo.  I also compiled a kernel from drm-next using an Arch User Repo package https://aur.archlinux.org/packages/linux-drm-intel-nightly/ on Manjaro, and it also hung within about 2 hours while running some youtube on loop, and a stream in mpv.

Comment 125 Dimitris Roussis 2016-03-10 18:02:19 UTC

I tried the drm-intel-next kernel and also intel linux drivers..Nothing works back to kernel 3.6.17

Comment 126 dertobi 2016-03-10 21:07:05 UTC

I'm relatively happy that my system is stable now thanks to the intel_idle.max_cstate=1 flag, however I agree with everything Dimitris Roussis wrote about this situation.

I have my machine since mid 2014, which means that this bug has plagued users for almost 2 years now. The number of users that have been burned by this issue must be staggering, and I assume most of them didn't file a bug report.

I can't comprehend how this bug is rated "P1 normal", when it's clearly a critical bug preventing a huge number of Intel processors from being stable on Linux.

Intel should really be embarrassed about this bug.

Can we please get a statement from an Intel employee about what is being done?

Comment 127 jds 2016-03-10 21:44:15 UTC

(In reply to dertobi from comment #126)
> I'm relatively happy that my system is stable now thanks to the
> intel_idle.max_cstate=1 flag, however I agree with everything Dimitris
> Roussis wrote about this situation.
> 
> I have my machine since mid 2014, which means that this bug has plagued
> users for almost 2 years now. The number of users that have been burned by
> this issue must be staggering, and I assume most of them didn't file a bug
> report.
> 
> I can't comprehend how this bug is rated "P1 normal", when it's clearly a
> critical bug preventing a huge number of Intel processors from being stable
> on Linux.
> 
> Intel should really be embarrassed about this bug.
> 
> Can we please get a statement from an Intel employee about what is being
> done?

Most non-ARM Chromebooks use Bay Trail chips.  Any sense of what the Chromium project may have done about this bug?

Comment 128 Tal Liron 2016-03-11 00:52:49 UTC

This bug is affecting me on an Asus Aspire E3-111.

So far so good with intel_idle.max_cstate=1.

I'll echo what others have said: it would be reassuring to hear from someone at Linux or Intel about progress towards solving this. Without a doubt, it has been "quietly" affecting a great many people for a long time, who had no of knowing what the issue was.

I spent quite a bit of money replacing the SSD thinking that it was the culprit. :(

Comment 129 Elmar Melcher 2016-03-11 14:49:01 UTC

For about 2 months I have been using on a daily basis kernel 4.4.0 with the patch mentioned in Comments 48, 55, 77, 98, 103, 105 on Atom Z3735G, without intel_idle.max_cstate. I do experience freezes at an average of about every 10 hours of use. Rarely I have a specific operation that always causes Hard LOCKUP, in these cases I reboot using intel_idle.max_cstate=0 and the freeze does not occur any more.

Now I compiled kernel 4.5.0-rc7, but I was not able to apply the mentioned patch. It does not apply cleanly and trying to introduce the failing parts by hand I got an error message during boot.
This kernel freezes within less than a minute after boot, even with intel_idle.max_cstate=0 in the boot command line.
With tsc=reliable clocksource=tsc in the boot command line the freeze does not occur for at least 30 minutes, but comments seem to inidcate that tsc command line is not recommended.

Is there an update of the mentioned patch ?

Comment 130 Hal 2016-03-11 15:46:19 UTC

I just wanted to update my post #114 after 2 weeks of testing as my Zotac system is now much more stable.
First, for the host OS: intel_idle.max_cstate=2 definitely saved my Zotac computer. No more host crash, nor any VirtualBox system freeze.
As for the guest OS freezing situation; I accidentally noticed that VirtualBox might have had a problem with 3 cores assigned to the guest LinuxMint OS. Changing it to 4 cores (the maximum available on my Zotac) seems to have stopped the freezing of the guest OS. In any event in the new configuration it has been running for over a week now with no hint of problems under medium to heavy load.

Comment 131 jds 2016-03-11 17:48:17 UTC

Replying to my own question from earlier,  Chrome OS is on 3.10.18 (!).  This is for version 48.0.2564.116 in the stable channel.

I found this by checking on a Chromebook that uses the n2940.

Note that there is an issue with this system: the wireless module craps out occasionally (logged bug).  Seems to be related to the iwl* subsystem.

jds

Comment 132 jbMacAZ 2016-03-11 18:12:18 UTC

(In reply to Elmar Melcher from comment #129)
...
> With tsc=reliable clocksource=tsc in the boot command line the freeze does
> not occur for at least 30 minutes, but comments seem to indicate that tsc
> command line is not recommended.
> 
> Is there an update of the mentioned patch ?

Bugzilla could benefit from the ability to append comments instead of forcing new ones.

I had problems that I thought were associated with tsc arguments.  But my device really does have issues with timeouts and connectivity using bluetooth with 4.5-rcx.  I just hadn't noticed before trying the tsc arguments.  FWIW, I'm following the guidance in comment #103.

cstate and tsc only minimize one or more long-standing freeze problems.  But for quite a few, they are sufficient.

This link, https://github.com/hadess/rtl8723bs/tree/master/patches might help with your patch problem (last 3 are edits of same patch.)  Also try the --dry-run flag to first test a patch without changing your source set.

Comment 133 Mădălin Ionuț Icleanu 2016-03-12 07:06:05 UTC

I've managed to run the 4.4.5 kernel on Archlinux for more than a day on my laptop that has the Bay Trail 2930 cpu without any freezes after adding intel_idle.max_cstate=1 AND commenting out tlp's CPU_SCALING_GOVERNOR_ON_AC and CPU_SCALING_GOVERNOR_ON_BAT options.

Maybe you guys could try setting the cpu governor to the default "powersave"? It worked for me.

Comment 134 Hal 2016-03-12 21:40:31 UTC

Hi! One more update to my posts #114 and #130.

Zotac's host OS LinuxMint 17.3 with Kernel 3.19.0 with intel_idle.max.cstate=2 is definitely holding up. It has gone through 2 weeks+ worth of stress testing by now and it works very well. The box is a bit warmer than originally (it has no fan, just passive cooling) but it's by no means within the critical range.

VirtualBox 5.0.16 is also holding up. I have a FreeBSD server which has worked on it for over 2 weeks under heavy load.

But, another virtual machine based n LinuxMint 17.3 running kernel 3.19.0 and xfce has been a bit more iffy. I thought that the processor core number was an issue, I still believe that there is a problem along those lines, when I assign 3 cores the failure rate definitely goes up. But with 4 cores I also had a freeze, although it was after several days of good working! 

Anyway, we are not out of the woods yet! But as for the host, everything is now very stable.

My question is about intel_idle.max_state value of 2 vs 1. Can anyone enlighten me about the difference? How much of power savings functionality is being allowed with 2 vs 1?
Thanks for any info.
Hal

Comment 135 Juha Sievi-Korte 2016-03-12 22:27:44 UTC

For the last question by Hal, difference with max_cstate=2 and max_cstate=1 with Pentium N3540 at least is occasional freezes (encounter usually a full lock-up within a week or two of use) vs no freezes at all with cstate=1. So far this max_cstate=1 is the only workaround that works for me. But I'm glad there is this one. Running kernel 4.4.3 now and my laptop is still usable and stable.

Sorry this doesn't answer question about the power usage. It is only aimed at the stability aspect. There are quite many comments indicating initial success and then updating that it did crash after all. The freezes (for me) have been all the time very inconsistent. Sometimes the hangs come within minutes of boot and sometimes I could get more than a week of uptime without the kernel parameter. But with max_cstate=1 this system is "rock solid", no freezes at all.

Agree on comments about the bug priority/severity, can't really use the 3.x series kernel due to some driver issues and with this cstate limiting I lose a lot on a battery life on laptop.

This must affect quite a huge number of users currently and at least in my case it took months to find out that it's actually a kernel bug and not some other software issue.

Comment 136 vad1m 2016-03-15 09:12:07 UTC

Guys, please try latest kernel (4.4 or 4.5) with installed intel-microcode package (only latest version!), for example from here: https://packages.debian.org/sid/intel-microcode . With enabled C6/C& in BIOS, kernel 4.5.0 and intel-microcode package (latest version from sid), I've tested my PC within 1.5 hours and everything was fine.

Comment 137 jbMacAZ 2016-03-15 18:15:58 UTC

Sorry, the intel_ucode does not fix freezing.  Manjaro(Arch derivative) already loads the micro-code (same version) each boot.  It took less than 10 minutes to freeze Manjaro15.10-x86_64 linux-4.5.0 without max_cstate limit (Asus T100-CHI).  However, I intend to add this debian package to my Ubuntu install, as this is still a good idea.  Thanks for the link.

Comment 138 vad1m 2016-03-15 18:48:30 UTC

I've tried today to test my PC with C7 state enabled in BIOS and with latest 4.5 kernel from http://kernel.ubuntu.com/~kernel-ppa/mainline/v4.5-wily/ and latest intel-microcode package 3.20151106.1 from https://packages.debian.org/sid/intel-microcode
Everything is fine with youtube videos, as well with idel state (previously I had freezes withing 10-30 minutes in 100% cases), so I think without patching kernel at least we have a solution with additional firmware (btw, microcode also can be downloaded from intel site as binary but in that case you should copy it manually to firmware directory in your system, for me debian package is much more convenient).

Comment 139 kossmann 2016-03-16 08:01:29 UTC

Same problem here...

Hardware: Intel NUC6i5SYH (Intel Skylake i5-6260U)
Software: Debian Stretch with Kernel 4.3.0-1-amd64

Linux freezes after a few hours, no KernelCrashDump (crashkernel=256M nmi_watchdog=1) available. Workaround (intel_idle.max_cstate=1) seems to help for the moment.

Comment 140 László Kara 2016-03-16 19:35:05 UTC

Did anyone got contacted Intel about this issue yet? We may need more help finding this bug.

Comment 141 Xermán 2016-03-16 21:21:17 UTC

Same problems here. I have an Acer Travelmate b115m with celeron n2940 and I was becoming mad until I found this topic.

Crashes every 20 min - 2 hours. No way of getting crashdump info.
cstate=1 seems to mitigate the problem, but the computer gets hot and runs kind of slower.

Comment 142 Chris Daniel 2016-03-16 21:35:06 UTC

Adding myself to the Baytrail freeze party! Lenovo MIIX 3 1030 powered by an Atom Z3735F, running Arch with a vanilla 4.5.0 kernel. 

Tried vad1m's method (though my CPU doesn't seem to have any microcode updates) just to be sure, got a hang.

Hal, I ran a couple of PowerTop draw tests on my machine. Interesting results:

cstate=1 : 3.40W
cstate=2 : 3.11W
normal   : 3.13W

Taken while idle in an Openbox session, single terminal window open.

Comment 143 Michal Feix 2016-03-16 21:58:18 UTC

(In reply to László Kara from comment #140)
> Did anyone got contacted Intel about this issue yet? We may need more help
> finding this bug.

This bug is already assigned to Len Brown from Intel, who is also mentioned as a maintainer of Intel Idle kernel code. Initial reporter of this bug is also an Intel employee. Anyway, I will try to raise this bug on linux-pm mailing list tomorrow, as it seems there is very little awareness about the fatality of this bug among others.

Comment 144 Vincent Frentzel 2016-03-16 22:10:51 UTC

Created attachment 209541 [details]
attachment-24616-0.html

Meanwhile Im still trying to get @intelsupport attention on twitter. Feel
free to RT:

https://twitter.com/zcecc22/status/710222385430077440
On Wed, 16 Mar 2016 at 22:58, <bugzilla-daemon@bugzilla.kernel.org> wrote:

> https://bugzilla.kernel.org/show_bug.cgi?id=109051
>
> --- Comment #143 from Michal Feix <michal@feix.cz> ---
> (In reply to László Kara from comment #140)
> > Did anyone got contacted Intel about this issue yet? We may need more
> help
> > finding this bug.
>
> This bug is already assigned to Len Brown from Intel, who is also
> mentioned as
> a maintainer of Intel Idle kernel code. Initial reporter of this bug is
> also an
> Intel employee. Anyway, I will try to raise this bug on linux-pm mailing
> list
> tomorrow, as it seems there is very little awareness about the fatality of
> this
> bug among others.
>
> --
> You are receiving this mail because:
> You are on the CC list for the bug.

Comment 145 Vincent Frentzel 2016-03-16 22:11:40 UTC

Meanwhile Im a rying to reach out to @intelsupport on twitter to see if we can get an official update.

Feel free to retweet https://twitter.com/zcecc22/status/710222385430077440

Comment 146 Hal 2016-03-16 22:37:52 UTC

Thank you Juha (#135) and Chris (#142).

I've been running both my Zotac and Intel boxes with intel_idle.max_cstate=1 for the last few days. Both with the value of 2 and 1 I got pretty good results, I have not seen any failures on either hosts.

I've also done some power monitoring on the AC line and it turns out that between 2 and 1 there is less than a watt of difference, although temperature wise it seems to be noticeably different (hotter with 1).

So, for all intents of purposes my boxes are now working trouble free.

VirtualBox system has also been very stable (as witnessed by my FreeBSD virtual server), but my Linux Guest OS is periodically failing on both systems.

I am now convinced that the failure on the Linux Guest OS is due to some video driver issue as opposed to the processor bug related to intel_idle.max_cstate thing.

But, all in all I am very disappointed both by Intel, and Linux (maybe I should say Ubuntu and LinuxMint) as in their race to release the latest and greatest they put out there half baked products.
This is as if we are back to the beginning of times and we are troubleshooting windows 3.1 systems...

I don't know if the real issue is Intel hardware or Linux software but either way my disappointment is such that I couldn't recommend anyone to switch to Linux as I have advocated for over a decade.

Here, I only voiced my own troubles with my own two machines. My friends and relatives who have bought inexpensive Bay Trail notebooks and got Ubuntu or Mint based on my recommendation and who are pissed because their machines freeze in the middle of a Netflix movie are not sophisticated enough to come to places like this to figure out what they are up to.
They will only say that windows xp worked better than any sh*t we have had in a long time (that certainly includes Linux), and I almost agree with them.

Comment 147 mario439 2016-03-16 22:42:34 UTC

I have the same bug in my HP Pavilion x360 with a Pentium CPU N3520 (Bay Trail architecture), running Ubuntu 15.10 and 4.2.0-30 Kernel version.
I´m using the private drivers for the Microprocessor.
I don´t try with the "intel_idle.max_cstate=1", because that´s need a lot of battery...
I want to use GNU/Linux again, but i can work normally with this bug :(

Pd: English is not my first lenguage.

Comment 148 Xermán 2016-03-16 22:50:30 UTC

(In reply to Hal from comment #146)
> Thank you Juha (#135) and Chris (#142). 
> 
> I've been running both my Zotac and Intel boxes with intel_idle.max_cstate=1
> for the last few days. Both with the value of 2 and 1 I got pretty good
> results, I have not seen any failures on either hosts.
> 
> I've also done some power monitoring on the AC line and it turns out that
> between 2 and 1 there is less than a watt of difference, although
> temperature wise it seems to be noticeably different (hotter with 1).
> 
> So, for all intents of purposes my boxes are now working trouble free. 
> 
> VirtualBox system has also been very stable (as witnessed by my FreeBSD
> virtual server), but my Linux Guest OS is periodically failing on both
> systems. 
> 
> I am now convinced that the failure on the Linux Guest OS is due to some
> video driver issue as opposed to the processor bug related to
> intel_idle.max_cstate thing.
> 
> But, all in all I am very disappointed both by Intel, and Linux (maybe I
> should say Ubuntu and LinuxMint) as in their race to release the latest and
> greatest they put out there half baked products. 
> This is as if we are back to the beginning of times and we are
> troubleshooting windows 3.1 systems...
> 
> I don't know if the real issue is Intel hardware or Linux software but
> either way my disappointment is such that I couldn't recommend anyone to
> switch to Linux as I have advocated for over a decade.
> 
> Here, I only voiced my own troubles with my own two machines. My friends and
> relatives who have bought inexpensive Bay Trail notebooks and got Ubuntu or
> Mint based on my recommendation and who are pissed because their machines
> freeze in the middle of a Netflix movie are not sophisticated enough to come
> to places like this to figure out what they are up to. 
> They will only say that windows xp worked better than any sh*t we have had
> in a long time (that certainly includes Linux), and I almost agree with them.

I could not agree more. I worked with linux (readhat) more than 10 years ago (compiling kernel, day by day work, etc.) and after some years without touching it I had the idea of just using it in my new laptop. Im a photo professional and I wanted to give a try to Darktable and current Gimp.

Im so so dissapointed, so so dissapointed. 

Linux is still not usable after all these years, it's even less stable now.
Complety buggy for a normal average user. I can't recommend it to any friend sharing my hardware or similar since no one will now what to do. I'm back to windows 10 and the computer runs fast and with no problemas at all.

I will keep an eye on this, but for me, a casual user very interested in Linux, this operating system is just a toy to spend time with. Just a toy.

Comment 149 kossmann 2016-03-16 23:06:42 UTC

Running as desktop is one side, running as server (so me) the other one :-( Same kernel on a Asus eeeBox B202 (Intel Atom) has no problems.

I don´t know, if there is a context, but since i use the workaround (on Intel Skylake), i don´t see messages like "systemd-sysv-generator overwriting existing symlink" in dmesg anymore.

Comment 150 dertobi 2016-03-16 23:33:42 UTC

I share the frustration as I have been using Linux for over 15 years and this is maybe the most serious bug in all that time (for me), which ironically seems to get extremely limited attention by Intel and Kernel developers. It seems like kernel developers don't usually use low end Intel Atom like hardware and therefore don't have to deal with the problems themselves a lot. (Admittedly that's speculation, but I can imagine that most kernel developers(or even pro users) prefer to use high end hardware (let alone for compile times))

I think it's unfair to say that because of that one (severe!) bug Linux is not recommendable anymore, since not everybody will want to use a Baytrail system, but at this time you could say Linux on baytrail isn't really advisable until this bug is fixed.

I lay the blame on Intel, as they should be stress testing their CPUs against the latest Linux kernel and pro-actively try to fix eventual bugs.

Comment 151 saracenim 2016-03-16 23:52:46 UTC

I've been using Linux since 1997,and this is the first time I've come across a bug so serious. Luckily I have another laptop with a different CPU and I can use that, but my baytrail machine is collecting dust. 
Really stupid idea to release newer and newer kernels just for the sake of adding new shiny numbers when so many people are affected by a massive bug like this, which makes windows 3.1 look like a dream. Linus has totally lost his marbles. 
I'd like to try bsd, but it's a pain in the neck and hardware support lags behind. Sigh.

Comment 152 Xermán 2016-03-16 23:59:38 UTC

Sorry if I wrote a too negative impression, but I just can't avoid to be dissapointed and frustated. I really wanted to jump to opensource software.

And I know this is made with the collaboration of a lot of volunteer people, thanks to them. But shame on Linux stability and support.

Comment 153 Tal Liron 2016-03-17 00:16:21 UTC

Some of you doomsday complainers need to calm down a bit!

Microsoft and Apple (and even Google in some cases) don't even have a way to open bugs on their operating systems and get transparent feedback with a way to track progress. And the computer gods know how many hours I spent trying to debug system freezes and crashes on Windows... Just right now my employer, which distributed hundreds of MacBook Pros to its users, is experiencing a bug with major battery drain for all of them. We have no idea what is causing it yet.

This particular bug was very tricky to pin down and the community did great in reporting it and patiently trying things out until we found a workaround. We provided a very good direction for Intel to look for the cause and find a fix. As with other major bugs, I'm certain the major distributions will backport the fix to their older kernels that are still under bugfix and security support.

Software generally is terrible, but Linux is better than most. We have a good system in place to fix bugs and keep making the OS better. I don't think there's anything particularly wrong with how Linux handles quality control.

That said, I would very much appreciate it if someone from Intel steps in a comments, even briefly, on this bug report. Hint, hint. :)

Comment 154 Vincent Frentzel 2016-03-17 00:27:03 UTC

Created attachment 209561 [details]
attachment-28440-0.html

I did raise the issue on twitter to @intelsupport, they told me to download
and use the intel graphic driver at http://intel.ly/24TDt9F .

Don't think @intelsupport is that useful afterall...

Does anyone have a direct contact there?
On Wed, 16 Mar 2016 at 20:35, <bugzilla-daemon@bugzilla.kernel.org> wrote:

> https://bugzilla.kernel.org/show_bug.cgi?id=109051
>
> --- Comment #140 from László Kara <laci.kara@gmail.com> ---
> Did anyone got contacted Intel about this issue yet? We may need more help
> finding this bug.
>
> --
> You are receiving this mail because:
> You are on the CC list for the bug.

Comment 155 dertobi 2016-03-17 01:12:27 UTC

@Vincent Frentzel

Some Intel guys are on IRC, for example on irc.freenode.org #intel-gfx . You might want to bug them about this bug. I know some are aware about the issue on this channel, but some also have been dismissive about it to be honest. I know one guy who has some patches on there and needs people like us to try them.

Comment 156 Hal 2016-03-17 02:28:39 UTC

(In reply to Tal Liron from comment #153)

> Some of you doomsday complainers need to calm down a bit!
???? 

> This particular bug was very tricky to pin down and the community did great
> in reporting it and patiently trying things out until we found a workaround.
> We provided a very good direction for Intel to look for the cause and find a
> fix. As with other major bugs, I'm certain the major distributions will
> backport the fix to their older kernels that are still under bugfix and
> security support.

As this link shows the problem was discovered 15 MONTHS AGO and it is yet to be fixed! https://bugs.freedesktop.org/show_bug.cgi?id=88012

> Software generally is terrible, but Linux is better than most. We have a
> good system in place to fix bugs and keep making the OS better. I don't
> think there's anything particularly wrong with how Linux handles quality
> control.

"Quality control" you say? Since this problem was discovered 15 MONTHS AGO the linux kernel went through tons of iterations. It could have been at least provided with an automatic detection and cstate switching mechanism!

> That said, I would very much appreciate it if someone from Intel steps in a
> comments, even briefly, on this bug report. Hint, hint. :)

Sensible suggestion! But check this out: https://communities.intel.com/thread/60984?start=0&tstart=0
It doesn't sound like Joe_Intel is much of a listener, is he?

I am afraid next year this time we will still be talking about this same bug, because if it didn't get fixed since January 2015 I don't see how and why it will be ever fixed.

Comment 157 John A. 2016-03-17 02:53:14 UTC

Created attachment 209571 [details]
Arch Linux 4.1.18 LTS panic #1 (photo 1 of 3)

Attaching 3 photos of kernel panics I've seen that may be related to this. Two photos are from Arch 4.1.18 LTS with intel_idle.max_cstate=1 (plus other kernel params, mostly borrowed from Clear Linux's boot line), and one is from Arch 4.4.3 using intel_idle.max_cstate=1.

System is a console-only mini-PC running a Celeron N3150 (Braswell) with 8GB RAM and 250GB mSATA SSD. Trying to use the mini-PC as a custom network router.

Attaching photos since these panics don't write to logs, and often don't show anything at all, halting the machine or causing a spontaneous reboot. I'm going to try setting up a netconsole to capture goings-on next.

All three instances seem to choke with some invocation of start_secondary(), if I'm reading the call trace correctly. 

Hoping these instances may help devs track the core issue down. Please let me know if additional info is required or if I can test anything.

Comment 158 John A. 2016-03-17 02:55:46 UTC

Created attachment 209581 [details]
Arch Linux 4.1.18 LTS panic #2 (photo 2 of 3)

Second kernel panic photo with Arch 4.1.18 LTS on Celeron N3150 (Braswell) system using max_cstate=1. Please see the first photo for more info.

Comment 159 John A. 2016-03-17 02:57:24 UTC

Created attachment 209591 [details]
Arch Linux 4.4.3 panic (photo 3 of 3)

Third/last photo of panics, this time with Arch 4.4.3 on Celeron N3150 (Braswell) using max_cstate=1. Please see first photo for more information.

Comment 160 jds 2016-03-17 05:22:15 UTC

I think you're not quite entertaining the level of failure that's involved here.

I totally appreciate your point.  "Software is generally terrible".  True.  But from the perspective of users here, is Linux really better than most?

My Mac at work has an uptime of 172 days.  Every night I sleep it when I go home.  I haven't had to reboot it in six months.  BTW it's a laptop.

This Linux-running thinkpad running Linux I have here crashes after 1-2 hours of sitting idle.  I don't have to do anything.  Turn it on; let it sit; crash!  That's much worse than Windows 3.1, which was sensitive to rogue applications, but didn't simply splat into smithereeens on its own.

So let's not sentimentalize and pretend this isn't a total fuck-up.  

Where is Intel?




(In reply to Tal Liron from comment #153)
> Some of you doomsday complainers need to calm down a bit!
> 
> Microsoft and Apple (and even Google in some cases) don't even have a way to
> open bugs on their operating systems and get transparent feedback with a way
> to track progress. And the computer gods know how many hours I spent trying
> to debug system freezes and crashes on Windows... Just right now my
> employer, which distributed hundreds of MacBook Pros to its users, is
> experiencing a bug with major battery drain for all of them. We have no idea
> what is causing it yet.
> 
> This particular bug was very tricky to pin down and the community did great
> in reporting it and patiently trying things out until we found a workaround.
> We provided a very good direction for Intel to look for the cause and find a
> fix. As with other major bugs, I'm certain the major distributions will
> backport the fix to their older kernels that are still under bugfix and
> security support.
> 
> Software generally is terrible, but Linux is better than most. We have a
> good system in place to fix bugs and keep making the OS better. I don't
> think there's anything particularly wrong with how Linux handles quality
> control.
> 
> That said, I would very much appreciate it if someone from Intel steps in a
> comments, even briefly, on this bug report. Hint, hint. :)

Comment 161 fao66134 2016-03-17 06:05:31 UTC

I use N3150(braswell) too.
I was set to "max_cstate=1". but, got freeze.

Looking at the coretemp, temperature of cpu2 and cpu3 was noticed that a little high.
So, i come up with to try "maxcpus=2". And then, it did not freeze.

cpu0 and cpu1 is no problem. but, cpu2 or cpu3 online to got freeze.

If this thing is useful, I'm happy.

Comment 162 Vladimir Jicha 2016-03-17 08:47:45 UTC

What is really bad about this bug is the fact that it used to work until kernel 3.16. I bought my HTPC with bay-trail because it supported Linux. And now I have an unsupported hardware with not replaceable motherboard that freezes even with kernel 3.13 (there are more bugs then this, I believe it is related to WiFi which also often looses it's connection and is very slow).

This chipset is in fact not supported by Linux now. I bought it because I heard everywhere that Intel has the best Linux support. Now I want to show them the same sign of respect Linus Torvalds showed NVIDIA some years ago.

I should have known that since Intel did the same to me with GMA500 graphics. I thought it was just a single mistake and they will not repeat it. But they did. :-(

Comment 163 jbMacAZ 2016-03-17 09:12:12 UTC

I've started getting occasional freezes again with 4.4 and 4.5.  That's even with cstate and tsc and a bunch of good but cast off freeze patches.  So, I'll try fewer CPU cores.  I can't risk having anything important on my system anyway, so who cares if nothing important takes longer.  

Next cycle, I'll just get AMD systems.

Comment 164 Bastien Nocera 2016-03-17 09:41:11 UTC

Well done on turning this into a forum thread. I wouldn't touch this bug with a 10-foot pole and I'm sure the Intel developers feel the same.

Comment 165 Molnár Roland 2016-03-17 11:15:18 UTC

I got the same issues after 4-5 days on Ubuntu 15.10 with the 4.5 kernel and intel driver. After this issue, it freezes again within 5-6 hours, sorry for the false hope :)

Now im trying the upcoming Ubuntu LTS release (16.04 Nightly) with the following kernel: 4.4.0-13-generic #29-Ubuntu SMP Fri Mar 11 19:31:18 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux

Seems stable right now after 2days and 13hours uptime, no cstate fix needed right now.

Kernel is not the newest, but the mentioned microcode package and intel drivers are up to date, va packages also, so hw decoding works nicely on it.

powertop shows the following idle stats:

          Package   |             Core    |            CPU 0
                    |                     | C0 active   4,1%
                    |                     | POLL        0,0%    0,0 ms
                    | C1 (cc1)    0,0%    | C1-CHT      0,0%    0,1 ms
                    |                     |
                    |                     |
C6 (pc6)   81,9%    | C6 (cc6)   95,4%    | C6S-CHT     2,0%    2,9 ms
                    |                     | C7S-CHT    31,2%   86,9 ms

                    |             Core    |            CPU 1
                    |                     | C0 active   1,2%
                    |                     | POLL        0,0%    0,0 ms
                    | C1 (cc1)    0,2%    | C1-CHT      0,2%    0,8 ms
                    |                     |
                    |                     |
                    | C6 (cc6)   97,5%    | C6S-CHT     4,7%    3,2 ms
                    |                     | C7S-CHT    46,4%   37,6 ms

                    |             Core    |            CPU 2
                    |                     | C0 active   0,3%
                    |                     | POLL        0,0%    0,0 ms
                    | C1 (cc1)    0,0%    | C1-CHT      0,0%    0,4 ms
                    |                     |
                    |                     |
                    | C6 (cc6)   99,0%    | C6S-CHT     2,3%    4,2 ms
                    |                     | C7S-CHT    91,4%   89,3 ms

                    |             Core    |            CPU 3
                    |                     | C0 active   2,1%
                    |                     | POLL        0,0%    0,0 ms
                    | C1 (cc1)    0,0%    | C1-CHT      0,0%    0,0 ms
                    |                     |
                    |                     |
                    | C6 (cc6)   94,1%    | C6S-CHT     0,3%    2,1 ms
                    |                     | C7S-CHT    85,7%   39,4 ms

                    |             GPU     |
                    |                     |
                    | Powered On  0,2%    |
                    | RC6        99,8%    |
                    | RC6p        0,0%    |
                    | RC6pp       0,0%    |
                    |                     |
                    |                     |

If i understand it correctly, the cpu cores are mostly in C6/C7 states.

One thing that i done with these config:
sudo powertop --auto-tune

I setted the Tunables items to good for all items in it with the above command.

I wrote about my experience after a few more days...

Comment 166 Hal 2016-03-17 12:29:35 UTC

I've been scavenging for more information about this intel_idle software module and I came across this interesting slide presentation from Len Brown (the Intel engineer in charge of the power saving scheme if I understood right). It's dated October 2015 and apparently used at his LinuxCon Dublin meeting.
Many pages refers to troubles with the idle thing, how they track it, measurements, etc. On several slides like #31, 32, 33 under "Things may go wrong". It mentions Linux Kernel versions which are buggy yet unfixed.
http://events.linuxfoundation.org/sites/events/files/slides/Brown-Linux-Suspend-at-Speed-of-Light-LC-EU-2015.pdf

Comment 167 Chen Yu 2016-03-17 14:03:09 UTC

Hi, all, I think we have a T100 in the lab, I'll have a try. BTW, could someone please tell me is it reproduced easily by playing videos?

Comment 168 John A. 2016-03-17 14:18:25 UTC

(In reply to fao66134 from comment #161)
> I use N3150(braswell) too.
> I was set to "max_cstate=1". but, got freeze.
> 
> Looking at the coretemp, temperature of cpu2 and cpu3 was noticed that a
> little high.
> So, i come up with to try "maxcpus=2". And then, it did not freeze.
> 
> cpu0 and cpu1 is no problem. but, cpu2 or cpu3 online to got freeze.
> 
> If this thing is useful, I'm happy.

This seems interesting... I think my N3150's issues tend to be with CPU2 most of the time too. (I'll recheck the panics I posted last night.) I wondered about possible CPU heat issues as well as I'm using a fanless aluminum case, but haven't been watching it closely. I'll start doing that.

Thanks for the maxcpus=2 workaround. I may also try that. Though I'd prefer to use all 4 cores :)

Possible related note: I tried using the latest OPNSense FreeBSD 10.2-based router/firewall distro, and ran into a CPU panic there too after a few hours. I didn't get a photo of it, but if I try it again I'll be sure to capture it.

Comment 169 Daniel Glöckner 2016-03-17 14:34:41 UTC

(In reply to Chen Yu from comment #167)
> BTW, could someone please tell me is it reproduced easily by playing videos?

Yes, it is. I'm using Firefox with HTML5 videos on YouTube to test for this bug. I always had at least one freeze within 4 hours when not restricting max_cstate.

Comment 170 John A. 2016-03-17 14:37:30 UTC

(In reply to jbMacAZ from comment #163)
> I've started getting occasional freezes again with 4.4 and 4.5.  That's even
> with cstate and tsc and a bunch of good but cast off freeze patches.  So,
> I'll try fewer CPU cores.  I can't risk having anything important on my
> system anyway, so who cares if nothing important takes longer.  

Have you tried just using max_cstate without the tsc parameters and the patches? When I added tsc params to my boot line it seemed to cause more instability/chances for halts and panics. That makes me wonder if tsc is somehow counterproductive to max_cstate.

Comment 171 Hal 2016-03-17 14:55:30 UTC

(In reply to Chen Yu from comment #167)
> Hi, all, I think we have a T100 in the lab, I'll have a try. BTW, could
> someone please tell me is it reproduced easily by playing videos?

I have an old SSD that I use to move things around and I have LinuxMint 17.3 stock kernel 3.19.0 on it. I can use it through SATA or with USB2.0 or USB 3.0 adaptors.

I just plugged it into my CI320 via SATA for a quick test. Freezing was as quick as moving the firefox window.

There is no special software on this SSD, no Virtualbox, no wine emulator, nothing. Pure stock linuxmint.

After rebooting, I tried to plug a USB flash disk, before the directory was read into thunar again the whole machine froze.

So, it is indeed quick to repeat the failure.

Comment 172 Michal Feix 2016-03-17 16:58:33 UTC

So I just get a reply from linux-pm kernel mailing list. People there are aware of this bug, but I've been told that it is quite hard to find the root cause.

I've been asked to check if kernel parameter idle=nomwait is making the problems go away. Obviously, CPU's might get warmer when trying  this. It is just a step to pinpoint the source.

Can you test this parameter and post results? Especially if you are one of those not lucky with intel_idle.max_cstate=1 parameter as a workaround.

Comment 173 jbMacAZ 2016-03-17 18:02:19 UTC

(In reply to John A. from comment #170)
> (In reply to jbMacAZ from comment #163)
> > I've started getting occasional freezes again with 4.4 and 4.5.  ...
> 
> Have you tried just using max_cstate without the tsc parameters and the
> patches? When I added tsc params to my boot line it seemed to cause more
> instability/chances for halts and panics. That makes me wonder if tsc is
> somehow counterproductive to max_cstate.

tsc is recent, I ran 4.2.x for months with relatively little trouble with just cstate and necessary patches.  Frankly, 4.3.x seems to run the same with or without tsc as long as cstate is set.  My gut is that there is a new instability in 4.4 and 4.5.  I can't jettison all my old patches because my T100 will have sdhci and prmb issues and other bits of T100 hardware will stop working.  


The lack of a crash log could be partially addressed by allocating a second dmesg buffer and alternating between them at boot.  The prior dmesg log would be preserved at next startup.  This should probably be a new .config option.  Alternatively, just save the last few K of the old dmesg before initializing dmesg at boot time.

Comment 174 Hal 2016-03-17 19:38:17 UTC

(In reply to John A. from comment #168)

> Possible related note: I tried using the latest OPNSense FreeBSD 10.2-based
> router/firewall distro, and ran into a CPU panic there too after a few
> hours. I didn't get a photo of it, but if I try it again I'll be sure to
> capture it.

That might be an OPNSense issue as they seem to have introduced lots of regressions as they tried to rewrite some of the code (or trying to cleaning it up). When I tried to run it on an Intel mobo a few weeks back it just kept crashing. On the same hardware pfsense ran without problems. Any particular reason you would favor opensense vs pfsense?

One little consolation I have about Bay Trail and Braswell is that FreeBSD (and PC-BSD) and pfSense both work flawlessly on the same hardware where I experience the Linux freezing circus.

Comment 175 Nils Asmussen 2016-03-17 20:51:13 UTC

(In reply to Michal Feix from comment #172)
> [...]
> I've been asked to check if kernel parameter idle=nomwait is making the
> problems go away. Obviously, CPU's might get warmer when trying  this. It is
> just a step to pinpoint the source.
> 
> Can you test this parameter and post results? Especially if you are one of
> those not lucky with intel_idle.max_cstate=1 parameter as a workaround.

Using vanilla kernel 4.5.0 I tried to boot with the options
tsc=reliable idle=nomwait
The system crashed after the "usual" amount of time (about an hour surfing the web).
I did not set cstate or anything else.

Comment 176 fao66134 2016-03-18 09:13:53 UTC

(In reply to John A. from comment #168)
> This seems interesting... I think my N3150's issues tend to be with CPU2
> most of the time too. (I'll recheck the panics I posted last night.) I
> wondered about possible CPU heat issues as well as I'm using a fanless
> aluminum case, but haven't been watching it closely. I'll start doing that.
> 
> Thanks for the maxcpus=2 workaround. I may also try that. Though I'd prefer
> to use all 4 cores :)

I found a new way.

"echo 0 > /sys/kernel/debug/x86/tlb_single_page_flush_ceiling"

I using full core, but have not yet acquired the frozen from this setting.

In my case, it was to disable the intel_idle and intel_pstate and i915, but i got a freeze.
Thus, when compared to the other CPU configuration changes of kernel3.16 and kernel4.5, I noticed the change of the TLB flush setting. (intel_tlb_flushall_shift_set function is abolished from "arch/x86/kernel/cpu/intel.c", And tlb_single_page_flush_ceiling has been added to  "arch/x86/mm/tlb.c")

Comment 177 BzukTuk 2016-03-18 09:39:59 UTC

(In reply to fao66134 from comment #161)
> I use N3150(braswell) too.
> I was set to "max_cstate=1". but, got freeze.
> 
> Looking at the coretemp, temperature of cpu2 and cpu3 was noticed that a
> little high.
> So, i come up with to try "maxcpus=2". And then, it did not freeze.
> 
> cpu0 and cpu1 is no problem. but, cpu2 or cpu3 online to got freeze.
> 
> If this thing is useful, I'm happy.

Thanks for another workaround :)

Running glxgears and x264 video on procesor Intel® Atom™ Z3735F (4 cores) - vanilla kernel v4.5.0:
maxcpus=1, no freeze (running 90 minutes)
maxcpus=2, no freeze (running 90 minutes)
maxcpus=3, no freeze (running over 4 hours)
no command line parameters, freeze occured after 5 minutes (as usual).

Comment 178 fao66134 2016-03-18 15:12:12 UTC

(In reply to fao66134 from comment #176)
> I found a new way.
> 
> "echo 0 > /sys/kernel/debug/x86/tlb_single_page_flush_ceiling"
> 
> I using full core, but have not yet acquired the frozen from this setting.

Sorry, i got freeze now.

Running time is longer, but it seems not perfect.

Comment 179 fao66134 2016-03-18 15:39:05 UTC

(In reply to Michal Feix from comment #172)
> I've been asked to check if kernel parameter idle=nomwait is making the
> problems go away. Obviously, CPU's might get warmer when trying  this. It is
> just a step to pinpoint the source.
> 
> Can you test this parameter and post results? Especially if you are one of
> those not lucky with intel_idle.max_cstate=1 parameter as a workaround.

I also tried, but was frozen in 5 minutes.
This is about the same as when you do not specify anything.

Comment 180 Michal Feix 2016-03-18 15:49:27 UTC

(In reply to fao66134 from comment #179)
> (In reply to Michal Feix from comment #172)
> > I've been asked to check if kernel parameter idle=nomwait is making the
> > problems go away. Obviously, CPU's might get warmer when trying  this. It
> is
> > just a step to pinpoint the source.
> >
> > Can you test this parameter and post results? Especially if you are one of
> > those not lucky with intel_idle.max_cstate=1 parameter as a workaround.
> 
> I also tried, but was frozen in 5 minutes.
> This is about the same as when you do not specify anything.

So, setting idle=nomwait is not helping you. Fine. If intel_idle.max_cstate=1 is a working solution for you, could you please try with intel_idle.max_cstate=0 and post back result?

Comment 181 julio.borreguero@gmail.com 2016-03-18 16:42:32 UTC

i want to give my latest feedback on this issue to this forum thread :-D
N2940 Baytrail System running stable on all 4 cores for 2 days now.
Running latest stable kernel 4.5.0 from git repo on gentoo linux.
With latest microcode firmware from intel microcode-20151106.tgz

uname -a
Linux shiva 4.5.0 #20 SMP Tue Mar 15 19:07:39 ART 2016 x86_64 Intel(R) Celeron(R) CPU N2940 @ 1.83GHz GenuineIntel GNU/Linux

kernel parameters:
i915.enable_rc6=1 tsc=reliable clocksource=tsc

i dont know if it is the kernel or the microcode that makes this system run stable, and of course i hope it stays stable.
Playing videos, listening to music, compiling packages no freezes yet.
hope it remains like this.

and please, this is a bug-report thread, not a discussion platform

Comment 182 jbMacAZ 2016-03-18 17:45:00 UTC

"nomwait" may be device dependent.  I ran it overnight (tsc=reliable and idle=nomwait w/o cstate) and it was still running after 10 hours.  I saw other results here - restarted without tsc and my system has already run five times longer than no arguments.)  I'll keep testing.  I'll need to repeat with 4.4+ since the newest kernels are less stable than 4.3 on my system.

Passive cooled Atom baytrail Z3775: cstate=1 runs nearly normal temp, cstate=0 runs slightly warmer.  "nomwait" runs about the same temp as cstate=1.  
Asus T100-CHI - Ubuntu15.10-i386, kernel-4.3.6, microcode, T100 patches, hunter patches, legacy-turbo patch.  Normally freezes well under 10 minutes without kernel arguments.

Comment 183 Michal Feix 2016-03-18 19:21:17 UTC

(In reply to julio.borreguero@gmail.com from comment #181)
> i want to give my latest feedback on this issue to this forum thread :-D
> N2940 Baytrail System running stable on all 4 cores for 2 days now.
> Running latest stable kernel 4.5.0 from git repo on gentoo linux.
> With latest microcode firmware from intel microcode-20151106.tgz
> 
> kernel parameters:
> i915.enable_rc6=1 tsc=reliable clocksource=tsc
> 
> i dont know if it is the kernel or the microcode that makes this system run
> stable, and of course i hope it stays stable.
> Playing videos, listening to music, compiling packages no freezes yet.

Microcode update 20151106 only updates the 2MB cache version of N2940. If you have 1MB cache variant of N2940, the microcode update was not the cure.

If you can test the 4.5 kernel version without any kernel parameters, it would help to understand whether it has been fixed in the meantime.

Comment 184 julio.borreguero@gmail.com 2016-03-18 19:37:56 UTC

> 
> Microcode update 20151106 only updates the 2MB cache version of N2940. If
> you have 1MB cache variant of N2940, the microcode update was not the cure.
> 
> If you can test the 4.5 kernel version without any kernel parameters, it
> would help to understand whether it has been fixed in the meantime.

ok, thank you for that information.
And yes, the cache is only 1MB but i guess you know that anyway from the attachment i posted at some earlier stage with system-specific info.

i just rebooted my machine, this time without extra kernel parameters.
my guess is that the kernel has been fixed for my architecture at least, as i was running those tsc-parameters in my last test (4.5.0-rc3) and that definitely froze.
i will be posting a hardware freeze as soon as it happens, otherwise i will let everyone know in 2-3 days that the system is still running stable. hopefully

Comment 185 dertobi 2016-03-19 00:34:09 UTC

I certainly don't want to destroy anyone's hopes, but I've had instances where my notebook ran stable for up to two weeks and then froze. Doesn't mean it has to happen, I'm just saying the absence of crashes overnight, within 10 hours, or even in 3-4 days is not a sure sign that the issue has been fixed.

Comment 186 Hal 2016-03-19 03:56:42 UTC

Among the posts there are several mentioning that kernel 3.16 is freeze free without any additional parameter like cstate or tsk. I am curious to know if those are distro provided versions or custom compiled ones?

Today I ran some tests with Linux Mint 17.2 which comes with kernel 3.16.0 as its standard and recommended kernel. On Zotac Nano CI320 N2930 it worked for about 4 hours then froze. I actually used it only for 35 minutes, then the machine was on but simply idling for the remaining 3.5 hours. I know precisely when it froze as the frozen clock at the bottom of the screen was visible.

Is there any consensus on a kernel version that reliably works on Bay Trail?

Comment 187 Chen Yu 2016-03-19 04:00:32 UTC

Tested with 4.5.0 and glxgears on T100, without any boot params, so far we have not reproduce this problem yet, as BzukTuk told me this method should freeze the system within 1-10 minutes.  Anyway I'll  keep up this stress testing.

Comment 188 jbMacAZ 2016-03-19 06:55:11 UTC

(In reply to dertobi from comment #185)
> I certainly don't want to destroy anyone's hopes, but I've had instances
> where my notebook ran stable for up to two weeks and then froze. <snip>

I'm just assessing the workarounds, while waiting for real fixes.

My nomwait solo test did freeze after about 4 hours - but then resumed by itself about 2 hours later without bluetooth and wifi working.  Rebooting restored communications.

(In reply to Chen Yu from comment #187)

> Tested with 4.5.0 and glxgears on T100, without any boot params, so far we
> have not reproduce this problem yet, as BzukTuk told me this method should
> freeze the system within 1-10 minutes.  Anyway I'll  keep up this stress
> testing.

There are several T100 models, which vary in how fast they freeze.  The T100T* models are more stable than the T100CHI.  Also, the very first freeze often takes longer than subsequent freezes.

Comment 189 julio.borreguero@gmail.com 2016-03-19 14:42:02 UTC

(In reply to Chen Yu from comment #187)
> Tested with 4.5.0 and glxgears on T100, without any boot params, so far we
> have not reproduce this problem yet, as BzukTuk told me this method should
> freeze the system within 1-10 minutes.  Anyway I'll  keep up this stress
> testing.

i think kernel 4.5.0 has a fix.
I am running it for several days now, but on a N2940. No freezing.
Since yesterday without any kernel boot parameters.
Anything prior to this kernel (any 4.4 kernel if you want to have a go) freezes for sure.

Also, there is a difference between N2940 and N2930.
For me, on a N2940 intel_idle.max_cstate never worked as a workaround, but it works on N2930 (deduced from posts in this thread).

i know it is still too early to say that 4.5.0 is fixed, but to me it certainly looks that way. freezes on my system always ocurred within 12h.

Comment 190 Hal 2016-03-19 15:10:20 UTC

(In reply to julio.borreguero@gmail.com from comment #189)
> 
> i think kernel 4.5.0 has a fix.
> I am running it for several days now, but on a N2940. No freezing.
> Since yesterday without any kernel boot parameters.
> Anything prior to this kernel (any 4.4 kernel if you want to have a go)
> freezes for sure.
> 

My lucky version for both N2930 and N3050 seems to be 4.4.6.

4.5.0 has brought up unrelated instabilities (mostly with VGA and Wireless) on my systems so I can't even thoroughly test it. 

4.4.6 on the other hand has been pretty good without cstates or tsk up to a point (much much longer time before freezing). Interestingly, on my Zotac box 4.4.6 spends much more time in C1 state than C6 or C7 according to powertop.

That said, the behavior of these different versions is quite wild. 
I tried to build a chart of hardware (2 separate computers one with N3050, the other with N2930) vs kernels (I have tested 3.16.0, 3.19.0, 4.0.0, 4.3.0, 4.4.0, 4.4.4, 4.4.5, 4.4.6, 4.5.0) and captured the freeze timing and conditions (like with or without video loss at freeze time) and the chart is full of inconsistencies. 

Repeat tests yield contradictory results most of the time. But, all in all 4.4.6 looks the best with the longest longevity. 4.3.0 seems to be the worst.

With ctates=2 freezing is almost non existent (only happened once in more than 40 sessions). With cstates=1 never got a freeze in any hardware/kernel combination, with some of the tests lasting more than 2 weeks. I never used a patch nor the tsk parameter.

Comment 191 jbMacAZ 2016-03-20 08:15:23 UTC

I have a N3540 system that freezes at most a couple times a month without any arguments, kernel version doesn't seem to matter.  .max_cstate {0,1} stabilized it.  Looking at the recent posts, the N-series appears to be the processor benefiting most from the new suggestions.  But the more smoke that gets cleared, the sooner the rest of the problems can be found.

On my Z3775 system (T100CHI), kernel 4.5.0 without arguments didn't last 2 minutes before freezing.  With idle=nomwait and it ran 2 hours before the time display froze (frozen seconds), the mouse cursor still moved.  Keyboard keys or mouse clicks were accepted about once every 90 seconds.

Next, maxcpus=2 and idle=nomwait produced a block of "serial8250: too much work for irq191" errors in dmesg.  Raising maxcpus to 3 got rid of them.  maxcpus= {2,3} yielded no obvious degradation when just browsing, etc, so I'll leave this running...  tsc may be destabilizing for some systems like mine.

Comment 192 cororok 2016-03-20 14:18:32 UTC

My dell laptop has N3540. It freezes on both xubuntu 15.10 and 16.04(still beta version) in 30m especially when I use chrome browser.
but it works well with intel_idle.max_cstate=1 on both version.

kernel 4.4.6(linux-headers-4.4.6-040406-generic_4.4.6-040406.201603161231_amd64.deb
) that I download from http://kernel.ubuntu.com/~kernel-ppa/mainline does not work without cstate flag.

So I downloaded newer 4.5.0-rc7 (linux-headers-4.5.0-040500rc7_4.5.0-040500rc7.201603061830_all.deb) and it is working well without cstate flag for half day. I will update the status after one or two days later.

Comment 193 julio.borreguero@gmail.com 2016-03-20 16:59:20 UTC

update:
system freeze on 4.5.0 kernel on N2940 no kernel parameters.
it took many hours (~40) but finally it happened.
back to kernel 4.1.12....

Comment 194 Xermán 2016-03-20 19:26:26 UTC

I gave it a try with Ubuntu 15.10 and kernel 4.5
I also installed the Intel microdrivers.

I was able to play a full 50 min video but then the computer freeze on the desktop wihtout any cpu/gpu intense operation (that I'm aware of).

Comment 195 cororok 2016-03-20 23:46:21 UTC

(In reply to cororok from comment #192)
> My dell laptop has N3540. It freezes on both xubuntu 15.10 and 16.04(still
> beta version) in 30m especially when I use chrome browser.
> but it works well with intel_idle.max_cstate=1 on both version.
> 
> kernel
> 4.4.6(linux-headers-4.4.6-040406-generic_4.4.6-040406.201603161231_amd64.deb
> ) that I download from http://kernel.ubuntu.com/~kernel-ppa/mainline does
> not work without cstate flag.
> 
> So I downloaded newer 4.5.0-rc7
> (linux-headers-4.5.0-040500rc7_4.5.0-040500rc7.201603061830_all.deb) and it
> is working well without cstate flag for half day. I will update the status
> after one or two days later.

4.5.0-rc7, even it is better than others, also froze.

Comment 196 cororok 2016-03-20 23:46:52 UTC

(In reply to cororok from comment #192)
> My dell laptop has N3540. It freezes on both xubuntu 15.10 and 16.04(still
> beta version) in 30m especially when I use chrome browser.
> but it works well with intel_idle.max_cstate=1 on both version.
> 
> kernel
> 4.4.6(linux-headers-4.4.6-040406-generic_4.4.6-040406.201603161231_amd64.deb
> ) that I download from http://kernel.ubuntu.com/~kernel-ppa/mainline does
> not work without cstate flag.
> 
> So I downloaded newer 4.5.0-rc7
> (linux-headers-4.5.0-040500rc7_4.5.0-040500rc7.201603061830_all.deb) and it
> is working well without cstate flag for half day. I will update the status
> after one or two days later.

4.5.0-rc7, even it is better than others, also froze.

Comment 197 jbMacAZ 2016-03-21 08:01:34 UTC

A combo bandaid for the Z3775 is idle=nomwait tsc=reliable maxcpus=3.  Test still running at 24 hours.  Better than 2 minutes without any kernel arguments... (Kernel 4.5.0.)

Comment 198 fao66134 2016-03-21 10:55:12 UTC

(In reply to Michal Feix from comment #180)
> So, setting idle=nomwait is not helping you. Fine. If
> intel_idle.max_cstate=1 is a working solution for you, could you please try
> with intel_idle.max_cstate=0 and post back result?

Only maxcpus is not to freeze. My result is next.

running time(1st, 2nd) #parameters
30m, 1h30m #none
10m, 40m #idle=nomwait
1h, 2h #intel_idle.max_cstate=0
2h, 1h #intel_idle.max_cstate=0 idle=nomwait
30m, 1h #intel_idle.max_cstate=1

N3150 Gentoo drm-intel-nightly_kernel-4.5.0+

Comment 199 Hal 2016-03-22 02:23:11 UTC

Interesting findings today:

1) Came across a 2 yr old system with a Bay Trail N2807 processor. Upgraded Ubuntu on it to kernel 4.4.6 with no parameter. It has been running for more than 12 hours without a glitch. What gives?! So, not all Bay Trail processors are afflicted by this problem?

2) I was given an Intel Nuc box for testing which turned out to be identical to mine, with same N3050. Duplicated my drive with DD and removed intel_idle.max_cstate=1. It kept working all day without missing a beat! I remove cstate from my own machine it freezes within the hour. So bizarre...

3) As I was digging into Virtualbox log files after my guest OS froze once again on my zotac, I noticed that there is nothing noteworthy until the moment of failure except for the message "28:28:41.009623 VMMDev: vmmDevHeartbeatFlatlinedTimer: Guest seems to be unresponsive. Last heartbeat received 4 seconds ago".

Then when I shutdown the guest OS window, Virtualbox adds a very extensive report about the state of the machine at the time it froze or became unresponsive.

So, this might be a good tool to help investigate how the failure is taking place.

My thinking is that this OS freezing problem is occurring the same, whether it is on a host (physical) machine or a guest (virtual) machine.

It has been found that with intel_idle.max_cstate=1 or alternative special kernel parameters we can get the kernel behave differently and avoid failure.

But that doesn't work with a virtual machine and whatever is causing the failure is making the virtual machine fail unrestrained. But that also indicates that the kernel software is falling apart not the microprocessor (or the microprocessor's microcode) otherwise when the virtual machine fails the host should also fail.

The dump in the virtualbox log file on closing is very rich in info, unfortunately it's way above my knowledge base. So, if anyone would be interested in analyzing it I could furnish it, although I think it is very easy to make the failure occur in virtualbox same as on the host.

Comment 200 RussianNeuroMancer 2016-03-22 03:47:48 UTC

Anybody know why this patches doesn't upstreamed? 

https://github.com/hadess/rtl8723bs/tree/master/patches_4.5

Comment 201 jds 2016-03-22 06:46:57 UTC

Created attachment 210171 [details]
attachment-21257-0.html

I get that we shouldn't turn this bug report into a forum discussion, but
what I just don't understand is why this bug isn't considered absolutely
critical.  Personally it doesn't affect me that much -- work gives me a
very nice macbook pro -- but this bug gives the lie to decades of making
fun of Windows BSODs.

A system that can't stay up for 30 minutes?  For millions and millions of
users -- all on the lower-end of the performance spectrum?  For kernels
that go back 2 years?  It's a massive pie in the face.

I've added the cstate kernel parameter.  The machine is more stable but
battery life has gone to hell.  Such is Linux, today.

On Mon, Mar 21, 2016 at 11:47 PM, <bugzilla-daemon@bugzilla.kernel.org>
wrote:

> https://bugzilla.kernel.org/show_bug.cgi?id=109051
>
> --- Comment #200 from RussianNeuroMancer <russianneuromancer@ya.ru> ---
> Anybody know why this patches doesn't upstreamed?
>
> https://github.com/hadess/rtl8723bs/tree/master/patches_4.5
>
> --
> You are receiving this mail because:
> You are on the CC list for the bug.
>

Comment 202 Mika Kuoppala 2016-03-23 08:50:01 UTC

https://cgit.freedesktop.org/~miku/drm-intel/log/?h=rc6_test

3 _tentative_ patches on that tree. Please try.

Comment 203 dertobi 2016-03-23 11:58:28 UTC

What desktop are you all running? For me it's gnome-shell. Maybe there's some connection between software, hardware and that freeze that we've been missing so far.

Comment 204 julio.borreguero@gmail.com 2016-03-23 18:09:49 UTC

(In reply to Mika Kuoppala from comment #202)
> https://cgit.freedesktop.org/~miku/drm-intel/log/?h=rc6_test
> 
> 3 _tentative_ patches on that tree. Please try.

i am running 4.5.0 with 3 tentative patches from mika ;-)
Already stresstesting for about 5h now. i will post any results here.

Comment 205 Travis Hall 2016-03-24 05:37:53 UTC

(In reply to Mika Kuoppala from comment #202)
> https://cgit.freedesktop.org/~miku/drm-intel/log/?h=rc6_test
> 
> 3 _tentative_ patches on that tree. Please try.
I got the hang after 7 and a half hours of letting my N2940 run youtube and a twitch stream.

Comment 206 jds 2016-03-24 06:19:24 UTC

(In reply to dertobi from comment #203)
> What desktop are you all running? For me it's gnome-shell. Maybe there's
> some connection between software, hardware and that freeze that we've been
> missing so far.

I don't think so.  I've tried with Cinnamon and Gnome 3.

Comment 207 dertobi 2016-03-24 06:45:46 UTC

(In reply to jds from comment #206)
> (In reply to dertobi from comment #203)
> > What desktop are you all running? For me it's gnome-shell. Maybe there's
> > some connection between software, hardware and that freeze that we've been
> > missing so far.
> 
> I don't think so.  I've tried with Cinnamon and Gnome 3.

Cinnamon is a Gnome 3 fork though.

Comment 208 jds 2016-03-24 18:40:59 UTC

(In reply to dertobi from comment #207)
> (In reply to jds from comment #206)
> > (In reply to dertobi from comment #203)
> > > What desktop are you all running? For me it's gnome-shell. Maybe there's
> > > some connection between software, hardware and that freeze that we've
> been
> > > missing so far.
> > 
> > I don't think so.  I've tried with Cinnamon and Gnome 3.
> 
> Cinnamon is a Gnome 3 fork though.

Ah, you're right.  I did try MATE too briefly, which I think is a Gnome 2 fork, and it crashed -- at the time I suspected Chrome/Chromium as the issue, so I didn't connect it with this bug.

Comment 209 Juha Sievi-Korte 2016-03-25 12:20:46 UTC

Update: Grabbed 4.5.0 for testing on affected system (Acer B-115M, N3540). This is downloaded from opensuse repos this time, exact version:

Linux cardhu 4.5.0-58.gb2c9ae5-default #1 SMP PREEMPT Wed Mar 16 17:30:21 UTC 2016 (b2c9ae5) x86_64 x86_64 x86_64 GNU/Linux

Running withtout a freeze for a week now in my normal use and stress-testing since this morning with HD videos. I'll report back if it freezes.

Someone asked about the desktop, I use xfce (some gnome-services running though). Have verified the freezes with two distributions, Ubuntu and Opensuse.

Comment 210 julio.borreguero@gmail.com 2016-03-25 12:29:18 UTC

it definitely is a kernel bug. read old posts in this thread.
i have verified this bug on 2 distributions and am running gentoo now, where everything is compiled.

i am running 4.5.0 from github kernel stable repo 
plus mikas 3 patches for the second day under full load and no freeze yet.

Comment 211 cororok 2016-03-25 19:13:27 UTC

I think the problem happens when C-state is changed. If it is right in order to test it needs a condition which changes CPU load up and down so that it can reach a certain situation where the CPU can get stuck.

In my case it happened when I use Chromebrowser on Xubuntu so I guessed it is related to GPU but I don't have any knowledge about that.

Comment 212 jds 2016-03-25 19:15:53 UTC

That's what I thought too at first -- and that sent me scramblingly looking at chrome flags etc.  But then I observed two different systems lock up even when no browser at all was running.

(In reply to cororok from comment #211)
> I think the problem happens when C-state is changed. If it is right in order
> to test it needs a condition which changes CPU load up and down so that it
> can reach a certain situation where the CPU can get stuck.
> 
> In my case it happened when I use Chromebrowser on Xubuntu so I guessed it
> is related to GPU but I don't have any knowledge about that.

Comment 213 podschie 2016-03-25 21:58:35 UTC

(In reply to jds from comment #212)
> That's what I thought too at first -- and that sent me scramblingly looking
> at chrome flags etc.  But then I observed two different systems lock up even
> when no browser at all was running.
> 
> (In reply to cororok from comment #211)
> > I think the problem happens when C-state is changed. If it is right in
> order
> > to test it needs a condition which changes CPU load up and down so that it
> > can reach a certain situation where the CPU can get stuck.
> > 
> > In my case it happened when I use Chromebrowser on Xubuntu so I guessed it
> > is related to GPU but I don't have any knowledge about that.

I can confirm, that my Acer ES1-311 with it's Intel 3540 CPU crashes not only while using Chromium browser. But I recognized it crashes more often using Chromium than Firefox. Mostly it happens, when I play a movie on YouTube or scrolling the timeline of facebook.

If I'm working with the PC without using any browser, the system seems stable. Writing with LibreOffice, graphic manipulation with GIMP or RawTherapee work pretty well with the 4.2.0-34 Kernel (Lubuntu) and I do not get as many freezes as before. But watching a DVD with an external drive is not possible, the system freezes within minutes. Strangely the freeze occurs pretty often if I'm just reading .pdf Documents with Evince.

Comment 214 Veronica 2016-03-25 23:33:30 UTC

Hi, i own an Asus Chromebox (Haswell Intel Celeron 2955U / 1.4 GHz) and I've always experienced full system freeze in any Linux distros I've tested including Kodibuntu and OpenElec BUT never had such issues with Windows 8/8.1/10 (currently booting off external HDD)
Currently I'm running GalliumOS based on Ubuntu 15.04 with Xfce off internal SSD, came with kernel 4.1.14 by default.

What I've tried:
GRUB_CMDLINE_LINUX_DEFAULT="quiet splash intel_idle.max_cstate=1 tpm_tis.interrupts=0 i915.enable_ips=0"

GRUB_CMDLINE_LINUX_DEFAULT="quiet splash intel_idle.max_cstate=2 tpm_tis.interrupts=0 i915.enable_ips=0"

NOTES:
*Just added intel_idle.max_cstate argument after "splash" the rest is default within /etc/default/grub.
* Neither worked intel_idle.max_cstate=1  froze in less than 10m while working in Terminal & intel_idle.max_cstate=2 froze in less than 15m while watching Netflix in Chrome Browser.

Currently testing:
Kernel 4.1.12 with no args from http://kernel.ubuntu.com/~kernel-ppa/mainline/v4.1.12-wily/ as some users suggested.Will report back if it freezes, otherwise after 2+ days.

Anything else I could try? I can't compile I just have the Chromebox for now plus I'm not that advanced. T.I.A

Comment 215 Veronica 2016-03-25 23:38:11 UTC

Update: System just froze with kernel 4.1.12 , this is very frustating.

Comment 216 Veronica 2016-03-25 23:44:47 UTC

Forgot to mention that Ubuntu Server 14.04.x is the only distro that has worked reliably for me, ran it for several months without issues then needed full OS so uninstalled.

Comment 217 Brent Davis 2016-03-26 08:46:18 UTC

I've been watching the posts on this bug report for several days now and thought I would post my own personal experience. Just bought a laptop with a N3540 chip in it and have also been experiencing random system lockups with the 4.x series kernels. But I wanted to mention that for some reason the stock kernel that comes with Debian "Jessie" gives me no problems what so ever crash wise. In fact the only reason I've been trying to use a 4.x kernel is becauhse my graphics performance seems to improve drastically with them. Especially in opengl applications (seen much higher FPS in apps). I did try the patches Mika Kuoppala posted on the stable 4.4.6 kernel from kernel.org but had a lockup after about an hour of use. Seems to me the crashes happen most when in chrome browsing websites but I have had lockups doing other things. Gonna try the stable GIT of 4.5.0 and see what happens. If 4.5 stick locks up with Mika's patches then I dunno what to do other than to go back to Debians 3.16 kernel. I'll take a performance hit but at least the computer will run without crashing.

Comment 218 julio.borreguero@gmail.com 2016-03-26 09:25:09 UTC

(In reply to Mika Kuoppala from comment #202)
> https://cgit.freedesktop.org/~miku/drm-intel/log/?h=rc6_test
> 
> 3 _tentative_ patches on that tree. Please try.

system freeze after ~2 days

(In reply to Veronica from comment #215)
> Update: System just froze with kernel 4.1.12 , this is very frustating.

i think you are the only one with a freeze on 4.1.[12-15] so far.
but then i haven't seen anyone posting with a 2955U unit in this thread.
please double-check you are running the correct kernel with uname -a
or try 3.16 as suggested by brent davis and others

Comment 219 RussianNeuroMancer 2016-03-26 09:28:58 UTC

Veronica and Brent, please check if workaround mentioned in bugreport title at least make system hang much later (or doesn't hang at all). If that the case, then it's worth a try patches from comment #203 instead of workaround.

Comment 220 Dmitry 2016-03-26 12:48:12 UTC

(In reply to julio.borreguero@gmail.com from comment #218)
> i think you are the only one with a freeze on 4.1.[12-15] so far.
No, not the only one. I use kernels 4.1.*(now 4.1.20) every day on BayTrail Z3770 tablet and have rare freezes. Of course with MMC PM QoS patches. Max_cstate=1 helps, but with much more power consumption. Also I hit another mysterious bug, when my tablet just turns off. It's look like overheating, but I don't know for sure.
Latest kernel git also has a bug with display blinking and corruption. So I can't use it for long enough to see hang.

P.S. I hit hang once when I was reading book with fbreader. Nothing more, just fbreader.

Comment 221 Veronica 2016-03-26 14:57:30 UTC

(In reply to julio.borreguero@gmail.com from comment #218)
> (In reply to Mika Kuoppala from comment #202)
> > https://cgit.freedesktop.org/~miku/drm-intel/log/?h=rc6_test
> > 
> > 3 _tentative_ patches on that tree. Please try.
> 
> system freeze after ~2 days
> 
> (In reply to Veronica from comment #215)
> > Update: System just froze with kernel 4.1.12 , this is very frustating.
> 
> i think you are the only one with a freeze on 4.1.[12-15] so far.
> but then i haven't seen anyone posting with a 2955U unit in this thread.
> please double-check you are running the correct kernel with uname -a
> or try 3.16 as suggested by brent davis and others

Yes I did verify. I'm very cautious when testing. What I did what press shift key while booting > advanced options and selected kernel 4.1.12 generic.
I know I'm the first with a Haswell to report but my Chromebox is having the exact same symptoms people in here is having.

Comment 222 Veronica 2016-03-26 14:59:50 UTC

(In reply to RussianNeuroMancer from comment #219)
> Veronica and Brent, please check if workaround mentioned in bugreport title
> at least make system hang much later (or doesn't hang at all). If that the
> case, then it's worth a try patches from comment #203 instead of workaround.

Hi, as I mentioned in post #214 cstate=1 and cstate=2 didn't work for me. The first one froze in less than 10m and the second in less than 15m.

Comment 223 GConst 2016-03-26 16:58:51 UTC

Hello, I have same frizing of Ubuntu 14.04.3 with 3,19 on my asrock q1900dc-itx; according information here I reinstalled system an downgrade version to 14.04.02 with 3.16.0.30 kernel, but today get stack again :-( Only version of Linux which works fine was Oracle Linux 7.2 with 3.10

Comment 224 Ernst Herzberg 2016-03-26 18:29:45 UTC

Maybe related patch?

http://www.spinics.net/lists/intel-gfx/msg90977.html

Comment 225 julio.borreguero@gmail.com 2016-03-26 19:14:44 UTC

(In reply to Ernst Herzberg from comment #224)
> Maybe related patch?
> 
> http://www.spinics.net/lists/intel-gfx/msg90977.html

looks interesting.
it seems to be for a different kernel version than 4.5.0 though, 2 out of 3 hunks fail, but i hopefully managed to adapt the patch and am compling a new test-kernel just now and will post any positive results, if so.

definitely worth a try, looks promising from the description. thanks

Comment 226 dertobi 2016-03-26 21:19:56 UTC

that patch looks indeed promising, I'm compiling the latest drm-intel kernel from git with that patch now (no hunks failing). Will report much later, as I expect this compilation process to take a long time on this hardware. :-)

Comment 227 julio.borreguero@gmail.com 2016-03-26 21:56:53 UTC

Created attachment 210771 [details]
drm/i915: Prevent machine death on Ivybridge context switching for kernel 4.5.0 from kernel archive

this is Chris Wilsons patch for latest drm-intel kernel slightly modified for latest kernel v4.5.0 from stable kernel archive repo tree
https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable.git

Comment 228 julio.borreguero@gmail.com 2016-03-26 23:16:39 UTC

(In reply to julio.borreguero@gmail.com from comment #227)
> Created attachment 210771 [details]
> drm/i915: Prevent machine death on Ivybridge context switching for kernel
> 4.5.0 from kernel archive
> 
> this is Chris Wilsons patch for latest drm-intel kernel slightly modified
> for latest kernel v4.5.0 from stable kernel archive repo tree
> https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable.git

it froze within 2h.
probably not worth trying for anyone else. anyway lets see what dertobis test with the original patch on the drm-intel kernel leaves us with

Comment 229 Dimitris Roussis 2016-03-27 09:07:57 UTC

The bug is still "P1 normal"!!!

This bug affect the 30% of all laptops this moment in the market.It is one of the most serious bug never explored!! There are thousands of Linux users dissapointed.

How we communicate with developers of kernel that have most high position  to tell them about how serious is this situation?

Comment 230 dertobi 2016-03-27 09:17:53 UTC

I hate to break it but the Chris Wilson patch is not the fix. My laptop froze within an hour.

Comment 231 cororok 2016-03-27 11:34:41 UTC

(In reply to Dimitris Roussis from comment #229)
> The bug is still "P1 normal"!!!
> 
> This bug affect the 30% of all laptops this moment in the market.It is one
> of the most serious bug never explored!! There are thousands of Linux users
> dissapointed.
> 
> How we communicate with developers of kernel that have most high position 
> to tell them about how serious is this situation?

You're absolutely right. It is a very serious bug because it freezes computer.
Bay trails is low end computer and many users of this are probably non technical ones and want to try to get a light OS because windows 10 with limited memory is not happy. But they will very disappoint.

Comment 232 Brent Davis 2016-03-27 13:13:55 UTC

Just wanted to give a quick update since the last post I made stated I was gonna try the latest stable GIT with Mika's 3 tentative patches. So far it's been 24 hours and I have not experienced a crash. Not sure yet if this is just luck or if a real difference has been made. But I can definitely say my stability coming from 4.4.6 has vastly improved. Been doing everything I can to break this thing. Youtbe, h264 video, opengl games, etc.

Comment 233 dertobi 2016-03-27 14:42:10 UTC

(In reply to Brent Davis from comment #232)
> Just wanted to give a quick update since the last post I made stated I was
> gonna try the latest stable GIT with Mika's 3 tentative patches. So far it's
> been 24 hours and I have not experienced a crash. Not sure yet if this is
> just luck or if a real difference has been made. But I can definitely say my
> stability coming from 4.4.6 has vastly improved. Been doing everything I can
> to break this thing. Youtbe, h264 video, opengl games, etc.

I just also tried Mika's three tentative patches applied to latest drm-intel as well as Chris Wilson's patch, and within an hour my system crashed yet again.

Brent, are you making sure you don't have the usual workaround parameters in the command prompt while testing the patches (happened to me before, you can check with #cat /proc/cmdline)?

Comment 234 Allen 2016-03-27 16:23:35 UTC

I have an ASUS motherboard with Celeron J1900 cpu. 
For me, kernel 3.19.0-47 from Ubuntu 14.04.3 is stable with options
  intel_idle.max_cstate=1 nox2apic loglevel=7 debug . 
System is used for web browsing and openvpn client .
Crashes were usually happening while scrolling a large web page with mouse wheel (such as wsj.com or nytimes.com front page).

Comment 235 ladiko 2016-03-27 17:17:02 UTC

We have about 50 mainboard with J1900 and some samples with J1800, N3050, N3150 and we had to go back to the original Ubuntu 14.04 kernel 3.13 as even the lts-utopic-kernel 3.16 rarely, but sometimes froze on some few mainboards.

Comment 236 Dimitris Roussis 2016-03-27 18:50:04 UTC

The last stable kernel without this horrible bug is 3.16.7.

Canonical provides extended support for this kernel until April 2016!!! .I hope until then this bug have fixed.

Comment 237 Brent Davis 2016-03-27 19:20:02 UTC

(In reply to dertobi from comment #233)
> (In reply to Brent Davis from comment #232)
> > Just wanted to give a quick update since the last post I made stated I was
> > gonna try the latest stable GIT with Mika's 3 tentative patches. So far
> it's
> > been 24 hours and I have not experienced a crash. Not sure yet if this is
> > just luck or if a real difference has been made. But I can definitely say
> my
> > stability coming from 4.4.6 has vastly improved. Been doing everything I
> can
> > to break this thing. Youtbe, h264 video, opengl games, etc.
> 
> I just also tried Mika's three tentative patches applied to latest drm-intel
> as well as Chris Wilson's patch, and within an hour my system crashed yet
> again.
> 
> Brent, are you making sure you don't have the usual workaround parameters in
> the command prompt while testing the patches (happened to me before, you can
> check with #cat /proc/cmdline)?

Haven't touched my bootup command with the cstate flags or anything. Just been testing kernels and patches. Didn't see any reason to because I'm looking for a permanent solution as opposed to a work around. But yeah for all I know mine might crash to. Just wish there was some full proof way to replicate instead of just waiting for it to happen.

Comment 238 Hal 2016-03-28 18:18:58 UTC

(In reply to Dimitris Roussis from comment #236)
> The last stable kernel without this horrible bug is 3.16.7.
> 

Is this your own experience on your own computer or some information from an authoritative source?
Because, Linux Mint 17.2 comes with 3.16.0 and it is prone to freeze on many  N3050 and N2930 machines that I tested.

Also, based on your statement, I installed 3.16.7 from the ubuntu repo http://kernel.ubuntu.com/~kernel-ppa/mainline/v3.16.7-ckt19-utopic/linux-headers-3.16.7-031607-generic_3.16.7-031607.201510301030_amd64.deb

It only worked for 1 hr on one machine and 3.5 hrs on the other.

So, if I may suggest, please do not make such authoritative, blanket statements, unless you can cite an authoritative source. Otherwise, simply say that this applies to your own equipment.

Also, on my own two computers (if you lookup this long thread you'll see my config info) several 3.16.n versions have been tested. They absolutely all eventually froze.
The only good thing is that cstate=2 reduces the failure rate significantly, and cstate=1 literally eliminates freezing on my computers.

Comment 239 Dimitris Roussis 2016-03-28 18:35:10 UTC

(In reply to Hal from comment #238)
> (In reply to Dimitris Roussis from comment #236)
> > The last stable kernel without this horrible bug is 3.16.7.
> > 
> 
> Is this your own experience on your own computer or some information from an
> authoritative source?
> Because, Linux Mint 17.2 comes with 3.16.0 and it is prone to freeze on many
> N3050 and N2930 machines that I tested.
> 
> Also, based on your statement, I installed 3.16.7 from the ubuntu repo
> http://kernel.ubuntu.com/~kernel-ppa/mainline/v3.16.7-ckt19-utopic/linux-
> headers-3.16.7-031607-generic_3.16.7-031607.201510301030_amd64.deb
> 
> It only worked for 1 hr on one machine and 3.5 hrs on the other.
> 
> So, if I may suggest, please do not make such authoritative, blanket
> statements, unless you can cite an authoritative source. Otherwise, simply
> say that this applies to your own equipment.
> 
> Also, on my own two computers (if you lookup this long thread you'll see my
> config info) several 3.16.n versions have been tested. They absolutely all
> eventually froze.
> The only good thing is that cstate=2 reduces the failure rate significantly,
> and cstate=1 literally eliminates freezing on my computers.

Almost all the users if you read the comments said that kernel 3.6.17 works without problem (comment 11,35,81,105 etc). 

Also in my situation with a N3050 Machine..Everything above this kernel unfortunately does not work.

if exist machines that have problem even with this kernel or below the bug is more serious that we think.

Comment 240 Markus Rehbach 2016-03-28 19:02:48 UTC

On my netbook 

Acer Aspire ES1-111/R2, BIOS V1.16 10/20/2015 Celeron N2940 4GB

I´m back to Centos7 (kernel 3.10 whatever) and the power drain is acceptable for me. No freeze problems so far in contrast to Ubuntu 15.10. 

Tried 4.5 mainline from elrepo and this one is working stable for me, too. 4.4.4 mainline was unstable.

Will try 4.4.6 now and will report MY (!) results.

Comment 241 Hal 2016-03-28 19:46:54 UTC

(In reply to Dimitris Roussis from comment #239)
> (In reply to Hal from comment #238)
> > (In reply to Dimitris Roussis from comment #236)
> > > The last stable kernel without this horrible bug is 3.16.7.
> > > 
> > 
> > Is this your own experience on your own computer or some information from
> an authoritative source?
...
> 
> Almost all the users if you read the comments said that kernel 3.6.17 works
> without problem (comment 11,35,81,105 etc). 
> 

Not quite accurate though. Please note:

1) In Vladmir Jicha in comments #104 and #162 mentioned that his computer froze about twice a week even with kernel 3.13

2) Ladiko in #235 indicates that some of their 50 boards freeze with 3.16 (and also 3.13)

I know that many people mentioned that they believed 3.16 worked well, without patching, on their specific hardware. And that's fine. But, that can't be generalized and turned into an authoritative statement that 3.16 was freeze free on the microprocessors that this thread is focusing on.

I can concur with the few unlucky people here that freezing problems are even occurring on version 3.13.

There are so any versions of the 3.x and 4.x kernels out there, and so many compilation sources, that performing a test matrix worthy of drawing conclusions from is almost impossible at this time.
Between standard issue distro kernels and what can be downloaded and compiled from kernel.org, or what's on ubuntu's prolific mainline kernel-ppa, I am quite convinced that when two people are referring to a particular version they are not necessarily talking about the same binary - far from it.
And I am not even talking about the privately patched derivatives...

So, no - unfortunately Linux has taken a very bad turn this time and troubleshooting this issue is going to be a miserable experience. (And that's probably why this bug is still not fixed after 15 months since its first discovery).
And I am not even talking about retrofitting the fix into all these versions.

Although, I will personally be very happy if I had only one version with a fix like 4.5!
Oh, but wait! there is a release candidate version 4.6 already!
WHAT A JOKE!

Comment 242 ladiko 2016-03-28 20:00:10 UTC

regarding 2): we have issues with 3.16, but not with 3.13 which we use right now.

Comment 243 Dimitris Roussis 2016-03-28 20:04:22 UTC

For me mr, Len Brown has a big responsibility of this situation. 

How is possible to assigned to you a such serious bug that affected 30% or more of the new laptops in markets and still the Importance is P1 normal and without inform developers above you.

I think mr.Len didn't understand the effect of this bug to the Linux world!! 

Just go to a computer shop. The half of laptops use these cpus!! What we can say to all these people? Dont use linux wait 2 more years that someone interested to fix the bug!!! The worst situation i saw the last 10 years in Linux.

Comment 244 Dimitris Roussis 2016-03-28 20:11:43 UTC

and also what do you think?..All the people are linux experts to try different kernels like us?..

It works exactly in this way...Somebody go to the shop buy a new laptop and install the latest ubuntu. After 20 minutes the system freeze and he said linux sucks i never use it again!!

Thats it!!

Comment 245 Hal 2016-03-28 20:34:44 UTC

(In reply to ladiko from comment #242)
> regarding 2): we have issues with 3.16, but not with 3.13 which we use right
> now.

My apology. The parentheses was a left over from the edit to the sentence after I realized that you were referring to 3.16 as freezing, but not 3.13. Thank you for pointing it out.

In any event my point to this group is that, I personally do not believe that a solution to this problem will be found anytime soon, as we cannot even identify the turning point beyond which this problem started to show up. And kernel version proliferation is certainly one reason for that.

For those who like me, this issue has serious ramifications (beyond the fun of using linux distros at home, with friends and family, etc.), like losing credibility in front of a prospective customer because you can't even give a presentation with your beautiful lightweight laptop computer without rebooting twice in one hour, I have a word of advice: Go buy yourself an entry level Mac laptop because it's time for evasive action.

It was fun riding the Linux wave - a little over 12 years for me. But now it's time to move on.

Comment 246 jds 2016-03-28 21:12:30 UTC

This is on my work macbook, which I sleep at the end of every workday:

$ uptime
17:11  up 154 days,  3:13, 6 users, load averages: 1.34 1.83 1.90

(In reply to Hal from comment #245)
> (In reply to ladiko from comment #242)
> > regarding 2): we have issues with 3.16, but not with 3.13 which we use
> right
> > now.
> 
> My apology. The parentheses was a left over from the edit to the sentence
> after I realized that you were referring to 3.16 as freezing, but not 3.13.
> Thank you for pointing it out.
> 
> In any event my point to this group is that, I personally do not believe
> that a solution to this problem will be found anytime soon, as we cannot
> even identify the turning point beyond which this problem started to show
> up. And kernel version proliferation is certainly one reason for that.
> 
> For those who like me, this issue has serious ramifications (beyond the fun
> of using linux distros at home, with friends and family, etc.), like losing
> credibility in front of a prospective customer because you can't even give a
> presentation with your beautiful lightweight laptop computer without
> rebooting twice in one hour, I have a word of advice: Go buy yourself an
> entry level Mac laptop because it's time for evasive action.
> 
> It was fun riding the Linux wave - a little over 12 years for me. But now
> it's time to move on.

Comment 247 micha 2016-03-28 21:55:06 UTC

Weird enough, but this thread is giving me back some hope!

I bought an Asus F551 (Intel N2930) laptop last year in February which came with a pre-installed Windows 8.1 64bit and which was running flawlessly - until I updated to Windows 10. Right after the update the laptop started to freeze randomly. Since I spend most of the time editing PHP code and watching the result in Firefox, I'm not really bringing the machine to its limits. And maybe that's the reason those freezes didn't happen so often. Sometimes it took 3 days, sometimes I got 3 crashed within half an hour. Absolutely unpredictable. Except of my unsaved program changes: no data loss - not a single hint in the system log.

I filed a detailed report to Asus then, but the only suggestion was restoring the machine to its shipping state. Poor, isn't it? And that's why I decided to give Linux a try instead. To keep it short: My Linux (Ubuntu studio) 4.2.0-34.lowlatency #39-Ubuntu SMP PREEMPT is freezing, too.

To me this looked very much like a hardware defect and my next idea was running memtest86. Strange enough I got not a single error when running the test with just ONE cpu, but hundreds of errors (all at address 0) with multiple processors involved.

Yeah, and so I was almost giving up hope on this laptop until I came across this thread today.
My first test was running 2 glxgears plus watching a video in firefox: Freeze after about 10 minutes.
After a reboot

Comment 248 micha 2016-03-28 21:59:56 UTC

(In reply to micha from comment #247)
> Weird enough, but this thread is giving me back some hope!
> 
> I bought an Asus F551 (Intel N2930) laptop last year in February which came
> with a pre-installed Windows 8.1 64bit and which was running flawlessly -
> until I updated to Windows 10. Right after the update the laptop started to
> freeze randomly. Since I spend most of the time editing PHP code and
> watching the result in Firefox, I'm not really bringing the machine to its
> limits. And maybe that's the reason those freezes didn't happen so often.
> Sometimes it took 3 days, sometimes I got 3 crashed within half an hour.
> Absolutely unpredictable. Except of my unsaved program changes: no data loss
> - not a single hint in the system log.
> 
> I filed a detailed report to Asus then, but the only suggestion was
> restoring the machine to its shipping state. Poor, isn't it? And that's why
> I decided to give Linux a try instead. To keep it short: My Linux (Ubuntu
> studio) 4.2.0-34.lowlatency #39-Ubuntu SMP PREEMPT is freezing, too.
> 
> To me this looked very much like a hardware defect and my next idea was
> running memtest86. Strange enough I got not a single error when running the
> test with just ONE cpu, but hundreds of errors (all at address 0) with
> multiple processors involved.
> 
> Yeah, and so I was almost giving up hope on this laptop until I came across
> this thread today.
> My first test was running 2 glxgears plus watching a video in firefox:
> Freeze after about 10 minutes.
> After a reboot

sorry, accidently hit the wrong key ...
after the reboot 4 hours ago I added the cstate=1 to the boot parms and the system is still alive, continously running 2 glxgears and playing videos.

Comment 249 cororok 2016-03-28 22:02:01 UTC

I guess Intel already knew the bug but wonder why they don't fix it.

User experience should be different between expensive Core and cheap Bay trail so that Intel make a huge profit in Core cpus. Windows 10 meets this strategy because it is slow on low memory (Intel computer stick with linux has 1gb ram compared to 2gb for windows 10).

Is Intel happy with this situation which restrains Bay trail in windows 10? like netbook was limited in 10 inch size?

Comment 250 dertobi 2016-03-28 22:18:29 UTC

(In reply to cororok from comment #249)
> I guess Intel already knew the bug but wonder why they don't fix it.
> 
> User experience should be different between expensive Core and cheap Bay
> trail so that Intel make a huge profit in Core cpus. Windows 10 meets this
> strategy because it is slow on low memory (Intel computer stick with linux
> has 1gb ram compared to 2gb for windows 10).
> 
> Is Intel happy with this situation which restrains Bay trail in windows 10?
> like netbook was limited in 10 inch size?

Let's not devolve into conspiracy theories. For me this looks like incompetency paired with negligence. Still bad.

Comment 251 RussianNeuroMancer 2016-03-28 22:27:11 UTC

In case anyone need it, there is amd64 deb packages with patches from Comment #202

https://github.com/milikhin/z3735-linux-patches
https://drive.google.com/folderview?id=0BzIRxogf-cVkLWdiMTRoenU5amM 

Linux 4.6rc1 package also include workaround for bug 112571.

Comment 252 cororok 2016-03-28 23:30:38 UTC

An internet post pointing here this bug.

http://www.phoronix.com/scan.php?page=news_item&px=Intel-Linux-Bay-Trail-Fail

Comment 253 Hal 2016-03-29 01:10:39 UTC

(In reply to micha from comment #248 & #247)

The Asus F551 is a decent machine (not a great machine for the intended use of Ubuntu Studio though). I reconfigured one with Linux Mint several months ago for a relative of mine. I remember having tried a low latency kernel (I can't recall the exact version) but the performance was terrible. Generally speaking processors in the N2930 class are not good candidates for low latency versions of the kernel. I eventually set that machine with Linux Mint 17.2 kernel 3.16, and a few months later upgraded the OS to 17.3 with kernel 3.19.
Of course the processor freezing problems came along and intel_idle.max_cstate=2 or 1, as discovered by many people by then, became the life saver for the machine.

Especially with cstate=1 you can expect your machine to be very stable and run flawlessly. On my laptop (although not a F551) I sometimes set it to 2 and take the risk of seeing my machine freeze as the battery lasts significantly longer with cstate set to 2 rather than to 1.

If you want to try a non-low-latency kernel, rather than installing a standard kernel on Ubuntu Studio, try a different flavor of Linux (maybe Linux Mint) with a standard kernel. Because Ubuntu Studio is a bit touchy (at least in my experience) and you may start seeing seemingly unrelated problems as soon as you replace its kernel.

Comment 254 Hal 2016-03-29 02:46:06 UTC

(In reply to Dimitris Roussis from comment #243)
> For me mr, Len Brown has a big responsibility of this situation. 
> 
You're making an excellent point. It's quite extraordinary that the gentleman in charge of fixing this bug has not posted a single line here on this thread, sharing his thoughts, or providing insight about his efforts on this matter.
Quite extraordinary...

Comment 255 Mika Kuoppala 2016-03-29 12:33:38 UTC

(In reply to julio.borreguero@gmail.com from comment #218)
> (In reply to Mika Kuoppala from comment #202)
> > https://cgit.freedesktop.org/~miku/drm-intel/log/?h=rc6_test
> > 
> > 3 _tentative_ patches on that tree. Please try.
> 
> system freeze after ~2 days
> 

Did that set affect the rate/time of hangs?

I am now at 6days of uptime. Workload is glxgears + vlc with vaapi

Comment 256 julio.borreguero@gmail.com 2016-03-29 12:48:57 UTC

(In reply to Mika Kuoppala from comment #255)
> (In reply to julio.borreguero@gmail.com from comment #218)
> > (In reply to Mika Kuoppala from comment #202)
> > > https://cgit.freedesktop.org/~miku/drm-intel/log/?h=rc6_test
> > > 
> > > 3 _tentative_ patches on that tree. Please try.
> > 
> > system freeze after ~2 days
> > 
> 
> Did that set affect the rate/time of hangs?
> 
> I am now at 6days of uptime. Workload is glxgears + vlc with vaapi

i stressed the system more than usual. had a big glxgears on one workspace and i was playing nonstop movies from shell with mplayer (-nosound). no vaapi.
Plus listening to music with clementine and compiling a lot of packages (gentoo upgrading packages).
Hard to say if it improved with random freezes that can occur at any time.
what i can say is that chris wilsons patch only took max 2h in freezing, although i applied it to 4.5.0 kernel.
I can try more patches or use vaapi or whatever, just let me know.

Comment 257 Hal 2016-03-29 13:42:26 UTC

(In reply to julio.borreguero@gmail.com from comment #256)
> (In reply to Mika Kuoppala from comment #255)
> > (In reply to julio.borreguero@gmail.com from comment #218)
> > > (In reply to Mika Kuoppala from comment #202)
> > > > https://cgit.freedesktop.org/~miku/drm-intel/log/?h=rc6_test
> > > > 
> > > > 3 _tentative_ patches on that tree. Please try.
> > > 
> > > system freeze after ~2 days
> > > 
> > 
> > Did that set affect the rate/time of hangs?
> > 
> > I am now at 6days of uptime. Workload is glxgears + vlc with vaapi
> 
> i stressed the system more than usual. had a big glxgears on one workspace
> and i was playing nonstop movies from shell with mplayer (-nosound). no
> vaapi.
> Plus listening to music with clementine and compiling a lot of packages
> (gentoo upgrading packages).
> Hard to say if it improved with random freezes that can occur at any time.
> what i can say is that chris wilsons patch only took max 2h in freezing,
> although i applied it to 4.5.0 kernel.
> I can try more patches or use vaapi or whatever, just let me know.

Pardon my intrusion. Although I am no longer testing anything related to this issue I thought sharing some of my findings might interest you.
The freezing is more prone to happen when the workload on the processor cores is light to medium as the power controller takes a more active role in switching the states. When you heavily load your processor with tasks it goes into low power or power saving states much less. If there is failure, more likely it's another cause than this bug at hand. Of course keep testing everything under heavy load too, but light load will probably cause this problem show up more quickly and frequently.
(When I was doing serious structured testing I noticed that actually with no "user" load, just the internal system calls were causing enough/more frequent cstate flip flops than when running videos etc)

Comment 258 julio.borreguero@gmail.com 2016-03-29 13:56:06 UTC

(In reply to Hal from comment #257)
> (In reply to julio.borreguero@gmail.com from comment #256)
> > (In reply to Mika Kuoppala from comment #255)
> > > (In reply to julio.borreguero@gmail.com from comment #218)
> > > > (In reply to Mika Kuoppala from comment #202)
> > > > > https://cgit.freedesktop.org/~miku/drm-intel/log/?h=rc6_test
> > > > > 
> > > > > 3 _tentative_ patches on that tree. Please try.
> > > > 
> > > > system freeze after ~2 days
> > > > 
> > > 
> > > Did that set affect the rate/time of hangs?
> > > 
> > > I am now at 6days of uptime. Workload is glxgears + vlc with vaapi
> > 
> > i stressed the system more than usual. had a big glxgears on one workspace
> > and i was playing nonstop movies from shell with mplayer (-nosound). no
> > vaapi.
> > Plus listening to music with clementine and compiling a lot of packages
> > (gentoo upgrading packages).
> > Hard to say if it improved with random freezes that can occur at any time.
> > what i can say is that chris wilsons patch only took max 2h in freezing,
> > although i applied it to 4.5.0 kernel.
> > I can try more patches or use vaapi or whatever, just let me know.
> 
> Pardon my intrusion. Although I am no longer testing anything related to
> this issue I thought sharing some of my findings might interest you.
> The freezing is more prone to happen when the workload on the processor
> cores is light to medium as the power controller takes a more active role in
> switching the states. When you heavily load your processor with tasks it
> goes into low power or power saving states much less. If there is failure,
> more likely it's another cause than this bug at hand. Of course keep testing
> everything under heavy load too, but light load will probably cause this
> problem show up more quickly and frequently.
> (When I was doing serious structured testing I noticed that actually with no
> "user" load, just the internal system calls were causing enough/more
> frequent cstate flip flops than when running videos etc)

Well thank you for your intrusion. That indeed sounds logical, good point.
Interestingly enough the system finally froze after i closed those glxgears and ever-looping movies, now that you are saying, which absolutely confirms your theory. i was at that point just watching a movie (low-res) without stressing the machine in any other way.
nonetheless the cstate workaround doesn't work for me, although i haven't tried cstate=2, only cstate=1 (on my N2940) and that seems to be hardware depending.

Comment 259 dertobi 2016-03-29 14:25:11 UTC

Quick observation:

Once the system is frozen I can still unplug/plug the HDMI cable and the frozen screen will reappear on my external monitor. Maybe that means nothing, but wouldn't that suggest that some level of kernel activity is still occuring? I can also use the FN keys of my laptop to disable/enable the laptop screen, but that could be happening purely on the firmware/BIOS level.

Comment 260 Juha Sievi-Korte 2016-03-29 14:46:50 UTC

(In reply to Hal from comment #257)
> 
> Pardon my intrusion. Although I am no longer testing anything related to
> this issue I thought sharing some of my findings might interest you.
> The freezing is more prone to happen when the workload on the processor
> cores is light to medium as the power controller takes a more active role in
> switching the states. When you heavily load your processor with tasks it
> goes into low power or power saving states much less. If there is failure,
> more likely it's another cause than this bug at hand. Of course keep testing
> everything under heavy load too, but light load will probably cause this
> problem show up more quickly and frequently.
> (When I was doing serious structured testing I noticed that actually with no
> "user" load, just the internal system calls were causing enough/more
> frequent cstate flip flops than when running videos etc)

My finding is similar than yours Hal, freezes happened almost always when doing "nothing much" ie. load and scroll a web page, sometimes hang happened just after reboot when everything was loaded and system started idling. I think it would be almost always when the load changes from 'high' to 'low' or 'idle'. 

When these problems started for me with some kernel version (after distribution upgrade from Ubuntu 14.10 to 15.04 (kernel 3.19)), the hangs first happened always when I tried to put the laptop to sleep by closing the lid. A bit later (perhaps further distribution upgrade when I got sick of the "buggy 15.04") came the full system lock-ups during 'daily use'.

But I was also thinking that are there now two (or even more) freeze issues in this same report that different users are experiencing, as cstate limiting doesn't help for everyone and also there are now other than baytrail systems also included (even they are likely related from design perspective).

Btw, still running the same 4.5.0 session, hitting two weeks marker in couple of days. Linux cardhu 4.5.0-58.gb2c9ae5-default #1 SMP PREEMPT Wed Mar 16 17:30:21 UTC 2016 (b2c9ae5) x86_64 x86_64 x86_64 GNU/Linux.

No other patches, no cstate limiting. Stress-tested with videos for one full day, otherwise it's been just my daily usage pattern with web-browsing, streaming, occasional gaming, etc.

Comment 261 julio.borreguero@gmail.com 2016-03-29 15:01:54 UTC

just a simple question/thought:

wouldn't it be quite easy to just write a program to change bewtween those cstates constantly to make a solid test program and finally be able to nail down the bug and make it reproducable ?
or to do this, if that is causing the problem:
>Two concurrent writes into the same register cacheline has the chance of
>killing the machine on Ivybridge and other gen7.
(citation from chris wilsons patch description)

a reliable "freeze program" would help tremendously, i think.

Comment 262 Hal 2016-03-29 15:33:21 UTC

(In reply to julio.borreguero@gmail.com from comment #261)
> just a simple question/thought:
> 
> wouldn't it be quite easy to just write a program to change bewtween those
> cstates constantly to make a solid test program and finally be able to nail
> down the bug and make it reproducable ?
> or to do this, if that is causing the problem:
> >Two concurrent writes into the same register cacheline has the chance of
> >killing the machine on Ivybridge and other gen7.
> (citation from chris wilsons patch description)
> 
> a reliable "freeze program" would help tremendously, i think.

Probably not easy, because even though you could force the cstate with your own procedure you can't prevent the microprocessor's microcode from interacting with it (unless of course you are an intel guy and have access to the nitty-gritty of the microcode and you know how to throw that control code in the dustbin and overwrite it with your own)

When I first ran into this freezing problem on My Zotac, I didn't know anything about this thread or the earlier one on the freedesktop site. As I tried to do quick and dirty troubleshooting I wrote a little program with a bunch of loops (some in sequence, some in parallel) stressing different parts of the processor and computer hardware as I thought it would help me isolate the area this problem was originating from. That kept the processor cores quite busy, but it also increased the longevity of the linux session. Without that dingy program the machine would freeze within 5-10 minutes after booting, with the software running freezing would only occur an hour or two later. So, that gave me more time to look around into the system. That also gave me a hint that probably the power saving mechanism was the culprit as it was kicking in during light loads on the cpu.

So, yes - it may be possible to come up with a Micky-Mouse solution to alleviate the negative impact of the problem and save the day, but a real solution by competent people who understand the root-cause of the problem is more desirable - especially after 15 months of this saga ...

Comment 263 cororok 2016-03-29 17:49:29 UTC

Instead of keep running full load of tasks how about doing something like below?
So the cpu-gpu is going up and down.


#! /bin/bash
function callCpuGpu() {
  killall -w firefox
  xdg-open https://www.youtube.com/xyz
}

fo i in {1..1000}
do
  callCpuGpu
  sleep 180s
done

Comment 264 cororok 2016-03-29 20:32:22 UTC

sorry for wrong one above.

#! /bin/bash
function callCpuGpu() {
  killall -w firefox
  sleep 60s # idle time
  xdg-open https://www.youtube.com/xyz
}

fo i in {1..1000}
do
  callCpuGpu
  sleep 60s # running time
done

Comment 265 jbMacAZ 2016-03-30 01:48:55 UTC

Well on a lighter note, 4.6-rc1-next.29 seems to have fixed two new failure modes since 4.4.x.  On my system both occur after about 10 hours (cstate=1.)  One was a semi-freeze, where the clock seconds field stops, but the mouse/touchscreen cursor still moves freely, and the user interface was checked/updated less than once a minute.  The other failure was the screen going black w/o warning, apparently frozen.  The newest patches didn't affect these failures. Without cstate, my system freezes within minutes per usual, the patches had no obvious effect.  (uP=Z3775)

Comment 266 micha 2016-03-30 08:59:03 UTC

(In reply to Hal from comment #253)
> (In reply to micha from comment #248 & #247)
> 
> The Asus F551 is a decent machine (not a great machine for the intended use
> of Ubuntu Studio though). I reconfigured one with Linux Mint several months
> ago for a relative of mine. I remember having tried a low latency kernel (I
> can't recall the exact version) but the performance was terrible. 

Thanks Hal for your hints.

Actually installing Linux on this Laptop wasn't meant to get a powerful multimedia device in the end - it was meant to be a cross check.

Windows 8.1 was running flawlessly for several months - and after updating to Windows10 the machine started to freeze randomly.

Right now, the most surprising and interesting aspect is this coincidence:
Older kernels seems to work correctly, while the newer ones don't.

Thus, to me it looks like both parties have "optimized" their kernels up to a point these cpus/architectures can't cope with any more.

All I can report so far is: After using lstate=1, my laptop is running more than 48 hours without any freeze. The first day with 2 glxgears and endless videos, today back to normal with just one firefox and a little editing. And I wouldn't wonder if Windows10 would run correctly with a similar booting option. Unfortunately I haven't found a switch like that up to now.

Comment 267 Hal 2016-03-30 12:32:21 UTC

(In reply to micha from comment #266)

> Thanks Hal for your hints.
You are welcome.

> Older kernels seems to work correctly, while the newer ones don't.
You are correct. There was a time I kept my system at least a couple of steps behind the "cutting edge" as to me reliability is key. But as I replaced some of my older, high power eating, machines with tiny, low power consuming ones with Bay Trail or Braswell family CPUs, I also had to step up the kernel versions. Because, for instance on the entry level Intel NUC the integrated video circuitry (HDMI part) is not properly handled by kernel 3.16.0. Most Display Port interfaces are prone to random problems with older kernels even when they are supported. Frankly if I could, I would stick to Linux Mint 17.2 and not even upgrade to 17.3 as I had Mint 17.2 running for over a year (without powering it down) on a home built AMD machine without a hiccup.

> ... if Windows10 would run correctly with a similar booting
> option. Unfortunately I haven't found a switch like that up to now.
I doubt that in the Windows case it's a kernel issue. It's probably a device driver issue on the integrated video hardware that needs to be fixed.
Also check Intel's website for any newer microcode versions for your microprocessor.

But for professional use I am now switching (back) to Mac. In the 80's and 90's people used to say "nobody got fired for buying an IBM computer". I think that applies to Apple nowadays ...

Comment 268 Dimitris Roussis 2016-03-31 18:58:18 UTC

http://www.phoronix.com/scan.php?page=news_item&px=Intel-Linux-Bay-Trail-Fail

Comment 269 Juha Sievi-Korte 2016-04-01 03:48:12 UTC

(In reply to Dimitris Roussis from comment #268)
> http://www.phoronix.com/scan.php?page=news_item&px=Intel-Linux-Bay-Trail-Fail

There was interesting point in the article comments section, that upgrading to xorg 1.18 had solved some freezes that had happened with chromium (but no specifics on hardware, other than mention of atom).

I checked my installation log back, and I've definitely verified a freeze with 1.18.0, but not with 1.18.1 - which I am running now with the 4.5.0. Perhaps unrelated noise, but caught my eye.

Comment 270 Tal Liron 2016-04-01 04:16:59 UTC

Interesting info: I had similar freezes running Android x86 (64bit version, UEFI) on the same machine. So it might really be Linux-specific and unrelated to the graphics stack.

Comment 271 kossmann 2016-04-01 06:40:58 UTC

I have no X-Server running, just a plain/headless Debian without monitor, keyboard, etc.

Comment 272 julio.borreguero@gmail.com 2016-04-01 14:40:36 UTC

(In reply to Juha Sievi-Korte from comment #269)
> (In reply to Dimitris Roussis from comment #268)
> >
> http://www.phoronix.com/scan.php?page=news_item&px=Intel-Linux-Bay-Trail-Fail
> 
> There was interesting point in the article comments section, that upgrading
> to xorg 1.18 had solved some freezes that had happened with chromium (but no
> specifics on hardware, other than mention of atom).
> 
> I checked my installation log back, and I've definitely verified a freeze
> with 1.18.0, but not with 1.18.1 - which I am running now with the 4.5.0.
> Perhaps unrelated noise, but caught my eye.

i just upgraded xorg-server to 1.18.2 from 1.17.4.
kernel 4.5.0 no patches no boot parameters.
it froze within minutes

[N2940]

Comment 273 dertobi 2016-04-01 16:22:38 UTC

(In reply to kossmann from comment #271)
> I have no X-Server running, just a plain/headless Debian without monitor,
> keyboard, etc.

It would be interesting to see a hypothesis as to how that bug in can occur in a headless setup. Can it still be the fault of the i915 driver in that case? Maybe the actual x86-64 cpu architecture linux code has some unexpected sideeffects with baytrail cpus?

Comment 274 dertobi 2016-04-02 13:06:23 UTC

I had two freezes today with kernel 4.6 that could be a different bug, but there's no way to know this for sure (yet).

This occured with intel_idle.max_cstate=1.

The good news is that this time it's at least partially reproducible and I say that because I don't know if others will be able to repruduce it, too.


1) I have my smartphone connected to one of the USB ports to keep it charged.

2) I try to reboot the phone.

3) Instead of rebooting the phone shuts off.  (Probably not enough juice)

4) Then I try to force a boot by holding the power button of the phone.

(The USB cable stays connected to my laptop while I'm doing all of this)

5) Just in the moment when the phone starts to boot, my desktop freezes in apparently exactly the same way people on this bug report already know about all too well.

My conclusions from this:

1) The phone is connected for charging, so it's not unlikely it's messing with the power management of the laptop by draining power and sudden shifts in that power drain. (Although it shouldn't)

2) There could be a bug in the USB subsystem.

3) My particular laptop might have a serious hardware defect.

4) ???

Anyone else, please feel free to speculate what that means.

Comment 275 Andy Furniss 2016-04-02 14:21:43 UTC

(In reply to dertobi from comment #273)
> (In reply to kossmann from comment #271)
> > I have no X-Server running, just a plain/headless Debian without monitor,
> > keyboard, etc.
> 
> It would be interesting to see a hypothesis as to how that bug in can occur
> in a headless setup. Can it still be the fault of the i915 driver in that
> case? Maybe the actual x86-64 cpu architecture linux code has some
> unexpected sideeffects with baytrail cpus?

J1900 Asrock Q1900DC-itx.

I could easily lock with kodi and tested lots in the early days of the FDO bug.

On some kernels the a patch in that bug seemed to prevent for me and is probably still in openelec (but I never ran kodi for more than 15 hours).

The patch - just option 2.

https://bugs.freedesktop.org/show_bug.cgi?id=88012#c33

Testing newer kernels did seem to gain a new issue.

As I bought the Q1900DC-itx to be a headless router/nas/pvr that's what I did with it.

Vanilla 4.1.1 no patch or workarounds (being headless there are no i915 IRQs so the patch would be pointless).

Ran 100 days OK updated kernel to 4.1.10 locked after 7 days, then next day.

Booted to 4.1.1. again ran OK for 127 days - updated to 4.1.18 no lock so far (up 37 days).

Don't know what was wrong with 4.1.10 or if it's just luck (but seems unlikely).

Looking at stable commits there is a baytrail one just before 4.1.13 which fixes GPIO register access - maybe that is helping me now in 4.1.18.

One other change made initially - though not because of locks is I disabled USB3 in BIOS as I have 2 USB DVB-T2 tuners and I was getting low level packet loss on the links. Seemed to be power related as spinning a CPU would fix it - but so did (well 99%) avoiding xhci by turning off USB3.

Comment 276 Martin 2016-04-03 16:22:39 UTC

Two years of problems and three lengthy and painful bisects later I finally arrived at commit 8fb55197e64d5988ec57b54e973daeea72c3f2ff (drm/i915: Agressive downclocking on Baytrail). A simple search for these terms brought me to this bug and I now know I'm not alone! Will try read-up on all comments on this bug later. Meanwhile I manually reverted the changes in mentioned commit in 4.5 and have yet to see a freeze. Will try max_cstate=1 later.

HW: ASRock Q1900-ITX.

Comment 277 kossmann 2016-04-04 05:41:59 UTC

(In reply to dertobi from comment #273)
> It would be interesting to see a hypothesis as to how that bug in can occur
> in a headless setup. Can it still be the fault of the i915 driver in that
> case? Maybe the actual x86-64 cpu architecture linux code has some
> unexpected sideeffects with baytrail cpus?

I updated to Kernel 4.4.0-1-amd64 and - this could be the trick for me - made a new released BIOS-Update, including a Micocode-Update (NUC6i5SYH). The uptime of my NUC is 2 days an 14 hours for now without max_cstate.

Comment 278 kossmann 2016-04-04 05:44:12 UTC

Sorry... forget my post

root@nuc:~# cat /proc/cmdline
BOOT_IMAGE=/vmlinuz-4.4.0-1-amd64 root=UUID=b4b6a796-e6c4-44b5-8d3b-0e34a2cae5c6 ro quiet crashkernel=256M nmi_watchdog=1 intel_idle.max_cstate=1

max_cstate is still set.

Comment 279 Martin 2016-04-04 12:25:16 UTC

Created attachment 211641 [details]
Reverted commit 8fb55197e64... for 4.5.0

Ok, have done homework and read the whole thread.
My experiences with the BayTrail issue:

HW: ASRock Q1900-ITX with J1900 onboard, like I said before.
Load: HTPC with MythTV recording/showing both SD and HD DVB-C material.
Noteworthy: I use an out-of-kernel compiled ddbridge module that comes with own dvb-core code.

In my experience problems began when I started compiling 4.2.* and immediately blamed the non-standard ddbridge module. There may have been problems with 4.1.* that I don't remember, but I'm VERY confident the latest iterations in the 4.1.* series are rock-solid. Last stable kernel I used before venturing on my latest bisect was 4.1.20 without patches or work-arounds.
Since my problems started with 4.2.0 and 4.1 series seemed stable I bisected between 4.1 and 4.2. This led me without a shadow of a doubt to Chris Wilson's commit 8fb55197e... Freezes tended to occur much faster as I approached this commit. On 4.2.0 and above it can take hours if not days, on 8fb55197e... it's a matter of minutes. I was surprised to end up on a commit that was not related to dvb/device code but relieved it precisely matched the other hardware I use which I never doubted it's stability.

My HTPC is now watching HD DVB-C content as we speak on 4.5.0 using accompanied patch, which is a manual reversal of 8fb55197e... to the best of my knowledge. It's been up since yesterday and hasn't crashed since, but I'm sceptical since freezes took longer on later kernels anyway. So far, so good.

Comment 280 Martin 2016-04-04 12:48:06 UTC

I lied! I've found an old mail conversation about the problem and indeed I started seeing the freezes on 3.17 like many others. I tried bisecting between 3.16 and 3.17 back then and never convincingly arrived at a commit I could blame due to the unpredictable nature. So it does seem we are looking at different bugs that (partially) got fixed somewhere in the 4.1 branch. Patched 4.5 still going strong btw.

Comment 281 Juha Sievi-Korte 2016-04-07 18:36:40 UTC

(In reply to Juha Sievi-Korte from comment #209)
> Update: Grabbed 4.5.0 for testing on affected system (Acer B-115M, N3540).
> This is downloaded from opensuse repos this time, exact version:
> 
> Linux cardhu 4.5.0-58.gb2c9ae5-default #1 SMP PREEMPT Wed Mar 16 17:30:21
> UTC 2016 (b2c9ae5) x86_64 x86_64 x86_64 GNU/Linux
> 
> Running withtout a freeze for a week now in my normal use and stress-testing
> since this morning with HD videos. I'll report back if it freezes.
> 
> Someone asked about the desktop, I use xfce (some gnome-services running
> though). Have verified the freezes with two distributions, Ubuntu and
> Opensuse.

juhas@cardhu:~> uptime
 21:32pm  up 19 days 13:22,  5 users,  load average: 1.44, 1.59, 1.51
juhas@cardhu:~> uname -a
Linux cardhu 4.5.0-58.gb2c9ae5-default #1 SMP PREEMPT Wed Mar 16 17:30:21 UTC 2016 (b2c9ae5) x86_64 x86_64 x86_64 GNU/Linux
juhas@cardhu:~> cat /proc/cmdline 
BOOT_IMAGE=/boot/vmlinuz-4.5.0-58.gb2c9ae5-default root=UUID=4e634188-9fb6-40f9-87ae-487fd31414f3 resume=/dev/disk/by-uuid/5daad161-5400-48d2-a6e5-8cc5e0f08c20 splash=silent quiet showopts

Never before had this long uptime without boot parameters. Seems I'm unable to make this crash now. Anyone else this lucky with N3540 and 4.5.0? Am I forever stuck with this particular kernel version? :)

Comment 282 Martin 2016-04-11 12:01:01 UTC

Just rolled back to vanilla 4.5 to see if I could make a stable system out of 4.5 without my patch as Juha says. I can't. After 2,5 hours of watching it froze like it ever did after 4.2.

So it seems for me at least I need to roll-back 8fb55197e... which btw is so stable I haven't seen a freeze in a week of regularly watching television.

Comment 283 aicjofs 2016-04-13 18:04:23 UTC

I have known about this bug for well over a year, mostly ignored it and was content on 3.16 and 3.13. Popped in a few days ago to see the state of things and read up. I can't believe a bug that locks up the system within a few minutes or few hours has got no love.

I have a J1900 as a HTPC running Ubuntu 14.04, around a year or more ago the transition from 3.16 to something higher introduced me to this issue. I was able to work around it on higher kernel versions through the BIOS settings for C-state. Never used the kernel flag. Anyway I didn't like that option so I went back to 3.16. In the interest of upgrading the system to Ubuntu 16.04 in the near future and the higher kernel version used I thought I should look in to this again.

I grabbed the 4.6-RC3 source and manually by hand reverted the "Aggressive Downclocking of Baytrail" patch. I was kind of depressed to look at all the new .config changes, sure have added a lot of stuff to a non working kernel... Anyway I have seen really positive results over the past 36 hours. No lockups. While I will be the first to admit I haven't ever tried anything past 4.1 when it was in RC status a long time ago. I was usually able to lock the system up within 30 min, of bouncing between browsing and scrolling busy web pages in firefox, and Kodi starting and stopping videos(anything I could think to make the GPU up/down threshold shift). I can't seem to make it lockup at all now.

I know from reading many people have thought they had it licked, only to post back a few days later that it didn't. Probably the case here too, but something has changed over the last year because like I said previously I was able to lock it up within minutes from anything over 3.16 to 4.1-RC when I messed with this bug last time, and it's lasting days+? so far.

Comment 284 Koen L 2016-04-14 08:08:42 UTC

I've read all/most related threads and to me this appears to be the status quo.

'quick' fixes:

# intel_idle.max_cstate

Adding intel_idle.max_cstate=1 OR intel_idle.max_cstate=0 to the kernel parameters seems to work for most people, but leaves the processor running even when it should be idle (not energy efficient and causes more heat).

# Kernel 4.5+ with commit reversal

Using kernel 4.5+ without commit 8fb55197e64d5988ec57b54e973daeea72c3f2ff (drm/i915: Agressive downclocking on Baytrail). Some people mention positive results when reverting this commits on earlier kernel versions as well.

# intel_pstate=disabled

Some have mentioned that setting the intel_pstate=disabled kernel parameter helps, but others confirmed it did not help in their case.

Problem background:

# Irregular

The issue does not appear on a regular basis, some have reported a working system for over a day (+1 for me) and then it crashes twice in an hour.

# Confirming

There are no/limited logs and as such it is difficult to tell whether everyone in these threads is actually experiencing the same issue.

# cstate & pstate information from Intel (posted by Chris Rainey)

1. C-states and P-states are very different(https://software.intel.com/en-us/blogs/2008/03/12/c-states-and-p-states-are-very-different)

2. Power Management States: P-States, C-States, and Package C-States(https://software.intel.com/en-us/articles/power-management-states-p-states-c-states-and-package-c-states)

3. (update) C-states, C-states and even more C-states(https://software.intel.com/en-us/blogs/2008/03/27/update-c-states-c-states-and-even-more-c-states)

Real fix?

A real fix has yet to be found... In the commit which some people have reverted (https://patchwork.freedesktop.org/patch/45755/) Wilson and Deepak (from Intel) are named and in a later message Wilson states "Why those vlv_punit_read() result in a machine hang was never understood." (https://lists.freedesktop.org/archives/intel-gfx/2016-January/084206.html).

I'll CC both of them to this thread.

To-do:

This issue affects kernels up to 4.5 (as far as I can tell from the discussion). 4.4 for sure (experiencing the issue on latest 4.4 myself now).

Comment 285 dertobi 2016-04-14 08:16:55 UTC

(In reply to Koen L from comment #284)
> I've read all/most related threads and to me this appears to be the status
> quo.
> 
> 'quick' fixes:
> 
> # intel_idle.max_cstate
> 
> Adding intel_idle.max_cstate=1 OR intel_idle.max_cstate=0 to the kernel
> parameters seems to work for most people, but leaves the processor running
> even when it should be idle (not energy efficient and causes more heat).
> 
> # Kernel 4.5+ with commit reversal
> 
> Using kernel 4.5+ without commit 8fb55197e64d5988ec57b54e973daeea72c3f2ff
> (drm/i915: Agressive downclocking on Baytrail). Some people mention positive
> results when reverting this commits on earlier kernel versions as well.
> 
> # intel_pstate=disabled
> 
> Some have mentioned that setting the intel_pstate=disabled kernel parameter
> helps, but others confirmed it did not help in their case.
> 
> Problem background:
> 
> # Irregular
> 
> The issue does not appear on a regular basis, some have reported a working
> system for over a day (+1 for me) and then it crashes twice in an hour.
> 
> # Confirming
> 
> There are no/limited logs and as such it is difficult to tell whether
> everyone in these threads is actually experiencing the same issue.
> 
> # cstate & pstate information from Intel (posted by Chris Rainey)
> 
> 1. C-states and P-states are very
> different(https://software.intel.com/en-us/blogs/2008/03/12/c-states-and-p-
> states-are-very-different)
> 
> 2. Power Management States: P-States, C-States, and Package
> C-States(https://software.intel.com/en-us/articles/power-management-states-p-
> states-c-states-and-package-c-states)
> 
> 3. (update) C-states, C-states and even more
> C-states(https://software.intel.com/en-us/blogs/2008/03/27/update-c-states-c-
> states-and-even-more-c-states)
> 
> Real fix?
> 
> A real fix has yet to be found... In the commit which some people have
> reverted (https://patchwork.freedesktop.org/patch/45755/) Wilson and Deepak
> (from Intel) are named and in a later message Wilson states "Why those
> vlv_punit_read() result in a machine hang was never understood."
> (https://lists.freedesktop.org/archives/intel-gfx/2016-January/084206.html).
> 
> I'll CC both of them to this thread.
> 
> To-do:
> 
> This issue affects kernels up to 4.5 (as far as I can tell from the
> discussion). 4.4 for sure (experiencing the issue on latest 4.4 myself now).

Thanks for that comprehensive summary!

The only thing that I want to add is:

# The bug is still occuring on the latest kernel 4.6rc3 and git.

Comment 286 Koen L 2016-04-14 08:24:26 UTC

No problem, we all want to get this fixed!

I actually ended up CC-ing every person mentioned in the 'signed-off-by' of this patch.

> Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
> Cc: Deepak S <deepak.s@linux.intel.com>
> Cc: Ville Syrjälä <ville.syrjala@linux.intel.com>
> Cc: Rodrigo Vivi <rodrigo.vivi@intel.com>
> Cc: Daniel Vetter <daniel.vetter@ffwll.ch>

Fairly certain they should be able to give us some more pointers as to how to properly fix this issue.

Comment 287 Mort Yao 2016-04-14 11:47:25 UTC

I have to add that the regression presents on all kernel versions after 3.16, so commit 8fb55197e64d5988ec57b54e973daeea72c3f2ff

    drm/i915: Agressive downclocking on Baytrail

was not the true cause, at least not for me. Since it was merged only after 4.2RC, but I experienced the freeze on 4.1 as well, easily in hours to days (someone above mentioned it already happens on 3.17 as well). Otherwise it could two different issues we're talking about in this thread.

For kernel freezes starting from 3.16, "the commit" was

https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=31685c258e0b0ad6aa486c5ec001382cf8a64212

    drm/i915/vlv: WA for Turbo and RC6 to work together.

as far as I can tell, reverting it or simply applying this patch https://github.com/OpenBricks/openbricks/blob/master/packages/system/linux/patches/4.0/linux-999-i915-use-legacy-turbo.patch seem to do the trick.

Comment 288 Libor Chmelik 2016-04-14 13:39:28 UTC

I'm on an Acer Aspire E 15 E5-511-P7AT Laptop with an Intel N3540. I had to wait for kernel 4.5. to solve the hanging issue on shutdown/reboot. But the random freezing issue still isn't fixed. I'm using Mint 17.3. I could downgrade the kernel, but that would recall the hanging issue on shutdown/reboot. 

The freezing issue only happens randomly. Sometimes after a few hours. Sometimes after a few days or even a week. But always when the graphics are used

I tried the latest intel graphic drivers, but the issue still isn't solved.
The laptop is 16 months old right now, and it's the first time I fell on a bug that wasn't solved in this time. It seems to happen only when the graphics are in extended use like vlc HD viewing or streaming HD Videos on Youtube higher than 720p.

Comment 289 jbMacAZ 2016-04-14 18:32:40 UTC

I'm still seeing rapid freeze if I remove cstate=1 (4.3.6 - 4.6.-rx).  I am also seeing a new unrelated freeze with 4.6-rc3->next20160413 sometimes when I plug or unplug a flash drive from a USB hub.  This USB problem has never occurred with older kernels I've used.  Same symptom, cstate ineffective... (Asus T100-CHI, Z3775)

Comment 290 Koen L 2016-04-14 18:34:10 UTC

@Mort Yao

Thanks for pointing this out!

I'm trying out the intel_idle.max_cstate=1 option first now (seems to work) because we've got some of these systems in production.

Afterwards I will use our test-system to check whether a kernel patch completely fixes the issue.

Comment 291 dertobi 2016-04-14 20:17:52 UTC

(In reply to jbMacAZ from comment #289)
> I'm still seeing rapid freeze if I remove cstate=1 (4.3.6 - 4.6.-rx).  I am
> also seeing a new unrelated freeze with 4.6-rc3->next20160413 sometimes when
> I plug or unplug a flash drive from a USB hub.  This USB problem has never
> occurred with older kernels I've used.  Same symptom, cstate ineffective...
> (Asus T100-CHI, Z3775)

I already wrote about how my phone rebooting while connected to the usb port causes my system to freeze, while that's not exactly the same thing you're reporting I feel it could be related.

Question for you and experts in USB:
- Is there a sudden drop/surge in power when plugging/unplugging a flash drive?

Because that's what I think is happening with my phone, it normally gets a charge from the port, then no charge, and then when the booting (of the phone) starts an unusual increase in the power the phone draws from the usb port, which then somehow or another influences the CPU or other components to cause that dreadful freeze.

Recently I had a freeze (caused by my phone) with the usual symptoms but the audio that was currently running was in a weird 1 second loop going on seemingly forever.

Comment 292 jbMacAZ 2016-04-14 22:36:14 UTC

(In reply to dertobi from comment #291)
> (In reply to jbMacAZ from comment #289)
> > I'm still seeing rapid freeze if I remove cstate=1 (4.3.6 - 4.6.-rx).  I am
> > also seeing a new unrelated freeze with 4.6-rc3->next20160413 sometimes
> when
> > I plug or unplug a flash drive from a USB hub.  This USB problem has never
> > occurred with older kernels I've used.  Same symptom, cstate ineffective...
> > (Asus T100-CHI, Z3775)
> 
> I already wrote about how my phone rebooting while connected to the usb port
> causes my system to freeze, while that's not exactly the same thing you're
> reporting I feel it could be related.
> 
> Question for you and experts in USB:
> - Is there a sudden drop/surge in power when plugging/unplugging a flash
> drive?

A rebooting phone could certainly provoke lots of otherwise latent bugs in a USB handler.  It would be a worthy test case for both hardware and firmware Q/A.

My hub is (externally) powered.  So USB power draw shouldn't be affecting my device.  Since I build my kernels elsewhere, this problem is unmistakeably recent.  And it is easily avoided by using an older kernel.  

Ultimately, these two freezes probably merit their own bug reports.

Comment 293 Martin 2016-04-17 14:06:20 UTC

I'm evaluating Mort Yao's idea of reverting 31685c258e0b0a..... and so far have not seen freezes on watching DVB-C HD content. I have however experienced two crashes while watching flash content (in Chrome). The problem is, I don't trust the flashplayer anymore, so I'm reluctant to say the patch isn't valid (for me).

Comment 294 Mort Yao 2016-04-17 15:02:05 UTC

Unfortunately I experienced another hang yesterday (after one week's stable use), so the patch I mentioned in the last comment isn't valid for me anymore.

On the other hand, reverting the complete commit 31685c2 isn't really easy -- the old revision of the module won't compile together with the current 4.x kernel codebase. I'd like to hear if anyone had any success doing that. However, a proper fix is yet to be found.

Comment 295 Martin 2016-04-17 15:31:18 UTC

My patch in comment 279 applies cleanly against 4.5.0 and 4.5.1 and resolves the problems for me (at least for a very, very long time). You could give that a try?

I agree it's not as clean as your second patch, but like others I suspect we're looking at different problems that orignated in the 3.17 and 4.2 branch. For me, the 3.17 problem seems solved in 4.1.x (for recent x) and my patch reverts whatever causes problems in the 4.2 and up series.

Comment 296 Mort Yao 2016-04-17 16:12:47 UTC

@Martin thanks for bringing this to my attention.

Yesterday's freeze was on a 4.5 kernel (with only legacy-turbo patch applied). It seems I should try to revert 8fb55197e64 also, since there seems to be two very different causes of freezes!

I'm currently on vanilla 4.1.6 (for me it's known to freeze once or twice a month; that's already much better than all kernels 4.2+), but I'm planning to try both:
1. Apply both the 8fb55197e revert patch and the legacy-turbo patch on 4.5.
2. Apply the legacy-turbo patch on 4.1.x.

will see how it goes.

Comment 297 Gabriel7340 2016-04-21 20:29:28 UTC

Same problem here.. I can't do nothing when the system freezes. With the  intel_idle.max_cstate=1 flag it's ok but consumes more power :/

Comment 298 Gabriel7340 2016-04-21 20:32:17 UTC

*I can't do anything

Comment 299 jjmeijer88 2016-04-21 22:41:16 UTC

Hi,

I'm using a ASRock Q1900 (Intel Celeron J1900 Baytrail) board with an Nvidia GT720 GPU and I don't get any hangs at all with Arch Linux (kernel 4.4.1-2-ARCH) + Kodi 16.1.
I'm wondering if you guys are using the onboard GPU? I guess I could switch GPU to try.

On my Intel Atom Z3770 Baytrail tablet (HP Omni10) the only way to get it even booted to Android-x86 is with intel_idle.max_cstate=1.

Comment 300 Hal 2016-04-23 14:59:33 UTC

(In reply to Hal from comment #199)
> Interesting findings today:
> 
> 2) I was given an Intel Nuc box for testing which turned out to be identical
> to mine, with same N3050. Duplicated my drive with DD and removed
> intel_idle.max_cstate=1. It kept working all day without missing a beat! I
> remove cstate from my own machine it freezes within the hour. So bizarre...
> 

This is something I posted several weeks ago. Since then I have been using both boxes in parallel, for same type of daily tasks (some web browsing, occasional video playback, some Netflix, lots of background music playing).

One of these boxes has a processor (N3050) with a stepping older than the other. That one doesn't show any freezing symptoms.

The newer box (with the same processor but more recent stepping) needs intel_idle.max_cstate=1 to run without freezing, otherwise it fails quite regularly, within a couple of hours after booting.

Hal

Comment 301 ladiko 2016-04-23 15:59:51 UTC

We have 50 ASRock Q1900-ITX - some work without issues, some work with cstate=1, some freeze anyway and need kernel 3.16. Only kernel 3.16 made all of them work without this issue. For another issue on another mainboard type we went back to 3.13, but some of them don't support a resolution of 1280x1024 via VGA this way. So we had to differ: This CPU = this kernel, that CPU = that kernel. Now we run Baytrail with 3.16 but i plan to compile a custom kernel 4.4. or 4.5 as explained before.

Comment 302 dan.g.tob 2016-04-23 16:02:53 UTC

Did this patchset ever get merged? sounds suspiciously similar.

http://lkml.iu.edu/hypermail/linux/kernel/1503.3/00271.html

Comment 303 jjmeijer88 2016-04-23 17:43:59 UTC

@Dan, I tested these patches. There is a slght improvement but the system still hangs at some point. At least th mmc bus that is.

Comment 304 jjmeijer88 2016-04-23 17:44:17 UTC

@Dan, I tested these patches. There is a slght improvement but the system still hangs at some point. At least th mmc bus that is.

Comment 305 dertobi 2016-04-23 19:35:01 UTC

I was told to try the patches from here:

(for 4.4)
https://github.com/fritsch/linux/commit/8b48465bd197e2f4891a3f9c5737bb13981d1c94 

and here:

(for 4.5)
https://bugs.freedesktop.org/show_bug.cgi?id=88012#c33

Which I will try later, but I want to encourage others to do the same.

Comment 306 dertobi 2016-04-23 19:35:40 UTC

I was told to try the patches from here:

(for 4.4)
https://github.com/fritsch/linux/commit/8b48465bd197e2f4891a3f9c5737bb13981d1c94 

and here:

(for 4.5)
https://bugs.freedesktop.org/show_bug.cgi?id=88012#c33

Which I will try later, but I want to encourage others to do the same.

Comment 307 dkrall 2016-04-25 17:19:46 UTC

I can confirm this issue to a certain degree.

Using a Dell Inspiron, with an Intel N3050 (stepping 3), I can boot and run kernel 4.2.0-35-generic (from Ubuntu 14.04), and also the 4.2 kernel shipped with Fedora 23, but no version higher than 4.2.

I get a black screen immediately after booting any kernel >4.2 using any distro available (fedora, opensuse, ubuntu, etc). Disabling pstates did not help in this case (running kernel 4.5).

If there's anything I can do to help debug this issue further, please let me know.

Comment 308 Ghry 2016-04-25 19:08:09 UTC

(In reply to lewexeki from comment #27)
> Hi,
> 
> I had the same problem with "Intel(R) Pentium(R) CPU  N3520  @ 2.16GHz".
> With kernel 4.2.0-16.19 there were ~5-8 freezes/day. After upgrading to
> 4.3.3-040303-generic (ubuntu version) it was much better: 1/2 freezes/day.
> With cstate=1 there has not been one yet.

I have N3520 BayTrail and I am using kernel-4.0.6 with cstate=1 as well right now; Since I set cstate=1 my asus notebook doesn't freeze (its about 10 days already);

Comment 309 Ghry 2016-04-25 19:14:46 UTC

I have Intel(R) Pentium(R) CPU  N3520  @ 2.16GHz BayTrail on my asus notebook; I tested cstate=1 and kernel 4.0.9 and it doesn't freeze about 10 days already; Can somebody tell me since what kernel version the bug will be solved totally?

Comment 310 ladiko 2016-04-26 04:26:47 UTC

My glass ball says kernel 6.6.6 will be useable.

Comment 311 Gabriel7340 2016-04-26 14:24:41 UTC

When all computers in the market have an intel processor bay trail.
For now ( only ) affects 40%!! of all PC's in the market.

Comment 312 GConst 2016-04-27 08:40:55 UTC

Hi all, for asrock q1900itx-dc I have found workaround: i turned off cstate in BIOS (UEFI). uptime more than week, i cannot say that im happy with this solution, but it allowed me wait untill this bug will be fixed.

Comment 313 dertobi 2016-04-27 09:23:40 UTC

(In reply to GConst from comment #312)
> Hi all, for asrock q1900itx-dc I have found workaround: i turned off cstate
> in BIOS (UEFI). uptime more than week, i cannot say that im happy with this
> solution, but it allowed me wait untill this bug will be fixed.

That's pretty much the same workaround we already have, except you're doing it in the BIOS instead of the kernel boot command line. And that makes total sense.

Comment 314 w2q 2016-04-27 20:43:52 UTC

Not sure if this has anything to do with our problem:

https://www.dragonflydigest.com/2016/04/04/17888.html

It says: 
If you remember this Baytrail problem, Daniel Bilik has gone and found a fix, as this appears to be a cross-platform bug, and he has patches for DragonFly.

http://lists.dragonflybsd.org/pipermail/users/2016-April/228682.html
http://gitweb.dragonflybsd.org/dragonfly.git/commitdiff/5d8e0f49ad2ab6201288c8b4f5ebb966f27e5779
http://lists.dragonflybsd.org/pipermail/users/2016-March/228645.html

Perhaps it helps. Good luck.

Comment 315 w2q 2016-04-27 20:55:53 UTC

Sorry, it seems to be just the fix from Mika (https://bugzilla.kernel.org/show_bug.cgi?id=109051#c203). So nothing new, I guess....

Comment 316 MarkB 2016-05-01 08:48:50 UTC

I believe that I have run into this problem running Ubuntu on an NUC clone with Intel's J1900 processor. Does anyone know if the freezing issue is confined to machines based on Bay Trail chips or is it more widespread than this?

Comment 317 Martin 2016-05-05 07:55:41 UTC

I recently experienced unprecendented hangs while idle (HTPC unreachable in the morning, so without GPU involvement) and while watching Flash video. Although watching DVB-C content was very stable with reversed 8fb55197e... I now am back to vanilla 4.5.2 (soon .3) using intel_idle.max_cstate=1. That does seem to be the magic bullet after all.

Apparently I don't quite understand the *states very well, because with this option set, I still see the GPU enter rc6 in powertop? Is that less efficient than c*? I do see that the packages is not going into pc2-7.

Comment 318 Len Brown 2016-05-06 17:11:50 UTC

re: comment316
MarkB, this sighting is specific to baytrail.

Comment 319 Phil 2016-05-06 18:46:44 UTC

I think I have run into this issue on my NUC with an Intel(R) Celeron(R) CPU  N3150  @ 1.60GHz (Braswell). 

Basically under any significant load (gcc compile of the linux kernel for example) the system reboots after a random amount of time. Ive tried several 4.x series kernels and have been able to reproduce the bug on all of them so far. Adding intel_idle.max_cstate=1 as suggested in this thread seams to mitigate the bug albeit it locks the CPU at 2167Mhz.

I am not using X, I'm running arch linux with only a couple of services enable (dhcpd and hostapd) as im using it mainly as a firewall/AP.

Comment 320 yuriy 2016-05-07 23:07:48 UTC

I have problem with  random freezing on kernel 4.2
intel_idle.max_cstate=1 din't helped me

Comment 321 Gabriel7340 2016-05-07 23:22:44 UTC

Your processor is an Intel BayTrail? Did you update-grub after the changes? Can you try kernel 3.13.0-85-generic instead?

Comment 322 Phil 2016-05-08 01:38:25 UTC

Gabriel7340: I think the N3150 is Braswell which is a refresh of Baytrail? [1,2] Should I file a separate bug? I am not using grub I'm using systemd-boot. 

[1] http://www.extremetech.com/extreme/202389-intel-quietly-launches-14nm-braswell-bay-trails-successor
[2] http://www.cnx-software.com/2015/04/01/intel-introduces-celeron-n3000-n-3050-n3150-and-pentium-n3700-low-power-braswell-processors/

Comment 323 yuriy 2016-05-08 10:38:23 UTC

(In reply to Gabriel7340 from comment #321)
> Your processor is an Intel BayTrail? Did you update-grub after the changes?
> Can you try kernel 3.13.0-85-generic instead?

Sorry comment #320 is no valid. I have used intel_idle.max_cstate=2. Right now its 6 hours up-time without freezes.

Is there some patches that fix this issue?

Comment 324 Mort Yao 2016-05-08 17:44:23 UTC

The freeze recurred to me today on 4.1 with legacy-turbo patch, for no reason (not even any GPU or CPU-intensive processes was running). So I would say no, there is no valid patch that completely fix the issue at this point. (4.1 indeed performs better than 4.2+, so that would be another lockup issue on 4.2+)

Comment 325 Gabriel7340 2016-05-08 18:36:25 UTC

(In reply to yuriy from comment #323)
> (In reply to Gabriel7340 from comment #321)
> > Your processor is an Intel BayTrail? Did you update-grub after the changes?
> > Can you try kernel 3.13.0-85-generic instead?
> 
> Sorry comment #320 is no valid. I have used intel_idle.max_cstate=2. Right
> now its 6 hours up-time without freezes.
> 
> Is there some patches that fix this issue?

http://www.hardwaresecrets.com/celeron-n3150-cpu-review/

"They come to replace the Bay Trail-D CPUs, actually using the same microarchitecture, ..." 

I think you are right.

For now the best solution is the intel_idle.max_cstate workaround. You can find some patches but I'm not sure if it works. Another solution could be compile the kernel without some commits like "Agressive downclocking on Baytrail/drm/i915".

Comment 326 alvararo 2016-05-08 23:41:08 UTC

I had freezes in an Acer laptop with Pentium N-3540 (Bay Trail). Now I'm using 3.13.0.85, I can confirm no freezes at all. 
From 3.16 and above, 4.1.12 is the one that works better for me, freezing after many hours.
Using 4.2 and above, the problem gets worse, with the system freezing few minutes after switch on.

Comment 327 Austin 2016-05-10 15:50:20 UTC

I figured I'd post my experience with this bug and how I avoid it.

Run into this problem ever since I bought my Inspiron 3000 with N3540 cpu. Running opensuse tumbleweed, always with latest Kernel.

System would freeze up, usually within 15 minutes of booting. I originally thought it was my SSD as it would often happen when accessing the disk, but it got worse and would eventually happen even when sitting idle under little load. Also, fan would also run at full speed from boot to crash.

If I suspend to ram as soon as the system has booted to the desktop, then bring the system out of suspend, the problem nearly goes away. The fan will work normally (it rarely kicks on unless I'm doing something crazy) and I can go long stretches without a crash. The problem is not completely gone...I'll crash maybe once a week. But it is much better than every 15 minutes. And if I forget to do the suspend trick after a reboot, I'm reminded quickly as it will crash within minutes EVERY time. 

I have never modified the idle_cstate as others have suggested. Perhaps my experience can help someone.

Comment 328 jbMacAZ 2016-05-11 06:34:53 UTC

(In reply to Austin from comment #327)
> I figured I'd post my experience with this bug and how I avoid it.
> 
> Run into this problem ever since I bought my Inspiron 3000 with N3540 cpu.
> Running opensuse tumbleweed, always with latest Kernel.
> 
> System would freeze up, usually within 15 minutes of booting. I originally
> thought it was my SSD as it would often happen when accessing the disk, but
> it got worse and would eventually happen even when sitting idle under little
> load. Also, fan would also run at full speed from boot to crash.
<snip>

I also have your model of Dell laptop. intel_idle.max_cstate=1 does works on my Dell.  I've run various kernels from 3.19 - 4.5 with Mint, Manjaro and Cubuntu.  Without ..cstate, I experience the screen freeze and runaway fan speed.

Comment 329 Daniel Bilik 2016-05-12 15:46:08 UTC

Like others, I've been also fighting this for several months. But it seems that _combination_ of "tentative" patches from Mika Kuoppala (see comment #c202) _and_ "legacy turbo" patch (comments #c93, #c98 and #c287) has finally stabilized i915 driver on my system (Asrock Q1900-ITX) to run it with deeper C-states enabled. See this post...

http://lists.dragonflybsd.org/pipermail/users/2016-May/249603.html

... so that I don't repeat myself. :)

HTH.

Comment 330 Xermán 2016-05-12 20:53:56 UTC

(In reply to Daniel Bilik from comment #329)
> Like others, I've been also fighting this for several months. But it seems
> that _combination_ of "tentative" patches from Mika Kuoppala (see comment
> #c202) _and_ "legacy turbo" patch (comments #c93, #c98 and #c287) has
> finally stabilized i915 driver on my system (Asrock Q1900-ITX) to run it
> with deeper C-states enabled. See this post...
> 
> http://lists.dragonflybsd.org/pipermail/users/2016-May/249603.html
> 
> ... so that I don't repeat myself. :)
> 
> HTH.

Thanks for your research Daniel, that looks promising.

Comment 331 jbMacAZ 2016-05-12 23:38:59 UTC

(In reply to Daniel Bilik from comment #329)
> Like others, I've been also fighting this for several months. But it seems
> that _combination_ of "tentative" patches from Mika Kuoppala (see comment
> #c202) _and_ "legacy turbo" patch (comments #c93, #c98 and #c287) has
> finally stabilized i915 driver on my system (Asrock Q1900-ITX) to run it
> with deeper C-states enabled. See this post...

Can I ask which kernel and processor family you are running?  I can't seem to replicate your success on my setup (various patched kernels 4.2 - 4.6rc, Atom Z3775).

While I can't definitively rule out a hardware platform issue, I am freeze free with ..cstate=1.  Newer kernels do take longer before freezing than older ones.

Comment 332 Daniel Bilik 2016-05-13 23:15:34 UTC

(In reply to jbMacAZ from comment #331)

> Can I ask which kernel and processor family you are running?

I run Dragonfly BSD on Asrock Q1900-ITX with Intel Celeron J1900. Dragonfly has drm infrastructure imported from linux kernel, with both intel and amd drivers being regularly updated. I started to experience machine freezes when i915 driver in Dragonfly was synced to what's in linux 4.0, and I had to limit CPU to C1. When Dragonfly synced i915 to linux 4.1, it made my system stable again, even with deeper C-states. But with update to a version from linux 4.2, freezes were back again. I was struggling this for months, but with two patches I've mentioned in previous post my system has been running stable for several weeks now, with deeper C-states enabled. In the meantime, i915 driver in Dragonfly was synced with linux 4.3, so I've updated my system this week, keeping the patches, and it still runs stable (it's been just a few days, but without the patches I was experiencing a freeze practically each day).

> While I can't definitively rule out a hardware platform issue, I am freeze
> free with ..cstate=1.

Well, because I use i915 driver in Dragonfly, I can't really confirm that the patches solve freezes on linux completely. But so far, it seems to be sufficient to make my system stable with deeper C-states, so I can definitely say the patches positively influence stability of i915 driver on Baytrail.

> Newer kernels do take longer before freezing than older ones.

In my experience, the system uptime and/or load doesn't seem to matter. Sometimes the system was running stable for two days, sometimes it freezed after two hours. In fact, it always freezed when the system was "doing nothing" and I just moved a mouse pointer or scrolled an already loaded page in firefox.

Comment 333 jbMacAZ 2016-05-14 00:48:21 UTC

(In reply to Daniel Bilik from comment #332)
> (In reply to jbMacAZ from comment #331)
> 
I appreciate the information and insights.  Perhaps there are additional factors affecting freezes from outside the drm code that aren't present in dragonfly.

---- 

On a different subject, is anyone getting a blank screen lockup starting with 4.6-rc7 and 4.5.4?  System runs for a while, seems fine and then suddenly the screen goes black, locked up.  I think maybe some of the bug fixes for this freeze bug may be almost right but now the symptom has changed from a static display to a black screen.  Just a feeling so far, but it needs the same hard reset to recover, so no dmesg to inspect.  Less recent kernels are still stable with cstate, so I don't think it's a hardware fault.  

Are the hunter patches now obsolete in 4.6-rc7/4.5.4?  My tests still use 2 of them that I had to use in earlier kernels.  If they aren't needed anymore, using them could explain this new issue.

Comment 334 Dimitris Roussis 2016-05-14 10:51:13 UTC

I installed cloudready distro http://www.neverware.com/ and no freezes anymore..

It is so so strange because this version use Linux kernel 4.0.5!!

In all linux distro i used i have freeezes for any kernel above 3.16.7 .

What is the difference in chromiumos Linux kernel??

Comment 335 Hal 2016-05-14 14:46:21 UTC

Anyone tried to install a linux-libre kernel and see if it would work better? 

I'm planning on trying the one from here: http://linux-libre.fsfla.org/pub/linux-libre/freesh/pool/main/l/linux-4.5.4-gnu/

But prior to doing it I would like any feedback you might provide, as I have no experience with linux-libre kernels and what I will be missing (understand breaking in my system) once I install it.

Regarding my systems' freeze status since I applied intel_idle.max_cstate=1; well no more freezing but both machines run noticeably warmer. Those boxes are small and cramped. They only have passive cooling. 

One other thing I noticed and which has alarmed me is that on both machines one of the cores runs at 100% for long periods (tens of minutes), then falls to normal levels for a minute or so and then goes up again creating a cycle.
I don't remember having seen that when I first started to use cstate=1, so I am not sure if the two are connected. But, I am certain that something is wrong with this behaviour.

Hal

Comment 336 heimdall_cuba 2016-05-14 15:46:52 UTC

I was more than 4 months trying to solve the problem of freezing on my laptop with Intel Pentium N3540 Bay Trail reading in this thread I found the solution my problem by establishing the intel_idle.max_cstate value=1 since then I have not returned to have problems however I do not quite understand I've modified that function and problems that may have long-term on my laptop.

Comment 337 Martin 2016-05-15 09:59:22 UTC

I applied Daniel's patches (comment 329) to 4.5.4 but alas, freezes at last. Back to max_cstate=1 again.

Comment 338 Maurizio 2016-05-15 11:30:07 UTC

Hope I'm not too optimistic but I'm trying 4.6.0-rc7-g44549e8 (I'm using Arch Linux and this is what was available in aur repositories 2 days ago) and so far I've experienced no crashes (almost 2 days of continous uptime with a normal use of the system).

Previously the RC3 version crashed as well.

Comment 339 Justin 2016-05-15 13:50:34 UTC

On my dell 3531 with the baytrail processor. I have linux-image-4.6.0-rc7-amd64 installed from debian experimental, here (https://packages.debian.org/experimental/kernel/linux-image-4.6.0-rc7-amd64). I still get the crashes without intel_idle.max_cstate value=1 

With intel_idle.max_cstate value=1 no crashes.

Comment 340 Hal 2016-05-15 15:21:23 UTC

(In reply to Hal from comment #335)
> Anyone tried to install a linux-libre kernel and see if it would work
> better? 
> 
> I'm planning on trying the one from here:
> http://linux-libre.fsfla.org/pub/linux-libre/freesh/pool/main/l/linux-4.5.4-
> gnu/
> 
linux-libre v4.5.4 from above repo installed well and worked on Linux Mint 17.3 but froze eventually. So the binary free version of the kernel is not any better than the regular kernel.
Hal

Comment 341 Maurizio 2016-05-16 10:42:09 UTC

Well I have now 3 solid days of uptime... so for me 4.6.0-rc7-g44549e8 seems to work pretty well (my CPU is a celeron N2930). 
 
I'm not really an expert so I would assume that g44etc is the commit. Worth checking what is the one used for the debian RC7 build... and what has been done in between (assuming it's not a later one and they broke it again :) )

Comment 342 Maurizio 2016-05-16 10:44:31 UTC

(In reply to Maurizio from comment #341)
> Well I have now 3 solid days of uptime... so for me 4.6.0-rc7-g44549e8 seems
> to work pretty well (my CPU is a celeron N2930). 
>  
> I'm not really an expert so I would assume that g44etc is the commit. Worth
> checking what is the one used for the debian RC7 build... and what has been
> done in between (assuming it's not a later one and they broke it again :) )

Also wanted to add that sensors now report a cpu core temperature 10 degree lower than with a 4.5 kernel with max_cstate=1 ...

Comment 343 MarkB 2016-05-17 08:00:08 UTC

Digging around a little and I am seeing many people use the word 'latency' and suggesting that one cause of the problem may be that the interrupts issued to wake up the CPU from a deeper idel state are somehow causing the freezing issue. Coupled with talk above about alternative kernels, I wanted to ask whether anyone has tried any of the alternative Ubuntu kernels, low latencey, real time, etc?

Comment 344 Daniel Bilik 2016-05-17 08:42:53 UTC

(In reply to Maurizio from comment #341)
> Well I have now 3 solid days of uptime... so for me 4.6.0-rc7-g44549e8 seems
> to work pretty well (my CPU is a celeron N2930). 

Indeed, looking through changes merged into 4.6-rc7 in past weeks, there are several commits to i915 driver claiming to solve hangs. And specifically for Baytrail, these two are interesting the most:

https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit?id=4ea3959018d09edfa36a9e7b5ccdbd4ec4b99e49
https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit?id=1b3e885a05d4f0a35dde035724e7c6453d2cbe71

If I read it correctly, the first one fixes the same problem with rps thresholds that one of Mika's "tentative" patches was trying to. And second one fixes "timing cruical" ringbuffer issue that IMHO could be causing hangs at random places. BTW, timing may be that additional factor mentioned by jbMacAZ in comment #c333, and it would explain why "legacy turbo" + "tentative" patches work for me and not for others - timing on Linux vs. Dragonfly definitely is different.

Anyway, I've swapped my "combo" patch for those commits mentioned above, and I'm currently testing it. Because, to be honest, the patches I've been using so far, despite making my system stable with deep C-states, smell a little "hackish". And those commits, besides being "official", look more like "the proper solution".

Comment 345 dan.g.tob 2016-05-17 10:27:29 UTC

I'm up to 22 hours uptime now with 4.6 vanilla without intel_idle.max_cstate=1. I'm using the ubuntu built packages on debian 8 (afaik there are no external patches). I have a lenovo ideapad 100s with an atom Z3735F


http://kernel.ubuntu.com/~kernel-ppa/mainline/v4.6-yakkety/

Comment 346 Chris Daniel 2016-05-17 11:12:34 UTC

4.6.0-1 from the Arch testing repo just brought down my Z3735F.

Comment 347 Daniel Bilik 2016-05-17 13:01:12 UTC

(In reply to Daniel Bilik from comment #344)
> (In reply to Maurizio from comment #341)
> ... and I'm currently testing it.

No luck, got freeze in less than a day. Commits 4ea3959+ and 1b3e885+ alone do not seem to be enough to prevent hangs. Back to my (somewhat dirty but working) patchset.

Comment 348 Martin 2016-05-17 13:23:46 UTC

Vanilla 4.6.0+ without max_cstate=1 still freezes up for me too, sticking to max_cstate=1 for now.

Comment 349 Maurizio 2016-05-17 15:04:57 UTC

Well it was too good to be true :) after 4 full days uptime I've got a crash few minutes ago. Just rebooted the machine, will let it run again to see if it was luck or at least now I can get some days of uptime. 

I will also update the kernel to the latest next time it crashes.

Comment 350 jbMacAZ 2016-05-17 18:36:07 UTC

I am having issues with 4.6.  ..cstate=1 no longer prevents ordinary display freeze (GUI locked, CPU activity = 0%.)  intel_idle.max_cstate=1 had been a reliable workaround (for me) since 4.2.6.  Asus T100CHI (Z3775) Ubuntu 16.04.  Kernel minimally patched for Bluetooth device ID and other hardware bits not yet supported by main-stream.  Patches proven in earlier kernels, pruned as necessary with each new kernel releases.  4.6-rc5 was still freeze free with cstate, unsure of rc6, rc7 also froze with black screen or soft freeze (mouse cursor freely moved, display updating only once every 1-2 minutes).

I choose to be optimistic, that the freeze bugs are being worked on now and another edit or two will finish fixing them.  It sounds like the current changes work, as is, for other systems.

Comment 351 Tal Liron 2016-05-17 18:41:58 UTC

Do the people committing the fixes on Linux now know about the testing we are doing here? Could someone here with some authority notify them?

Comment 352 Daniel Glöckner 2016-05-17 22:19:50 UTC

Please everyone, keep this on topic.

If your cursor updates when you move your mouse, this is not your bug.
If your screen turns black, this is not your bug.
If you can still SSH into or ping your device, this is not your bug.
If it's just some application dying, this is not your bug.
If you still see freezes after max_cstate=1, this is not your bug.

There may be other problems with Bay Trail that might show some of these symptoms, but this is not the correct Bugzilla entry to discuss them.

Comment 353 Gabriel7340 2016-05-18 01:11:04 UTC

I'm using now version:
4.6.0-040600rc1-generic #201603261930 SMP Sat Mar 26 23:32:43 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux 

Kernel without any crash! :-)

Comment 354 jds 2016-05-18 05:44:11 UTC

Created attachment 216481 [details]
attachment-24742-0.html

Tried 4.6 RC7 from ubuntu kernel release page.

4.6.0-040600rc7-generic #201605081830 SMP Sun May 8 22:32:57 UTC 2016
x86_64 x86_64 x86_64 GNU/Linux

Full lock-up after a few minutes.  Reverted to max_cstate=2.



On Tue, May 17, 2016 at 9:11 PM, <bugzilla-daemon@bugzilla.kernel.org>
wrote:

> https://bugzilla.kernel.org/show_bug.cgi?id=109051
>
> --- Comment #353 from Gabriel7340 <gabriel_7340@hotmail.com> ---
> I'm using now version:
> 4.6.0-040600rc1-generic #201603261930 SMP Sat Mar 26 23:32:43 UTC 2016
> x86_64
> x86_64 x86_64 GNU/Linux
>
> Kernel without any crash! :-)
>
> --
> You are receiving this mail because:
> You are on the CC list for the bug.
>

Comment 355 Maurizio 2016-05-18 10:35:55 UTC

Updated 4.6.0-g2dcd0af, no luck - froze after half a day. Green screen, complely hanged.

It either they broke it again or the 4 days uptime have been just a lucky shot. 

Anyway, is actually the maintainer of this component aware of the bug? This is still in "NEW" state with no official updates (also lists up to kernel 4.2 while 4.6 is also affected) ? 

The max_cstate is not really a proper workaround, the power consumption as well as temperature goes up dramatically.

Comment 356 Andrew Clayton 2016-05-18 12:19:01 UTC

> Anyway, is actually the maintainer of this component aware of the bug? This

This bug is assigned to Len Brown and he has commented here, so *he* at least is aware of this.

However, I fear (and has already been mentioned in earlier comments) this bug report has long since lost any usefulness it might have once had and has just turned into a dumping ground for random comments and updates and now reads like some web forum thread,

Comment 357 jds 2016-05-18 16:12:57 UTC

Created attachment 216551 [details]
attachment-7936-0.html

Not so.  The bug discourse may have become a bit ragged due to the age of
the bug and the near-total non-response by the owner or by kernel people.
But there's a perfectly clear thread: every kernel from at least 3.19
through 4.6 locks up hard on BayTrail and Broadwell systems after minutes
or hours.

jds

On Wed, May 18, 2016 at 8:19 AM, <bugzilla-daemon@bugzilla.kernel.org>
wrote:

> https://bugzilla.kernel.org/show_bug.cgi?id=109051
>
> --- Comment #356 from Andrew Clayton <andrew@digital-domain.net> ---
> > Anyway, is actually the maintainer of this component aware of the bug?
> This
>
> This bug is assigned to Len Brown and he has commented here, so *he* at
> least
> is aware of this.
>
> However, I fear (and has already been mentioned in earlier comments) this
> bug
> report has long since lost any usefulness it might have once had and has
> just
> turned into a dumping ground for random comments and updates and now reads
> like
> some web forum thread,
>
> --
> You are receiving this mail because:
> You are on the CC list for the bug.
>

Comment 358 Maurizio 2016-05-18 17:26:42 UTC

(In reply to Andrew Clayton from comment #356)
> > Anyway, is actually the maintainer of this component aware of the bug? This
> 
> This bug is assigned to Len Brown and he has commented here, so *he* at
> least is aware of this.
> 
> However, I fear (and has already been mentioned in earlier comments) this
> bug report has long since lost any usefulness it might have once had and has
> just turned into a dumping ground for random comments and updates and now
> reads like some web forum thread,

Well, most of the people experiencing the problem refers to this thread. It should be the best place to source for information, and why not ask for some cooperation? 

The bug is open since December with no status change whatsoever but reports of baytrail hanging date back to Oct 2014 when 3.17 has been released. This is a pretty serious problem as it is preventing linux to run properly on a very large number of systems... it just doesn't look that it is getting the right attention.

Comment 359 julio.borreguero@gmail.com 2016-05-18 17:48:17 UTC

i tried drm-intel kernel 4.6.0-rc7 on N2940
System freeze after 2 days.

I understand the frustration of the people too, a serious bug with a clear defined thread and only some frustrated users commenting.
Still the vast majority here helping testing and reporting for their platform.
Non-existant feedback, not getting any attention, we don't even know if the maintainer is still alive ;)
How is it not going to read like a forum thread after more than 2 years with many mayor kernel versions since its appearance ?

Comment 360 jbMacAZ 2016-05-18 18:18:20 UTC

(In reply to Daniel Glöckner from comment #352)
> Please everyone, keep this on topic.
> 
> If your cursor updates when you move your mouse, this is not your bug.
> If your screen turns black, this is not your bug.
<snip>> 
> There may be other problems with Bay Trail that might show some of these
> symptoms, but this is not the correct Bugzilla entry to discuss them.

Since removing the Hunter patches(see comments #55, #103) from my 4.6 build, I have not had a recurrence of those alternate freezes.

Comment 361 nw9165-3201 2016-05-18 23:05:31 UTC

@ Len Brown:

Any chance you could give us an update on this issue?

It would be much appreciated.

Regards

Comment 362 Gabriel7340 2016-05-21 17:30:42 UTC

(In reply to Gabriel7340 from comment #353)
> I'm using now version:
> 4.6.0-040600rc1-generic #201603261930 SMP Sat Mar 26 23:32:43 UTC 2016
> x86_64 x86_64 x86_64 GNU/Linux 
> 
> Kernel without any crash! :-)

Sorry the bug comes back :/ I went back to kernel 3.13.0-85-generic again :/

Comment 363 Libor Chmelik 2016-05-21 20:05:26 UTC

I'm now using kernel 4.6.0-040600-generic from Ubuntu, since my laptop (Acer Aspire E5-511-P7AT with Pentium N3540) runs on Mint 17.3

On kernel 4.5 with cstate=1 it worked flawlessly. 

After approx. 1 hour on 4.6 without cstate=1 it froze again during playback of an HD movie on VLC.

Trying 4.6 now with cstate=1

Also I can't downgrade the kernel lower than 4.5 because then the shutdown/reboot hanging/freezing issue on this machine would be back again. :-(

Comment 364 jbMacAZ 2016-05-22 01:32:30 UTC

4.5.5 has the same broken cstate bandaid as 4.6.  In other words, both kernels freeze and ..cstate=1 no longer stops it.  4.6-rc7 did the same thing.  

Anyone have a new workaround?  I can always use 4.4.11 which is actually running pretty well now.  It just doesn't support all of my hardware (eg sound.)

Comment 365 Justin 2016-05-22 01:49:59 UTC

intel_idle.max_cstate=4   Appears to work with rc7.  So we can get more of the power savings.

Comment 366 tim 2016-05-24 00:23:27 UTC

Just wondering if this is not the droid we're looking for...

On an unrelated development - saw a lot of jitter across different BYT platforms, excessively so, and not just on J1900, but also on the Z3537G.

Digging into things, and pulling an older IVB based 1037U box, saw the same thing.

putting intel_max_cstates=1 sort of solved the problem for the most part - this is with ubuntu-server 4.4.0-22 - by meaning sort of, it worked around it.

Reverting the valleyview change out that was in 3.16 kind of fixed it - e.g. no more freezes on the BYT devices - but the IVB never had the freezes in the first place with 4.4.0.

Hmmm... something tells me there's more to this problem than just the graphics driver. I just don't have handy the gear needed to get deeper into the HW - e.g. JTAG and Protocol Analyzer these days - but I'm suspecting that there is something going on with the timing on both BYT and IVB, and I suspect Haswell, Braswell, and later..

Comment 367 tim 2016-05-24 04:22:55 UTC

Hokay - did some more debug/testing - pthread crashes are inconsistent, when looking at stack dumps.

Munged the UEFI to keep the BYT running at a constant speed, and things are fine. Same with IVB - weird... with constant clocks, it's all good. Let the cores sleep a bit too much - boom - worrisome as this could lead to data corruption that folks wouldn't see immediately.

While I can't dig into the HW - long time back on ARM with an RTOS, we found that dynamic clocks could lead to issues with cpu clocks and mem states with mem reads specifically - cpu would read before memory was ready.

since BYT, IVB use the uncore/system agent - for both CPU and GPU, this is the area of interest - as the uncore controls timing for everything.

Probably need someone from intel systems to sort out this, as this is all their stuff inside.

Comment 368 Gabriel7340 2016-05-25 22:58:18 UTC

Thanks for the debug! Good work. I'm still analysing the code, but, as you suggest, someone from intel can see more accurately and quickly what is really going wrong.

Comment 369 joev.mi 2016-05-26 12:44:28 UTC

with Kernel 4.4.9 I am now seeing lockups fairly frequently.  I had started to see some with FC22 and thought it was related to the nouveau driver so I updated to FC23 on May 1; it was due anyway.  Lockups were apparently resolved.  that was with Kernel 4.4.6.  With kernel 4.4.9 I am now seeing lockups on a daily basis.  Just prior to one such lock up I noticed a log entry: 
"May 24 19:52:50 xps8700.durand8450.info kernel: NMI watchdog: Watchdog detected hard LOCKUP on cpu 0
-- Reboot -- "
So I found this thread.  Now we are talking something with kernel and cstate or pci/msi interaction?  I have a Dell XPS8700 desktop; I looked in my BIOS setup and see nothing related to cstate.  I tried the pci=nomsi in the GRUB entry and got no relief.  I thought perhaps my BIOS is to old (I'm at A07, Dell's latest is A11).  I tried flashing it from freedos but it fails to burn.  Anyhow, I am now running kernel 4.4.6-201 which seems stable.

Comment 370 joev.mi 2016-05-26 13:35:01 UTC

... finally read carefully enough to figure out the other method to set cstate is to qualify the kernel invocation in GRUB.  I am now running kernel 4.4.9 with cstate set to 1.  It hasn't locked up in half an hour.

Comment 371 joev.mi 2016-05-26 17:40:25 UTC

... but not really much longer.  log reports that cstate = 1 was reached shortly after reboot, then log records:  'NMI watchdog: Watchdog detected hard LOCKUP on cpu 0' about two hours after boot.  hmm.  I think I've tried the two work arounds.  I am apparently able to run with kernel 4.4.6

Comment 372 joev.mi 2016-05-26 17:49:23 UTC

... and I could have mentioned that about 14 minutes after the watchdog notice above I see the following in the log:
kernel: INFO: rcu_sched detected stalls on CPUs/tasks:
kernel:         0-...: (1 GPs behind) idle=643/1/0 softirq=79517/79517 fqs=320010
kernel:         (detected by 3, t=960034 jiffies, g=113164, c=113163, q=0)
kernel: Task dump for CPU 0:
kernel: swapper/0       R  running task        0     0      0 0x00000008
kernel:  ffffffff8163d5af 000000001ec13d60 0000066d9d796809 ffffffff81d3b0c0
kernel:  ffffffff81c04000 ffff88021ec1fb00 ffffffff81cc1040 ffffffff81c00000
kernel:  ffffffff81c03ec0 ffffffff8163d797 ffffffff81c03ed8 ffffffff810e6752
kernel: Call Trace:
kernel:  [<ffffffff8163d5af>] ? cpuidle_enter_state+0xff/0x2b0
kernel:  [<ffffffff8163d797>] ? cpuidle_enter+0x17/0x20
kernel:  [<ffffffff810e6752>] ? call_cpuidle+0x32/0x60
kernel:  [<ffffffff8163d773>] ? cpuidle_select+0x13/0x20
kernel:  [<ffffffff810e6a10>] ? cpu_startup_entry+0x290/0x350
kernel:  [<ffffffff8179513c>] ? rest_init+0x7c/0x80
kernel:  [<ffffffff81d6201e>] ? start_kernel+0x498/0x4b9
kernel:  [<ffffffff81d61120>] ? early_idt_handler_array+0x120/0x120
kernel:  [<ffffffff81d61339>] ? x86_64_start_reservations+0x2a/0x2c
kernel:  [<ffffffff81d61485>] ? x86_64_start_kernel+0x14a/0x16d

at which point the log stops.

and I'm still working on why my BIOS update won't actually update.

Comment 373 Javier Antonio Nisa Avila 2016-05-27 06:20:24 UTC

Hi Guys!!!

I Have the same bud in thinkpad e11 N2930

This bug is similar https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1575467

And people of canonical build a test kernel with possible solution.

Can you probe??

http://kernel.ubuntu.com/~jsalisbury/lp1575467

Thanks.

Comment 374 joev.mi 2016-05-27 12:44:45 UTC

just for the record my current state is running on kernel 4.4.6 out of the box, no tweaks in GRUB.  I was too hasty when I inferred I was running well with kernel 4.4.6 with limit on cstate.

Comment 375 Len Brown 2016-05-27 16:57:10 UTC

The only Linux-based commercial product that used BYT
was based on the Android snapshot/fork of the Linux kernel,
not the upstream Linux kernel.

Nobody knows why the Android version of Linux
is stable on this hardware, while upstream Linux is not.
There have been several de-bunked theories.

No, it isn't a bug in the intel_idle driver -- you'll
have the same results with "intel_idle.max_cstate=0",
which will run the acpi_idle driver.

The cause is likely due to an SOC device other than the CPU.

Comment 376 Maurizio 2016-05-27 17:27:24 UTC

(In reply to Javier Antonio Nisa Avila from comment #373)
> Hi Guys!!!
> 
> I Have the same bud in thinkpad e11 N2930
> 
> This bug is similar
> https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1575467
> 
> And people of canonical build a test kernel with possible solution.
> 
> Can you probe??
> 
> http://kernel.ubuntu.com/~jsalisbury/lp1575467
> 
> Thanks.

If I understand correctly the various comments they are doing the bisection to understand which commit caused the issue, but its not (yet) a possible solution. 
Of course the problem is so widespread that a lot of duplicated effort is being made.

Comment 377 joev.mi 2016-05-30 18:20:22 UTC

after more watching perhaps I'm on the wrong thread.  I actually now can see that I get the system freezes regardless of the cstate work around or the psi=nomsi workaround and regardless of which of the installed kernels I select from 4.4.6, 4.4.8 and 4.4.9.  I've tried disabling watchdog not because I think that a causal relationship but watchdog is often logging an alert just before a lockup event.  I was momentarily optimistic that disabling watchdog might change my system event from a system freeze to a crash which I would have preferred.  
    I have been running with intel_idle.max_cstate=0 for the past several days.  I am noticing a lot of "(tracker-miner-fs:1951): Tracker-CRITICAL" events in the log.  I'll plan to go back to the beginning of my searching to see if I can make a better match to what I'm seeing.

Comment 378 BzukTuk 2016-05-30 20:14:43 UTC

Hi again,
Kernel 4.6 + Mika Kuoppalas 3 _tentative_ patches + linux-999-i915-use-legacy-turbo.patch = over 120h in one single session (without reboot/sleep..) and another 20+/- hours in few 3-4hour long sessions without single freeze. Still counting...

(some of Adrian Hunters patches for pm/mmc were also applied, but I dont think (hope) this matters)

Kernel 4.5.x + Mikas 3 _tentative_ patches + linux-999-i915-use-legacy-turbo.patch = first freeze after like 8 hours, second freeze came in few minutes after boot

(all above was without any intel_idle.cstate parameter)

Thanks Daniel Bilik for this "combo" 

Maybe something new in v4.6 fixed the last hole... Or maybe Im just lucky.

Comment 379 Mina Demian 2016-05-31 10:46:54 UTC

I can confirm that this fix worked for me on Ubuntu 14.06 (kernel: 4.4.0-22-generic) on an Acer Aspire E15 laptop, dual-booting with Windows 8. There have been no irrecoverable freezes since applying the fix yesterday, but there have been a few times where it slowed down to almost freeze. Thankfully, it saved itself.

Comment 380 joev.mi 2016-05-31 13:45:57 UTC

I believe I have confirmed my issue is not related to the subject of this thread.  I re-initialized tracker and all seems to have cleared up.  I thought the freezes I was seeing were right in line with descriptions above but unlike other reports I was getting no relief from the work-arounds that reportedly helped others.  Changing cstate to zero and disabling watchdog did help me focus on the real problem.  Thank you for your patience.

Comment 381 Dmitry 2016-05-31 14:16:10 UTC

From Documentation/kernel-parameters.txt:
>>  intel_idle.max_cstate=  [KNL,HW,ACPI,X86]
>>       0       disables intel_idle and fall back on acpi_idle.
>>       1 to 6  specify maximum depth of C-state.

acpi_idle is different idle driver and could disable all C-states as well. It depends on ACPI tables.

So, you need to try this and check which states are enabled:
$ uname -a
$ cat /sys/devices/system/cpu/cpuidle/current_driver
$ cat /sys/devices/system/cpu/cpu0/cpuidle/state*/name

For me it's like this:
>>  Linux venue11pro 4.1.25-dirty #337 SMP PREEMPT Wed May 25 01:53:43 MSK 2016
>>  i686 Intel(R) Atom(TM) CPU Z3770 @ 1.46GHz GenuineIntel GNU/Linux
>>  intel_idle
>>  POLL
>>  C1-BYT
>>  C6N-BYT
>>  C6S-BYT
>>  C7-BYT
>>  C7S-BYT

P.S. Also I have to mention that even with kernel 4.1.25, mmc PM QOS patches and legacy turbo patch I have freezes. Disabling SDIO wifi with ath6kl driver prevent any lockup at all.

Comment 382 joev.mi 2016-06-06 22:14:04 UTC

I'm baaack.  I've had additional events.  The best solution I could get to with the work arounds was to set cstate=0 and switch from gnome to xfce desktop.  And make sure I close firefox and thunderbird.  That combination really stretched out the events but still at least one a day.  I also tried disabling power management for the monitor, I had it set to turn the monitor off after 45 minutes of inactivity.  My next approach is to flash the computer with the latest release posted by Dell of AMI BIOS, A11.  If I haven't mentioned I'm running workstation fedora on a Dell XPS 8700 which I bought new two years ago.  It was still running BIOS A07 which it came with.  I completed the re-flash this afternoon.

I've been having some difficulty sorting out what pieces are really in play.  The symptoms I see sound like what is described above but I have not seen the relief others have reported from the work arounds.  Also, kernel seems to be implicated in the discussion but my events did not seem to be associated with a kernel upgrade (I didn't realize that until more recently.  I was running 4.4.8 for ten days until my events started.)

Comment 383 D. Hugh Redelmeier 2016-06-07 03:08:37 UTC

@joev.mi  This bugzilla entry is about Baytrail processors.  Your computer does not have one of those -- it uses "4th generation Intel Core processors".  Please start a separate bugzilla.  This one is already confusing enough already.

I think that reports of Braswell/Cherrytrail problems are likely relevant.

Examples of Baytrail reported (above) as having the bug: Atom Z3735G, Atom Z3770. Celeron J1900, Celeron N2930, Celeron N2940, Pentium J2900, Pentium N3520, Pentium N3540

Examples of Baytrail reported (above) without seeming to have the bug: Celeron CPU N2830, Celeron CPU N2840

Examples of Braswell/Cherrytrail reported (above) as having the bug: Celeron N3050

Examples of Braswell/Cherrytrail reported (above) without seeming to have the bug: N3150, N3700

(I stopped reading at about comment 150)

Comment 384 Koen Roggemans 2016-06-07 04:48:16 UTC

Created attachment 219241 [details]
attachment-22682-0.html

Correction to the list: I have an N3150 that has the bug: with the
workaround with cstate=1 I have seen it freezing only once.

2016-06-07 5:08 GMT+02:00 <bugzilla-daemon@bugzilla.kernel.org>:

> https://bugzilla.kernel.org/show_bug.cgi?id=109051
>
> --- Comment #383 from D. Hugh Redelmeier <hugh@mimosa.com> ---
> @joev.mi  This bugzilla entry is about Baytrail processors.  Your computer
> does
> not have one of those -- it uses "4th generation Intel Core processors".
> Please start a separate bugzilla.  This one is already confusing enough
> already.
>
> I think that reports of Braswell/Cherrytrail problems are likely relevant.
>
> Examples of Baytrail reported (above) as having the bug: Atom Z3735G, Atom
> Z3770. Celeron J1900, Celeron N2930, Celeron N2940, Pentium J2900, Pentium
> N3520, Pentium N3540
>
> Examples of Baytrail reported (above) without seeming to have the bug:
> Celeron
> CPU N2830, Celeron CPU N2840
>
> Examples of Braswell/Cherrytrail reported (above) as having the bug:
> Celeron
> N3050
>
> Examples of Braswell/Cherrytrail reported (above) without seeming to have
> the
> bug: N3150, N3700
>
> (I stopped reading at about comment 150)
>
> --
> You are receiving this mail because:
> You are on the CC list for the bug.
>

Comment 385 Libor Chmelik 2016-06-07 05:30:25 UTC

Since my last post and cstate=1 on kernel 4.6.0-040600-generic from Ubuntu, my laptop (Acer Aspire E5-511-P7AT with Pentium N3540) running Mint 17.3 didn't freeze once.
I tried every usual cause possible (Fullscreen HD videos on youtube or in VLC. Browsing content loaded websites in chrome and firefox. Batch HD conversion in Handbrake, etc.).

No freezing or hanging so far.

The cstate=1 workaround seems to work for me so far.

Comment 386 Maurizio 2016-06-07 12:23:49 UTC

I'm really confused now, I've switched from arch to debian stable (3.16.0 kernel) which didn't froze once as I was expecting (I've read the bug started with 3.17 not 3.16) then I've reinstalled arch (standard) with kernel 4.5.4 and so far I didn't experience a single freeze (now I have 48 hours of uptime)

Didn't really have the expertise to understand if any patch has been applied in the last couple of months to 4.5.4 kernel by arch team: the only thing I did differently this time is disabling in the bios all components I really don't need (like serial port for example)... I will let it run for a couple more days to check if it keeps running then I'll start playing with the bios turning on or off devices again. Does this make any sense?
If I understand correctly what Len said its a problem with a device driver rather than with the intel_idle?

Comment 387 Gabriel7340 2016-06-07 13:02:02 UTC

Did you disable cstates in bios?

Comment 388 Maurizio 2016-06-07 14:28:19 UTC

(In reply to Gabriel7340 from comment #387)
> Did you disable cstates in bios?

No... I've just disabled some devices I don't use but I didn't do it with the bug in mind.  I will check it and take some notes to see if the bios settings(with the exception of cstates) can actually change something.

Comment 389 marco_silva85 2016-06-07 14:39:32 UTC

Does the freeze only happen when using X11 or a Desktop Environment? Am I safe if I only use my hardware without any intel driver or X11?

I want to use my Q1900 just as a server, in console mode.

Comment 390 ladiko 2016-06-07 18:54:14 UTC

I use an asrock q1900-itx without X and kernel 3.19, 4.2 and now 4.4 and no special settings. Running for a year without issues.

Comment 391 ladiko 2016-06-07 19:00:17 UTC

Ahh and i forgot to say that it was sorted out because it had all other known issues when running as a kiosk system.

By the way, when running as kiosk system we had the problem that the USB ports started to stop working after a random time. The devices dont even disappear when unplugged.  The exact same imaged installation has no issues on AMD kabini or older intel core2duo or celeron 847. Is there anything known regarding this issue?

Because of all this trouble we moved to AMDs Kabini which works without any issues.

Comment 392 jbMacAZ 2016-06-08 18:03:18 UTC

The lastest 2 maintenance releases of 4.5 & 4.6 seem to have restored the cstate work-around.  My T100CHI is running again without the classic freeze described here.  Many thanks to whoever restored the cstate work-around.

Comment 393 Maurizio 2016-06-09 07:53:56 UTC

(In reply to jbMacAZ from comment #392)
> The lastest 2 maintenance releases of 4.5 & 4.6 seem to have restored the
> cstate work-around.  My T100CHI is running again without the classic freeze
> described here.  Many thanks to whoever restored the cstate work-around.

This is my impression too ... I've upgraded yesterday to 4.6 kernel and no crashes for 15 hours so far. Before I had 4 days of up-time with 4.5.4.

Would be nice to have a confirmation, also to avoid that one of the next patches bring everything back.

Comment 394 Zhang Rui 2016-06-20 02:17:38 UTC

so what's the status now?

Comment 395 Michal Feix 2016-06-20 04:59:14 UTC

On Acer TravelMate B115-M (N2940 @ 1.83 GHz), with latest BIOS, still hangs ocassionally with kernel 4.6.2. But it's definitelly way better than with previous kernels. I've eliminated max_cstate=1 workaround about a week ago and the machine crashed only once or twice during the past 7 days. So to sum it up - still not 100% perfect but definitelly a huge improvement.

BTW - I've enabled HW watchdog in systemd configuration. When the machines hangs (display hangs, network hangs, mouse and keyboard not reacting, etc.], it is still automatically rebooted with HW watchdog. If I understand that correctly, this reboot watchdog is independent from the kernel and should always be able to automatically reboot machine with hanged kernel. As these crashes became less frequent, I started to use this HW watchdog as a new temporary workaround to keep my machine up when beeing used remotely.

Comment 396 Libor Chmelik 2016-06-20 05:42:30 UTC

With the workaround cstate=1 on kernel 4.6.0-040600-generic from Ubuntu, my laptop (Acer Aspire E5-511-P7AT with Pentium N3540) is still running so far.

No freezing or hanging.

The only difference from kernels previous to 4.4. is that I disabled the onboard Broadcom Wifi/Bluetooth card/chip.
I'm using a Railink USB dongle instead.

Comment 397 Maurizio 2016-06-20 08:15:36 UTC

With Arch linux kernel 4.6.2 - and clearly no max_cstate - I've experienced some occasional crashes but it was always during some heavy load of the machine (streaming hd videos from the network), while previously it crashed randomly after few hours even with the machine completely idle. So big improvement, I will try to stress the machine with max_cstate=1 to check if the crashes are due to the same problem or something else.

Comment 398 Sebastian Parschauer 2016-06-25 06:18:18 UTC

I have the same bug with an Intel Xeon E5-1620 v3 CPU, NVIDIA Quadro K620 and 256 GB NVMe SSD. I was wondering why both PCIe cards were affected. On the NVMe card I've seen an XFS file system corruption from time to time.
"intel_idle.max_cstate=1" fixed the problem with openSUSE Leap 42.1 (4.1 kernel). C-States are broken with Haswell CPUs affecting the PCIe cards!

See:
http://www.intel.de/content/dam/www/public/us/en/documents/specification-updates/xeon-e5-v3-spec-update.pdf
http://www.intel.com/content/www/us/en/processors/xeon/xeon-e7-v3-spec-update.html
http://www.intel.com/content/www/us/en/processors/core/4th-gen-core-family-desktop-specification-update.html

HSX54: "A P-State or C-State Transition May Lead to a System Hang"
HSD38: "TSC May be Incorrect After a Deep C-State Exit"
HSD44: "Display May Flicker When Package C-States Are Enabled"
HSD50: "Throttling and Refresh Rate Maybe be Incorrect After Exiting Package C-
State"
HSD60: "Processor May Not Enter Package C6 or Deeper C-states When PCIe* 
Links Are Disabled"
HSD77: "Graphics Processor Ratio And C-State Transitions May Cause a System Hang"
HSD104: "PCIe* Device’s SVID is Not Preserved Across The Package C7 C-State"

Comment 399 D. Hugh Redelmeier 2016-06-25 14:39:56 UTC

@Sebastian Parschauer #398: This is NOT the same bug.  Your systems processor is neither Baytrail nor Cherrytrail.  Please start a different bugzilla entry.

Certainly it would interest many people if c-states are broken in Haswell.

If you think that there is something relevant to bug 109051, add a comment here pointing to your new bugzilla bug.

Comment 400 Hong Zhang 2016-06-25 23:28:42 UTC

Hello Everyone.
I just come to send some feedback
I'm using a thin ITX N3150 board by SOYO and my OS is archlinux
I ran into the same bug several day ago
I change my kernel to 4.4.13-1-lts(it's 4.4.14 now but should also work) and do nothing with the kernel parameters or the x configuration file, I have not encounter screen freeze any more(for more than 1 hour)
I change my kernel to 4.7-rc4. the computer also work properly

I have try to add the "intel_idle.max_cstate=1" by using efibootmgr
"efibootmgr -d /dev/sdb -p 1 -c -L "Arch Linux FallBack" -l /vmlinuz-linux -u "root=/dev/sdb2 rw initrd=/initramfs-linux.img i915.semaphores=1 intel_idle.max_cstate=1"
but it does not work, i ran into screen freeze just about 5min. maybe i did not add the parameter the right way?

sorry for my poor english- -

Comment 401 Paul Mansfield 2016-06-25 23:31:24 UTC

I would install rEFInd and then make that the primary boot target; it's much easier to configure rEFInd to boot linux with the desired parameters.

Comment 402 ladiko 2016-06-26 07:01:45 UTC

cat /proc/cmdline to get the currently running kernel version and parameters.

Comment 403 cirrusuk 2016-06-29 23:25:52 UTC

System:    Host: hawker64 Kernel: 4.6.3-1-ARCH x86_64 (64 bit) Desktop: dwm 6.1 Distro: Arch Linux
Machine:   Mobo: ASUSTeK model: P6T SE v: Rev 1.xx Bios: American Megatrends v: 0908 date: 09/21/2010
CPU:       Quad core Intel Core i7 920 (-HT-MCP-) cache: 8192 KB 
           clock speeds: max: 2672 MHz 1: 2672 MHz 2: 2672 MHz 3: 2672 MHz 4: 2672 MHz 5: 2672 MHz 6: 2672 MHz
           7: 2672 MHz 8: 2672 MHz
Graphics:  Card: Advanced Micro Devices [AMD/ATI] RV770 [Radeon HD 4870]
           Display Server: X.Org 1.18.3 driver: radeon Resolution: 1920x1080@60.00hz, 1920x1080@60.00hz
           GLX Renderer: Gallium 0.4 on AMD RV770 (DRM 2.43.0, LLVM 3.8.0) GLX Version: 3.0 Mesa 11.2.2
Audio:     Card-1 Advanced Micro Devices [AMD/ATI] RV770 HDMI Audio [Radeon HD 4850/4870] driver: snd_hda_intel
           Card-2 Intel 82801JI (ICH10 Family) HD Audio Controller driver: snd_hda_intel
           Card-3 Hewlett-Packard driver: USB Audio
           Card-4 Logitech QuickCam Pro 9000 driver: USB Audio
           Sound: Advanced Linux Sound Architecture v: k4.6.3-1-ARCH
Network:   Card: Realtek RTL8111/8168/8411 PCI Express Gigabit Ethernet Controller driver: r8169
           IF: enp5s0 state: up speed: 100 Mbps duplex: full mac: 00:26:18:97:7b:40
Drives:    HDD Total Size: 1388.6GB (0.1% used) ID-1: /dev/sdc model: SAMSUNG_HM250HI size: 250.1GB
           ID-2: /dev/sdb model: Hitachi_HTS54164 size: 40.0GB ID-3: /dev/sda model: HDS728080PLA380 size: 82.3GB
           ID-4: USB /dev/sdd model: Cruzer_Blade size: 16.0GB ID-5: /dev/sde model: WDC_WD2500AAKS size: 250.1GB
           ID-6: /dev/sdf model: Hitachi_HDS72107 size: 750.2GB
Partition: ID-1: swap-1 size: 2.05GB used: 0.00GB (0%) fs: swap dev: /dev/sdf2
Sensors:   System Temperatures: cpu: 54.5C mobo: 48.0C gpu: 78.0
           Fan Speeds (in rpm): cpu: 2500 psu: 0 sys-1: 0 sys-2: 0
Info:      Processes: 207 Uptime: 44 min Memory: 1104.5/5962.6MB Client: Shell (zsh) inxi: 2.3.0 

I too have been experiencing these hardlockups since 4.1 on Archlinux x86_64, i can go for 12 hours w/o lockup tho sometimes they happen quicker, dont wanna use the kernel paramater as i'm a tight assed Scot whos'e electric bill is high enough ;)
The Log is hard to get but the output looks very similar to [url]https://bugzilla.kernel.org/attachment.cgi?id=209581[/url
I plan to revert to older stable kernel or maybe LTS.
I hope to report back with relevant logs. 
Thanks to all who working on this.

Comment 404 cirrusuk 2016-06-29 23:35:34 UTC

OK like others i do not run Baywell, however im confident this is similar kernel regression regardless of CPU codename. However i will look around for this bug specific to my hardware.
System:    Host: hawker64 Kernel: 4.6.3-1-ARCH x86_64 (64 bit) Desktop: dwm 6.1 Distro: Arch Linux
Machine:   Mobo: ASUSTeK model: P6T SE v: Rev 1.xx Bios: American Megatrends v: 0908 date: 09/21/2010
CPU:       Quad core Intel Core i7 920 (-HT-MCP-) cache: 8192 KB 
           clock speeds: max: 2672 MHz 1: 2672 MHz 2: 2672 MHz 3: 2672 MHz 4: 2672 MHz 5: 2672 MHz 6: 2672 MHz
           7: 2672 MHz 8: 2672 MHz
Graphics:  Card: Advanced Micro Devices [AMD/ATI] RV770 [Radeon HD 4870]
           Display Server: X.Org 1.18.3 driver: radeon Resolution: 1920x1080@60.00hz, 1920x1080@60.00hz
           GLX Renderer: Gallium 0.4 on AMD RV770 (DRM 2.43.0, LLVM 3.8.0) GLX Version: 3.0 Mesa 11.2.2
Audio:     Card-1 Advanced Micro Devices [AMD/ATI] RV770 HDMI Audio [Radeon HD 4850/4870] driver: snd_hda_intel
           Card-2 Intel 82801JI (ICH10 Family) HD Audio Controller driver: snd_hda_intel
           Card-3 Hewlett-Packard driver: USB Audio
           Card-4 Logitech QuickCam Pro 9000 driver: USB Audio
           Sound: Advanced Linux Sound Architecture v: k4.6.3-1-ARCH
Network:   Card: Realtek RTL8111/8168/8411 PCI Express Gigabit Ethernet Controller driver: r8169
           IF: enp5s0 state: up speed: 100 Mbps duplex: full mac: 00:26:18:97:7b:40
Drives:    HDD Total Size: 1388.6GB (0.1% used) ID-1: /dev/sdc model: SAMSUNG_HM250HI size: 250.1GB
           ID-2: /dev/sdb model: Hitachi_HTS54164 size: 40.0GB ID-3: /dev/sda model: HDS728080PLA380 size: 82.3GB
           ID-4: USB /dev/sdd model: Cruzer_Blade size: 16.0GB ID-5: /dev/sde model: WDC_WD2500AAKS size: 250.1GB
           ID-6: /dev/sdf model: Hitachi_HDS72107 size: 750.2GB
Partition: ID-1: swap-1 size: 2.05GB used: 0.00GB (0%) fs: swap dev: /dev/sdf2
Sensors:   System Temperatures: cpu: 54.5C mobo: 48.0C gpu: 78.0
           Fan Speeds (in rpm): cpu: 2500 psu: 0 sys-1: 0 sys-2: 0
Info:      Processes: 207 Uptime: 44 min Memory: 1104.5/5962.6MB Client: Shell (zsh) inxi: 2.3.0 

I too have been experiencing these hardlockups since 4.1 on Archlinux x86_64, i can go for 12 hours w/o lockup tho sometimes they happen quicker, dont wanna use the kernel paramater as i'm a tight assed Scot whos'e electric bill is high enough ;)
The Log is hard to get but the output looks very similar to [url]https://bugzilla.kernel.org/attachment.cgi?id=209581[/url
I plan to revert to older stable kernel or maybe LTS.
I hope to report back with relevant logs. 
Thanks to all who working on this.

Comment 405 cirrusuk 2016-06-29 23:37:09 UTC

OK like others i do not run Baywell, however im confident this is similar kernel regression regardless of CPU codename. However i will look around for this bug specific to my hardware.
System:    Host: hawker64 Kernel: 4.6.3-1-ARCH x86_64 (64 bit) Desktop: dwm 6.1 Distro: Arch Linux
Machine:   Mobo: ASUSTeK model: P6T SE v: Rev 1.xx Bios: American Megatrends v: 0908 date: 09/21/2010
CPU:       Quad core Intel Core i7 920 (-HT-MCP-) cache: 8192 KB 
           clock speeds: max: 2672 MHz 1: 2672 MHz 2: 2672 MHz 3: 2672 MHz 4: 2672 MHz 5: 2672 MHz 6: 2672 MHz
           7: 2672 MHz 8: 2672 MHz
Graphics:  Card: Advanced Micro Devices [AMD/ATI] RV770 [Radeon HD 4870]
           Display Server: X.Org 1.18.3 driver: radeon Resolution: 1920x1080@60.00hz, 1920x1080@60.00hz
           GLX Renderer: Gallium 0.4 on AMD RV770 (DRM 2.43.0, LLVM 3.8.0) GLX Version: 3.0 Mesa 11.2.2
Audio:     Card-1 Advanced Micro Devices [AMD/ATI] RV770 HDMI Audio [Radeon HD 4850/4870] driver: snd_hda_intel
           Card-2 Intel 82801JI (ICH10 Family) HD Audio Controller driver: snd_hda_intel
           Card-3 Hewlett-Packard driver: USB Audio
           Card-4 Logitech QuickCam Pro 9000 driver: USB Audio
           Sound: Advanced Linux Sound Architecture v: k4.6.3-1-ARCH
Network:   Card: Realtek RTL8111/8168/8411 PCI Express Gigabit Ethernet Controller driver: r8169
           IF: enp5s0 state: up speed: 100 Mbps duplex: full mac: 00:26:18:97:7b:40
Drives:    HDD Total Size: 1388.6GB (0.1% used) ID-1: /dev/sdc model: SAMSUNG_HM250HI size: 250.1GB
           ID-2: /dev/sdb model: Hitachi_HTS54164 size: 40.0GB ID-3: /dev/sda model: HDS728080PLA380 size: 82.3GB
           ID-4: USB /dev/sdd model: Cruzer_Blade size: 16.0GB ID-5: /dev/sde model: WDC_WD2500AAKS size: 250.1GB
           ID-6: /dev/sdf model: Hitachi_HDS72107 size: 750.2GB
Partition: ID-1: swap-1 size: 2.05GB used: 0.00GB (0%) fs: swap dev: /dev/sdf2
Sensors:   System Temperatures: cpu: 54.5C mobo: 48.0C gpu: 78.0
           Fan Speeds (in rpm): cpu: 2500 psu: 0 sys-1: 0 sys-2: 0
Info:      Processes: 207 Uptime: 44 min Memory: 1104.5/5962.6MB Client: Shell (zsh) inxi: 2.3.0 

I too have been experiencing these hardlockups since 4.1 on Archlinux x86_64, i can go for 12 hours w/o lockup tho sometimes they happen quicker, dont wanna use the kernel paramater as i'm a tight assed Scot whos'e electric bill is high enough ;)
The Log is hard to get but the output looks very similar to [url]https://bugzilla.kernel.org/attachment.cgi?id=209581[/url
I plan to revert to older stable kernel or maybe LTS.
I hope to report back with relevant logs. 
Thanks to all who working on this.

Comment 406 carlos.valin 2016-07-01 01:49:50 UTC

Same problem in Acer aspire switch 10

Comment 407 Vladimir Jicha 2016-07-02 09:55:52 UTC

Unfortunately it seems to be clear now that as I expected this bug will never get fixed. I can only see more and more people posting here that they are affected too. But nobody even cared to change the bug state to critical from normal, confirmed from new or update the affected kernel versions up to 4.6 (and most likely any future).

Comment 408 muhaar 2016-07-02 10:14:38 UTC

Very disappointed of Intel's non excistent product support :S

Comment 409 Paul Mansfield 2016-07-02 13:59:32 UTC

I am fairly sure Intel never sold the Baytrail process for the Linux platform except in a very limited capacity (the computer stick is the only one as far as I know), the Z37xx series only sold commercially for Windows. I don't think any Chromebooks used it at all, they used the N28xx variants.
So, really, we can't expect much from Intel.

Comment 410 Dmytro Kyrychuk 2016-07-02 14:40:39 UTC

> Intel never sold the Baytrail process for the Linux platform

Despite the fact, Acer did (and still does) sell their Aspire E5-511 with Linpus (a distribution of Linux), which I considered as a fair proof of that those laptops would be fine with Ubuntu as well. Apparently, I was wrong.

Comment 411 Maurizio 2016-07-03 16:48:19 UTC

Anyway officially or unofficially the problem has been extremely reduced in the latest kernel version. 

I'm running 4.6.3 on a celeron N2930 and I can get without any problems several days of uptime. Every so and then I experience a crash when streaming video, but way better than before when the machine crashed when idle after few hours.

Comment 412 Paul Mansfield 2016-07-03 16:59:23 UTC

I have found that 4.5.7 with the patch set from John Brodie on the Asus T100 Ubuntu Google+ group is very stable with cstate=1 and sdio wifi - I can achieve days of uptime.
See the files section linked from here: https://plus.google.com/communities/117853703024346186936

Comment 413 Vladimir Jicha 2016-07-04 19:48:24 UTC

My Shuttle XS35V4 was offered as Linux compatible. And that is not the only bay-trail computer with declared Linux support.

Comment 414 amjafuso 2016-07-07 08:12:32 UTC

I did test kernel 4.6.3 (cpu j1900), took about 3h to crash (chromium browser, no video). Setting cstate to 1 or 2 still fixes the problem.



@vladimir jicha

I also do have a shuttle xs35v4. Please set kernel parameter intel_idle.max_cstate

For grub:
- vi /etc/default/grub: GRUB_CMDLINE_LINUX_DEFAULT="quiet splash intel_idle.max_cstate=2"
- update-grub
- reboot

Absolutely no crashes anymore, power consumption is only 8W.

Comment 415 Rush Hour Rambo 2016-07-08 02:00:22 UTC

Im running an Inspiron 3551 Laptop with a Pentium N3540. I've been having this same bug since Ubuntu 15.10. I recently installed Linux Mint 18 MATE and no longer had the freezing caused by this bug. Yesterday I installed some updates and ever since the bug/ freezing has returned. I am not sure if the updates included kernel updates but I have 2 kernels still in my pc and the old version is 4.4.0-21-generic is the one that did not cause any freezing for over a week after installing Mint 18. The newer version is 4.4.0-28-generic and it caused freezing.
I have rolled back to and tested 4.4.0-21-generic and can watch youtube videos at 720 60fps with no freezing, with 4.4.0-28-generic youtube videos cause the freezing.
4.4.0-21-generic seems to fix this issue for me. anyone else try?

Comment 416 Martin 2016-07-08 17:03:23 UTC

On 4.6.0, max_cstate=2 is not an option here. Will try 4.6.3 later.

Comment 417 Alejandro Morales Lepe 2016-07-08 17:36:43 UTC

Is there any way to check in code what changed for Baytrail CPUs between kernel 3.16 and 3.17? Was there a patch specific for Baytrail that is causing the issue? or some patch for c-states? There should be any active effort to fix this bug because it affects multiple machines with Ubuntu preinstalled, and Ubuntu is retiring support for kernel 3.16 so people will be stuck with either a very old kernel or will experience freezes with 16.04. Specially on machines with Ubuntu pre-installed like the Dell Inspiron 3551 Ubuntu Edition. This will really public image of Linux distros on consumer computers.

Comment 418 Joe Burmeister 2016-07-08 18:21:45 UTC

I'm fairly sure I have had this in 3.16 on my media machine. Or at least some other complete system freeze. I think it's just very rare under 3.16. So I'm not convinced the answer will fall out of bisection. :-(

Seams graphics related from what has been said. Unless it does happen on headless machines, in which case, that is clearly not true. Guessing interaction of power state of CPU vs GPU. One changed in just the wrong place for the other. But loads a speculation there I don't have time to dig into. I'm getting temped to just bin the board when being stuck 3.16 becomes an issue. Or use it as headless with another Pi media machine.
On 8 Jul 2016 18:36, bugzilla-daemon@bugzilla.kernel.org wrote: > > https://bugzilla.kernel.org/show_bug.cgi?id=109051 > > --- Comment #417 from Alejandro Morales Lepe --- > Is there any way to check in code what changed for Baytrail CPUs between kernel > 3.16 and 3.17? Was there a patch specific for Baytrail that is causing the > issue? or some patch for c-states? There should be any active effort to fix > this bug because it affects multiple machines with Ubuntu preinstalled, and > Ubuntu is retiring support for kernel 3.16 so people will be stuck with either > a very old kernel or will experience freezes with 16.04. Specially on machines > with Ubuntu pre-installed like the Dell Inspiron 3551 Ubuntu Edition. This will > really public image of Linux distros on consumer computers. > > -- > You are receiving this mail because: > You are on the CC list for the bug.

Comment 419 Alejandro Morales Lepe 2016-07-08 18:35:05 UTC

I have yet to experience a complete lock up in 3.16 the locks up I have had happen when I fill my Inspiron 3551 RAM by running a lot of stuff, however I am able to reboot the computer with some SysRq magic while on newer kernels the lock up prevents me from doing this... could be a different thing? 

At least you can run headless :( I am suffering this problem in my daily driver and I pretty much need it for everything. I got this machine for the price and the idea that I would be getting a Linux ready computer, oh boy...

I am not an expert, but if there is some way I can help to debug this, anybody, please let me now. 

(In reply to Joe Burmeister from comment #418)
> I'm fairly sure I have had this in 3.16 on my media machine. Or at least
> some other complete system freeze. I think it's just very rare under 3.16.
> So I'm not convinced the answer will fall out of bisection. :-(
> 
> Seams graphics related from what has been said. Unless it does happen on
> headless machines, in which case, that is clearly not true. Guessing
> interaction of power state of CPU vs GPU. One changed in just the wrong
> place for the other. But loads a speculation there I don't have time to dig
> into. I'm getting temped to just bin the board when being stuck 3.16 becomes
> an issue. Or use it as headless with another Pi media machine.
> On 8 Jul 2016 18:36, bugzilla-daemon@bugzilla.kernel.org wrote: > >
> https://bugzilla.kernel.org/show_bug.cgi?id=109051 > > --- Comment #417 from
> Alejandro Morales Lepe --- > Is there any way to check in code what changed
> for Baytrail CPUs between kernel > 3.16 and 3.17? Was there a patch specific
> for Baytrail that is causing the > issue? or some patch for c-states? There
> should be any active effort to fix > this bug because it affects multiple
> machines with Ubuntu preinstalled, and > Ubuntu is retiring support for
> kernel 3.16 so people will be stuck with either > a very old kernel or will
> experience freezes with 16.04. Specially on machines > with Ubuntu
> pre-installed like the Dell Inspiron 3551 Ubuntu Edition. This will > really
> public image of Linux distros on consumer computers. > > -- > You are
> receiving this mail because: > You are on the CC list for the bug.

Comment 420 Paul Mansfield 2016-07-08 19:58:26 UTC

https://git.kernel.org/cgit/linux/kernel/git/stable/linux-stable.git/log/?id=refs/tags/v3.17

Comment 421 Alejandro Morales Lepe 2016-07-08 20:13:37 UTC

From what I am seeing the option itself to set up intel_idle.max_cstate=1
was added in kernel 3.17, does it have any relation to the problem, or am I getting lost in my ignorance? Is there any problem with the default value? where that value is used? Excuseme if this is not of much help but I am trying to make some sense from this, but since I am not familiar with kernel development maybe I am just running in circles. 

https://git.kernel.org/cgit/linux/kernel/git/stable/linux-stable.git/commit/?id=2e92c7ad8f269c2b5b7f2a4763675f55f00b75f5

Comment 422 Joe Burmeister 2016-07-08 20:25:51 UTC

It could be a different complete freeze. Without pouring time on it, I can't know.
There is no "the fix" as far as I know. The only work round is set cstate in the BIOS or kernel argument. Does the same thing and that sucks for power usage.

I'd love some good news to.On 8 Jul 2016 19:35, bugzilla-daemon@bugzilla.kernel.org wrote:
>
> https://bugzilla.kernel.org/show_bug.cgi?id=109051 
>
> --- Comment #419 from Alejandro Morales Lepe <aml240sx@gmail.com> --- 
> I have yet to experience a complete lock up in 3.16 the locks up I have had 
> happen when I fill my Inspiron 3551 RAM by running a lot of stuff, however I
> am 
> able to reboot the computer with some SysRq magic while on newer kernels the 
> lock up prevents me from doing this... could be a different thing? 
>
> At least you can run headless :( I am suffering this problem in my daily
> driver 
> and I pretty much need it for everything. I got this machine for the price
> and 
> the idea that I would be getting a Linux ready computer, oh boy... 
>
> I am not an expert, but if there is some way I can help to debug this,
> anybody, 
> please let me now. 
>
> (In reply to Joe Burmeister from comment #418) 
> > I'm fairly sure I have had this in 3.16 on my media machine. Or at least 
> > some other complete system freeze. I think it's just very rare under 3.16. 
> > So I'm not convinced the answer will fall out of bisection. :-( 
> > 
> > Seams graphics related from what has been said. Unless it does happen on 
> > headless machines, in which case, that is clearly not true. Guessing 
> > interaction of power state of CPU vs GPU. One changed in just the wrong 
> > place for the other. But loads a speculation there I don't have time to dig 
> > into. I'm getting temped to just bin the board when being stuck 3.16
> becomes 
> > an issue. Or use it as headless with another Pi media machine. 
> > On 8 Jul 2016 18:36, bugzilla-daemon@bugzilla.kernel.org wrote: > > 
> > https://bugzilla.kernel.org/show_bug.cgi?id=109051 > > --- Comment #417
> from 
> > Alejandro Morales Lepe --- > Is there any way to check in code what changed 
> > for Baytrail CPUs between kernel > 3.16 and 3.17? Was there a patch
> specific 
> > for Baytrail that is causing the > issue? or some patch for c-states? There 
> > should be any active effort to fix > this bug because it affects multiple 
> > machines with Ubuntu preinstalled, and > Ubuntu is retiring support for 
> > kernel 3.16 so people will be stuck with either > a very old kernel or will 
> > experience freezes with 16.04. Specially on machines > with Ubuntu 
> > pre-installed like the Dell Inspiron 3551 Ubuntu Edition. This will >
> really 
> > public image of Linux distros on consumer computers. > > -- > You are 
> > receiving this mail because: > You are on the CC list for the bug. 
>
> -- 
> You are receiving this mail because: 
> You are on the CC list for the bug.

Comment 423 Andrew Clayton 2016-07-08 20:33:49 UTC

(In reply to Alejandro Morales Lepe from comment #421)

> https://git.kernel.org/cgit/linux/kernel/git/stable/linux-stable.git/commit/
> ?id=2e92c7ad8f269c2b5b7f2a4763675f55f00b75f5

That's just adding documentation. The intel_idle driver was added back in 2010 and Bay Trail support was added in 2015 by 718987d695adc991eb94501209fe5353136c8c16 ("intel_idle: support Bay Trail")

And possibly last touched by

d7ef76717322c8e2df7d4360b33faa9466cb1a0d ("intel_idle: Update support for Silvermont Core in Baytrail SOC")


IIRC J1900 is a Silvermont.

Comment 424 Paul Mansfield 2016-07-08 20:37:27 UTC

Yes, a J1900 is an Atom and has Silvermont cores. But so does a Baytrail

https://en.wikipedia.org/wiki/Silvermont

Comment 425 André Hoogendoorn 2016-07-12 23:34:47 UTC

Seems some Intel CPU's have bugs in design
http://www.intel.com/content/dam/www/public/us/en/documents/specification-updates/pentium-n3520-j2850-celeron-n2920-n2820-n2815-n2806-j1850-j1750-spec-update.pdf

Comment 426 André Hoogendoorn 2016-07-13 03:01:06 UTC

I have read the pdf and according to Intel, there is a C6 state hardware bug in the CPU numbers as listed on page 9 and 10
----
VLP52 EOI Transactions May Not be Sent if Software Enters Core C6 During an Interrupt Service Routine.

Problem:
If core C6 is entered after the start of an interrupt service routine but before a write
to the APIC EOI (End of Interrupt) register, and the core is woken up by an event
other than a fixed interrupt source the core may drop the EOI transaction the next
time APIC EOI register is written and further interrupts from the same or lower
priority level will be blocked.

Implication:
EOI transactions may be lost and interrupts may be blocked when core C6 is used
during interrupt service routines.

Workaround:
It is possible for the firmware to contain a workaround for this erratum.
----

Comment 427 Andrew Clayton 2016-07-14 01:10:06 UTC

(In reply to André Hoogendoorn from comment #426)
> I have read the pdf and according to Intel, there is a C6 state hardware bug
> in the CPU numbers as listed on page 9 and 10

Interesting. I have been running with intel_idle.max_cstate=5 (changed from 2, which was fine) under the Fedora 24 4.6.3 kernel on a J1900 CPU for 14+ hours now.

IIRC I would have had a lockup by now...

Comment 428 Wolfgang M. Reimer 2016-07-14 09:31:02 UTC

(In reply to Andrew Clayton from comment #427)
> (In reply to André Hoogendoorn from comment #426)
> > I have read the pdf and according to Intel, there is a C6 state hardware
> bug
> > in the CPU numbers as listed on page 9 and 10
> 
> Interesting. I have been running with intel_idle.max_cstate=5 (changed from
> 2, which was fine) under the Fedora 24 4.6.3 kernel on a J1900 CPU for 14+
> hours now.
> 
> IIRC I would have had a lockup by now...

According to "cpupower idle-info" a J1900 CPU has

Available idle states: POLL C1-BYT C6N-BYT C6S-BYT C7-BYT C7S-BYT

So running a J1900 CPU with intel_idle.max_cstate=5 is basically THE SAME AS running it with intel_idle.max_cstate=1, intel_idle.max_cstate=2, intel_idle.max_cstate=3, or intel_idle.max_cstate=4. If it ran stably with either of the latter settings it will also run stably with intel_idle.max_cstate=5.

Comment 429 Martin 2016-07-14 09:36:31 UTC

My J1900 reliably freezes with any intel_idle.max_cstate > 1 (kernel 4.6.0) but I see why you would expect otherwise.

Comment 430 Max Stegmeyer 2016-07-14 09:37:02 UTC

(In reply to Wolfgang M. Reimer from comment #428)
> (In reply to Andrew Clayton from comment #427)
> > (In reply to André Hoogendoorn from comment #426)
> > > I have read the pdf and according to Intel, there is a C6 state hardware
> bug
> > > in the CPU numbers as listed on page 9 and 10
> > 
> > Interesting. I have been running with intel_idle.max_cstate=5 (changed from
> > 2, which was fine) under the Fedora 24 4.6.3 kernel on a J1900 CPU for 14+
> > hours now.
> > 
> > IIRC I would have had a lockup by now...
> 
> According to "cpupower idle-info" a J1900 CPU has
> 
> Available idle states: POLL C1-BYT C6N-BYT C6S-BYT C7-BYT C7S-BYT
> 
> So running a J1900 CPU with intel_idle.max_cstate=5 is basically THE SAME AS
> running it with intel_idle.max_cstate=1, intel_idle.max_cstate=2,
> intel_idle.max_cstate=3, or intel_idle.max_cstate=4. If it ran stably with
> either of the latter settings it will also run stably with
> intel_idle.max_cstate=5.

But something must be different. I also use a J1900 mainboard and there is a difference in power consumption between running with max_cstate=1 and max_cstate=2.
For me, that's
max_cstate=1: 17.2W
max_cstate=2: 16.5W
no max_cstate: 15.9W

Comment 431 Vaidas Jablonskis 2016-07-14 14:08:21 UTC

Just found this bug report. I used to be getting freezes on my Dell XPS 15 9550 with Skylake Intel(R) Core(TM) i7-6700HQ CPU @ 2.60GHz up until 4.7.0-0.rc7.git1.2.fc25.x86_64 kernel.

I don't have max_cstate set at the moment. Something has changed since rc7.git0 kernel build.

I am running fedora 24 with https://fedoraproject.org/wiki/RawhideKernelNodebug repo enabled.

Comment 432 D. Hugh Redelmeier 2016-07-14 14:46:44 UTC

@Vaidas Jablonskis #431: this bug report is about Baytrail CPUs.  Skylake is quite different.

Comment 433 Vaidas Jablonskis 2016-07-14 15:25:34 UTC

(In reply to D. Hugh Redelmeier from comment #432)
> @Vaidas Jablonskis #431: this bug report is about Baytrail CPUs.  Skylake is
> quite different.

Oops. My apologies for not reading the title.

Comment 434 Wolfgang M. Reimer 2016-07-14 18:09:11 UTC

Created attachment 223851 [details]
Disable all C6 states enable all C7 core states for Baytrail CPUs

Disable all C6 states enable all C7 core states for Baytrail CPUs to verify whether erratum VLP52 is root cause for this bug. Must be run as root.

Comment 435 Wolfgang M. Reimer 2016-07-14 18:12:53 UTC

Created attachment 223861 [details]
Shows all core states (C-states) + some related info as a formatted table

The intel_idle.max_cstate boot parameter refers to enumeration done by the linux kernel (number in column State) and not to the Intel notation of core states C0, C1, C2, C3, C6, C7, etc. Latency, Residency, and Time units are microseconds.

Comment 436 Wolfgang M. Reimer 2016-07-14 18:14:31 UTC

(In reply to Martin from comment #429)
> My J1900 reliably freezes with any intel_idle.max_cstate > 1 (kernel 4.6.0)
> but I see why you would expect otherwise.

(In reply to Max Stegmeyer from comment #430)
> (In reply to Wolfgang M. Reimer from comment #428)
> > (In reply to Andrew Clayton from comment #427)
> > > (In reply to André Hoogendoorn from comment #426)
> > > > I have read the pdf and according to Intel, there is a C6 state
> hardware bug
> > > > in the CPU numbers as listed on page 9 and 10
> > > 
> > > Interesting. I have been running with intel_idle.max_cstate=5 (changed
> from
> > > 2, which was fine) under the Fedora 24 4.6.3 kernel on a J1900 CPU for
> 14+
> > > hours now.
> > > 
> > > IIRC I would have had a lockup by now...
> > 
> > According to "cpupower idle-info" a J1900 CPU has
> > 
> > Available idle states: POLL C1-BYT C6N-BYT C6S-BYT C7-BYT C7S-BYT
> > 
> > So running a J1900 CPU with intel_idle.max_cstate=5 is basically THE SAME
> AS
> > running it with intel_idle.max_cstate=1, intel_idle.max_cstate=2,
> > intel_idle.max_cstate=3, or intel_idle.max_cstate=4. If it ran stably with
> > either of the latter settings it will also run stably with
> > intel_idle.max_cstate=5.
> 
> But something must be different. I also use a J1900 mainboard and there is a
> difference in power consumption between running with max_cstate=1 and
> max_cstate=2.
> For me, that's
> max_cstate=1: 17.2W
> max_cstate=2: 16.5W
> no max_cstate: 15.9W

Ok, you are right and I found out, what the problem is. The Linux kernel enumerates the states for the J1900 as follows:

0 POLL
1 C1-BYT
2 C6N-BYT
3 C6S-BYT
4 C7-BYT
5 C7S-BYT

The parameter intel_idle.max_cstate refers to that enumeration and does _NOT_ conform to the Intel notation of the C-states (which confused me):

So "intel_idle.max_cstate=2" means POLL, C1-BYT, and C6N-BYT (the first of the intel C6 states) are enabled and all other states (C6S-BYT, C7-BYT, C7S-BYT) are disabled and _CANNOT_ be enabled after boot time.

Fortunately the /sys interface of the kernel allows fine-grained tweeking at run-time and one can turn off and on the the states individually (if not disabled at boot time via intel_idle.max_cstate=<number>).

In order to investigate whether erratum VLP52 is the root cause for this kernel bug (109051) I attached two shell scripts to this bug.

The first (c6off+c7on.sh) will disable all intel C6 core states for Baytrail processors (C6N-BYT and C6S-BYT) + enable all C7 core states (C7-BYT and C7S-BYT).

The second script can be used to verify that the C6 states are disabled (column "Disabled" should show a "1" for the disabled states and the count for the columns "Time" and "Usage" should not change any longer for the disabled C6*-BYT states).

The "c6off+c7on.sh" script should be started at system boot and if erratum VLP52 is the root cause of this bug then Baytrail systems with the processors mentioned in https://bugzilla.kernel.org/show_bug.cgi?id=109051#c425 (J2850, J1850, J1750, N3510, N2810, N2805, N2910, N3520, N2920, N2820, N2806, N2815, J2900, J1900, J1800, N3530, N2930, N2830, N2807, N3540, N2940, N2840, N2808) should run stably again. Especially Baytrail based systems with low average load (e.g. tablets and notebooks) should consume considerably less power with enabled C7*-BYT states.

Please give feedback (stability, power consumption, etc.)!

Comment 437 Wolfgang M. Reimer 2016-07-14 19:28:11 UTC

Running my submitted scripts

https://bugzilla.kernel.org/attachment.cgi?id=223851
https://bugzilla.kernel.org/attachment.cgi?id=223861

on a J1900 system should produce a similar output:

$ sudo $HOME/bin/c6off+c7on.sh
DISABLED state C6N-BYT for cpu0.
DISABLED state C6S-BYT for cpu0.
DISABLED state C6N-BYT for cpu1.
DISABLED state C6S-BYT for cpu1.
DISABLED state C6N-BYT for cpu2.
DISABLED state C6S-BYT for cpu2.
DISABLED state C6N-BYT for cpu3.
DISABLED state C6S-BYT for cpu3.

$ $HOME/bin/cstateInfo.sh 
cpu0 State  Name     Disabled  Latency  Residency       Time  Usage
         0  POLL            0        0          0      77432    267
         1  C1-BYT          0        1          1   13849382  21986
         2  C6N-BYT         1      300        275     891290   1491
         3  C6S-BYT         1      500        560    1340774   1078
         4  C7-BYT          0     1200       4000    3190476    380
         5  C7S-BYT         0    10000      20000  255687727   1025
cpu1 State  Name     Disabled  Latency  Residency       Time  Usage
         0  POLL            0        0          0      10256    160
         1  C1-BYT          0        1          1   12134067  10470
         2  C6N-BYT         1      300        275     897517    514
         3  C6S-BYT         1      500        560    2742364    688
         4  C7-BYT          0     1200       4000    3223395    312
         5  C7S-BYT         0    10000      20000  256625325    886
cpu2 State  Name     Disabled  Latency  Residency       Time  Usage
         0  POLL            0        0          0      58350    205
         1  C1-BYT          0        1          1   14738863  26297
         2  C6N-BYT         1      300        275     974127   1195
         3  C6S-BYT         1      500        560    2688385    879
         4  C7-BYT          0     1200       4000   25533926   1768
         5  C7S-BYT         0    10000      20000  231166600   1894
cpu3 State  Name     Disabled  Latency  Residency       Time  Usage
         0  POLL            0        0          0       9249    232
         1  C1-BYT          0        1          1   14294725  24977
         2  C6N-BYT         1      300        275    1678518   2863
         3  C6S-BYT         1      500        560    2531238   1394
         4  C7-BYT          0     1200       4000    7240420    693
         5  C7S-BYT         0    10000      20000  250630919   2281

Running cstateInfo.sh again should show no changes in the lines for the disabled C6 states (C6N-BYT and C6S-BYT):

$ $HOME/bin/cstateInfo.sh 
cpu0 State  Name     Disabled  Latency  Residency        Time  Usage
         0  POLL            0        0          0       77497    277
         1  C1-BYT          0        1          1    17466806  23676
         2  C6N-BYT         1      300        275      891290   1491
         3  C6S-BYT         1      500        560     1340774   1078
         4  C7-BYT          0     1200       4000     4231024    429
         5  C7S-BYT         0    10000      20000  1113610759   3134
cpu1 State  Name     Disabled  Latency  Residency        Time  Usage
         0  POLL            0        0          0       10292    168
         1  C1-BYT          0        1          1    20242967  12191
         2  C6N-BYT         1      300        275      897517    514
         3  C6S-BYT         1      500        560     2742364    688
         4  C7-BYT          0     1200       4000     4398584    346
         5  C7S-BYT         0    10000      20000  1109869872   2675
cpu2 State  Name     Disabled  Latency  Residency        Time  Usage
         0  POLL            0        0          0       58662    277
         1  C1-BYT          0        1          1    24698671  33431
         2  C6N-BYT         1      300        275      974127   1195
         3  C6S-BYT         1      500        560     2688385    879
         4  C7-BYT          0     1200       4000    94027530   3711
         5  C7S-BYT         0    10000      20000  1014763708   6407
cpu3 State  Name     Disabled  Latency  Residency        Time  Usage
         0  POLL            0        0          0        9448    277
         1  C1-BYT          0        1          1    29230274  30522
         2  C6N-BYT         1      300        275     1678518   2863
         3  C6S-BYT         1      500        560     2531238   1394
         4  C7-BYT          0     1200       4000    14492087   1315
         5  C7S-BYT         0    10000      20000  1090072439   7878

As one can see in my case most of the core's idle time is now spent in state C7S-BYT.

Comment 438 Andy Furniss 2016-07-14 23:38:58 UTC

Nice script, FWIW, maybe by luck, but it seems being headless on J1900 helps a lot. I would surely lock with an unpatched kernel + graphics.

Note lack of i915 interrupts (those shown were there at boot).

So 99 days (would be longer but had to have power off) vanilla 4.1.18. 

asr[~]$ sh cstateInfo.sh 
cpu0 State  Name     Disabled  Latency  Residency           Time       Usage
         0  POLL            0        0          0      352565556      538707
         1  C1-BYT          0        1          1   130181110251   755499147
         2  C6N-BYT         0      300        275   168721715688   321645308
         3  C6S-BYT         0      500        560  2679566473195  1081712423
         4  C7-BYT          0     1200       4000  5201809523872   677055949
         5  C7S-BYT         0    10000      20000   232672548010     6063953
cpu1 State  Name     Disabled  Latency  Residency           Time       Usage
         0  POLL            0        0          0       66553174      100721
         1  C1-BYT          0        1          1    21321194555    95022167
         2  C6N-BYT         0      300        275    59708872499    80912844
         3  C6S-BYT         0      500        560  1699545542740   616884568
         4  C7-BYT          0     1200       4000  6157806454862   674802503
         5  C7S-BYT         0    10000      20000   528940757115    15010441
cpu2 State  Name     Disabled  Latency  Residency           Time       Usage
         0  POLL            0        0          0       52182600       54992
         1  C1-BYT          0        1          1    11577684031    44781333
         2  C6N-BYT         0      300        275    30691974207    38857448
         3  C6S-BYT         0      500        560   926619750261   332837818
         4  C7-BYT          0     1200       4000  5605371769375   533458885
         5  C7S-BYT         0    10000      20000  1938187552261    60665722
cpu3 State  Name     Disabled  Latency  Residency           Time       Usage
         0  POLL            0        0          0       87403241       51053
         1  C1-BYT          0        1          1    10016724016    38039691
         2  C6N-BYT         0      300        275    28416307863    35148851
         3  C6S-BYT         0      500        560   827491037749   293247527
         4  C7-BYT          0     1200       4000  5475692994922   503244246
         5  C7S-BYT         0    10000      20000  2176064852237    65760515
asr[~]$ uptime 
 00:27:42 up 99 days, 11:27,  1 user,  load average: 0.01, 0.02, 0.05
asr[~]$ uname -a
Linux asr 4.1.18 #1 SMP Mon Feb 22 23:38:21 GMT 2016 x86_64 GNU/Linux
asr[~]$ cat /proc/interrupts 
            CPU0       CPU1       CPU2       CPU3       
   0:         40          0          0          0   IO-APIC-edge      timer
   1:          3          0          0          0   IO-APIC-edge      i8042
   7:          1          0          0          0   IO-APIC-edge    
   8:          2          0          0          0   IO-APIC-fasteoi   rtc0
   9:          0          0          0          0   IO-APIC-fasteoi   acpi
  12:          4          0          0          0   IO-APIC-edge      i8042
  23:   97932462          0          0          0   IO-APIC   23-fasteoi   ehci_hcd:usb1
  87:         38          0          0          0   PCI-MSI-edge      i915
  88:   12913831          0          0          0   PCI-MSI-edge      0000:04:00.0
  89: 1013520070          0          0          0   PCI-MSI-edge      eth0
 NMI:          0          0          0          0   Non-maskable interrupts
 LOC: 1613399563 1431968881  931951991  935213869   Local timer interrupts
 SPU:          0          0          0          0   Spurious interrupts
 PMI:          0          0          0          0   Performance monitoring interrupts
 IWI:          1          0          0          0   IRQ work interrupts
 RTR:          0          0          0          0   APIC ICR read retries
 RES:   26307754   50297144   12568066   12293478   Rescheduling interrupts
 CAL:       2124       1472    1663888    1463719   Function call interrupts
 TLB:      25130      19257      11500      11783   TLB shootdowns
 TRM:          0          0          0          0   Thermal event interrupts
 THR:          0          0          0          0   Threshold APIC interrupts
 MCE:          0          0          0          0   Machine check exceptions
 MCP:      28651      28651      28651      28651   Machine check polls
 ERR:          1
 MIS:          0

Comment 439 ladiko 2016-07-15 06:02:43 UTC

I have a machine which was sorted out cause of the freezes and another issue with USB devices randomly disappear on this platform. Later on I used it as a headless asterisk server and never had a single freeze with Ubuntu 14.04 and kernel 3.16, 3.19, 4.1 or 4.4. So without a running Xserver, it seems to work without freezes.

Comment 440 Maurizio 2016-07-18 08:12:37 UTC

Guys posting this again, not really sure if this helps but my system is up since 7 days without a crash and of course NO max_cstate parameter set. I'm running arch linux, with kernel 4.6.3. 

My CPU is a Celeron N2930... I have X constantly running as the PC runs kodi by default. 

processor       : 2
vendor_id       : GenuineIntel
cpu family      : 6
model           : 55
model name      : Intel(R) Celeron(R) CPU  N2930  @ 1.83GHz
stepping        : 8
microcode       : 0x829


Linux zotac 4.6.3-1-ARCH #1 SMP PREEMPT Fri Jun 24 21:19:13 CEST 2016 x86_64 GNU/Linux

10:06:41 up 7 days, 13:34,  2 users,  load average: 0,08, 0,06, 0,01

[    0.000000] Linux version 4.6.3-1-ARCH (builduser@tobias) (gcc version 6.1.1 20160602 (GCC) ) #1 SMP PREEMPT Fri Jun 24 21:19:13 CEST 2016
[    0.000000] Command line: initrd=\initramfs-linux.img root=/dev/sda2 rw
[    0.000000] x86/fpu: Legacy x87 FPU detected.
[    0.000000] x86/fpu: Using 'eager' FPU context switches.

Comment 441 Kemal Ilgar Eroğlu 2016-07-18 20:30:41 UTC

Hi all,

I've been following this bug for a long time as my Bay Trail tablet HP Pavilion X2 with an Atom Z3736F kept freezing within 1-2 hours after booting. I've tried all major kernel versions since 4.1. I must mention that they all included the following mmc patches:

https://github.com/hadess/rtl8723bs

They also included the intel patch suggested at Debian's wiki:

https://wiki.debian.org/InstallingDebianOn/HP/Pavilion%20x2%2010%20%282015%20model%29/Jessie?action=AttachFile&do=view&target=intel_display.patch

Other than those, I tried various patches I found around hoping to cure the freezes. Even max_cstate=0 did not help. Finally, with 4.6.3 + Mika Kuoppala's 3 patches the situation got somewhat better but I never exceeded 4 hours without a freeze. Then I came across Daniel Bilik's patches elsewhere (for some reason I had overlooked his posts on this page!).

With his patches applied, I made several reboots, also playing around with Wolfgang's scripts and so far I never had a regular freeze[1]. Now my tablet's uptime has reached 24 hours for the first time ever, being booted without any max_cstate arguments and the C6 states being active. It's perhaps too soon to declare this a success but apparently Daniel's addition to Mika's patches has made a huge difference here.

[1] What I mean is this: Without max_cstate=0 (or 1), the tablet can freeze during boot. Most of the freezes are when the disks are mounted/fsck'ed and others when the drm framebuffer is initialized. And occasionally the screen goes blank when drm takes over, but I can recover with magic SysRq. I don't know if any of these problems might be due to other factors than the Intel Bay Trail bug. But once it gets past the booting stage, (with these latest patches) it seems to survive the rest.

Comment 442 bzq7xy5gpj 2016-07-21 07:46:49 UTC

I'm using a Bay Trail NUC (DN2820FYKH) but I don't remember encountering this bug [109051] any time. I just post it here to let you know that for whatever reason this NUC seems not to be affected.

Using latest Arch (through Antergos) here are some infos:

I boot into Desktop and didn't do anything else (or how should I reproduce this bug?).

$ uname -a
Linux *** 4.6.4-1-ARCH #1 SMP PREEMPT Mon Jul 11 19:12:32 CEST 2016 x86_64 GNU/Linux

$ cat /proc/cpuinfo 
processor	: 0
vendor_id	: GenuineIntel
cpu family	: 6
model		: 55
model name	: Intel(R) Celeron(R) CPU  N2820  @ 2.13GHz
stepping	: 3
microcode	: 0x324
cpu MHz		: 533.116
cache size	: 1024 KB
...

$ ./cstateInfo.sh 
cpu0 State  Name     Disabled  Latency  Residency        Time   Usage
         0  POLL            0        0          0    12418075     717
         1  C1-BYT          0        1          1    93279269   79577
         2  C6N-BYT         0      300        275    20566295   34104
         3  C6S-BYT         0      500        560   949015145  355284
         4  C7-BYT          0     1200       4000  8810179482  634227
         5  C7S-BYT         0    10000      20000  8042123989   99746
cpu1 State  Name     Disabled  Latency  Residency        Time   Usage
         0  POLL            0        0          0     6130914     716
         1  C1-BYT          0        1          1   391648635   73352
         2  C6N-BYT         0      300        275    17716302   30140
         3  C6S-BYT         0      500        560   846101843  326140
         4  C7-BYT          0     1200       4000  8388706626  596330
         5  C7S-BYT         0    10000      20000  8278801097   97725
$ uptime
 14:29:40 up  5:04,  2 users,  load average: 0,17, 0,07, 0,01

Comment 443 ichudov 2016-08-04 16:02:41 UTC

I have a Shuttle XS35V4 with Intel Celeron and  Z36xxx/Z37xxx Series Graphics. It would rather quickly hang when graphical capabilities were used. Watching youtube or a video would hang it in minutes.

Ever since I set intel_idle.max_cstate=1, I could not reproduce any hangs, despite continuously playing Idiocracy in a while loop, and playing a stream of youtube videos at the same time, the computer stays up for days.

Comment 444 Rick Lee 2016-08-07 17:18:21 UTC

Fascinating 2 hours reading 443 comments above ^^^

Dell Inspiron 17R SE 7720. Intel i7 3630QM. Nvidia GT650M. Observations based on Youtube streaming and 10 chrome tabs.

Had been running 3.13.092 / Ubuntu 14.04.1 for a month. Conky monitor heat 50C. CPU frequency bounces between 1200 and 2400 Mhz. 8 CPU's run around 15%. No real problems other than Turbo Boost appears inactive.

Yesterday Upgrade 4.4.0 / Unbuntu 16.04 close lid doesn't suspend.
Update 4.6.3 patch systemd config to make suspend work.
Conky monitor heat now 70C and ocassional 5-10 second keyboard lag. CPU scaling now goes over 3200Mhz turbo boost limit. But usually around 2000 Mhz. 8 CPU's now running at 7% with more even balancing.

When wifi gets patchy CPU really races.
Also have smartphone plugged into USB powered hub.
Also have external TV via HDMI.
Will try Yakety Yak (Ubuntu 16.10) soon because sound switches to laptop from TV during suspend and PulseAudio 9 (in Yakety Yak) fixes this PulseAudio 8 "undocumented feature" (in Xenial X-thingy).

No system lock ups but running 20C hotter and occasional keyboard freezes from 5-10 seconds are concerning.

HTH. Please don't flame me for not saying BayTrail.

Comment 445 Rick Lee 2016-08-07 17:50:09 UTC

Like a few other posters I spoke too soon. As I was writing last poast Youtube was auto running at 144p. Under 4.6.3, at Youtube 1080p the numbers are: heat 80C, average Mhz 3000, 8 CPU's 18% utilization (manual visually calculated average). It took many months studying UDEV and now I fear it will be the same with systemd.

Comment 446 amjafuso 2016-08-08 13:57:29 UTC

Tried new Kernel 4.7.0 and removed intel_idle.max_cstate=1 (cpu J1900).

Crashed after a few hours. Still not solved.

Comment 447 Maciej Hrebien 2016-08-08 16:15:19 UTC

Catched today:

Aug  8 06:07:44 HP-Mini kernel: [   10.104246] ------------[ cut here ]------------
Aug  8 06:07:44 HP-Mini kernel: [   10.104266] WARNING: CPU: 0 PID: 21 at /build/linux-7z1rSb/linux-3.16.7-ckt25/include/linux/kref.h:47 kobject_get+0x3a/0x50()
Aug  8 06:07:44 HP-Mini kernel: [   10.104270] Modules linked in: acpi_cpufreq(+) processor fuse autofs4 ext4 crc16 mbcache jbd2 ums_realtek sg sd_mod crc_t10dif crct10dif_generic crct10dif_common usb_storage ahci libahci ehci_pci uhci_hcd ehci_hcd libata psmouse scsi_mod usbcore usb_common r8169 mii fan thermal thermal_sys
Aug  8 06:07:44 HP-Mini kernel: [   10.104319] CPU: 0 PID: 21 Comm: kworker/0:1 Not tainted 3.16.0-4-amd64 #1 Debian 3.16.7-ckt25-2+deb8u3
Aug  8 06:07:44 HP-Mini kernel: [   10.104323] Hardware name: Hewlett-Packard HP Mini 210-3000/3594, BIOS F.13 11/10/2011
Aug  8 06:07:44 HP-Mini kernel: [   10.104332] Workqueue: kacpi_notify acpi_os_execute_deferred
Aug  8 06:07:44 HP-Mini kernel: [   10.104336]  0000000000000000 ffffffff8150e08f 0000000000000000 0000000000000009
Aug  8 06:07:44 HP-Mini kernel: [   10.104343]  ffffffff81067777 ffff880036961c00 0000000000000202 0000000000000003
Aug  8 06:07:44 HP-Mini kernel: [   10.104349]  0000000000000003 ffff880036e422f0 ffffffff812acbfa ffff880036961d28
Aug  8 06:07:44 HP-Mini kernel: [   10.104355] Call Trace:
Aug  8 06:07:44 HP-Mini kernel: [   10.104367]  [<ffffffff8150e08f>] ? dump_stack+0x5d/0x78
Aug  8 06:07:44 HP-Mini kernel: [   10.104376]  [<ffffffff81067777>] ? warn_slowpath_common+0x77/0x90
Aug  8 06:07:44 HP-Mini kernel: [   10.104383]  [<ffffffff812acbfa>] ? kobject_get+0x3a/0x50
Aug  8 06:07:44 HP-Mini kernel: [   10.104391]  [<ffffffff813d94f0>] ? cpufreq_cpu_get+0x70/0xc0
Aug  8 06:07:44 HP-Mini kernel: [   10.104398]  [<ffffffff813d9f2a>] ? cpufreq_update_policy+0x1a/0x1d0
Aug  8 06:07:44 HP-Mini kernel: [   10.104406]  [<ffffffff813da0e0>] ? cpufreq_update_policy+0x1d0/0x1d0
Aug  8 06:07:44 HP-Mini kernel: [   10.104421]  [<ffffffffa018b566>] ? cpufreq_set_cur_state.part.3+0x83/0x8a [processor]
Aug  8 06:07:44 HP-Mini kernel: [   10.104430]  [<ffffffffa018b666>] ? processor_set_cur_state+0x97/0xd1 [processor]
Aug  8 06:07:44 HP-Mini kernel: [   10.104444]  [<ffffffffa0000e05>] ? thermal_cdev_update+0xa5/0x110 [thermal_sys]
Aug  8 06:07:44 HP-Mini kernel: [   10.104453]  [<ffffffffa0003729>] ? step_wise_throttle+0x49/0x80 [thermal_sys]
Aug  8 06:07:44 HP-Mini kernel: [   10.104462]  [<ffffffffa000161c>] ? handle_thermal_trip+0x4c/0x150 [thermal_sys]
Aug  8 06:07:44 HP-Mini kernel: [   10.104471]  [<ffffffffa000179d>] ? thermal_zone_device_update+0x7d/0xd0 [thermal_sys]
Aug  8 06:07:44 HP-Mini kernel: [   10.104479]  [<ffffffff813319a1>] ? acpi_ev_notify_dispatch+0x3c/0x51
Aug  8 06:07:44 HP-Mini kernel: [   10.104485]  [<ffffffff8131e457>] ? acpi_os_execute_deferred+0x10/0x1a
Aug  8 06:07:44 HP-Mini kernel: [   10.104492]  [<ffffffff81081742>] ? process_one_work+0x172/0x420
Aug  8 06:07:44 HP-Mini kernel: [   10.104499]  [<ffffffff81081dd3>] ? worker_thread+0x113/0x4f0
Aug  8 06:07:44 HP-Mini kernel: [   10.104505]  [<ffffffff815105c1>] ? __schedule+0x2b1/0x700
Aug  8 06:07:44 HP-Mini kernel: [   10.104511]  [<ffffffff81081cc0>] ? rescuer_thread+0x2d0/0x2d0
Aug  8 06:07:44 HP-Mini kernel: [   10.104519]  [<ffffffff8108800d>] ? kthread+0xbd/0xe0
Aug  8 06:07:44 HP-Mini kernel: [   10.104526]  [<ffffffff81087f50>] ? kthread_create_on_node+0x180/0x180
Aug  8 06:07:44 HP-Mini kernel: [   10.104533]  [<ffffffff81514158>] ? ret_from_fork+0x58/0x90
Aug  8 06:07:44 HP-Mini kernel: [   10.104540]  [<ffffffff81087f50>] ? kthread_create_on_node+0x180/0x180
Aug  8 06:07:44 HP-Mini kernel: [   10.104544] ---[ end trace 6a04776659b650d3 ]---

and:

Aug  8 06:10:14 HP-Mini kernel: [  169.942572] general protection fault: 0000 [#1] SMP 
Aug  8 06:10:14 HP-Mini kernel: [  169.942761] Modules linked in: bnep ctr ccm nfsd auth_rpcgss oid_registry nfs_acl nfs lockd fscache sunrpc arc4 snd_hda_codec_idt ath9k ath9k_common snd_hda_codec_generic ath9k_hw uvcvideo videobuf2_vmalloc videobuf2_memops videobuf2_core v4l2_common coretemp videodev snd_hda_intel media ath3k btusb snd_hda_controller kvm ath hp_wmi snd_hda_codec bluetooth mac80211 i915 iTCO_wdt drm_kms_helper cfg80211 drm 6lowpan_iphc iTCO_vendor_support sparse_keymap snd_hwdep snd_pcm rfkill ac shpchp wmi i2c_i801 i2c_algo_bit joydev evdev snd_timer serio_raw pcspkr lpc_ich mfd_core i2c_core snd video battery soundcore button acpi_cpufreq processor fuse autofs4 ext4 crc16 mbcache jbd2 ums_realtek sg sd_mod crc_t10dif crct10dif_generic crct10dif_common usb_storage ahci libahci ehci_pci uhci_hcd ehci_hcd libata psmouse scsi_mod usbcore usb_common r8169 mii fan thermal thermal_sys
Aug  8 06:10:14 HP-Mini kernel: [  169.944026] CPU: 2 PID: 836 Comm: Xorg Tainted: G        W     3.16.0-4-amd64 #1 Debian 3.16.7-ckt25-2+deb8u3
Aug  8 06:10:14 HP-Mini kernel: [  169.944026] Hardware name: Hewlett-Packard HP Mini 210-3000/3594, BIOS F.13 11/10/2011
Aug  8 06:10:14 HP-Mini kernel: [  169.944026] task: ffff88007b236d20 ti: ffff880079294000 task.ti: ffff880079294000
Aug  8 06:10:14 HP-Mini kernel: [  169.944026] RIP: 0010:[<ffffffff812bada0>]  [<ffffffff812bada0>] sg_next+0x0/0x30
Aug  8 06:10:14 HP-Mini kernel: [  169.944026] RSP: 0018:ffff880079297b80  EFLAGS: 00010202
Aug  8 06:10:14 HP-Mini kernel: [  169.944026] RAX: ea00011d51829182 RBX: 0000000000000001 RCX: ffff880036b38880
Aug  8 06:10:14 HP-Mini kernel: [  169.944026] RDX: ffff880036b38700 RSI: 0000000000000000 RDI: ea00011d51829182
Aug  8 06:10:14 HP-Mini kernel: [  169.944026] RBP: 000000000000ffff R08: 0000000007637000 R09: 0000000000000000
Aug  8 06:10:14 HP-Mini kernel: [  169.944026] R10: 0000000007800000 R11: 0000000000000000 R12: ea00011d51829182
Aug  8 06:10:14 HP-Mini kernel: [  169.944026] R13: ffff88007c7ee098 R14: ffffffff8181f660 R15: ffff8800628d1900
Aug  8 06:10:14 HP-Mini kernel: [  169.944026] FS:  00007f4281a3c980(0000) GS:ffff88007f300000(0000) knlGS:0000000000000000
Aug  8 06:10:14 HP-Mini kernel: [  169.944026] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Aug  8 06:10:14 HP-Mini kernel: [  169.944026] CR2: 00007f428157e000 CR3: 000000007b2ff000 CR4: 00000000000007e0
Aug  8 06:10:14 HP-Mini kernel: [  169.944026] Stack:
Aug  8 06:10:14 HP-Mini kernel: [  169.944026]  ffffffffa03b7b7b ffff88007c1b7a08 0000000000001000 0000000000001000
Aug  8 06:10:14 HP-Mini kernel: [  169.944026]  ffff880079e0b800 0000000000000000 ffffffffa03bdf8f 0000000020000000
Aug  8 06:10:14 HP-Mini kernel: [  169.944026]  0000000000000000 ffff880000000000 0000000000000000 ffff88007c1b0000
Aug  8 06:10:14 HP-Mini kernel: [  169.944026] Call Trace:
Aug  8 06:10:14 HP-Mini kernel: [  169.944026]  [<ffffffffa03b7b7b>] ? i915_gem_gtt_prepare_object+0x6b/0xb0 [i915]
Aug  8 06:10:14 HP-Mini kernel: [  169.944026]  [<ffffffffa03bdf8f>] ? i915_gem_object_pin+0x57f/0x780 [i915]
Aug  8 06:10:14 HP-Mini kernel: [  169.944026]  [<ffffffffa0421559>] ? i915_gem_execbuffer_reserve_vma.isra.16+0x95/0x11a [i915]
Aug  8 06:10:14 HP-Mini kernel: [  169.944026]  [<ffffffffa042182a>] ? i915_gem_execbuffer_reserve+0x24c/0x2dc [i915]
Aug  8 06:10:14 HP-Mini kernel: [  169.944026]  [<ffffffffa03b358d>] ? i915_gem_do_execbuffer.isra.24+0x89d/0x13f0 [i915]
Aug  8 06:10:14 HP-Mini kernel: [  169.944026]  [<ffffffffa03bcc8b>] ? i915_gem_object_put_fence+0x1b/0xc0 [i915]
Aug  8 06:10:14 HP-Mini kernel: [  169.944026]  [<ffffffffa03b459f>] ? i915_gem_execbuffer2+0xaf/0x2b0 [i915]
Aug  8 06:10:14 HP-Mini kernel: [  169.944026]  [<ffffffffa02db8a7>] ? drm_ioctl+0x1c7/0x5b0 [drm]
Aug  8 06:10:14 HP-Mini kernel: [  169.944026]  [<ffffffff811be12e>] ? dput+0x9e/0x170
Aug  8 06:10:14 HP-Mini kernel: [  169.944026]  [<ffffffff811ba9af>] ? do_vfs_ioctl+0x2cf/0x4b0
Aug  8 06:10:14 HP-Mini kernel: [  169.944026]  [<ffffffff81085261>] ? task_work_run+0x91/0xb0
Aug  8 06:10:14 HP-Mini kernel: [  169.944026]  [<ffffffff811bac11>] ? SyS_ioctl+0x81/0xa0
Aug  8 06:10:14 HP-Mini kernel: [  169.944026]  [<ffffffff815144ca>] ? int_signal+0x12/0x17
Aug  8 06:10:14 HP-Mini kernel: [  169.944026]  [<ffffffff8151420d>] ? system_call_fast_compare_end+0x10/0x15
Aug  8 06:10:14 HP-Mini kernel: [  169.944026] Code: 27 fa ff ff 0f 1f 80 00 00 00 00 c7 47 10 00 00 00 00 89 57 0c 48 89 37 89 4f 08 c3 66 66 66 66 66 66 2e 0f 1f 84 00 00 00 00 00 <f6> 07 02 75 13 48 8b 57 20 48 8d 47 20 f6 c2 01 75 09 f3 c3 0f 
Aug  8 06:10:14 HP-Mini kernel: [  169.944026]  RSP <ffff880079297b80>
Aug  8 06:10:14 HP-Mini kernel: [  170.030137] ---[ end trace 6a04776659b650d4 ]---

and also:

Aug  8 06:07:44 HP-Mini kernel: [   11.789183] ACPI Error: Field [D128] at 1040 exceeds Buffer [NULL] size 160 (bits) (20140424/dsopcode-236)
Aug  8 06:07:44 HP-Mini kernel: [   11.789203] ACPI Error: Method parse/execution failed [\_SB_.WMID.HWMC] (Node ffff88007e852f40), AE_AML_BUFFER_LIMIT (20140424/psparse-536)
Aug  8 06:07:44 HP-Mini kernel: [   11.789228] ACPI Error: Method parse/execution failed [\_SB_.WMID.WMAD] (Node ffff88007e852d10), AE_AML_BUFFER_LIMIT (20140424/psparse-536)
Aug  8 06:07:44 HP-Mini kernel: [   11.789420] ACPI Error: Field [D128] at 1040 exceeds Buffer [NULL] size 160 (bits) (20140424/dsopcode-236)
Aug  8 06:07:44 HP-Mini kernel: [   11.789437] ACPI Error: Method parse/execution failed [\_SB_.WMID.HWMC] (Node ffff88007e852f40), AE_AML_BUFFER_LIMIT (20140424/psparse-536)
Aug  8 06:07:44 HP-Mini kernel: [   11.789461] ACPI Error: Method parse/execution failed [\_SB_.WMID.WMAD] (Node ffff88007e852d10), AE_AML_BUFFER_LIMIT (20140424/psparse-536)
Aug  8 06:07:44 HP-Mini kernel: [   11.789643] ACPI Error: Field [D128] at 1040 exceeds Buffer [NULL] size 160 (bits) (20140424/dsopcode-236)
Aug  8 06:07:44 HP-Mini kernel: [   11.789659] ACPI Error: Method parse/execution failed [\_SB_.WMID.HWMC] (Node ffff88007e852f40), AE_AML_BUFFER_LIMIT (20140424/psparse-536)

With the freeze effect (hard-boot required). Aren't the dumps related?

Comment 448 D. Hugh Redelmeier 2016-08-08 18:17:02 UTC

@Maciej Hrebien #444
You don't explain much about your system.  Buried in the log is "Hardware name: Hewlett-Packard HP Mini 210-3000/3594, BIOS F.13 11/10/2011".  This seems to be a notebook with a pineview or earlier Atom processor.  Not the subject of this bugzilla entry.

Comment 449 Maciej Hrebien 2016-08-09 04:47:00 UTC

Yes, it's N570 chip and I can share more details if needed. I thought the dumps are related as setting cstate to 1 makes the device usable (running ~12h now without the freeze). The 3.8.2 kernel seems to be working fine for me that is without any freezes and workarounds.

Comment 450 Kevin 2016-08-09 19:41:54 UTC

Hello, my computer is an Acer E5-511P, it always crashed randomly (the problem above) when in Linux, I was able to resolve the issue with "intel_idle.max_cstate=1" and blacklisting dw_dmac and dw_dmac_core.

A fun fact, now I'm using Windows 10 with the "Windows Subsystem for Linux" (Ubuntu 14.04, Linux kernel version 3.4), and my computer has crashed two times since then (the UI freezed, and the cpu fan spinning at top speed).

Comment 451 Andy Shevchenko 2016-08-10 14:54:40 UTC

(In reply to Kevin from comment #450)

> and blacklisting dw_dmac and dw_dmac_core.

That's should be solved in v4.5. So, if you have kernel v4.5+, please, try again w/o disabling dw_dmac. You may refer to bug #101271 for the details.

Comment 452 Juha Sievi-Korte 2016-08-22 16:43:36 UTC

(In reply to Wolfgang M. Reimer from comment #437)
> Running my submitted scripts
> 
> https://bugzilla.kernel.org/attachment.cgi?id=223851
> https://bugzilla.kernel.org/attachment.cgi?id=223861
> 
> on a J1900 system should produce a similar output:
> 
> As one can see in my case most of the core's idle time is now spent in state
> C7S-BYT.

Thanks a lot!

I've had my system running since then with disabled C6 state and no freezes. I've done one reboot to update kernel, 21 days uptime on current session now. This is on N3540 laptop with which I've had quite random but steady occurences of freezes over last year.

So might be too early to declare success, but it seems promising for now.

Comment 453 Srdjan Todorovic 2016-08-22 21:49:18 UTC

I tried the disable C6 state script as per Wolfgang M. Reimer's scripts (C6 events didnt seem to be increasing afterwards), lockup within 2 hours.

Linux htpc 4.4.0-34-generic #53-Ubuntu SMP Wed Jul 27 16:06:39 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux

model name      : Intel(R) Celeron(R) CPU  J1900  @ 1.99GHz
stepping        : 8
microcode       : 0x831

Board is Asrock Mod Q1900ITX

My use case: launch Kodi, start palying a DVD for 20 minutes, pause DVD for 30 minutes, resume playing for perhaps 10 minutes, then pause again for 20 minutes. When I tried to resume playback, system was unresponsive. Even the reset button doesn't respond.

Just booted with intel_idle.max_cstate=1, will report if this has same issue.

Comment 454 smaj 2016-08-25 12:44:51 UTC

I disabled C6 states as described by Wolfgang M. Reimer. No crashes for two days now, when having them on this J1900 based system reliably in less than an hour uptime.

Board is ASRock Q1900-ITX 


processor	: 0
vendor_id	: GenuineIntel
cpu family	: 6
model		: 55
model name	: Intel(R) Celeron(R) CPU  J1900  @ 1.99GHz
stepping	: 3
microcode	: 0x320
cpu MHz		: 1521.891
cache size	: 1024 KB
physical id	: 0
siblings	: 4
core id		: 0
cpu cores	: 4
apicid		: 0
initial apicid	: 0
fpu		: yes
fpu_exception	: yes
cpuid level	: 11
wp		: yes
flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx est tm2 ssse3 cx16 xtpr pdcm sse4_1 sse4_2 movbe popcnt tsc_deadline_timer rdrand lahf_lm 3dnowprefetch epb tpr_shadow vnmi flexpriority ept vpid tsc_adjust smep erms dtherm ida arat
bugs		:
bogomips	: 3993.60
clflush size	: 64
cache_alignment	: 64
address sizes	: 36 bits physical, 48 bits virtual
power management:

Comment 455 cscs 2016-08-26 20:39:59 UTC

Thanks to Wolfgang M. Reimer, I am now running 4.7.2 with no problems.
As per https://bugzilla.kernel.org/show_bug.cgi?id=109051#c437

Here is my relevant information;

lscpu:

Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                4
On-line CPU(s) list:   0-3
Thread(s) per core:    1
Core(s) per socket:    4
Socket(s):             1
NUMA node(s):          1
Vendor ID:             GenuineIntel
CPU family:            6
Model:                 55
Model name:            Intel(R) Pentium(R) CPU  N3540  @ 2.16GHz
Stepping:              8
CPU MHz:               499.677
CPU max MHz:           2665.6001
CPU min MHz:           499.8000
BogoMIPS:              4328.66
Virtualization:        VT-x
L1d cache:             24K
L1i cache:             32K
L2 cache:              1024K
NUMA node0 CPU(s):     0-3
Flags:                 fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl vmx est tm2 ssse3 cx16 xtpr pdcm sse4_1 sse4_2 movbe popcnt tsc_deadline_timer rdrand lahf_lm 3dnowprefetch epb tpr_shadow vnmi flexpriority ept vpid tsc_adjust smep erms dtherm ida arat

Base Board Information
	Manufacturer: Dell Inc.
	Product Name: 0H4MK6
	Version: A00
	Serial Number: .HS1LK52.CN7620657U0002.

Comment 456 RussianNeuroMancer 2016-08-27 05:23:27 UTC

How disabling C6 with still enabled C7 affect battery life?

Comment 457 fdservices 2016-08-27 17:32:56 UTC

(In reply to cscs from comment #455)
> Thanks to Wolfgang M. Reimer, I am now running 4.7.2 with no problems.
> As per https://bugzilla.kernel.org/show_bug.cgi?id=109051#c437
> 
> Here is my relevant information;
> 
> lscpu:
> 
> Architecture:          x86_64
> CPU op-mode(s):        32-bit, 64-bit
> Byte Order:            Little Endian
> CPU(s):                4
> On-line CPU(s) list:   0-3
> Thread(s) per core:    1
> Core(s) per socket:    4
> Socket(s):             1
> NUMA node(s):          1
> Vendor ID:             GenuineIntel
> CPU family:            6
> Model:                 55
> Model name:            Intel(R) Pentium(R) CPU  N3540  @ 2.16GHz
> Stepping:              8
> CPU MHz:               499.677
> CPU max MHz:           2665.6001
> CPU min MHz:           499.8000
> BogoMIPS:              4328.66
> Virtualization:        VT-x
> L1d cache:             24K
> L1i cache:             32K
> L2 cache:              1024K
> NUMA node0 CPU(s):     0-3
> Flags:                 fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge
> mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall
> nx rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology
> nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl vmx est
> tm2 ssse3 cx16 xtpr pdcm sse4_1 sse4_2 movbe popcnt tsc_deadline_timer
> rdrand lahf_lm 3dnowprefetch epb tpr_shadow vnmi flexpriority ept vpid
> tsc_adjust smep erms dtherm ida arat
> 
> Base Board Information
>       Manufacturer: Dell Inc.
>       Product Name: 0H4MK6
>       Version: A00
>       Serial Number: .HS1LK52.CN7620657U0002.

As far as I can tell, I too am running 4.7.2 with no problem - Arch Linux, Acer Travelmate B115M

Comment 458 mohican 2016-08-29 13:26:03 UTC

Hello, on a Lenovo E50-00 with CPU Intel Pentium J2900
I had these random freezes.

In addition I also have a freeze when RESUMING AFTER SUSPEND.
Therefore when I tested more distributions (kernel versions) I just tested the resume after suspend.
(Hibernate does work fine.)

I had the bug with :
- Linux Mint 18.0 (based on Ubuntu 16.04) with kernel 4.4
- Ubuntu 14.04
- Ubuntu 12.04 with kernel 3.13

I also had the bug when I changed CPU's BIOS setting to C1 only.

Comment 459 jbMacAZ 2016-08-30 17:33:50 UTC

The c6-off/c7-on script is also effective on Z3775 baytrail processor in my ASUS T100CHI (and is reported effective for other ASUS T100T* models.)  Listed cstates for the Z3775 are (POLL,C1-BYT,C6N-BYT,C6S-BYT,C7-BYT,C7S-BYT)

Now my only kernel arguments are "tsc=reliable clocksource=tsc".  I no longer need intel_idle.max_cstate={0,1}.  Even with recent kernels, the T100CHI would rarely go more than 30 minutes without a freeze unless cstate was limited.  I also had freezes when trying max_cstate=3.

Many thanks for tracking this one down.

Comment 460 ladiko 2016-08-30 18:12:13 UTC

Tried the c6-disabling script on asrock q1900itx-m. Ran for a whole night while otherwise it would freeze within 1 or 2 hours and had to use the max_cstate=1-fix. Will roll it out to 50 more machines until December. Not a fix but an ok workaround for this issue.

Comment 461 Paul Nijenhuis 2016-09-01 16:01:56 UTC

I've also tried the c6-disabling script, i made a startup service for it
on OpenSuse Tumbleweed with the latest updates. It's now running smoothly
while first i did disable the c6 and c7 state in the UEFI BIOS.
Now i've re-enabled the states and everything seems to run ok.
No freezes for 4 hours now.
output from cstateInfo.sh :
cpu0 State  Name     Disabled  Latency  Residency        Time    Usage
         0  POLL            0        0          0    29315450    16010
         1  C1-BYT          0        1          1  2075922404  2720073
         2  C6N-BYT         1      300        275     1298175      459
         3  C6S-BYT         1      500        560     6214377     1612
         4  C7-BYT          0     1200       4000  1228124502   139642
         5  C7S-BYT         0    10000      20000   474724588    20423
cpu1 State  Name     Disabled  Latency  Residency        Time    Usage
         0  POLL            0        0          0    34446485    17262
         1  C1-BYT          0        1          1  2026994454  2604049
         2  C6N-BYT         1      300        275     1339377      414
         3  C6S-BYT         1      500        560     5097170      895
         4  C7-BYT          0     1200       4000  1215493749   130038
         5  C7S-BYT         0    10000      20000   554333717    22140
cpu2 State  Name     Disabled  Latency  Residency        Time    Usage
         0  POLL            0        0          0    32861042    15934
         1  C1-BYT          0        1          1  2994741581  5739338
         2  C6N-BYT         1      300        275      958074      269
         3  C6S-BYT         1      500        560     5137562     1061
         4  C7-BYT          0     1200       4000   533053353    77172
         5  C7S-BYT         0    10000      20000   111720079     5108
cpu3 State  Name     Disabled  Latency  Residency        Time    Usage
         0  POLL            0        0          0    26658663    12680
         1  C1-BYT          0        1          1  2232165052  3062867
         2  C6N-BYT         1      300        275      900500      238
         3  C6S-BYT         1      500        560     4047949      844
         4  C7-BYT          0     1200       4000  1198658992   148698
         5  C7S-BYT         0    10000      20000   307666394    14599

And lscpu
Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                4
On-line CPU(s) list:   0-3
Thread(s) per core:    1
Core(s) per socket:    4
Socket(s):             1
NUMA node(s):          1
Vendor ID:             GenuineIntel
CPU family:            6
Model:                 55
Model name:            Intel(R) Celeron(R) CPU  J1900  @ 1.99GHz
Stepping:              8
CPU MHz:               1332.718
CPU max MHz:           2415.7000
CPU min MHz:           1332.8000
BogoMIPS:              3993.60
Virtualization:        VT-x
L1d cache:             24K
L1i cache:             32K
L2 cache:              1024K
NUMA node0 CPU(s):     0-3
Flags:                 fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl vmx est tm2 ssse3 cx16 xtpr pdcm sse4_1 sse4_2 movbe popcnt tsc_deadline_timer rdrand lahf_lm 3dnowprefetch epb tpr_shadow vnmi flexpriority ept vpid tsc_adjust smep erms dtherm ida arat

Comment 462 Paul Nijenhuis 2016-09-01 16:21:01 UTC

(In reply to Paul Nijenhuis from comment #461)
> I've also tried the c6-disabling script, i made a startup service for it
> on OpenSuse Tumbleweed with the latest updates. It's now running smoothly
> while first i did disable the c6 and c7 state in the UEFI BIOS.
> Now i've re-enabled the states and everything seems to run ok.
> No freezes for 4 hours now.
> output from cstateInfo.sh :
> cpu0 State  Name     Disabled  Latency  Residency        Time    Usage
>          0  POLL            0        0          0    29315450    16010
>          1  C1-BYT          0        1          1  2075922404  2720073
>          2  C6N-BYT         1      300        275     1298175      459
>          3  C6S-BYT         1      500        560     6214377     1612
>          4  C7-BYT          0     1200       4000  1228124502   139642
>          5  C7S-BYT         0    10000      20000   474724588    20423
> cpu1 State  Name     Disabled  Latency  Residency        Time    Usage
>          0  POLL            0        0          0    34446485    17262
>          1  C1-BYT          0        1          1  2026994454  2604049
>          2  C6N-BYT         1      300        275     1339377      414
>          3  C6S-BYT         1      500        560     5097170      895
>          4  C7-BYT          0     1200       4000  1215493749   130038
>          5  C7S-BYT         0    10000      20000   554333717    22140
> cpu2 State  Name     Disabled  Latency  Residency        Time    Usage
>          0  POLL            0        0          0    32861042    15934
>          1  C1-BYT          0        1          1  2994741581  5739338
>          2  C6N-BYT         1      300        275      958074      269
>          3  C6S-BYT         1      500        560     5137562     1061
>          4  C7-BYT          0     1200       4000   533053353    77172
>          5  C7S-BYT         0    10000      20000   111720079     5108
> cpu3 State  Name     Disabled  Latency  Residency        Time    Usage
>          0  POLL            0        0          0    26658663    12680
>          1  C1-BYT          0        1          1  2232165052  3062867
>          2  C6N-BYT         1      300        275      900500      238
>          3  C6S-BYT         1      500        560     4047949      844
>          4  C7-BYT          0     1200       4000  1198658992   148698
>          5  C7S-BYT         0    10000      20000   307666394    14599
> 
> And lscpu
> Architecture:          x86_64
> CPU op-mode(s):        32-bit, 64-bit
> Byte Order:            Little Endian
> CPU(s):                4
> On-line CPU(s) list:   0-3
> Thread(s) per core:    1
> Core(s) per socket:    4
> Socket(s):             1
> NUMA node(s):          1
> Vendor ID:             GenuineIntel
> CPU family:            6
> Model:                 55
> Model name:            Intel(R) Celeron(R) CPU  J1900  @ 1.99GHz
> Stepping:              8
> CPU MHz:               1332.718
> CPU max MHz:           2415.7000
> CPU min MHz:           1332.8000
> BogoMIPS:              3993.60
> Virtualization:        VT-x
> L1d cache:             24K
> L1i cache:             32K
> L2 cache:              1024K
> NUMA node0 CPU(s):     0-3
> Flags:                 fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge
> mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall
> nx rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology
> nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl vmx est
> tm2 ssse3 cx16 xtpr pdcm sse4_1 sse4_2 movbe popcnt tsc_deadline_timer
> rdrand lahf_lm 3dnowprefetch epb tpr_shadow vnmi flexpriority ept vpid
> tsc_adjust smep erms dtherm ida arat

Ok, freeze again after 4,5 hours :-(
Back to enable only C1 in BIOS.....

Comment 463 amjafuso 2016-09-02 09:31:44 UTC

Newest Kernel 4.7.2 crashed after 3 hours. Going back to intel_idle.max_cstate=1.

Who cares? More than 400 comments and the status is still NEW? I might give up reporting to this thread...

Comment 464 Hal 2016-09-02 16:50:25 UTC

On my Celeron N2930 system I switched from intel_idle.max_cstate=1 to using c6off+c7on.sh a couple of days ago. Everything works quite well so far.
cstateInfo.sh confirms that there is no C6N-BYT or C6S-BYT being used. As this is a very active machine running 3 virtual machines (light load) the CPU is mostly in the C1E-BYT state. C7-BYT and C7S-BYT also get a good hit.
So, my experience so far is very positive. The only surprising thing is that my box is not running much cooler than before this change. I guess the active CPU load explains that.
Hal

Comment 465 Hal 2016-09-02 16:56:08 UTC

(In reply to Hal from comment #464)
> On my Celeron N2930 system I switched from intel_idle.max_cstate=1 to using
> c6off+c7on.sh a couple of days ago. Everything works quite well so far.
> cstateInfo.sh confirms that there is no C6N-BYT or C6S-BYT being used. As
> this is a very active machine running 3 virtual machines (light load) the
> CPU is mostly in the C1E-BYT state. C7-BYT and C7S-BYT also get a good hit.
> So, my experience so far is very positive. The only surprising thing is that
> my box is not running much cooler than before this change. I guess the
> active CPU load explains that.
> Hal

I also wanted to add that I've been using kernel version 4.5.7.
Finally a question too: What would be the best way to launch c6off+c7on.sh. I have it currently in a "session and startup" entry in xfce4. But it would probably make sense to start it before xfce or even xorg is launched.
Hal

Comment 466 ladiko 2016-09-03 05:18:32 UTC

That would add the function as a system service on ubuntu 14.04. Later versions would use systemd services. So it's different there, but don't have the files here.




echo -e 'for state in /sys/devices/system/cpu/cpu*/cpuidle/state* ; do
	case "$(< "${state}/name")" in
		C6*-BYT|C6*-CHT) echo "1" > "${state}/disable" ;;
		C7*-BYT|C7*-CHT) echo "0" > "${state}/disable" ;;
	esac
done' > /etc/init.d/c6off+c7on.sh
chown root:root /etc/init.d/c6off+c7on.sh
chmod 755 /etc/init.d/c6off+c7on.sh
update-rc.d -f /etc/init.d/c6off+c7on.sh start 90 2 .



I think the bug also exists on CherryTrail, so i added |C6*-CHT and |C7*-CHT. If the bug doesnt affect it, just remove it.

Comment 467 Martin 2016-09-03 08:09:08 UTC

I've been running the previously unstable kernel 4.5.4 without max_cstate=1 using the c6off+c7on script for more than 3 days now and have yet see the dreaded freeze.

$ uptime 
 10:02:05 up 3 days, 13:13,  2 users,  load average: 0,16, 0,13, 0,14

$ cat /proc/cpuinfo 
processor       : 0
vendor_id       : GenuineIntel
cpu family      : 6
model           : 55
model name      : Intel(R) Celeron(R) CPU  J1900  @ 1.99GHz
stepping        : 3
...

Seems to be the holy grail for me! Thx a lot!
Too bad intel never chipped in with the addendum before. Shame!

I put the c6off+c7on script in /etc/rc.local.

Comment 468 Juha Sievi-Korte 2016-09-04 19:11:15 UTC

(In reply to Juha Sievi-Korte from comment #452)
> (In reply to Wolfgang M. Reimer from comment #437)
> > Running my submitted scripts
> > 
> > https://bugzilla.kernel.org/attachment.cgi?id=223851
> > https://bugzilla.kernel.org/attachment.cgi?id=223861
> > 
> > on a J1900 system should produce a similar output:
> > 
> > As one can see in my case most of the core's idle time is now spent in
> state
> > C7S-BYT.
> 
> Thanks a lot!
> 
> I've had my system running since then with disabled C6 state and no freezes.
> I've done one reboot to update kernel, 21 days uptime on current session
> now. This is on N3540 laptop with which I've had quite random but steady
> occurences of freezes over last year.
> 
> So might be too early to declare success, but it seems promising for now.

And today two crashes with c6 disabled by this script, so this wasn't the root cause either. Phew. Back to two hour battery life it is...

Comment 469 ladiko 2016-09-04 19:14:36 UTC

Crashed or freezes?

Comment 470 Juha Sievi-Korte 2016-09-04 19:38:56 UTC

(In reply to ladiko from comment #469)
> Crashed or freezes?

Sorry for that, freezed. First happened within the 'long' uptime session and next within hour of a reboot, so it seems equally random for me as before. Cstate script was run early in boot-up.

Comment 471 ladiko 2016-09-04 19:47:38 UTC

It's fine. Just wanted to be sure what we talk about. Didn't yet pushed it to the other 50 machines. Just the one I tested it on ran stable. So you checked the c6 state after boot? I will run a long term test. Like several days before I roll it out to the other machines.

Comment 472 Juha Sievi-Korte 2016-09-04 20:26:17 UTC

(In reply to ladiko from comment #471)
> It's fine. Just wanted to be sure what we talk about. Didn't yet pushed it
> to the other 50 machines. Just the one I tested it on ran stable. So you
> checked the c6 state after boot? I will run a long term test. Like several
> days before I roll it out to the other machines.

Yep, checked that the script was run ok and c6 wasn't active after the last reboot.

As some folks still seem to have promising results with this script, I think I'll let it still run for a while to see the effects in longer run. Perhaps it makes some events that cause this issue to happen less frequently.

Checked that last log entry was ~40 mins after the reboot at last attempt when it freezed. The system was just sitting idle by itself when it happened, one ssh session open on desktop + few tabs on browser.

Comment 473 ichudov 2016-09-04 20:30:43 UTC

I have a computer where I had constant crashes. I set "intel_idle.max_cstate=1" and now it stays up for weeks and never crashes.

Comment 474 jbMacAZ 2016-09-06 08:06:05 UTC

Ran 44 hours with c6off+c7on.sh before freezing hard.  Usually would freeze within 30 minutes without any cstate limits.  Z3775 might have issue with C7 states. The system was idle when it locked up.  

Will be there newer uCode that would help my baytrail?

Comment 475 rkrambovitis 2016-09-07 06:14:07 UTC

I have had 0 lockups (3 days) using 4.8-rc5 from ubuntu mainline archives.
No patches, max_cstate settings or anything.
Before that I was using 3.16 kernel which was the last stable one for me (baytrail).

Unfortunately my hdmi is not working with this kernel.

Anyone else wanna try and report ?

Comment 476 Martin 2016-09-07 06:50:46 UTC

(In reply to Martin from comment #467)
> I've been running the previously unstable kernel 4.5.4 without max_cstate=1
> using the c6off+c7on script for more than 3 days now and have yet see the
> dreaded freeze.
> 
> $ uptime 
>  10:02:05 up 3 days, 13:13,  2 users,  load average: 0,16, 0,13, 0,14
> 
> $ cat /proc/cpuinfo 
> processor       : 0
> vendor_id       : GenuineIntel
> cpu family      : 6
> model           : 55
> model name      : Intel(R) Celeron(R) CPU  J1900  @ 1.99GHz
> stepping        : 3
> ...
> 
> Seems to be the holy grail for me! Thx a lot!
> Too bad intel never chipped in with the addendum before. Shame!
> 
> I put the c6off+c7on script in /etc/rc.local.

I regret having to crawl back on my statement above. After many days of stable TV watching, our HTPC was non responsive and I had to power-cycle it to get back in business.

Comment 477 Hal 2016-09-08 19:22:04 UTC

Follow up on week old post: c6off+c7on still works well for my Zotac system. Below comes my stats:

Thu Sep  8 15:18:37 EDT 2016
 15:18:37 up 7 days,  6:46,  2 users,  load average: 6.95, 7.15, 7.28

  *-cpu
       description: CPU
       product: Intel(R) Celeron(R) CPU  N2930  @ 1.83GHz
       vendor: Intel Corp.
       physical id: 34
       bus info: cpu@0
       version: Intel(R) Celeron(R) CPU N2930 @ 1.83GHz
       slot: SOCKET 0
       size: 2165MHz
       capacity: 2400MHz
       width: 64 bits
       clock: 83MHz
       capabilities: x86-64 fpu fpu_exception wp vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx rdtscp constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx est tm2 ssse3 cx16 xtpr pdcm sse4_1 sse4_2 movbe popcnt tsc_deadline_timer rdrand lahf_lm 3dnowprefetch ida arat epb dtherm tpr_shadow vnmi flexpriority ept vpid tsc_adjust smep erms cpufreq
       configuration: cores=4 enabledcores=4 threads=4

cpu0 State  Name     Disabled  Latency  Residency          Time      Usage
         0  POLL            0        0          0      22302321    2554713
         1  C1-BYT          0        1          1   15757596319  262145340
         2  C1E-BYT         0       15         30  119156258736  521385788
         3  C6N-BYT         1       40        275        456080        865
         4  C6S-BYT         1      140        560        855986        870
         5  C7-BYT          0     1200       1500    3704252148    5761453
         6  C7S-BYT         0    10000      20000      61455590       3197
cpu1 State  Name     Disabled  Latency  Residency          Time      Usage
         0  POLL            0        0          0     208494519    3428799
         1  C1-BYT          0        1          1   17097828007  280365124
         2  C1E-BYT         0       15         30  117773052639  522674956
         3  C6N-BYT         1       40        275        376329        506
         4  C6S-BYT         1      140        560        784219        596
         5  C7-BYT          0     1200       1500    3331593582    4844889
         6  C7S-BYT         0    10000      20000      59530341       2921
cpu2 State  Name     Disabled  Latency  Residency          Time      Usage
         0  POLL            0        0          0      21634146    2606774
         1  C1-BYT          0        1          1   16503086447  284253332
         2  C1E-BYT         0       15         30  122405265787  542914915
         3  C6N-BYT         1       40        275        537565        835
         4  C6S-BYT         1      140        560        626723        544
         5  C7-BYT          0     1200       1500    3845541641    5968789
         6  C7S-BYT         0    10000      20000      43528414       2486
cpu3 State  Name     Disabled  Latency  Residency          Time      Usage
         0  POLL            0        0          0      22154717    2630336
         1  C1-BYT          0        1          1   16894582028  282524229
         2  C1E-BYT         0       15         30  123929531803  549088123
         3  C6N-BYT         1       40        275        313412        440
         4  C6S-BYT         1      140        560        722191        486
         5  C7-BYT          0     1200       1500    4109133756    6219843
         6  C7S-BYT         0    10000      20000      52241428       2709

Hal

Comment 478 Dmitry 2016-09-09 08:39:36 UTC

Haven't written here for some time. Latest kernels(~4.8-rc5) under z3770 aren't stable without max_cstate parameter. Hit this bug even in inintram root switch state. See this bug: 150881 . Kernel will freeze in tens seconds after entering idle state.

Comment 479 sfumato1977 2016-09-09 12:55:00 UTC

on my tablet Z3735F cpu , kernel 4.4 to 4.8
The "stability" and guaranteed less than 15 minutes
instead adding to grub.cfg  intel_idle.max_cstate=1 and clocksource=tsc 
I can use it for days.

P.S.
I suspect the clocksource "refined - jiffies"

Comment 480 Dmitry 2016-09-09 13:04:45 UTC

If kernel is unstable when cpu is in idle, then jiffies are no good.

For me, a zero response from Intel is evidence that there is a huge mistake in hardware. But I don't believe that it can't be overcome by any software changes.

Comment 481 László Kara 2016-09-09 18:34:00 UTC

(In reply to Dmitry from comment #480)
> If kernel is unstable when cpu is in idle, then jiffies are no good.
> 
> For me, a zero response from Intel is evidence that there is a huge mistake
> in hardware. But I don't believe that it can't be overcome by any software
> changes.

It does not crash on Windows. Some software change must work.

Comment 482 Hal 2016-09-10 19:29:40 UTC

(In reply to Dmitry from comment #480)
 > For me, a zero response from Intel is evidence that there is a huge mistake
 > in hardware. But I don't believe that it can't be overcome by any software
 > changes.
> 
> It does not crash on Windows. Some software change must work.

Not only Windows XP, NT, Vista, 7, 8, 10 work without problem but also BSD distributions work well (including pfsense). Linux kernels 2.x and 3.0 kernels (probably all the way up to 3.16 - although on one of my systems 3.16 freezes) seem to work too.

There is no doubt there is a bug related to the cstates in some of those Intel processors but evidently software workarounds or fixes are possible. I think the kernel team should be equally accountable for a definitive fix as is Intel (with no disrespect intended to either group of engineers - actually eternal gratitude to all Linux/GNU people for an outstanding platform).

Now, is the CPU bug a design problem or due to 14nm manufacturing process issues? that's an intriguing question in my mind; because I happen to own 2 identical boxes (Intel NUC) manufactured (by Intel) just a few months apart - one with the freezing problem and the other without. The steppings on the CPU and BIOS versions are unfortunately not identical. 
So whether the 'fix' is part of a newer CPU stepping or part of a remediation microcode loaded through the BIOS at start up, it looks like the CPU is able to run all cstates. That tells me that a post-manufacturing CPU microcode fix is always possible. 

Hal

Comment 483 Paul Mansfield 2016-09-10 21:50:33 UTC

I am inclined to agree with Hal. There's a group of three of us with Z3735F-based Toshiba Click Minis. Despite all running the same UEFI firmware versions and dock firmware versions, mine seems to be less stable. It could also be down to the other choices we've made in terms of root partitions on memory cards, but we have no solid proof. If we could attach a simple serial console we might have some hope.

I'm not sure if everybody's seen the Shark's Cove reference design
http://www.cnx-software.com/2014/07/30/sharks-cove-intel-atom-bay-trail-t-development-board-for-windows-8-1-is-now-available-for-299/

the links from that are broken, but I managed to track down the
technical docs for it:

https://firmware.intel.com/sites/default/files/Sharks_Cove_Schematic.pdf

http://composter.com.ua/documents/Sharks-Cove-Technical-Specifications.pdf


maybe it will help someone who understands drivers to fix the situation.

Comment 484 Dmitry 2016-09-11 09:07:57 UTC

If this bug depends on firmware than I have bad news. Atom Baytrail doesn't support microcode loading. Firstly tried tree different ways loading microcode and none of them succeeded. Then found list of cpus which support microcode loading and there is no Z3770.

Mine microcode: sig=0x30673, pf=0x2, revision=0x324

P.S. There is no microcode for 06-37-03 (Family-6,Model-55,Stepping-3).

Comment 485 julio.borreguero@gmail.com 2016-09-11 10:21:03 UTC

the only one beeing able to fix this would be intel, at least in a proper way.
obviously they don't give a shit. baytrail is old, on top low-cost. they want to sell new stuff.
let's face it, if you have a workaround for this bug (like me running kernel 4.1.12) then you are lucky.
i assume we are left alone with this and nothing is going to happen from intels side.
if you are clever never buy intel or nvidia again.
i probably won't.
that is all i can do.

Comment 486 Martin Brand 2016-09-12 17:53:57 UTC

Hi, I have just found the following link where an Intel employee seems to have found a problem and has already developed some kind of a solution.
However I have not been able to find out if this has been added to the kernel.
Here is the link:
https://lkml.org/lkml/2015/3/24/271
Perhaps somebody with more experience than myself could have a look at this?

Comment 487 Daniel Glöckner 2016-09-12 17:58:35 UTC

(In reply to Martin Brand from comment #486)
> Hi, I have just found the following link where an Intel employee seems to
> have found a problem and has already developed some kind of a solution.

See Comment 55. This patch is not enough to fix the problem.

Comment 488 BzukTuk 2016-09-12 19:01:41 UTC

(In reply to BzukTuk from comment #378)
> Hi again,
> Kernel 4.6 + Mika Kuoppalas 3 _tentative_ patches +
> linux-999-i915-use-legacy-turbo.patch = over 120h in one single session
> (without reboot/sleep..) and another 20+/- hours in few 3-4hour long
> sessions without single freeze. Still counting...
> 
> (some of Adrian Hunters patches for pm/mmc were also applied, but I dont
> think (hope) this matters)
> ...

Hi there again,
just so you know, I did not experienced !single! freeze on fresh kernels (>=4.6) since I started using Mika`s patches + legacy turbo patch together (as mentioned above). Could anyone with freeze issues and non-laptop machine give it a long stress test? I have just tablet/laptop and I dont want to wreck battery/LCD (also can`t turn display off - another story). I did not count exactly, but I think I have >500 hour long uptime without any freeze on this device.

Mika`s tentative patches
https://cgit.freedesktop.org/~miku/drm-intel/commit/?h=rc6_test&id=e564271291fa70265b53fa34c01cbb0ae6282e81
https://cgit.freedesktop.org/~miku/drm-intel/commit/?h=rc6_test&id=7e6c3f36563d133cff5b700d9c36b12ac2a0c643
https://cgit.freedesktop.org/~miku/drm-intel/commit/?h=rc6_test&id=b2f08adb19fcb18fea7cda9908fa52e2b9db5e7f

Legacy turbo:
https://github.com/OpenELEC/OpenELEC.tv/blob/master/packages/linux/patches/4.7.3/linux-999-i915-use-legacy-turbo.patch

Comment 489 Rick Lee 2016-09-13 02:31:01 UTC

(In reply to Martin Brand from comment #486)
> Hi, I have just found the following link where an Intel employee seems to
> have found a problem and has already developed some kind of a solution.
> However I have not been able to find out if this has been added to the
> kernel.
> Here is the link:
> https://lkml.org/lkml/2015/3/24/271
> Perhaps somebody with more experience than myself could have a look at this?

That message is 18 months old... Hardly qualifies as "hot off the press" using Intels' Moore's law.

Comment 490 Paul Nijenhuis 2016-09-13 05:56:40 UTC

(In reply to BzukTuk from comment #488)
> (In reply to BzukTuk from comment #378)
> > Hi again,
> > Kernel 4.6 + Mika Kuoppalas 3 _tentative_ patches +
> > linux-999-i915-use-legacy-turbo.patch = over 120h in one single session
> > (without reboot/sleep..) and another 20+/- hours in few 3-4hour long
> > sessions without single freeze. Still counting...
> > 
> > (some of Adrian Hunters patches for pm/mmc were also applied, but I dont
> > think (hope) this matters)
> > ...
> 
> Hi there again,
> just so you know, I did not experienced !single! freeze on fresh kernels
> (>=4.6) since I started using Mika`s patches + legacy turbo patch together
> (as mentioned above). Could anyone with freeze issues and non-laptop machine
> give it a long stress test? I have just tablet/laptop and I dont want to
> wreck battery/LCD (also can`t turn display off - another story). I did not
> count exactly, but I think I have >500 hour long uptime without any freeze
> on this device.
> 
> Mika`s tentative patches
> https://cgit.freedesktop.org/~miku/drm-intel/commit/
> ?h=rc6_test&id=e564271291fa70265b53fa34c01cbb0ae6282e81
> https://cgit.freedesktop.org/~miku/drm-intel/commit/
> ?h=rc6_test&id=7e6c3f36563d133cff5b700d9c36b12ac2a0c643
> https://cgit.freedesktop.org/~miku/drm-intel/commit/
> ?h=rc6_test&id=b2f08adb19fcb18fea7cda9908fa52e2b9db5e7f
> 
> Legacy turbo:
> https://github.com/OpenELEC/OpenELEC.tv/blob/master/packages/linux/patches/4.
> 7.3/linux-999-i915-use-legacy-turbo.patch

can you please point out how to apply these patches?
Thanx in advance

Comment 491 Christian First 2016-09-13 14:45:10 UTC

Hello, i've got a Biostar J1900 with Celeron Quad Core.
The freeze comes with all Ubuntu 16.04 based distributions. I've tested it with Ubuntu 16.04, Lubuntu 16.04 and Ubuntu Mate 16.04. Even Ubuntu 14.04.5 freezes.
Now i am running with Zorin 9 and Ubuntu 14.04.4. Kernel in use is 3.13.0-95-generic. Would apply myself for a testing person. Maybe is here a supporting person from Germany? Regards Christian

Comment 492 sfumato1977 2016-09-13 14:56:37 UTC

because this could be missing ?

Atom PMC platform clocks:

drivers/clk/x86/clk-byt-plt.c:

https://patchwork.kernel.org/patch/9286345/

Comment 493 jbMacAZ 2016-09-14 17:47:13 UTC

Testing the c6off/c7on script with a Z3775 system at idle.  First freeze took 44 hours.  Subsequent freezes take about 3 hours.  Turning off C7-BYT extends to 4 hours of idling before freezing.  Turning off C7S-BYT on just one core gets me running again w/o freezing.  Effectively, one core is set to intel_idle.max_cstate=1 while the others could allow C7S-BYT.  

Setting intel_idle.max_cstate=2 without the c6off/c7on script yields less than 30 minutes of run time before freezing, often within a few minutes.  

I'm not sure that it's permissible to control power saving on a per core basis. What I did may just be an another way to set ..max_cstate=1.

FWIW - I have had one or two identical freezes in Windows, but this is quite rare in comparison to linux.

Comment 494 Hal 2016-09-15 01:36:07 UTC

(Follow up to own message #477)

12 days into using c6off+c7on I decided to go back to the intel_idle.max_cstate=1 workaround.

The reason is that although I did not experience any freeze or crash on my host OS, I started to see some very awkward SSD access problems. Swapping the SSD drive with a new one did not alleviate the problem, but going back to max_cstate=1 definitely eliminated it.

The awkwardness of the SSD problem was that it would tie up data retrieval from the SSD for many tens of seconds but the host OS wouldn't fail. The mouse, internet access etc would all work. Many drive access retrial messages would pop up but without causing a system crash.

On the other hand, as I ran Virtualbox and several virtual machines, those would partially freeze. For instance their GUIs would not respond to keyboard or mouse actions but I could still SSH into them from the host computer or a remote computer. Eventually I would get serious data corruption in the guest machines.

The problem with the guest machines didn't happen very often but happened on different virtual machines running different GNU flavors and different Linux kernels.

Hal

Comment 495 Paul Nijenhuis 2016-09-16 06:56:59 UTC

(In reply to Paul Nijenhuis from comment #490)
> (In reply to BzukTuk from comment #488)
> > (In reply to BzukTuk from comment #378)
> > > Hi again,
> > > Kernel 4.6 + Mika Kuoppalas 3 _tentative_ patches +
> > > linux-999-i915-use-legacy-turbo.patch = over 120h in one single session
> > > (without reboot/sleep..) and another 20+/- hours in few 3-4hour long
> > > sessions without single freeze. Still counting...
> > > 
> > > (some of Adrian Hunters patches for pm/mmc were also applied, but I dont
> > > think (hope) this matters)
> > > ...
> > 
> > Hi there again,
> > just so you know, I did not experienced !single! freeze on fresh kernels
> > (>=4.6) since I started using Mika`s patches + legacy turbo patch together
> > (as mentioned above). Could anyone with freeze issues and non-laptop
> machine
> > give it a long stress test? I have just tablet/laptop and I dont want to
> > wreck battery/LCD (also can`t turn display off - another story). I did not
> > count exactly, but I think I have >500 hour long uptime without any freeze
> > on this device.
> > 
> > Mika`s tentative patches
> > https://cgit.freedesktop.org/~miku/drm-intel/commit/
> > ?h=rc6_test&id=e564271291fa70265b53fa34c01cbb0ae6282e81
> > https://cgit.freedesktop.org/~miku/drm-intel/commit/
> > ?h=rc6_test&id=7e6c3f36563d133cff5b700d9c36b12ac2a0c643
> > https://cgit.freedesktop.org/~miku/drm-intel/commit/
> > ?h=rc6_test&id=b2f08adb19fcb18fea7cda9908fa52e2b9db5e7f
> > 
> > Legacy turbo:
> >
> https://github.com/OpenELEC/OpenELEC.tv/blob/master/packages/linux/patches/4.
> > 7.3/linux-999-i915-use-legacy-turbo.patch
> 
> can you please point out how to apply these patches?
> Thanx in advance

I found out how to apply the patches and i'm building a 4.7.3 kernel on OpenSuse Tumbleweed...
I'll post the results later

Comment 496 Konstantin Koslowski 2016-09-16 10:27:26 UTC

(In reply to Paul Nijenhuis from comment #495)
> (In reply to Paul Nijenhuis from comment #490)
> > (In reply to BzukTuk from comment #488)
> > > (In reply to BzukTuk from comment #378)
> > > > Hi again,
> > > > Kernel 4.6 + Mika Kuoppalas 3 _tentative_ patches +
> > > > linux-999-i915-use-legacy-turbo.patch = over 120h in one single session
> > > > (without reboot/sleep..) and another 20+/- hours in few 3-4hour long
> > > > sessions without single freeze. Still counting...
> > > > 
> > > > (some of Adrian Hunters patches for pm/mmc were also applied, but I
> dont
> > > > think (hope) this matters)
> > > > ...
> > > 
> > > Hi there again,
> > > just so you know, I did not experienced !single! freeze on fresh kernels
> > > (>=4.6) since I started using Mika`s patches + legacy turbo patch
> together
> > > (as mentioned above). Could anyone with freeze issues and non-laptop
> machine
> > > give it a long stress test? I have just tablet/laptop and I dont want to
> > > wreck battery/LCD (also can`t turn display off - another story). I did
> not
> > > count exactly, but I think I have >500 hour long uptime without any
> freeze
> > > on this device.
> > > 
> > > Mika`s tentative patches
> > > https://cgit.freedesktop.org/~miku/drm-intel/commit/
> > > ?h=rc6_test&id=e564271291fa70265b53fa34c01cbb0ae6282e81
> > > https://cgit.freedesktop.org/~miku/drm-intel/commit/
> > > ?h=rc6_test&id=7e6c3f36563d133cff5b700d9c36b12ac2a0c643
> > > https://cgit.freedesktop.org/~miku/drm-intel/commit/
> > > ?h=rc6_test&id=b2f08adb19fcb18fea7cda9908fa52e2b9db5e7f
> > > 
> > > Legacy turbo:
> > >
> https://github.com/OpenELEC/OpenELEC.tv/blob/master/packages/linux/patches/4.
> > > 7.3/linux-999-i915-use-legacy-turbo.patch
> > 
> > can you please point out how to apply these patches?
> > Thanx in advance
> 
> I found out how to apply the patches and i'm building a 4.7.3 kernel on
> OpenSuse Tumbleweed...
> I'll post the results later

using an ASROCK Q2900 with a J2900 cpu. Until now i was running an old 3.14-lts kernel because all newer ones froze after some time.

tried a custom 4.7.2 kernel on archlinux with the 4 patches mentioned above, system still froze when idling for around 30 hours, back to 3.14.

see the PKGBUILD here in case anybody wants to try: https://dl.dropboxusercontent.com/u/9188780/linux-baytrail.zip

Comment 497 Paul Nijenhuis 2016-09-18 19:03:42 UTC

(In reply to Paul Nijenhuis from comment #495)
> (In reply to Paul Nijenhuis from comment #490)
> > (In reply to BzukTuk from comment #488)
> > > (In reply to BzukTuk from comment #378)
> > > > Hi again,
> > > > Kernel 4.6 + Mika Kuoppalas 3 _tentative_ patches +
> > > > linux-999-i915-use-legacy-turbo.patch = over 120h in one single session
> > > > (without reboot/sleep..) and another 20+/- hours in few 3-4hour long
> > > > sessions without single freeze. Still counting...
> > > > 
> > > > (some of Adrian Hunters patches for pm/mmc were also applied, but I
> dont
> > > > think (hope) this matters)
> > > > ...
> > > 
> > > Hi there again,
> > > just so you know, I did not experienced !single! freeze on fresh kernels
> > > (>=4.6) since I started using Mika`s patches + legacy turbo patch
> together
> > > (as mentioned above). Could anyone with freeze issues and non-laptop
> machine
> > > give it a long stress test? I have just tablet/laptop and I dont want to
> > > wreck battery/LCD (also can`t turn display off - another story). I did
> not
> > > count exactly, but I think I have >500 hour long uptime without any
> freeze
> > > on this device.
> > > 
> > > Mika`s tentative patches
> > > https://cgit.freedesktop.org/~miku/drm-intel/commit/
> > > ?h=rc6_test&id=e564271291fa70265b53fa34c01cbb0ae6282e81
> > > https://cgit.freedesktop.org/~miku/drm-intel/commit/
> > > ?h=rc6_test&id=7e6c3f36563d133cff5b700d9c36b12ac2a0c643
> > > https://cgit.freedesktop.org/~miku/drm-intel/commit/
> > > ?h=rc6_test&id=b2f08adb19fcb18fea7cda9908fa52e2b9db5e7f
> > > 
> > > Legacy turbo:
> > >
> https://github.com/OpenELEC/OpenELEC.tv/blob/master/packages/linux/patches/4.
> > > 7.3/linux-999-i915-use-legacy-turbo.patch
> > 
> > can you please point out how to apply these patches?
> > Thanx in advance
> 
> I found out how to apply the patches and i'm building a 4.7.3 kernel on
> OpenSuse Tumbleweed...
> I'll post the results later

Unfortenately, freeze after 1,5 days.... :-( back to C1 only in BIOS.

Comment 498 w2q 2016-09-19 21:02:51 UTC

The script of Wolfgang Reimer seems to be a good workaround so far. A way to install this permanently is described here:

https://forum.manjaro.org/t/intel-baytrail-freezes-the-linux-kernel/1931/10

Works for manjaro and ubuntu

Comment 499 w2q 2016-09-19 21:05:29 UTC

Additional thoughts:



Wolfgang had the idea to write a test routine to verify whether erratum VLP52 was the root cause for this bug.

I found an erratum of another CPU (Z670),

http://www.intel.de/content/dam/www/public/us/en/documents/specification-updates/atom-z6xx-specification-update.pdf

that has the same description (its number here is BN38, page 25):
"EOI Transaction May Not be Sent if Software Enters Core C6 During an Interrupt Service Routine."

Here a workaround is given by Intel!
"Software should check the ISR register and enter CD1 only if any interrupt is in service."

Perhaps this is helpful to find an even more effective method to avoid this error without blocking C6 generally. There even might be already a fix for the Z6xx-cpu in the kernel.

Comment 500 Konstantin Koslowski 2016-09-19 21:22:43 UTC

Created attachment 239231 [details]
attachment-21109-0.html

Thank you!

Spark by Readdle (https://itunes.apple.com/app/id997102246?mt=8&uo=4&at=10l6UJ&ct=Spark_QR_viral)

Spark by Readdle (https://itunes.apple.com/app/id997102246?mt=8&uo=4&at=10l6UJ&ct=Spark_QR_viral)

Comment 501 Michal Feix 2016-09-20 13:28:39 UTC

I also came across a patch that was created for SUSE and that seems to be adressing mentioned erratum in pre 4.X kernels:

https://build.opensuse.org/package/view_file?file=22160-Intel-C6-EOI.patch&package=xen&project=home%3Acharlesa%3AopenSUSE11.3&rev=7

Comment 502 Travis Hall 2016-09-20 20:01:56 UTC

(In reply to Michal Feix from comment #501)
> I also came across a patch that was created for SUSE and that seems to be
> adressing mentioned erratum in pre 4.X kernels:
> 
> https://build.opensuse.org/package/view_file?file=22160-Intel-C6-EOI.
> patch&package=xen&project=home%3Acharlesa%3AopenSUSE11.3&rev=7

Wow, if this works, that will be absolutely fantastic.  I'll be compiling with this patch as soon as I get home.  Now to get it merged into mainline...

Comment 503 Andrew Clayton 2016-09-20 20:39:00 UTC

(In reply to Travis Hall from comment #502)
> (In reply to Michal Feix from comment #501)
> > I also came across a patch that was created for SUSE and that seems to be
> > adressing mentioned erratum in pre 4.X kernels:
> > 
> > https://build.opensuse.org/package/view_file?file=22160-Intel-C6-EOI.
> > patch&package=xen&project=home%3Acharlesa%3AopenSUSE11.3&rev=7
> 
> Wow, if this works, that will be absolutely fantastic.  I'll be compiling
> with this patch as soon as I get home.  Now to get it merged into mainline...

You might want to check your CPU model number. If I'm reading that patch right, it won't for example have any effect on my J1900 CPU with a model number of 55 (0x37) (assuming boot_cpu_data.x86_model is what is displayed in /proc/cpuinfo as "Model").

Comment 504 Dmitry 2016-09-21 19:13:07 UTC

Strange, latest git kernel works without cmdline parameter or any scripts. System works 3 days with reboots without freezes.
I recompiled kernel with PREEMPT_VOLUNTARY, NO_HZ, RCU_FAST_NO_HZ and IRQ_TIME_ACCOUNTING. In cmdline I have:tsc=reliable clocksource=tsc pcie_aspm=force nmi_watchdog=0.

cpu0 State  Name     Disabled  Latency  Residency        Time   Usage
         0  POLL            0        0          0      345670     157
         1  C1-BYT          0        1          1    68407851  147097                                                                                                  
         2  C6N-BYT         0      300        275    48677058   52359                                                                                                  
         3  C6S-BYT         0      500        560   680803055  270719                                                                                                  
         4  C7-BYT          0     1200       4000  1337235518  180699                                                                                                  
         5  C7S-BYT         0    10000      20000   771972999   31738                                                                                                  
cpu1 State  Name     Disabled  Latency  Residency        Time   Usage                                                                                                  
         0  POLL            0        0          0      344769     200                                                                                                  
         1  C1-BYT          0        1          1   365963607  701346                                                                                                  
         2  C6N-BYT         0      300        275    88538699   99895                                                                                                  
         3  C6S-BYT         0      500        560  1131180391  481825
         4  C7-BYT          0     1200       4000  1097908939  191670
         5  C7S-BYT         0    10000      20000   189777207   20370
cpu2 State  Name     Disabled  Latency  Residency        Time   Usage
         0  POLL            0        0          0      223966     125
         1  C1-BYT          0        1          1    82220646  205674
         2  C6N-BYT         0      300        275    81845726   99118
         3  C6S-BYT         0      500        560  1150791746  496208
         4  C7-BYT          0     1200       4000  1226103249  198451
         5  C7S-BYT         0    10000      20000   313530989   24010
cpu3 State  Name     Disabled  Latency  Residency        Time   Usage
         0  POLL            0        0          0      146758     132
         1  C1-BYT          0        1          1    68183419  163110
         2  C6N-BYT         0      300        275    56932066   64232
         3  C6S-BYT         0      500        560   846271647  338258
         4  C7-BYT          0     1200       4000  1344663248  198891
         5  C7S-BYT         0    10000      20000   564638850   27960

Comment 505 Dmitry 2016-09-26 20:34:02 UTC

Still no freezes. Please, try somebody kernel 4.8-rc8 or it could be another workaround which doesn't relate to max_cstate.

Comment 506 Travis Hall 2016-09-28 16:38:32 UTC

(In reply to Dmitry from comment #505)
> Still no freezes. Please, try somebody kernel 4.8-rc8 or it could be another
> workaround which doesn't relate to max_cstate.

I'm still seeing the freezes on 4.8-rc8, ran a youtube playlist over night, woke up to a freeze.

Comment 507 Daniel Bilik 2016-10-05 10:33:45 UTC

(In reply to Dmitry from comment #505)
> Please, try somebody kernel 4.8-rc8 or it could be another
> workaround which doesn't relate to max_cstate.

For all of you who still hope that this could (and will) be fixed, let me direct your attention to commit a7b4667+ (drm/i915: Never fully mask the the EI up rps interrupt on SNB/IVB):

https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit?id=a7b4667a00025ac28300737c868bd4818b6d8c4d

I guess that specifically this commit has stabilized i915 driver behaviour on less powerful CPUs (ie. our Baytrail Atoms and Celerons), so that some people have found their systems to run stable with Linux 4.8 (the commit was merged to 4.8-rc1).

I've applied this one-liner to i915 driver from Linux 4.4 (vanilla, no other "stabilization" patches), and got similar experience as Dmitry, ie. desktop system running on J1900 with no C-states limiting, used almost daily several hours per session, with just regular shutdowns, and working stable for weeks now.

Though it may not solve stability issues for everyone completely, the commit does seem to hit the right nail.

Comment 508 jjmeijer88 2016-10-05 11:20:51 UTC

With this commit (included in kernel 4.4.20) my Bay Trail tablet can finally run stable without limiting c-states.

a3043e mmc: sdhci-acpi: Reduce Baytrail eMMC/SD/SDIO hangs
https://git.kernel.org/cgit/linux/kernel/git/stable/linux-stable.git/commit/?id=a3043ecef71f5b880fe1b1d2aa77b3a896b86a0c

Comment 509 ladiko 2016-10-05 15:54:03 UTC

In which (newer) Kernel versions the patch is included? Ubuntu 16.04 is using 4.4.0, so it's not included. Do we have to wait for 18.04 or an LTS/HWE kernel? I am not sure if i want to go for the mainstream kernel and stay with the script which disables C6 - at least for the moment.

Comment 510 ladiko 2016-10-05 15:55:19 UTC

By the way - the patch is named "Reduce Baytrail eMMC/SD/SDIO hangs" - is this MMC/SD patch really related to CPU/GPU hangs?

Comment 511 Koen Roggemans 2016-10-05 16:19:55 UTC

Created attachment 240861 [details]
attachment-3924-0.html

I searched in the Ubuntu Kernels for "drm/i915: Never fully mask the the EI
up rps interrupt on SNB/IVB" and I found in
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1615620 in the full
text of the fix that it this fix is applied on 4.4.0-38.57, which is the
currently released kernel version for a standard 16.04 installation.

2016-10-05 17:55 GMT+02:00 <bugzilla-daemon@bugzilla.kernel.org>:

> https://bugzilla.kernel.org/show_bug.cgi?id=109051
>
> --- Comment #510 from ladiko <ladiko@web.de> ---
> By the way - the patch is named "Reduce Baytrail eMMC/SD/SDIO hangs" - is
> this
> MMC/SD patch really related to CPU/GPU hangs?
>
> --
> You are receiving this mail because:
> You are on the CC list for the bug.
>

Comment 512 jjmeijer88 2016-10-05 16:26:51 UTC

(In reply to ladiko from comment #509)
> In which (newer) Kernel versions the patch is included? Ubuntu 16.04 is
> using 4.4.0, so it's not included. Do we have to wait for 18.04 or an
> LTS/HWE kernel? I am not sure if i want to go for the mainstream kernel and
> stay with the script which disables C6 - at least for the moment.

It's included in the longterm vanilla kernel 4.4.20 and up. You can install it manually or via the package manager I guess.

 http://kernel.ubuntu.com/~kernel-ppa/mainline/v4.4.23/

The i915 patch mentioned by Daniel is also in there.


(In reply to ladiko from comment #510)
> By the way - the patch is named "Reduce Baytrail eMMC/SD/SDIO hangs" - is
> this MMC/SD patch really related to CPU/GPU hangs?

It's related to the subject of this thread, not directly to a hanging CPU/GPU. In my case I had a hanging MMC bus related to c-states. No GPU issues for me though :).

Comment 513 Brave Hurts 2016-10-11 17:40:16 UTC

> It's included in the longterm vanilla kernel 4.4.20 and up. You can install
> it manually or via the package manager I guess.
> 
>  http://kernel.ubuntu.com/~kernel-ppa/mainline/v4.4.23/
> 

installed 4.4.24. this didn't fix the freezing for N3700 running ubunutu 14.04

http://kernel.ubuntu.com/~kernel-ppa/mainline/v4.4.24/

Comment 514 Sebastian Heyn 2016-10-11 18:17:13 UTC

SO this is a SD/EMMC only bug?? no issue in the n3510/j1900 motherboards with SATA HDD?

Is this only included in the >4.4.20 kernel line or also in later ones? (4.7 etc)

Comment 515 vad1m 2016-10-11 19:19:53 UTC

finally, I don't have any freezes after installing 4.8.0-997-generic kernel from http://kernel.ubuntu.com/~kernel-ppa/mainline/drm-intel-next/current/
Desktop motherboard with J1900 CPU and SATA HDD.
All C-States, including C7, seems working well (haven't checked power consumption, but no freezes at all during about one week of continuous tests).

Comment 516 Libor Chmelik 2016-10-11 21:07:08 UTC

Indeed. I didn't experience any freezes so far since nearly a week now. But i installed the normal kernel 4.8.0-040800-generic from http://kernel.ubuntu.com/~kernel-ppa/mainline/v4.8/
I removed the intel_idle.max_cstate=1 from the grub config. And so far no problems.
Laptop Acer Aspire E15 E5-511-P7AT with Pentium N3540 up to 2,66GHz.
The only thing I noticed during the kernel upgrade was a "nagging" about missing intel-drm-i915 firmware.
HD Playback on Youtube or in VLC works flawlessly. Also no troubles with Steam.

Comment 517 RussianNeuroMancer 2016-10-11 21:43:16 UTC

Is there separate bugreport about SD/EMMC issue?

Comment 518 Martin 2016-10-12 06:46:38 UTC

I patched my current 4.5.4 with the mentioned patch didn't have any success. So I suspect (hope) there's more to it than only those two lines. Will try to upgrade to 4.8 soon to see if that helps.

Comment 519 julio.borreguero@gmail.com 2016-10-12 07:11:58 UTC

after some months i tried latest kernel 4.8.1 from kernel.org
still freezing

my system:
https://bugzilla.kernel.org/attachment.cgi?id=198961

Comment 520 Sebastian Heyn 2016-10-13 08:56:07 UTC

I just changed from a non-crashing N3510 to a J1900 and I had a freeze after 5 minutes. Disabled the C states in the bios, but a energy saving solution looks different.

Comment 521 Brave Hurts 2016-10-13 16:24:25 UTC

(In reply to vad1m from comment #515)
> finally, I don't have any freezes after installing 4.8.0-997-generic kernel
> from http://kernel.ubuntu.com/~kernel-ppa/mainline/drm-intel-next/current/
> Desktop motherboard with J1900 CPU and SATA HDD.

Same to me for my N3700. Installed this one two days ago. Usually it freezed some minutes after starting Firefox.

Comment 522 Todd Fulton 2016-10-13 17:19:04 UTC

I've been plagued by this bug ever since I got my laptop, like 1 1/2 - 2 years ago. Just saying.

My PC is an Acer E5-511p
The CPU: Intel(R) Pentium(R) CPU  N3530  @ 2.16GHz

That being said, I went months without a hangup/freeze using Ubuntu 16.04 LTS with various kernels from 4.4.0-X-generic, using SELinux, and xscreensaver (without gnome-screensaver installed), and intel_idle.max_cstate=1. I only ever rebooted when a new kernel came out. ( I never tried without max_cstate, is should have).

I got tired of trying to get SELinux working on Ubuntu and decided to go back to apparmor and gnome-screensaver, as well as upgrading to 4.4.0-42-generic (previous was 4.4.0-38-generic). Since using apparmor as LSM, gnome-screensaver, and 4.4.0-42 (yesterday), I get freezes with the periodic spinning fans again even with intel_idle.max_cstate=1, but it seems when I am not using the pc for 20+ minutes.

I switched max_ctate=0, am using this now

$ cat /proc/cmdline 
BOOT_IMAGE=/boot/vmlinuz-4.4.0-42-generic.efi.signed root=UUID=cf4dc10b-511a-4369-ad5c-637833244929 ro apparmor=1 security=apparmor intel_idle.max_cstate=0

I will switch back one-by-one the things I have changed going forward and see if it stops crashes. I get a hint that it was xcreensaver preventing the cpu from going idle that was preventing the hangups/freezes. Maybe no new info there.

I've had UEFI enabled and use the signed kernels, not sure if that matters as I have this issue even in BIOS mode.

Hope that helps.

Comment 523 Javier Antonio Nisa Avila 2016-10-13 17:37:09 UTC

Created attachment 241641 [details]
attachment-14281-0.html

You know if with the New Ubuntu versión solve the bug?

El 13 oct. 2016 7:19 p. m., <bugzilla-daemon@bugzilla.kernel.org> escribió:

> https://bugzilla.kernel.org/show_bug.cgi?id=109051
>
> Todd Fulton <edge-case@hotmail.com> changed:
>
>            What    |Removed                     |Added
> ------------------------------------------------------------
> ----------------
>                  CC|                            |edge-case@hotmail.com
>
> --- Comment #522 from Todd Fulton <edge-case@hotmail.com> ---
> I've been plagued by this bug ever since I got my laptop, like 1 1/2 - 2
> years
> ago. Just saying.
>
> My PC is an Acer E5-511p
> The CPU: Intel(R) Pentium(R) CPU  N3530  @ 2.16GHz
>
> That being said, I went months without a hangup/freeze using Ubuntu 16.04
> LTS
> with various kernels from 4.4.0-X-generic, using SELinux, and xscreensaver
> (without gnome-screensaver installed), and intel_idle.max_cstate=1. I only
> ever
> rebooted when a new kernel came out. ( I never tried without max_cstate, is
> should have).
>
> I got tired of trying to get SELinux working on Ubuntu and decided to go
> back
> to apparmor and gnome-screensaver, as well as upgrading to 4.4.0-42-generic
> (previous was 4.4.0-38-generic). Since using apparmor as LSM,
> gnome-screensaver, and 4.4.0-42 (yesterday), I get freezes with the
> periodic
> spinning fans again even with intel_idle.max_cstate=1, but it seems when I
> am
> not using the pc for 20+ minutes.
>
> I switched max_ctate=0, am using this now
>
> $ cat /proc/cmdline
> BOOT_IMAGE=/boot/vmlinuz-4.4.0-42-generic.efi.signed
> root=UUID=cf4dc10b-511a-4369-ad5c-637833244929 ro apparmor=1
> security=apparmor
> intel_idle.max_cstate=0
>
> I will switch back one-by-one the things I have changed going forward and
> see
> if it stops crashes. I get a hint that it was xcreensaver preventing the
> cpu
> from going idle that was preventing the hangups/freezes. Maybe no new info
> there.
>
> I've had UEFI enabled and use the signed kernels, not sure if that matters
> as I
> have this issue even in BIOS mode.
>
> Hope that helps.
>
> --
> You are receiving this mail because:
> You are on the CC list for the bug.
>

Comment 524 kernelbugtracker 2016-10-13 17:40:24 UTC

I haven't experienced any freezes since v4.8 from  http://kernel.ubuntu.com/~kernel-ppa/mainline/ (generic version) on a lenovo yoga 300 (Intel Celeron N2940).

Comment 525 Todd Fulton 2016-10-13 18:06:51 UTC

Javier Antonio Nisa Avila (In reply to Javier Antonio Nisa Avila from comment #523)
> Created attachment 241641 [details]
> attachment-14281-0.html
> 
> You know if with the New Ubuntu versión solve the bug?
> 
> El 13 oct. 2016 7:19 p. m., <bugzilla-daemon@bugzilla.kernel.org> escribió:
> 

I'll try 16.10 out as well, no problem. I see it's running 4.8, thanks for the heads up on the release ;).

Comment 526 bms 2016-10-15 06:09:25 UTC

All,

  I have experienced the hard lockups on a NDis b324 using kernel version 3.13 (all minor variants from ubuntu).  The time scale is order of several hours to several weeks; the result is always a hard lockup.  With 4.8.1 the hard lockup occurs after around 5 minutes.  With the cstate restriction I have yet to see it crash (in 48hrs of testing with 4.8.1 and the cstate restriction).

  I hate to say it, but I don't think this bug is going to get fixed, and that the workaround is the fix!

  What it will take is engineering time from Intel with fully-instrumented dev boards to analyse this, and the wherewithal to do the root cause analysis.

  We can merely speculate from the sidelines, and so I will speculate that this bug affects all operating systems, and because of code timing variations some OSes get lucky and others do not.  There may be no other fix than to disable the cstate management and suck the power loss.

-bms

Comment 527 mhartzel 2016-10-15 13:51:22 UTC

I suspect there are different bugs in different Intel chipsets / processors, since some fixes works for some people but not all. It is also possible that some of these hardware bugs might be impossible to fix in software. It has happend before both in Intel and AMD hardware. However I have a stable system now and I wanted to share my findings to maybe help some other users with the same hardware.

My processor is: Intel(R) Celeron(R) CPU N2930 @ 1.83GHz (Baytrail). The machine is Acer_Extensa_EX2508-C66M Laptop.

I use Gentoo on this system so I always build my own kernel. I had freezes right from the beginning (kernel 4.1), but found a workaround that worked for this system. I had a rock solid system with kernels 4.1 - 4.4 when I:

- chose "Intel P State Control" to be built into the kernel
- chose "Default CPU Frequency Governor" to be "Performance"
- booted the system with kernel option: intel_idle.max_cstate=0

This resulted processor frequency to be constant and idle processor temperature to be between 48 - 50 degrees Celcius. I use this laptop for recording multitrack audio so the stable processor frequency was a bonus. Heavy audio processing is more reliable when the processor speed is constant. This is due to the fact that stable processor frequency leads to predictable latencies and multitrack audio software likes that. I don't much care about power consumption since my laptop is always plugged in, so I don't now what affect this might have had to the battery life.

When I upgraded to kernel 4.7 this changed and I begun the get the freezes again. I also noticed that my processor speed had begun to fluctuate even though I used the same kernel options I used with kernels 4.1 - 4.4. It seems the fact that processor speed was constant with my settings in kernels 4.1 - 4.4 was a bug and Intel had "fixed" this for 4.7. I now have found new settings that work for me with kernel 4.7.

I did:

- choose "Intel P State Control" to be built into the kernel
- choose "Default CPU Frequency Governor" to be "Performance"
- booted the system with kernel option: intel_idle.max_cstate=0
- disable turbo boost with: echo 1 > /sys/devices/system/cpu/intel_pstate/no_turbo
- disable processor pstate 3 in all processor cores (example for core 0): echo 1 > /sys/devices/system/cpu/cpu0/cpuidle/state3/disable

The last option leaves other pstates (0, 1 and 2) on, but only disables pstate 3. These settings results in similar behaviour that with previous kernels meaning stable processor frequency (1.8 Ghz) and idle core temperature about 48 - 50 degrees Celcius.

I've used kernel 4.7 now for 8 days with no freezes, if one occurs I will disable power saving state 2 for all processor cores and so on until I have a stable system.

I have sometimes had turbo boost enabled and have not had any freezes so when it becomes evident in a couple of weeks that these settings really do work, I will enable turbo boost again and see if that has any effect on stability.

I use a script to disable all processor pstate3s and turbo boost. It is based on another script talked about on this forum. You can download the script here:

https://dl.dropboxusercontent.com/u/2071830/disable_intel_processor_pstates.sh

or here:

http://pastebin.com/egTKmkwX

Comment 528 Sebastian Heyn 2016-10-15 23:03:30 UTC

Thanks for the advice. I applied your settings to gentoo (vanilla-sources-4.8.1)
However I additionally disabled speedstep in bios.

It seems that my n3150 is much more stable than my j1900. I had a freeze within minutes on the j1900 - even though it runs headless, no X. The case has a 12cm fan next to the CPU it is never going higher than 30°C. -

However! 

cpu MHz         : 479.980


cat /sys/bus/cpu/devices/cpu0/cpufreq/scaling_available_governors 
performance powersave

cat /sys/bus/cpu/devices/cpu0/cpufreq/scaling_governor 
performance

cat /sys/bus/cpu/devices/cpu0/cpufreq/scaling_cur_freq 
479980

cat /sys/bus/cpu/devices/cpu0/cpufreq/scaling_max_freq 
1600000


weird. On Performance governor, the cpu should never clock down


Is there a microcode update to these cpus? Do they make a difference?

Comment 529 Sebastian Heyn 2016-10-15 23:24:43 UTC

adding "intel_pstate=disable" seems to disable any frequency variation. the cpu now sits on 1.6ghz.

Comment 530 Jochen Hein 2016-10-16 05:38:29 UTC

Created attachment 241811 [details]
Patch to disable c-states at boot

Comment 531 Jochen Hein 2016-10-16 05:44:40 UTC

There might be the following errata affected:
VLP52 EOI Transactions May Not be Sent if Software Enters Core C6
During an Interrupt Service Routine.
AAU36     EOI Transaction May Not be Sent if Software Enters Core C6 During an
Interrupt Service Routine
AAN42     EOI Transaction May Not be Sent if Software Enters Core C6 During an
Interrupt Service Routine
BN38.EOI Transaction May Not be Sent if
AAK76.       EOI Transaction May Not be Sent if
 Software Enters Core C6 During an
 Interrupt Service Routine
BA106.
EOI Transaction May Not be Sent if
 Software Enters Core C6 During an
 Interrupt Service Routine
AAJ72. EOI Transaction May Not be Sent if Software Enters Core C6 Duringan Interrupt Service Routine

Comment 532 bms 2016-10-16 06:05:44 UTC

As an experiment I've set up a google spreadsheet in the hopes you will enter details about your system(s), configurations you have tried, and the length of time that your test ran prior to failure.

The goal is simply to be able to mine the data.

The spreadsheet is (or rather should be) fully editable, so please don't abuse it; I think we all want this resolved.

https://docs.google.com/spreadsheets/d/1oajcMYL9oSt0O6VTpaIj0osGJxKGKSPSYtLnqr3UHNk/edit?usp=sharing

Here are some suggestions on how you should fill in the entries:

Column A:
Did your system end up in a locked up state?

Column B:
How long did your test run for. For example, if your test ended in a lock up, was it several hours or just a few minutes. If you answered yes for column A, then enter the amount of time your computer ran prior to you rebooting or powering down the system.

Column C:
The name / make of your machine. We want to know who made the motherboard.

Column D:
The model name of your CPU.

Column E:
The code name for your CPU. Naturally non-baytrail cpus that show similar failures will be interesting information to know.

Column F, G, H:
Enter the details from the result of "cat /proc/cpuinfo".

Column I, J:
use dmidecode to obtain the bios information, enter the version and vendor.

Column K:
The linux kernel version: use "uname -a"

Column L:
Did you modify the kernel boot parameters; if so, record them.

Column M:
Additional notes: What other configurations did you do? Did you use the c6 off script, etc...

Do add columns if you think the data relevant.

... just trying to get to the bottom of this...

Comment 533 mhartzel 2016-10-16 08:12:42 UTC

Kernel 4.8 seems to have some bugs in cpufreq. Intel has recently added a fix for these for kernel 4.9, so I will skip kernel 4.8 completely and use 4.9 when it comes out.

Here is the message telling about the Intel regression fixes for 4.9:

https://lkml.org/lkml/2016/10/14/288

Here is the Phoronix article mentioning it:

https://www.phoronix.com/scan.php?page=news_item&px=Linux-4.9-Atom-P-State-Algo

Comment 534 mhartzel 2016-10-16 12:55:59 UTC

BMS: Great idea :) There are a couple of other things to consider. On newer kernels (4.7)  you have the option of controlling processor performance either by ACPI P-State driver or Intel P-State driver. I had lockups when using the ACPI driver, Intel version works for me with no lockups.

Also it might be important to know which governor (powersave, ondemand, performance) people use, since it deeply affects how the processor uses power saving states.

I also added a column called "Reporter", I hope this is allright. It helps when some additional information needs to be asked from the reporter.

Comment 535 ekemate93 2016-10-17 18:05:32 UTC

I have updated my home server to Ubuntu 16.10 which contains 4.8.0-22-generic, now I have 53 hours uptime.

Before that I used Ubuntu 16.04.1 LTS with 4.4 and I had to disable C-states in the bios to get a stable system. 

I hope this kernel solved the bug, I will keep the spreadsheet updated with my longtime results.

My system is ASRock Q1900-ITX with a J1900 CPU.

Comment 536 youling257 2016-10-17 22:42:28 UTC

(In reply to jbMacAZ from comment #191)
> I have a N3540 system that freezes at most a couple times a month without
> any arguments, kernel version doesn't seem to matter.  .max_cstate {0,1}
> stabilized it.  Looking at the recent posts, the N-series appears to be the
> processor benefiting most from the new suggestions.  But the more smoke that
> gets cleared, the sooner the rest of the problems can be found.
> 
> On my Z3775 system (T100CHI), kernel 4.5.0 without arguments didn't last 2
> minutes before freezing.  With idle=nomwait and it ran 2 hours before the
> time display froze (frozen seconds), the mouse cursor still moved.  Keyboard
> keys or mouse clicks were accepted about once every 90 seconds.
> 
> Next, maxcpus=2 and idle=nomwait produced a block of "serial8250: too much
> work for irq191" errors in dmesg.  Raising maxcpus to 3 got rid of them. 
> maxcpus= {2,3} yielded no obvious degradation when just browsing, etc, so
> I'll leave this running...  tsc may be destabilizing for some systems like
> mine.

I compile 4.9 rc1 kernel，dmesg，serial8250: too much work for irq191

4.8 no this problem.

Comment 537 Sudhanshu 2016-10-18 09:05:17 UTC

I have been suffering from the same issue, but on a Broadwell system (Dell XPS 9343, i5-5200). Restricting the max_cstate to 1 helps. c6off+c7on (after modifying to work on BDW instead of BYT), does not. It works only when I disable all cstates except C1 and C1E (which is rquivalent to max_cstate=1).

Though I have been following this thread since long, I never posted. I have lately been wondering if there are any other Broadwell users facing the same, and if there is a separate bug for them. I mean, though the symptoms are exactly same, I am not 100% sure if the bug is the same.

Also, as most other users here, I have no logs anywhere - syslog/kern.log - which would help raising a separate bug request.

Summarising,
Are there any broadwell co-sufferers here?
Am I safe to assume this is the same bug as mine?

Comment 538 Dmitry 2016-10-18 13:24:51 UTC

I've also added my machine into google spreadsheet. 

"serial8250: too much work for irq191" - also see this when I try to turn bluetooth on. I've never managed to get it working though.

Comment 539 Wolfgang M. Reimer 2016-10-18 15:36:34 UTC

(In reply to Jochen Hein from comment #531)
> There might be the following errata affected:
> VLP52 EOI Transactions May Not be Sent if Software Enters Core C6
...
> AAJ72. EOI Transaction May Not be Sent if Software Enters Core C6 Duringan
> Interrupt Service Routine

Thanks Jochen, I started to dig and found out, that a lot of Intel processors suffer from erratum:

EOI Transaction May Not be Sent if Software Enters Core C6 During an Interrupt Service Routine

Here is the list (with links to the docs) I found so far: 

 [1] AAJ72: Intel Core i7-900 Desktop Processor Extreme Edition Series and Intel Core i7-900 Desktop Processor Series
     http://www.intel.com/content/dam/www/public/us/en/documents/specification-updates/core-i7-900-ee-and-desktop-processor-series-spec-update.pdf
 [2] AAK76: Intel Xeon Processor 5500 Series
     http://www.intel.com/content/dam/www/public/us/en/documents/specification-updates/xeon-5500-specification-update.pdf
 [3] AAM73: Intel Xeon Processor 3500 Series
     http://www.intel.com/Assets/en_US/PDF/specupdate/321333.pdf
 [4] AAN42: Intel Core i7-800 and i5-700 Desktop Processor Series
     http://www.intel.com/content/dam/www/public/us/en/documents/specification-updates/core-i7-800-i5-700-spec-update.pdf
 [5] AAO42: Intel Xeon Processor 3400 Series 
     http://www.intel.com/content/dam/www/public/us/en/documents/specification-updates/xeon-3400-specification-update.pdf
 [6] AAP41: Intel Core i7-900 Mobile Processor Extreme Edition Series, Intel Core i7-800 and i7-700 Mobile Processor Series
     http://www.intel.com/content/dam/www/public/us/en/documents/specification-updates/core-i7-900-mobile-ee-and-mobile-processor-series-spec-update.pdf
 [7] AAT32: Intel Core i7-600, i5-500, i5-400 and i3-300 Mobile Processor Series
     http://www.intel.com/content/dam/www/public/us/en/documents/specification-updates/core-mobile-spec-update.pdf
 [8] AAU36: Intel Core i5-600, i3-500 Desktop Processor Series and Intel Pentium Desktop Processor 6000 Series
     http://www.intel.com/content/dam/www/public/us/en/documents/specification-updates/core-i5-600-i3-500-pentium-6000-spec-update.pdf
 [9] AAY38: Intel Xeon Processor 3600 Series
     http://www.intel.com/content/dam/www/public/us/en/documents/specification-updates/xeon-3600-specification-update.pdf
[10] BA106: Intel Xeon Processor 7500 Series
     http://www.intel.com/content/dam/www/public/us/en/documents/specification-updates/xeon-processor-7500-series-specification-update.pdf
[11] BB38: Intel Atom Processor Z6xx Series 
     http://www.intel.com/content/dam/doc/specification-update/atom-z6xx-specification-update.pdf
[12] BC38: Intel Core i7-900 Desktop Processor Extreme Edition Series and Intel Core i7-900 Desktop Processor Series on 32-nm Process
     http://www.intel.com/content/dam/www/public/us/en/documents/specification-updates/core-i7-900-ee-and-desktop-processor-series-32nm-spec-update.pdf
[13] BD40: Intel Xeon Processor 5600 Series Specification Update
     http://www.intel.de/content/dam/www/public/us/en/documents/specification-updates/xeon-5600-specification-update.pdf
[14] BF41: Intel Xeon Processor C5500/C3500 Series
     http://www.intel.com/content/dam/www/public/us/en/documents/specification-updates/xeon-c5500-c3500-spec-update.pdf
[15] BG31: Intel Pentium P6000 and U5000 Mobile Processor Series
     http://www.intel.com/content/dam/www/public/us/en/documents/specification-updates/pentium-p6000-u5000-mobile-specification-update.pdf
[16] BI46: Intel Atom Processor E6xx Series
     http://www.intel.com/content/dam/www/public/us/en/documents/specification-updates/atom-e6xx-spec-update.pdf
[17] BN38: Intel Atom Processor Z600 Series
     http://www.intel.com/content/dam/www/public/us/en/documents/specification-updates/atom-z6xx-specification-update.pdf
[18] BP37: Intel Xeon Processor E7-8800/4800/2800 Product Families
     http://www.intel.com/content/dam/www/public/us/en/documents/specification-updates/xeon-e7-8800-4800-2800-families-specification-update.pdf
[19] CC5: Intel Atom Processor Z2760
     http://www.intel.com/content/dam/www/public/us/en/documents/specification-updates/atom-z2760-spec-update.pdf
[20] VLI55: Intel Atom Processor E3800
     http://www.intel.com/content/dam/www/public/us/en/documents/specification-updates/atom-e3800-family-spec-update.pdf
[21] VLP52: Intel Celeron and Pentium Processor N- and J-Series
     http://www.intel.com/content/dam/www/public/us/en/documents/specification-updates/pentium-n3520-j2850-celeron-n2920-n2820-n2815-n2806-j1850-j1750-spec-update.pdf
[22] VLT56: Intel Atom Processor Z3600 and Z3700 Series
     http://www.intel.com/content/dam/www/public/us/en/documents/specification-updates/atom-Z36xxx-Z37xxx-spec-update.pdf

The erratum is first mentioned in November 2008 [1] and a first patch for it (only for AAJ72-plagued processors reported in [1]) has been added by the Xen developers in September 2010:

https://lists.xen.org/archives/html/xen-devel/2010-09/msg00894.html

Comment 540 AB 2016-10-18 18:17:36 UTC

The image on the screen freezes and the USB ports do not work, ATX power button not responding. Right?

I have bug like this many times on i5-4460 (P87 Gigabyte mb with nvidia videocard) after updating from kernel 3.13 (ubuntu 14.04) to kernel 4.4 (ubuntu 16.04). This appears more often during hot weather, then usb-wifi attached, when computer is idle. More rare when some programs are active and when it is cold in the room. I thought that the reason are micro-cracks in the motherboard but now I see this ticket and will delay shopping new matherboard:)

Also I saw bug like this on ASUS R556 with i5-5200U with nvidia video on 4.4. Now in asus ubuntu was updated to 16.10 and I test it with kernel 4.8.

intel_idle.max_cstate=1 did not helps in both cases on kernel 4.4

Comment 541 Sudhanshu 2016-10-20 03:28:10 UTC

Also, on Broadwell, *any* c-state (beyond 1e) if enabled, causes the lockdown. For baytrail, as some users have pointed out, just c7 off and others enabled works.

Comment 542 HansPeterIngo 2016-10-20 12:10:57 UTC

I cannot confirm that Ubuntu 16.10 fixes the bug. The freezes still remain with Kernel 4.8.0-22.

Comment 543 FL 2016-10-20 19:39:40 UTC

Freezes after one hour of VLC...
OS: Arch Linux
Kernel: x86_64 Linux 4.8.2-1-ARCH
CPU: Intel Pentium CPU J2900 @ 2.4157GHz
GPU: Mesa DRI Intel(R) Bay Trail

Comment 544 Sebastian Heyn 2016-10-21 14:33:17 UTC

Yes I read that 4.8 has a faulty p-state implementation that should be fixed with 4.9.

Comment 545 Libor Chmelik 2016-10-21 17:55:49 UTC

(In reply to Libor Chmelik from comment #516)
> Indeed. I didn't experience any freezes so far since nearly a week now. But
> i installed the normal kernel 4.8.0-040800-generic from
> http://kernel.ubuntu.com/~kernel-ppa/mainline/v4.8/
> I removed the intel_idle.max_cstate=1 from the grub config. And so far no
> problems.
> Laptop Acer Aspire E15 E5-511-P7AT with Pentium N3540 up to 2,66GHz.
> The only thing I noticed during the kernel upgrade was a "nagging" about
> missing intel-drm-i915 firmware.
> HD Playback on Youtube or in VLC works flawlessly. Also no troubles with
> Steam.

Spoken to early. It froze after all.
Situation 1 : Youtube in HD AND cpu's in forced performance mode (turbo boost ??)
Situation 2 : HD Playback in VLC in automatic powersave mode.
But it took 12 days until the first freeze. And 3 days later for the second.

Trying kernel 4.8.2 from ubuntu mainline now. c-state still disabled in grub.conf

Comment 546 Justin 2016-10-22 21:58:34 UTC

One week so far no crashes.  4.8.0-rc8-amd64

Options

GRUB_CMDLINE_LINUX_DEFAULT=intel_idle.max_cstate=5

In rc.local this script is run at boot...

 ----- 

#!/bin/bash
echo 1 > /sys/devices/system/cpu/cpu0/cpuidle/state3/disable
echo 1 > /sys/devices/system/cpu/intel_pstate/no_turbo


thanks

Comment 547 bzq7xy5gpj 2016-10-24 12:53:03 UTC

Let's see whether the Goldmont successor architecture CPUs 
https://en.wikipedia.org/wiki/Goldmont_(microarchitecture)#List_of_Goldmont_processors
are affected. The first motherboards/NUCs are already/soon available, like the ASRock J4205-ITX. Would be nice if someone could report on that. BTW: My experience has been that a Celeron-NUC is extremely slow for desktop usage and my next NUC is going to be a Pentium-NUC (if of course not affected, otherwise maybe something from AMD).

Comment 548 Sebastian Heyn 2016-10-24 20:18:23 UTC

Why wait for another product to spend money on? I have two of boards that are not doing what they should that have actually cost money. Before doing that I'll change to ARM cpus because of the frustration. 

My J1900 resetted twice today doing nothing (headless, no X) - (4.4 kernel) I deactivated C states in the bios. Until now it seems stable

I will give 4.9_rc1 a try in the morning and activate all c-states. I bought those boards to save me some power, not sit around doing nothing ;-)

Comment 549 Sebastian Heyn 2016-10-25 07:38:55 UTC

Has anyone tried this kernel yet?  

https://aur.archlinux.org/packages/linux-baytrail48/

Comment 550 thorsten 2016-10-29 17:03:11 UTC

I have a N2940 under Gentoo and keep running into the same bug.
i already tried:
4.7.5-gentoo
4.8.2-gentoo
and git-sources too:
4.9-rc1

still getting random freezes (depending on workload every 15 min to 2 hours).

Comment 551 mhartzel 2016-10-30 17:30:21 UTC

thorsten: Try the commands below, and report back. These eliminate hang ups on my N2930 with kernel 4.7 (Gentoo).

First start kernel with: intel_idle.max_cstate=0

Then give these commands as root:

echo 1 > /sys/devices/system/cpu/intel_pstate/no_turbo
echo 1 > /sys/devices/system/cpu/cpu0/cpuidle/state3/disable
echo 1 > /sys/devices/system/cpu/cpu1/cpuidle/state3/disable
echo 1 > /sys/devices/system/cpu/cpu2/cpuidle/state3/disable
echo 1 > /sys/devices/system/cpu/cpu3/cpuidle/state3/disable

Comment 552 Elmar Melcher 2016-11-04 16:56:54 UTC

Tried as indicated
linux-4.8-3-baytrail-60cacd661dacfd0a7c4aa6f82d11f1c1664e70ad.tar.gz
cp config.x86_64 .config
make INSTALL_MOD_STRIP=1 rpm
then installed on Atom Z3735G
running for 1 hour now without kernel parameter, neither cstate nor tsc.

Everything else I ever tried crashed in less than 1 hour, sometimes in
1 minute without kernel parameters.

On 10/25/16, bugzilla-daemon@bugzilla.kernel.org
<bugzilla-daemon@bugzilla.kernel.org> wrote:
> https://bugzilla.kernel.org/show_bug.cgi?id=109051
>
> --- Comment #549 from Sebastian Heyn <sebastian.heyn@yahoo.de> ---
> Has anyone tried this kernel yet?
>
> https://aur.archlinux.org/packages/linux-baytrail48/
>
> --
> You are receiving this mail because:
> You are on the CC list for the bug.
>

Comment 553 Sebastian Heyn 2016-11-07 08:17:32 UTC

My system is running stable for 2 weeks now with the archlinux baytrail kernel. (I had to restart it to swap an hdd however.  Could it be that a headless system is much less likely to crash?

Comment 554 Daniel Glöckner 2016-11-07 09:51:40 UTC

(In reply to Sebastian Heyn from comment #553)
> Could it be that a headless system is much less likely to crash?

Yes, the initial bug report linked in the first comment thought the problem was related to the GPU driver. Even if you are using the unaccelerated efifb + xf86-fideo-fbdev driver, you are less likely to get the freeze. After all, the best way to trigger the problem so far has been to play videos.

Comment 555 Sebastian Heyn 2016-11-07 10:14:09 UTC

Hi Daniel,

thanks. Playing videos means GPU decoding or high framerate framebuffer access via CPU?

Comment 556 Paul Mansfield 2016-11-07 11:08:35 UTC

I've been using a J1900 board as a router/firewall/fileserver for a couple of years now, it's a gigabyte ga-j1900d3v (chosen for the dual gigabit NICs and the low power consumption). It's pretty stable, runs for weeks and weeks without locking up, but of course there's no video activity - often there's not even a monitor plugged in! However, when it does lock up, it needs a forced reset, as it will have locked up solid.

Comment 557 Paul Mansfield 2016-11-07 11:15:28 UTC

p.s. I've never used the cstate hack, always used stock kernels without any special patching.

Comment 558 thorsten 2016-11-07 20:47:42 UTC

Update on my side: 
4.8.4-gentoo seems to work since several days so far without patches or disabling cstate options on my machine.

If anyone is interested provided my kernel as a download with modules and initrd:

http://s000.tinyupload.com/index.php?file_id=06491416522851495522

md5sum:
4c7fbd190b8656899cfe3b35dbd6f185  kernel.tar.bz2

sha1sum:
3218d1a4064b649d64c46fa493c3d364f1f02737  kernel.tar.bz2

I have an Aspire ES1-311.
Would be interested if this kernel works on other machines, too.

Comment 559 Sebastian Heyn 2016-11-08 10:41:38 UTC

@Thorsten,

can you check if your /proc/cpuinfo shows the correct frequency info? Mine seems to hang on less than 500MHZ, using the ondemand governor

Comment 560 thorsten 2016-11-08 19:44:21 UTC

@Sebastian,

i have 499 MHz shown in /proc/cpuinfo too, but i think its a display error:

~# cpupower frequency-info
analyzing CPU 0:
  driver: intel_pstate
  CPUs which run at the same hardware frequency: 0
  CPUs which need to have their frequency coordinated by software: 0
  maximum transition latency:  Cannot determine or is not supported.
  hardware limits: 500 MHz - 2.25 GHz
  available cpufreq governors: performance powersave
  current policy: frequency should be within 500 MHz and 2.25 GHz.
                  The governor "powersave" may decide which speed to use
                  within this range.
  current CPU frequency: 500 MHz (asserted by call to hardware)
  boost state support:
    Supported: yes
    Active: yes

and:
~# cpupower frequency-info
analyzing CPU 0:
  driver: intel_pstate
  CPUs which run at the same hardware frequency: 0
  CPUs which need to have their frequency coordinated by software: 0
  maximum transition latency:  Cannot determine or is not supported.
  hardware limits: 500 MHz - 2.25 GHz
  available cpufreq governors: performance powersave
  current policy: frequency should be within 500 MHz and 2.25 GHz.
                  The governor "powersave" may decide which speed to use
                  within this range.
  current CPU frequency: 2.25 GHz (asserted by call to hardware)
  boost state support:
    Supported: yes
    Active: yes

maybe the difference from my kernel to the 'regular' is CONFIG_CPU_FREQ_GOV_SCHEDUTIL is set as default policy in my config.
Strange tho that cpupower only shows powersave and performance as available governors?

Comment 561 Poumon 2016-11-10 20:44:09 UTC

@Thorsten

Hello ! I use an asusE502 with Intel Corporation Atom Processor Z36xxx/Z37xxx
I am affected by this problem since 1 year. 

The kernels 4.8.4 works for me since 2 days (without intel_idle.max_cstate=1 ). It still work for you ?

Comment 562 luanrafael19 2016-11-11 14:00:27 UTC

Tive estes problemas de travamento no meu notebook Asus z550ma que utiliza o processador n2940. Como consegui resolver o problema no Ubuntu 16.04:

1-Instalei os drivers de vídeo da intel (https://01.org/linuxgraphics/downloads)
2- Atualizei o Kernel do Ubuntu para versão 4.8.6
3- Após atualizar o Kernel, fui em: Configurações, Programas e Atualizações, Drivers Adicionais e desativei o Processor microcode firmware for intel cpus de intel-microcode (coloquei em não usar este dispositivo)
4- Reiniciei o Notebook
5-Criei um arquivo de configuração com o nome "i915.conf" (digitar sem aspas) e dentro dele inseri o código: "options i915 modeset=1 enable_execlists=0" (digitar sem aspas)
6- Colei este arquivo de configuração na pasta etc/modprobe.d
7- Reiniciei o PC
8- Resultado: o notebook já não trava a aproximadamente 1 semana (estou usando ele o dia inteiro)

OBS: não precisei inserir este código (intel_idle.max_cstate=1) no Grub

Este notebook z550ma possui também um problema com a placa wifi. Para resolver os problemas basta instalar os drivers da placa (o site do Diolinux tem um tutorial) e inserir um código na pasta etc/modprobe.d

Este código possui o nome "rtl8723be.conf" (digitar sem aspas) e dentro deste arquivo deve estar escrito o código: "options rtl8723be fwlps=N ips=N" (digitar sem aspas)

A internet agora funciona normalmente aqui.

Espero ter ajudado pessoal!!!

OBS: BR também entende de Linux!!!

Comment 563 luanrafael19 2016-11-11 14:04:56 UTC

I had these locking issues on my Asus z550ma notebook that uses the n2940 processor. How I solved the problem in Ubuntu 16.04:

1-I installed intel's video drivers (https://01.org/linuxgraphics/downloads)
2- Updated the Ubuntu Kernel for version 4.8.6
3- After updating the Kernel, I went to: Settings, Programs and Updates, Additional Drivers and deactivated the Processor microcode firmware for intel-microcode cpus (I put in not to use this device)
4- Restart the Notebook
5 - I created a configuration file with the name "i915.conf" (enter without quotes) and inside it inserted the code: "options i915 modeset = 1 enable_execlists = 0" (enter without quotes)
6- Pasted this configuration file into the etc / modprobe.d folder.
7- Restart the PC
8- Result: the notebook no longer locks for approximately 1 week (I'm using it all day)

This z550ma notebook also has a problem with the wifi card. To solve the problems simply install the drivers of the card (the Diolinux website has a tutorial) and insert a code in the etc / modprobe.d folder

This code has the name "rtl8723be.conf" (enter without quotes) and inside this file should be written the code: "options rtl8723be fwlps = N ips = N" (type without quotes)

The internet now works normally here.

I hope I have helped people !!!

Note: BR also understands Linux

Sorry for my bad English!!!!

Comment 564 Poumon 2016-11-11 18:02:57 UTC

@Thorsten

Ubuntu freezes this morning after 3 days of usage with the 4.8.4. False hope ...

Comment 565 mhartzel 2016-11-11 20:16:38 UTC

In my experience it may take a week or two before the first freeze happens. It would be very helpful if people could wait and use their machines for 7 - 14 days before declaring success. This would help us weed out false positives :)

Thanks for filling in your success and failure details into the spreadsheet bms created, it seems to me patterns are emerging, please keep filling in details about your experiments :)

Comment 566 RussianNeuroMancer 2016-11-11 23:21:34 UTC

I would say if freeze takes few days to reproduce, while before it was few hours or even minutes - it's already success to some degree.

By the way, this patches could be interesting for some subscribers: https://github.com/burzumishi/linux-baytrail-flexx10/tree/master/kernel/patches/v4.8

Especially 0001 and 0006 probably could reduce hangs even more.

Comment 567 thorsten 2016-11-12 06:52:49 UTC

@Poumon Had my first two freezes with my kernel yesterday, with my older kernels I had daily freezes without using other options. So sorry for the false positive.

@RussianNeuroMancer I think too its an improvement too if we can use all the power saving features on an 'unpatched' kernel for multiple days now

Comment 568 thorsten 2016-11-12 07:26:08 UTC

I have an unused J1800 desktop machine, so I'll try to reproduce the problem and maybe try to get a kernel stacktrace over serial terminal in the next week.
I hope we can pinpoint the actual origin of the problem this way. 

@RussianNeuroMancer if the problem would be mmc-related the regular desktop user without an mmc reader should not affected since the modules for mmc would not be loaded, but maybe the other patches could change something.

Comment 569 RussianNeuroMancer 2016-11-12 07:37:14 UTC

@thorsten, if problem mmc-related I wonder why it doesn't fixed long time ago. Patches for mmc hang literally available for years.

Comment 570 Ajay Garg 2016-11-15 14:00:20 UTC

I am facing freezing issues on a SOC running on Intel-Celeron-J1900. The devices are supposed to be deployed in areas with not a single human-being, so freezes are unacceptable. Also, I really don't care about power-consumption.

I was wondering why has no one tried the following kernel-options ::

intel_idle.max_cstate=0 processor.max_cstate=0 idle=poll


If I am not being idiotic, above options would surely switch-off all power-management-possibilities?

Comment 571 Sebastian Heyn 2016-11-15 15:42:44 UTC

@Ajay:

Are the machines headless or with an active X running? Some board allow to switch off all power management on the BIOS

Comment 572 Ajay Garg 2016-11-15 15:52:17 UTC

@Sebastien,

Thanks for the reply.

Nopes, each machine has Ubuntu-14.04.3 installed, with kernel upgraded manually to 3.19(-generic).

I don't have a board with me right now, so cannot confirm if there is an option in the BIOS. But irrespective of that, won't each of the kernel-options (as per my previous post) work?


The important question is, might

intel_idle.max_cstate=0 processor.max_cstate=0 idle=poll

break anything over a period of time?

Comment 573 Ajay Garg 2016-11-15 15:54:02 UTC

(In reply to Ajay Garg from comment #572)
> @Sebastien,
> 
> Thanks for the reply.
> 
> Nopes, each machine has Ubuntu-14.04.3 installed, with kernel upgraded
> manually to 3.19(-generic).

Pardon me, I meant a full-blown client-image of Ubuntu-14.04.3, with all the fancy GUI.


> 
> I don't have a board with me right now, so cannot confirm if there is an
> option in the BIOS. But irrespective of that, won't each of the
> kernel-options (as per my previous post) work?
> 
> 
> The important question is, might
> 
> intel_idle.max_cstate=0 processor.max_cstate=0 idle=poll
> 
> break anything over a period of time?

Comment 574 mhartzel 2016-11-15 17:06:34 UTC

@Ajay Garg 

I don't own a J1900 device, but I guess the main concern with those options is heat. The kernel documentation (https://www.kernel.org/doc/Documentation/kernel-parameters.txt) warns that idle=poll will make the machine run hot. It may be better to drop that option if possible.

I guess it is best to test the machines in advance. You could  install a software for monitoring cpu / gpu temperature and run workloads that will be typical when the machines are installed on location. In this way you will get some hard data about the machines reliability.

It's really frustrating that Intel hardware is as buggy as it is right now. I can't remember any worse period in Intel history. I guess they a really afraid of the flood of ARM devices and trying to compete with those they are going too far with aggressive power saving features.

Comment 575 DE 2016-11-15 21:00:49 UTC

(In reply to Wolfgang M. Reimer from comment #539)

Hello, you have missed the lucky 23.

[23] AAZ32: Intel Celeron P4000 and U3000 Mobile Processor Series
     http://www.intel.ie/content/dam/www/public/us/en/documents/specification-updates/celeron-mobile-p4000-u3000-specification-update.pdf

Comment 576 Martin Brand 2016-11-15 21:20:46 UTC

Just tried 4.8.7 on Ubuntu 16.04. This kernel won't even boot. I have a N3700 CPU. So back to 4.8.6.

Comment 577 thorsten 2016-11-16 06:25:09 UTC

I tried the serial console approach and could not get a kernel crash dump this way despite the machine freezing with 4.8.7. My guess is because this seems to be a hardware-bug the cpu is frozen before the kernel can throw a crash dump or contact my serial console. 

@Ajay Garg
I would probably disable cpufreq (and cstates) alltogether on a non-mobile machine. Downside would be a hotter and possibly louder machine. Also in case your chipset has a watchdog functionality maybe this would be an idea how to reset the machine automatically after freezing it it helps with your application.
Alternatively in your use case I would problably switch to unaffected  hardware i.e. a Celeron 847 or otherwise do a lot of testing first. 

@Martin Brand
I have a Pentium N3700 too and was not yet affected by this bug so far, have you had freezes before?

Comment 578 André Hoogendoorn 2016-11-16 11:12:25 UTC

It IS a hardware bug and Intel should fix it.

Comment 579 Martin Brand 2016-11-16 17:40:46 UTC

Yes I did have freezes before, but never during boot. The c6off+c7on scripts from Wolfgang Reimer made my system usable. So thanks a lot for that!
Without the script it usually freezes within an hour. With c6off about once every one to two weeks. Still very annoying when it happens.

Comment 580 RussianNeuroMancer 2016-11-16 18:14:37 UTC

What conditions of entering C7? I run this script and looking at powertop output now, cores spend 96-97% time in C1. So looks like it's doesn't different much from known intel_idle.max_cstate=1 workaround.

Comment 581 Martin Brand 2016-11-16 20:16:21 UTC

My Powertop displays the following
PowerTOP 2.8      Übersicht  Untätigkeits Frequenzstatistik Gerätestatisti Einstellbarkeit                              
          Paket     |             Kern    |            CPU 0
                    |                     | C0 aktiv    7,2%
                    |                     | POLL        0,0%    0,0 ms
                    | C1 (cc1)   23,7%    | C1-CHT     22,3%    1,2 ms
                    |                     |
                    |                     |
C6 (pc6)   18,1%    | C6 (cc6)   62,9%    | C6S-CHT     0,0%    0,0 ms
                    |                     | C7S-CHT    16,3%   20,7 ms

                    |             Kern    |            CPU 1
                    |                     | C0 aktiv   22,4%
                    |                     | POLL        0,0%    0,0 ms
                    | C1 (cc1)   11,3%    | C1-CHT      9,5%    2,1 ms
                    |                     |
                    |                     |
                    | C6 (cc6)   59,3%    | C6S-CHT     0,0%    0,0 ms
                    |                     | C7S-CHT    22,6%   22,9 ms
So it uses C7 state. Battery life is better and CPU temperature is also about 10°C less than with intel_idle.max_cstate=1 workaround

Comment 582 RussianNeuroMancer 2016-11-17 02:41:32 UTC

Ok, I need to clarify, that in my case it's BayTrail CPU Z3735G.  After script run I stop getting PC7 and CC6 (that works before script run) and doesn't get expected C7S-BYT (constantly 0% while C6S-BYT was sometimes over 90% before script run). So for me script outcome is not different from intel_idle.max_cstate=1 workaround.

Is there anyone who have working PC7/CC6/C7S-BYT on BayTrail device after disabling C6S-BYT?

Comment 583 Elmar Melcher 2016-11-18 12:52:45 UTC

On CPU Z3735G, I always saw a Call Trace related to hard LOCKUP on the screen when it froze while in console mode. As also reported in other comments (#157, #568), these Call Traces were not logged by the system.

Today I received this message in a xterm:

Message from syslogd@leia at Nov 18 09:50:32 ...
 kernel:NMI watchdog: Watchdog detected hard LOCKUP on cpu 3#001dModules linked in: msr r8723bs(O) intel_...

And found the complete Call Trace in dmesg:

[  261.956671] NMI watchdog: Watchdog detected hard LOCKUP on cpu 3dModules linked in: msr r8723bs(O) intel_rapl intel_soc_dts_thermal nls_iso8859_1 intel_powerclamp coretemp nls_cp437 vfat kvm_intel fat kvm iTCO_wdt snd_soc_sst_bytcr_rt5640 gpio_keys iTCO_vendor_support irqbypass crct10dif_pclmul crc32_pclmul ghash_clmulni_intel aesni_intel aes_x86_64 lrw gf128mul joydev glue_helper input_leds snd_usb_audio snd_usbmidi_lib snd_hwdep mousedev ablk_helper cryptd snd_soc_rt5645 snd_rawmidi mac_hid cfg80211 intel_cstate pcspkr thermal evdev kxcjk_1013 snd_intel_sst_acpi snd_intel_sst_core tpm_crb industrialio_triggered_buffer snd_soc_rt5640 soc_button_array snd_soc_sst_mfld_platform kfifo_buf snd_soc_rl6231 snd_soc_sst_match industrialio dptf_power int3406_thermal int3403_thermal snd_soc_core processor_thermal_device int3402_thermal goodix int3400_thermal battery snd_compress int340x_thermal_zone snd_pcm_dmaengine acpi_thermal_rel intel_soc_dts_iosf ac97_bus hci_uart snd_seq snd_seq_device b
[  261.956681] CPU: 3 PID: 3055 Comm: inkscape Tainted: G           O    4.8.6-BAYTRAIL48 #1
[  261.956684] Hardware name: Positivo Informatica SA WCBT1013/WCBT1013, BIOS 1.7 06/09/2015
[  261.956687]  0000000000000086 000000000bea3d8f ffff880038643bf0 ffffffff812f9d4b
[  261.956690]  0000000000000000 0000000000000000 ffff880038643c08 ffffffff8111d918
[  261.956693]  ffff880038d30800 ffff880038643c40 ffffffff811613ac 0000000000000001
[  261.956695] Call Trace:
[  261.956698]  [<ffffffff812f9d4b>] dump_stack+0x63/0x88
[  261.956700]  [<ffffffff8111d918>] watchdog_overflow_callback+0xc8/0xf0
[  261.956703]  [<ffffffff811613ac>] __perf_event_overflow+0x7c/0x1b0
[  261.956706]  [<ffffffff81169664>] perf_event_overflow+0x14/0x20
[  261.956708]  [<ffffffff8100c147>] intel_pmu_handle_irq+0x1e7/0x4a0
[  261.956711]  [<ffffffff81185606>] ? __pagevec_lru_add_fn+0x186/0x290
[  261.956714]  [<ffffffff811f7395>] ? mem_cgroup_commit_charge+0x85/0x100
[  261.956716]  [<ffffffff81187209>] ? lru_cache_add_active_or_unevictable+0x39/0xc0
[  261.956719]  [<ffffffff811af8da>] ? handle_mm_fault+0x41a/0x1550
[  261.956722]  [<ffffffff810055ed>] perf_event_nmi_handler+0x2d/0x50
[  261.956724]  [<ffffffff810312d1>] nmi_handle+0x61/0x140
[  261.956727]  [<ffffffff81031878>] default_do_nmi+0x48/0x130
[  261.956730]  [<ffffffff81031a4b>] do_nmi+0xeb/0x160
[  261.956732]  [<ffffffff815f1406>] nmi+0x56/0xa5


The system did not freeze but continued to operate normally.

Kernel was 4.8.6 with following patches:

from https://aur.archlinux.org/packages/linux-baytrail48/, from linux-4.8-3-baytrail-60cacd661dacfd0a7c4aa6f82d11f1c1664e70a, baytrailfix[1-5].patch 

from https://github.com/burzumishi/linux-baytrail-flexx10/tree/master/kernel/patches/v4.8, patch 0001*, 0006*, and 0008*

and config:

from https://aur.archlinux.org/packages/linux-baytrail48, from linux-4.8-3-baytrail-60cacd661dacfd0a7c4aa6f82d11f1c1664e70a, config.x86_64

Clocksource during this event was
cat /sys/devices/system/clocksource/clocksource0/current_clocksource 
refined-jiffies

but often, after reboot in this configuration, clocksource is tsc.

Comment 584 Josep Pujadas-Jubany 2016-11-22 16:59:04 UTC

Scripts at comment #c437 solve the problem. However, I had to modify c6off+c7on.sh in order to work for CHT (Cerry Trail) processors.

Latest stable kernel for Ubuntu (4.8.10) seems to solve also the problem.

More details at:

https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1575467/comments/142

Comment 585 Martin Brand 2016-11-22 19:47:50 UTC

Same here. I have upgraded to 4.8.9. I was not able to boot 4.8.7. Kernel 4.8.9 with c6off+c7on.sh has definitely improved the situation. So far no crash. The system survived 2 hour HD film and an old windows game played with wine. This has not happened before, so I am very hopeful.

Comment 586 Ajay Garg 2016-11-28 09:06:28 UTC

Len Brown (and Intel in general) should be ashamed of themselves (assuming they still have some self-respect left).

Making mistakes is perfectly acceptable (we all make mistakes). But being shamelessly quiet is a sign of impotency.

Comment 587 Ajay Garg 2016-11-28 09:07:22 UTC

Len Brown (and Intel in general) should be ashamed of themselves (assuming they still have some self-respect left).

Making mistakes is perfectly acceptable (we all make mistakes). But being shamelessly quiet is a sign of impotency.

Comment 588 Dan0780 2016-11-28 21:13:46 UTC

Hello,

For all of you having issues with this I used the c6off+c7on script and did nto solve my problem.  So I modified the script to turn off both C6 & C7 and have not had a freeze in months.  My alters script is below.  Hope it helps some.

#!/bin/sh

#title:       c6off+c7off.sh
#description: Disables all C6 and C7 core states for Baytrail CPUs
#author:      Wolfgang Reimer <linuxball (at) gmail.com>
#date:        2016014
#version:     1.0    
#usage:       sudo <path>/c6off+c7on.sh
#notes:       Intended as test script to verify whether erratum VLP52 (see
#             [1]) is the root cause for kernel bug 109051 (see [2]). In order
#             for this to work you must _NOT_ use boot parameter
#             intel_idle.max_cstate=<number>.
#
# [1] http://www.intel.com/content/dam/www/public/us/en/documents/specification-updates/pentium-n3520-j2850-celeron-n2920-n2820-n2815-n2806-j1850-j1750-spec-update.pdf
# [2] https://bugzilla.kernel.org/show_bug.cgi?id=109051

# Disable ($1 == 1) or enable ($1 == 0) core state, if not yet done.
disable() {
	local action
	read disabled <disable
	test "$disabled" = $1 && return
	echo $1 >disable || return
	action=ENABLED; test "$1" = 0 || action=DISABLED
	printf "%-8s state %7s for %s.\n" $action "$name" $cpu  
}

# Iterate through each core state and for Baytrail (BYT) disable all C6 & C7 states.
cd /sys/devices/system/cpu
for cpu in cpu[0-9]*; do
	for dir in $cpu/cpuidle/state*; do
		cd "$dir"
		read name <name
		case $name in
			C6*-BYT) disable 1;;
			C7*-BYT) disable 1;;
		esac
		cd ../../..
	done
done

Comment 589 mhartzel 2016-11-29 10:18:52 UTC

@Dan0780 Please tell us what your processor is. Without this info we don't know in which cases your solution helps.

Thanks :)

Comment 590 mhartzel 2016-11-29 10:23:52 UTC

Dan0780: Could you please fill in details of your solution in the google spreadsheet BMS created so your solution will be easily found by people having the same hardware.

The spreadsheet is here:

https://docs.google.com/spreadsheets/d/1oajcMYL9oSt0O6VTpaIj0osGJxKGKSPSYtLnqr3UHNk/edit?usp=sharing

Comment 591 Dan0780 2016-11-29 12:21:07 UTC

Sorry, my processor is J1900.  I will try and fill out the spreadsheet

Comment 592 Michaël 2016-11-29 13:45:22 UTC

(In reply to Dan0780 from comment #591)
> Sorry, my processor is J1900.  I will try and fill out the spreadsheet

Dan; as far as I can see (a diff would have been useful), the difference with the original script is that you actually disable C7—this really does the same as max_cstate=1 then.

Comment 593 ladiko 2016-11-29 14:15:16 UTC

The script is unnecessarily complicated ...

# baytrail workaround for https://bugs.freedesktop.org/show_bug.cgi?id=88012
for state in /sys/devices/system/cpu/cpu*/cpuidle/state* ; do
	case "$(< "${state}/name")" in
		C6*-BYT|C6*-CHT) echo "1" > "${state}/disable" ;;
		C7*-BYT|C7*-CHT) echo "0" > "${state}/disable" ;;
	esac
done

or to disable C6 and C7:

for state in /sys/devices/system/cpu/cpu*/cpuidle/state* ; do
	case "$(< "${state}/name")" in
		C6*-BYT|C6*-CHT|C7*-BYT|C7*-CHT) echo "1" > "${state}/disable" ;;
	esac
done

it just lacks all  the feedback of the other script while the shell will complain in case you dont have the right permissions. otherwise if everything is fine - you just get no feedback - fine for me.

Comment 594 Dan0780 2016-11-29 16:26:47 UTC

(In reply to Michaël from comment #592)
> (In reply to Dan0780 from comment #591)
> > Sorry, my processor is J1900.  I will try and fill out the spreadsheet
> 
> Dan; as far as I can see (a diff would have been useful), the difference
> with the original script is that you actually disable C7—this really does
> the same as max_cstate=1 then.

I did not have the option to set max_cstate=1 and therefore I modified the original script.  Either way I just wanted to share as with the original script of disabling C6 only it did not work for me but disabling all of them worked and have no issues.

Comment 595 Wolfgang M. Reimer 2016-11-29 16:58:11 UTC

(In reply to ladiko from comment #593)
> The script is unnecessarily complicated ...

... for you. I NEED the feedback (Usually I test hundreds of boxes with different combinations of enabled/disabled cstates and log the output for documentation purposes. For the next test result I need to document what exactly has changed from the previous state).

> 
> # baytrail workaround for https://bugs.freedesktop.org/show_bug.cgi?id=88012
> for state in /sys/devices/system/cpu/cpu*/cpuidle/state* ; do
>       case "$(< "${state}/name")" in
>               C6*-BYT|C6*-CHT) echo "1" > "${state}/disable" ;;
>               C7*-BYT|C7*-CHT) echo "0" > "${state}/disable" ;;
>       esac
> done
> 
> or to disable C6 and C7:
> 
> for state in /sys/devices/system/cpu/cpu*/cpuidle/state* ; do
>       case "$(< "${state}/name")" in
>               C6*-BYT|C6*-CHT|C7*-BYT|C7*-CHT) echo "1" > "${state}/disable"
> ;;
>       esac
> done
> 
> it just lacks all  the feedback of the other script while the shell will
> complain in case you dont have the right permissions. otherwise if
> everything is fine - you just get no feedback - fine for me.

... and your changed script IS NOT POSIX shell (dash, busybox' ash) compatible  any longer (which is REQUIRED in my case).

Comment 596 Wolfgang M. Reimer 2016-11-29 17:04:39 UTC

(In reply to DE from comment #575)
> (In reply to Wolfgang M. Reimer from comment #539)
> 
> Hello, you have missed the lucky 23.
> 
> [23] AAZ32: Intel Celeron P4000 and U3000 Mobile Processor Series
>     
> http://www.intel.ie/content/dam/www/public/us/en/documents/specification-
> updates/celeron-mobile-p4000-u3000-specification-update.pdf

Thanks, added to my list.

Comment 597 Pshem K 2016-11-29 20:14:43 UTC

I have an unnamed board based on a J1900 with a number of GigE ports. The device is used as a router and runs headless. I've experimented with a number of combinations and simply disabling C6S-BYT state (using a script) made the biggest improvement for me (forcing max_cstate=1 also works, but cpu runs hotter). Without  the C6S-BYT being disabled the uptime would be never longer than 24h, sometimes the device would reload some other times - locked up hard, never leaving a trace on the serial console on what exactly went wrong. This is using standard 4.4.0-47-generic Ubuntu Xenial kernel. Now I have an uptime of over 7 days with no issues.

Comment 598 Hal 2016-12-01 17:06:14 UTC

After running my Zotac (Intel® Celeron® Processor N2930 Bay Trail family) box without any crashes for over 2 months (thanks to intel_idle.max_cstate=1) a few days ago I installed the latest Linux Mint stock version kernel (4.4.0-51) with intel_idle.max_cstate=1.

To my greatest surprise my machine froze a few hours later while it was in light use. I found it frozen again the following morning while it was idling overnight. Then again this morning completely frozen.

This is a significant regression in this machine's case, as from the very beginning of this saga intel_idle.max_cstate=1 has been a life saver, and until now no kernel version had frozen while using intel_idle.max_cstate=1.

So, right now I have it running with 4.4.35-040435-generic #201611260431 SMP (obtained from http://kernel.ubuntu.com/~kernel-ppa/mainline/)

Hopefully it will work better. I know that I can always go back to 4.4.0 from ubuntu, but I am concerned that version 4.4.0 might have known security vulnerabiities).

Hal

Comment 599 Martin Brand 2016-12-01 19:57:19 UTC

Hi Hal,
I am using Kernel 4.8.9 with c6 states disabled. This has worked for me since November 21. Why don't you try this kernel before you go back to 4.4.0

Comment 600 mhartzel 2016-12-01 21:19:48 UTC

@ Hal I have a N2930 processor and had freezes again when migrating from kernel 4.4 to 4.7. The commands below stopped the freezes when used with  intel_idle.max_cstate=0. Try these :)

echo 1 > /sys/devices/system/cpu/intel_pstate/no_turbo
echo 1 > /sys/devices/system/cpu/cpu0/cpuidle/state3/disable
echo 1 > /sys/devices/system/cpu/cpu1/cpuidle/state3/disable
echo 1 > /sys/devices/system/cpu/cpu2/cpuidle/state3/disable
echo 1 > /sys/devices/system/cpu/cpu3/cpuidle/state3/disable

Comment 601 john 2016-12-02 08:51:08 UTC

max_state=1 does not work for my Asrock q1900DC-ITX J1900 processor.
Using Ubuntu 16.04.1 LTS (GNU/Linux 4.4.0-51-generic x86_64)

Avg uptime is around 48-72 hours with the c6 and c7 fix it's even less about 26 hours. I'm getting really frustrated with this. Thinking of buying a i3 barebone pc setup from gigabyte. GB-BKi3A-7100 http://www.gigabyte.com/products/product-page.aspx?pid=6079#ov

Does the i3-7100U have similar issues with freezing?

Comment 602 Justin 2016-12-02 10:56:20 UTC

You also need to

#!/bin/bash
echo 1 > /sys/devices/system/cpu/cpu0/cpuidle/state6/disable
echo 1 > /sys/devices/system/cpu/cpu1/cpuidle/state6/disable
echo 1 > /sys/devices/system/cpu/cpu2/cpuidle/state6/disable
echo 1 > /sys/devices/system/cpu/cpu3/cpuidle/state6/disable

 in order to stop crashing... I did this with rc.local and a script.

Comment 603 john 2016-12-02 11:07:19 UTC

Okay, right now i have the script running from crontab:
@reboot /home/john/scripts/c6off+c7off.sh

can i just add the 

echo 1 > /sys/devices/system/cpu/cpu0/cpuidle/state6/disable
echo 1 > /sys/devices/system/cpu/cpu1/cpuidle/state6/disable
echo 1 > /sys/devices/system/cpu/cpu2/cpuidle/state6/disable
echo 1 > /sys/devices/system/cpu/cpu3/cpuidle/state6/disable

lines very end?

Comment 604 john 2016-12-02 11:34:36 UTC

Also (In reply to john from comment #603)
> Okay, right now i have the script running from crontab:
> @reboot /home/john/scripts/c6off+c7off.sh
> 
> can i just add the 
> 
> echo 1 > /sys/devices/system/cpu/cpu0/cpuidle/state6/disable
> echo 1 > /sys/devices/system/cpu/cpu1/cpuidle/state6/disable
> echo 1 > /sys/devices/system/cpu/cpu2/cpuidle/state6/disable
> echo 1 > /sys/devices/system/cpu/cpu3/cpuidle/state6/disable
> 
> lines very end?

Also ls -la into:

/sys/devices/system/cpu/cpu0/cpuidle

only gives 2 states:

state0  state1

Comment 605 Hal 2016-12-02 17:09:56 UTC

(Follow up to own post #598)
> After running my Zotac (Intel® Celeron® Processor N2930 Bay Trail family)
> box without any crashes for over 2 months (thanks to
> intel_idle.max_cstate=1) a few days ago I installed the latest Linux Mint
> stock version kernel (4.4.0-51) with intel_idle.max_cstate=1.
> 
> To my greatest surprise my machine froze a few hours later while it was in
> light use. I found it frozen again the following morning while it was idling
> overnight. Then again this morning completely frozen.
> 
> This is a significant regression in this machine's case, as from the very
> beginning of this saga intel_idle.max_cstate=1 has been a life saver, and
> until now no kernel version had frozen while using intel_idle.max_cstate=1.
> 
> So, right now I have it running with 4.4.35-040435-generic #201611260431 SMP
> (obtained from http://kernel.ubuntu.com/~kernel-ppa/mainline/)
> 
> Hopefully it will work better. I know that I can always go back to 4.4.0
> from ubuntu, but I am concerned that version 4.4.0 might have known security
> vulnerabiities).
> 
> Hal

Thanks for the suggestions. For now, I am sticking to 4.4.35-040435-generic as it seems to be working fine. No freezing or crash in 25 hours!

One reason I am sticking to the 4.4 line is that it is a long term support version. In the past I ran 4.5.7, but then it was no longer maintained because of EOL. I am still not sure if the 4.8 strain will be long term or not. Also, my (albeit superficial) reading of comments about it gave me the impression that 4.8 started off the wrong foot (insufficient QA on its initial release).

Anyway, I only update the kernel when I need it to support new devices (like USB 3.1 or 802.11 ac), or when I hear about new found vulnerabilities, and typically try to stay one step behind rather than at the cutting edge.

I'll keep the thread posted if anything new happens but so far 4.4.35 looks good!

Hal

Comment 606 nw9165-3201 2016-12-07 15:35:27 UTC

(In reply to Hal from comment #605)
> I am still not sure if the 4.8 strain will be long term or
> not.

4.8 will not be LTS. But 4.9 will be, see:

http://kroah.com/log/blog/2016/09/06/4-dot-9-equals-equals-next-lts-kernel/

Comment 607 Elmar Melcher 2016-12-08 14:26:20 UTC

Configuration from #583 stable during more than 2 weeks of daily use,
no kernel parameters, no C-state script.
Updated the spreadsheet.

Observed a 30% chance that tsc is rejected as clocksource during boot and refined-jiffies is used instead. In this case wall clock is almost 10x slower and keyboard repetition rate is extremely slow, and occasionally a hard lockup occurs in one processor core, but system continues working.
For this reason I will use kernel parameter tsc=reliable from now on.

Does it makes sense to reject tsc of a CPU that has the flags rdtscp. constant_tsc, and nonstop_tsc ?

Comment 608 RussianNeuroMancer 2016-12-13 20:05:26 UTC

https://www.spinics.net/lists/linux-i2c/msg27520.html

> About this patch vs bug bko109051, yesterday I've spend time reading
> that entire bug. It seems it is a combination of at least 3 bugs
> combined, 2 i915 related with commits which seem to trigger
> the problem (2 different groups of users with a different problem
> it seems) which causes a hang every few hours. And one other
> bug where the system freezes in minutes, that one sounds like
> what I was seeing without this patch (but may well be yet
> another issue).
> 
> As for the 2 i915 bugs, there have been git bisects for both of
> them, it would be good if someone could take a look at these, just
> search for bisect in that huge bug.

Comment 609 Vincent Gerris 2016-12-14 17:54:45 UTC

Hi,

The patch here:
https://bugzilla.kernel.org/show_bug.cgi?id=109051#c530
seems to have fixed the problem for me on my N3520 Bay trail (Lenovo Yoga 2 11).

I changed the patch to be applied on 4.8.0-30, the default Ubuntu kernel on an updated Ubuntu 16.10.

Please test if that fixes the issue.
The patch (couldn't find attachment button):

diff --git a/drivers/idle/intel_idle.c b/drivers/idle/intel_idle.c
index 67ec58f..2a77317 100644
--- a/drivers/idle/intel_idle.c
+++ b/drivers/idle/intel_idle.c
@@ -1242,6 +1242,34 @@ static void bxt_idle_state_table_update(void)
 
 }
 /*
+ * byt_idle_state_table_update(void)
+ *
+ * On BYT, we have errata VLP52 and disable C6.
+ * https://bugzilla.kernel.org/show_bug.cgi?id=109051A
+ * http://www.intel.com/content/dam/www/public/us/en/documents/specification-updates/pentium-n3520-j2850-celeron-n2920-n2820-n2815-n2806-j1850-j1750-spec-update.pdf
+ * VLP52 EOI Transactions May Not be Sent if Software Enters Core C6 During an Interrupt Service Routine.
+
+Problem:
+If core C6 is entered after the start of an interrupt service routine but before a write
+to the APIC EOI (End of Interrupt) register, and the core is woken up by an event
+other than a fixed interrupt source the core may drop the EOI transaction the next
+time APIC EOI register is written and further interrupts from the same or lower
+priority level will be blocked.
+
+Implication:
+EOI transactions may be lost and interrupts may be blocked when core C6 is used
+during interrupt service routines.
+
+Workaround:
+It is possible for the firmware to contain a workaround for this erratum.
+ */
+static void byt_idle_state_table_update(void)
+{
+	printk(PREFIX "byt_idle_state_table_update reached\n");
+	byt_cstates[1].disabled = 1;	/* C6N-BYT */
+	byt_cstates[2].disabled = 1;	/* C6S-BYT */
+}
+/*
  * sklh_idle_state_table_update(void)
  *
  * On SKL-H (model 0x5e) disable C8 and C9 if:
@@ -1299,6 +1327,11 @@ static void intel_idle_state_table_update(void)
 	case INTEL_FAM6_ATOM_GOLDMONT:
 		bxt_idle_state_table_update();
 		break;
+	case INTEL_FAM6_ATOM_SILVERMONT1: /* BYT */
+                printk(PREFIX "intel_idle_state_table_update BYT 0x37 reached\n");
+                byt_idle_state_table_update();
+                break;
+
 	case INTEL_FAM6_SKYLAKE_DESKTOP:
 		sklh_idle_state_table_update();
 		break;

Comment 610 Vincent Gerris 2016-12-14 17:56:41 UTC

Created attachment 247621 [details]
Patch for Bay trail for 4.8

based on https://bugzilla.kernel.org/show_bug.cgi?id=109051#c530

Comment 611 Pshem K 2016-12-14 19:29:47 UTC

(In reply to Pshem K from comment #597)
> I have an unnamed board based on a J1900 with a number of GigE ports. The
> device is used as a router and runs headless. I've experimented with a
> number of combinations and simply disabling C6S-BYT state (using a script)
> made the biggest improvement for me (forcing max_cstate=1 also works, but
> cpu runs hotter). Without  the C6S-BYT being disabled the uptime would be
> never longer than 24h, sometimes the device would reload some other times -
> locked up hard, never leaving a trace on the serial console on what exactly
> went wrong. This is using standard 4.4.0-47-generic Ubuntu Xenial kernel.
> Now I have an uptime of over 7 days with no issues.

Spoke too soon. It looks like I can only get stability with max_cstate=1. Disabling C6 only helped a lot, but occasionally the box would still lock up. The device acts as a router and usually the lockups occur after a long (a few hours) high speed (300-600Mb/s) transfers. 
Currently running with ubuntu 4.4.0-53 kernel.

Comment 612 Vincent Gerris 2016-12-14 19:37:56 UTC

Please try the patches that were posted and report. Thank you

Comment 613 Vincent Gerris 2016-12-16 23:16:55 UTC

for the ubuntu users, here are some precompiled kernels in deb package format, containing the Bay trail fixes:
https://www.dropbox.com/sh/c39et4hr6tgp60q/AAC35c56aOEOwkmhjdvtG6dsa?dl=0

Please test and report back!

Comment 614 VoobScout 2016-12-17 10:22:34 UTC

(In reply to Vincent Gerris from comment #610)
> Created attachment 247621 [details]
> Patch for Bay trail for 4.8
> 
> based on https://bugzilla.kernel.org/show_bug.cgi?id=109051#c530

Can you confirm that this is needed for Z3735F CPU?

I'm currently running Arch 4.8.11-1-zen with 2 patches from https://github.com/ferbar/rtl8723bs/tree/master/patches_4.7 and "clocksource=tsc" in cmdline.

System appears stable, aside from misrecognized battery and lack of external physical keys support, it's a commodity tablet "Axdia international GmbH wintab 9 plus 3G/Tablet, BIOS 5.6.5 03/10/2015".

Comment 615 ladiko 2016-12-18 13:05:29 UTC

#611 if it is only a router, just don't start X and it should be stable.

Comment 616 Pshem K 2016-12-18 21:39:52 UTC

(In reply to ladiko from comment #615)
> #611 if it is only a router, just don't start X and it should be stable.

There is no X on that box. Not running it is not sufficient to make the machine stable. Without the max_cstate=1 the device eventually locks up.

Comment 617 Vincent Gerris 2016-12-18 22:09:27 UTC

(In reply to VoobScout from comment #614)
> (In reply to Vincent Gerris from comment #610)
> > Created attachment 247621 [details]
> > Patch for Bay trail for 4.8
> > 
> > based on https://bugzilla.kernel.org/show_bug.cgi?id=109051#c530
> 
> Can you confirm that this is needed for Z3735F CPU?
> 
> I'm currently running Arch 4.8.11-1-zen with 2 patches from
> https://github.com/ferbar/rtl8723bs/tree/master/patches_4.7 and
> "clocksource=tsc" in cmdline.
> 
> System appears stable, aside from misrecognized battery and lack of external
> physical keys support, it's a commodity tablet "Axdia international GmbH
> wintab 9 plus 3G/Tablet, BIOS 5.6.5 03/10/2015".

Hi, it seems like a Bay trail processor so I think you need it.
I don't know about that Arch kernel, I saw they also patched a 4.8 kernel.
The patch from Jochen Hein works for me as does my modification for 4.8 kernels. 
Just compile your own kernel with the patch applied according to your Linux version to be sure.

Comment 618 Prashant Poonia 2016-12-23 15:36:58 UTC

I am currently using 4.8.0-32 kernel installed from linuxmint 18's update manager. System is stable without intel_idle.max_cstate=1 till now . Do test this kernel out.

Comment 619 mhartzel 2016-12-24 12:26:33 UTC

Prashant Poonia: Please tell us what your processor is, otherwise your success story is quite useless to others :) Different processors have different bugs and also different workarounds. One solution does not fit all :)

Comment 620 Prashant Poonia 2016-12-25 11:49:26 UTC

(In reply to mhartzel from comment #619)
> Prashant Poonia: Please tell us what your processor is, otherwise your
> success story is quite useless to others :) Different processors have
> different bugs and also different workarounds. One solution does not fit all
> :)

sorry :D
its N3540 Baytrail
laptop is asus x553MA
linuxmint 18 with kernel 4.8.0-32
The updated yakkety yak's wifi driver for this kernel causes freezes when using wifi, rest it works flawlessly. Hope this helps someone, and i recommend you to check it out

Comment 621 Gi_44 2016-12-25 15:00:53 UTC

Hello to everybody
I 'm new here.

I would like to share my story.

(In reply too to VoobScout and Vincent Gerris lasts posts-  comment #614 and 617, respectively)


I recently built different Linux Flavors  on this Z3735F mini machine :
https://www.aliexpress.com/store/product/2016-QOTOM-Micro-ITX-motherboard-Z3735F-with-2GB-RAM-32GB-SSD-WIFI-Bluetooth-support-Win-8/108231_32694240800.html - (Swiped for all of the the MS stuff when received.)


The native Jessie multiarch (https://cdimage.debian.org/cdimage/unofficial/non-free/cd-including-firmware/8.6.0+nonfree/multi-arch/iso-cd/) works fine directly, is stable, without no adjustments but with, however, no HDMI, WIFI, sound and Bluetooth....


Trying with Debian different kernels (https://github.com/hadess/rtl8723bs/wiki/RTL8723BS-module-building-instruction-for-Debian-GNU-Linux) gave instability and a lot of freezes.

I also tried the 'Linuxium - LUBUNTU 16.04 OS" that works fine, is stable and the wifi is directly well active (RTL8723bs) but still without sound, HDMI or Bluetooth.

($ inxi -F
System:    Host: gil-lbnt Kernel: 4.4.0-31-linuxium x86_64 (64 bit)
Desktop: LXDE (Openbox 3.6.1) Distro: Ubuntu 16.04 xenial
Machine:   Mobo: AMI model: Aptio CRB
Bios: American Megatrends v: 5.6.5 date: 08/01/2015
CPU:       Quad core Intel Atom Z3735F (-MCP-) cache: 1024 KB
clock speeds: max: 1832 MHz 1: 705 MHz 2: 1426 MHz 3: 1140 MHz
4: 1374 MHz
Graphics:  Card: Intel Atom Processor Z36xxx/Z37xxx Series Graphics & Display
Display Server: X.Org 1.18.4 drivers: intel (unloaded: fbdev,vesa)
Resolution: 1366x768@59.79hz
GLX Renderer: Mesa DRI Intel Bay Trail GLX Version: 3.0 Mesa 11.2.0
Audio:     Card IntelHDMI driver: IntelHDMI Sound: ALSA v: k4.4.0-31-linuxium
Network:   Card: Failed to Detect Network Card!
Drives:    HDD Total Size: 15.5GB (Used Error!)
ID-1: /dev/mmcblk0 model: N/A size: 31.0GB
ID-2: USB /dev/sda model: USB_DISK_3.0 size: 15.5GB
Partition: ID-1: / size: 10G used: 5.3G (57%) fs: ext4 dev: /dev/mmcblk0p2
ID-2: /home size: 17G used: 3.4G (21%) fs: ext4 dev: /dev/mmcblk0p4
ID-3: swap-1 size: 1.00GB used: 0.00GB (0%) fs: swap dev: /dev/mmcblk0p3
RAID:      No RAID devices: /proc/mdstat, md_mod kernel module present
Sensors:   System Temperatures: cpu: 45.0C mobo: N/A
Fan Speeds (in rpm): cpu: N/A
Info:      Processes: 189 Uptime: 0 min Memory: 258.1/1939.2MB
Client: Shell (bash) inxi: 2.2.35 )

Recently, I upgraded to the generic kernel 4.8 (http://sourcedigit.com/21520-upgrade-linux-kernel-4-8-10-install-linux-kernel-4-8-10-ubuntu/)

And after rebooting, I installed this r8723bs module version:
https://github.com/ferbar/rtl8723bs

($ inxi -F
System:    Host: gil-lbnt Kernel: 4.8.10-040810-generic x86_64 (64 bit)
Desktop: LXDE (Openbox 3.6.1) Distro: Ubuntu 16.04 xenial
Machine:   Mobo: AMI model: Aptio CRB
Bios: American Megatrends v: 5.6.5 date: 08/01/2015
CPU:       Quad core Intel Atom Z3735F (-MCP-) cache: 1024 KB
clock speeds: max: 1832 MHz 1: 499 MHz 2: 499 MHz 3: 499 MHz
4: 499 MHz
Graphics:  Card: Intel Atom Processor Z36xxx/Z37xxx Series Graphics & Display
Display Server: X.Org 1.18.4 drivers: intel (unloaded: fbdev,vesa)
Resolution: 1366x768@59.79hz
GLX Renderer: Mesa DRI Intel Bay Trail GLX Version: 3.0 Mesa 11.2.0
Audio:     Card-1 bytcr-rt5640 driver: bytcr-rt5640
Card-2 USB Audio DAC driver: USB-Audio
Card-3 Texas Instruments Audio Codec driver: USB Audio
Sound: Advanced Linux Sound Architecture v: k4.8.10-040810-generic
Network:   Card: Failed to Detect Network Card!
Drives:    HDD Total Size: 15.5GB (Used Error!)
ID-1: /dev/mmcblk1 model: N/A size: 31.0GB
ID-2: USB /dev/sda model: USB_DISK_3.0 size: 15.5GB
Partition: ID-1: / size: 10G used: 5.3G (57%) fs: ext4 dev: /dev/mmcblk1p2
ID-2: /home size: 17G used: 3.4G (21%) fs: ext4 dev: /dev/mmcblk1p4
ID-3: swap-1 size: 1.00GB used: 0.00GB (0%) fs: swap dev: /dev/mmcblk1p3
RAID:      No RAID devices: /proc/mdstat, md_mod kernel module present
Sensors:   System Temperatures: cpu: 43.0C mobo: N/A
Fan Speeds (in rpm): cpu: N/A
Info:      Processes: 190 Uptime: 7 min Memory: 287.5/1938.7MB
Client: Shell (bash) inxi: 2.2.35
)

Now, the 4.8 kernel seems well stable (one week, 24/24), with the wifi well working... (always no sound, no HDMI nor Bluetooth).

Comment 622 Prashant Poonia 2016-12-25 18:12:00 UTC

(In reply to Gi_44 from comment #621)
> Hello to everybody
> I 'm new here.
> 
> I would like to share my story.
> 
> (In reply too to VoobScout and Vincent Gerris lasts posts-  comment #614 and
> 617, respectively)
> 
> 
> I recently built different Linux Flavors  on this Z3735F mini machine :
> https://www.aliexpress.com/store/product/2016-QOTOM-Micro-ITX-motherboard-
> Z3735F-with-2GB-RAM-32GB-SSD-WIFI-Bluetooth-support-Win-8/108231_32694240800.
> html - (Swiped for all of the the MS stuff when received.)
> 
> 
> The native Jessie multiarch
> (https://cdimage.debian.org/cdimage/unofficial/non-free/cd-including-
> firmware/8.6.0+nonfree/multi-arch/iso-cd/) works fine directly, is stable,
> without no adjustments but with, however, no HDMI, WIFI, sound and
> Bluetooth....
> 
> 
> Trying with Debian different kernels
> (https://github.com/hadess/rtl8723bs/wiki/RTL8723BS-module-building-
> instruction-for-Debian-GNU-Linux) gave instability and a lot of freezes.
> 
> I also tried the 'Linuxium - LUBUNTU 16.04 OS" that works fine, is stable
> and the wifi is directly well active (RTL8723bs) but still without sound,
> HDMI or Bluetooth.
> 
> ($ inxi -F
> System:    Host: gil-lbnt Kernel: 4.4.0-31-linuxium x86_64 (64 bit)
> Desktop: LXDE (Openbox 3.6.1) Distro: Ubuntu 16.04 xenial
> Machine:   Mobo: AMI model: Aptio CRB
> Bios: American Megatrends v: 5.6.5 date: 08/01/2015
> CPU:       Quad core Intel Atom Z3735F (-MCP-) cache: 1024 KB
> clock speeds: max: 1832 MHz 1: 705 MHz 2: 1426 MHz 3: 1140 MHz
> 4: 1374 MHz
> Graphics:  Card: Intel Atom Processor Z36xxx/Z37xxx Series Graphics & Display
> Display Server: X.Org 1.18.4 drivers: intel (unloaded: fbdev,vesa)
> Resolution: 1366x768@59.79hz
> GLX Renderer: Mesa DRI Intel Bay Trail GLX Version: 3.0 Mesa 11.2.0
> Audio:     Card IntelHDMI driver: IntelHDMI Sound: ALSA v: k4.4.0-31-linuxium
> Network:   Card: Failed to Detect Network Card!
> Drives:    HDD Total Size: 15.5GB (Used Error!)
> ID-1: /dev/mmcblk0 model: N/A size: 31.0GB
> ID-2: USB /dev/sda model: USB_DISK_3.0 size: 15.5GB
> Partition: ID-1: / size: 10G used: 5.3G (57%) fs: ext4 dev: /dev/mmcblk0p2
> ID-2: /home size: 17G used: 3.4G (21%) fs: ext4 dev: /dev/mmcblk0p4
> ID-3: swap-1 size: 1.00GB used: 0.00GB (0%) fs: swap dev: /dev/mmcblk0p3
> RAID:      No RAID devices: /proc/mdstat, md_mod kernel module present
> Sensors:   System Temperatures: cpu: 45.0C mobo: N/A
> Fan Speeds (in rpm): cpu: N/A
> Info:      Processes: 189 Uptime: 0 min Memory: 258.1/1939.2MB
> Client: Shell (bash) inxi: 2.2.35 )
> 
> Recently, I upgraded to the generic kernel 4.8
> (http://sourcedigit.com/21520-upgrade-linux-kernel-4-8-10-install-linux-
> kernel-4-8-10-ubuntu/)
> 
> And after rebooting, I installed this r8723bs module version:
> https://github.com/ferbar/rtl8723bs
> 
> ($ inxi -F
> System:    Host: gil-lbnt Kernel: 4.8.10-040810-generic x86_64 (64 bit)
> Desktop: LXDE (Openbox 3.6.1) Distro: Ubuntu 16.04 xenial
> Machine:   Mobo: AMI model: Aptio CRB
> Bios: American Megatrends v: 5.6.5 date: 08/01/2015
> CPU:       Quad core Intel Atom Z3735F (-MCP-) cache: 1024 KB
> clock speeds: max: 1832 MHz 1: 499 MHz 2: 499 MHz 3: 499 MHz
> 4: 499 MHz
> Graphics:  Card: Intel Atom Processor Z36xxx/Z37xxx Series Graphics & Display
> Display Server: X.Org 1.18.4 drivers: intel (unloaded: fbdev,vesa)
> Resolution: 1366x768@59.79hz
> GLX Renderer: Mesa DRI Intel Bay Trail GLX Version: 3.0 Mesa 11.2.0
> Audio:     Card-1 bytcr-rt5640 driver: bytcr-rt5640
> Card-2 USB Audio DAC driver: USB-Audio
> Card-3 Texas Instruments Audio Codec driver: USB Audio
> Sound: Advanced Linux Sound Architecture v: k4.8.10-040810-generic
> Network:   Card: Failed to Detect Network Card!
> Drives:    HDD Total Size: 15.5GB (Used Error!)
> ID-1: /dev/mmcblk1 model: N/A size: 31.0GB
> ID-2: USB /dev/sda model: USB_DISK_3.0 size: 15.5GB
> Partition: ID-1: / size: 10G used: 5.3G (57%) fs: ext4 dev: /dev/mmcblk1p2
> ID-2: /home size: 17G used: 3.4G (21%) fs: ext4 dev: /dev/mmcblk1p4
> ID-3: swap-1 size: 1.00GB used: 0.00GB (0%) fs: swap dev: /dev/mmcblk1p3
> RAID:      No RAID devices: /proc/mdstat, md_mod kernel module present
> Sensors:   System Temperatures: cpu: 43.0C mobo: N/A
> Fan Speeds (in rpm): cpu: N/A
> Info:      Processes: 190 Uptime: 7 min Memory: 287.5/1938.7MB
> Client: Shell (bash) inxi: 2.2.35
> )
> 
> Now, the 4.8 kernel seems well stable (one week, 24/24), with the wifi well
> working... (always no sound, no HDMI nor Bluetooth).

can you post the link from where you downloaded wifi driver??
 i am also running 4.8 kernel with no issues except lockups when downloading significant data through wifi

Comment 623 Gi_44 2016-12-25 19:52:37 UTC

Hi Prashant Poonia
The address is in the post
Here -> : 
" And after rebooting, I installed this r8723bs module version:
  https://github.com/ferbar/rtl8723bs"

Comment 624 Prashant Poonia 2016-12-25 22:36:53 UTC

(In reply to Gi_44 from comment #623)
> Hi Prashant Poonia
> The address is in the post
> Here -> : 
> " And after rebooting, I installed this r8723bs module version:
>   https://github.com/ferbar/rtl8723bs"

ohh! its a realtek driver, my bad luck.

Anyone else test and confirm 4.8

Comment 625 Vincent Gerris 2016-12-25 22:44:45 UTC

Created attachment 248541 [details]
attachment-26085-0.html

Hi,

I already tested it and reported that it freezes on my N3520. It may take
longer but it will.
Please test it very thoroughly and for a long time.
The processor has an errata, so it does not make sense it would be fixed
unless the kernel is patched or you had some firmware update somehow.

I really hope you can so some more and thorough testing and please try to
not get exited too soo. It could help if you try the patch Jochen Hein
posted or the 4.8x mod I posted or any precompiled kernels and test powet
management.

Even if you do not get a freeze on your setup, it may be a good extra
support reason to have the patch applied in the mainline with some
priority, since this still affects many users.

Thank you

On Dec 25, 2016 23:37, <bugzilla-daemon@bugzilla.kernel.org> wrote:

https://bugzilla.kernel.org/show_bug.cgi?id=109051

--- Comment #624 from Prashant Poonia <pooniaprashant400@gmail.com> ---
(In reply to Gi_44 from comment #623)
> Hi Prashant Poonia
> The address is in the post
> Here -> :
> " And after rebooting, I installed this r8723bs module version:
>   https://github.com/ferbar/rtl8723bs"

ohh! its a realtek driver, my bad luck.

Anyone else test and confirm 4.8

--
You are receiving this mail because:
You are on the CC list for the bug.

Comment 626 Jochen Hein 2016-12-25 23:42:21 UTC

I'm running with this in /etc/modprobe.d/rtl8723be.conf:

options rtl8723be fwlps=0 swlps=0

Otherwise Wifi is unstable for me

Comment 627 pilot_6572 2016-12-27 21:55:17 UTC

It it strange

I have the same Qotom z3735f board but only jessie 3.16 is well stable with and only without any bay and cherry drivers.

With the 4.7 and 4.8 kernels, freezes append very quickly and with r8723bs drivers (ferbar or hadess) or with the linuxium OSs, the systems overload and crash insanely...

Comment 628 Len Brown 2016-12-27 21:57:22 UTC

Created attachment 248751 [details]
Debug patch to enable BYT C6 auto-demotion

Please test this patch and report if if it has any
effect on the stability issue.

You can verify that it is applied and running via dmesg:

dmesg | grep idle
intel_idle: BYT C6 auto-demotion-disable: 0

Under some conditions, it will reduce the amount
of C6 residency, which you can observe with turbostat:

 # turbostat --debug -o ts.out sleep 10

Comment 629 Len Brown 2016-12-28 11:43:07 UTC

Created attachment 248841 [details]
nanosleep.c

As mentioned above, idle-related failures become more rare
when heavy load is added to the system.  So a "stress test"
for idle entry/exit does not add computation.  Instead,
it does almost no work except waking and going back to sleep.

Attached is a little program, nanosleep.c, that can be used
as an idle "stress test".  It has a random element,
so running it for longer duration will provoke
a wider variety of timing.  Also, my intent was that one
copy be run for every  logical CPU in the system,
but you may find it useful running it other ways
that I have not thought of.

nanosleep takes a single parameter, its target for highest wakes per second.
By default, it uses a max of 500 wakes per second, which
would be wakes at a rate up to (1 sec/500) = 2 ms.
Or if you run 4 copies, that becomes 2000/sec, or 500us.

For reference, cpuidle's target residency for C6N-BYT is 275 usec.
So even at that rate, of wakeup, the system may still be able
to enter C6.

For those with Baytrail systems that fail without intel_idle.max_cstate=1,
it would be interesting if you can experiment with nanosleep,
alone or in combination with glxgears or video playback or whatever
to see if you can provoke the failure sooner.

For observing what C-states are actually in use, please use turbostat;
which is available in the upstream kernel tree under
tools/power/x86/turbostat/ (yes, you should be able to use the latest
version of turbostat with an older kernel, as long as the kernel
supports the cpu msr driver)

Note that turbostat exposes the underlying C-state residency
hardware counters.  While the software counters in sysfs reflect
what the kernel requested, the hardware residency counters reflect
what states were actually achieved.
For this reason, it is preferable to use turbostat instead of
Wolfgang's script in comment #435.  eg.

 # turbostat --debug -o ts.out sleep 10

forks the "sleep 10" command -- you can use any command --
and outputs the stats to the file ts.out.  If you omit
the command for turbostat to fork, it will run in interval mode
until you kill it.

Comment 630 Prashant Poonia 2016-12-28 16:21:45 UTC

(In reply to pilot_6572 from comment #627)
> It it strange
> 
> I have the same Qotom z3735f board but only jessie 3.16 is well stable with
> and only without any bay and cherry drivers.
> 
> With the 4.7 and 4.8 kernels, freezes append very quickly and with r8723bs
> drivers (ferbar or hadess) or with the linuxium OSs, the systems overload
> and crash insanely...

do you have the focaltech touchpad drivers for 3.16 kernel? or 3.16 with drivers cooked

Comment 631 jbMacAZ 2016-12-28 19:10:48 UTC

(In reply to Len Brown from comment #628)
> Created attachment 248751 [details]
> Debug patch to enable BYT C6 auto-demotion
> 
> Please test this patch and report if if it has any
> effect on the stability issue.
> 
> You can verify that it is applied and running via dmesg:
> 
> dmesg | grep idle
> intel_idle: BYT C6 auto-demotion-disable: 0
> 

I stripped down my 4.8.15 setup on Asus T100CHI (Z3775).  No ..cstate arg no c6offc7on script only tsc=reliable and let it idle on Mint Cinnamon 18.1 desktop with wifi, and bt enabled).  It took less than 1/2 hour to freeze.  Then I added auto-demotion-disable patch to the kernel.  The CHI has been running over 9 hours.  I'll leave it running (same conditions) a few days to see if freezes.

Comment 632 Dmitry 2016-12-30 11:35:02 UTC

Turbostat is not working in debug for me:
turbostat: msr 0 offset 0x1aa read failed: Input/output error

I haven't seen freezes since September-Oktober. Nanosleep didn't show anything new, 4.8.15 stable with Z3770. Four tasks with taskset on different cores. GLxgears, youtube in firefox over wifi(ath6kl), mpd. This all on battery with powersave governor.
Cmdline:root=UUID=... rootfstype=f2fs ro tsc=reliable clocksource=tsc pcie_aspm=force nmi_watchdog=0 rd.skipfsck fsck.mode=skip quiet splash

cpupower monitor -m Idle_Stats -i 10 -c sleep 300
sleep took 300,00265 seconds and exited with status 0
    |Idle_Stats                               
CPU | POLL | C1-B | C6N- | C6S- | C7-B | C7S- 
   0|  0,00|  0,58|  1,84| 42,82| 26,51|  2,23
   1|  0,00|  1,09|  2,38| 49,07| 19,96|  0,54
   2|  0,00|  0,90|  2,15| 39,27| 28,15|  5,10
   3|  0,06|  0,46|  1,59| 37,10| 29,12|  6,59

Tested for 2 hours then got bored. 
P.S. Gentoo, vanilla stable kernel with bfq, ath6kl and different small(shut sst debug output up and touch button scancode change) patches.

Comment 633 Pilot_6572n 2016-12-30 13:23:37 UTC

During the test with the nanosleep script with Jessie and the 4.8 krnel loaded on the Qotom z3537f motherboard (intel_idle.max_cstate=1 added), I ran "stellarium" and "cairo-dock", two high level time and video resources consuming programs and a freeze occurred directly. Progressively any program, kernel or system have been unable to be loaded. The Bios has been altered.

I rebuilt Jessie with only the 3.16 original, cancelling before any GPT partition (sgdisk --zapp-all). The Bios is altered but the grub.efi is loadable through the Shell (fs0:). Running again the same stellarium and cairo-dock programs, with no grub modification or nanosleep.c running, gave the same crashes affecting now progressively others video players or browsers.

I am looking now to the AMI afulnx_64 tool to flash the Bios before reloading an OS.

Comment 634 jbMacAZ 2016-12-30 19:31:29 UTC

(In reply to jbMacAZ from comment #631)
> 
> I stripped down my 4.8.15 setup on Asus T100CHI (Z3775).  No ..cstate arg no
> c6offc7on script only tsc=reliable and let it idle on Mint Cinnamon 18.1
> desktop with wifi, and bt enabled).  It took less than 1/2 hour to freeze. 
> Then I added auto-demotion-disable patch to the kernel.  The CHI has .. 

.. run fine for 48 hours - the last 24 hours with 4 copies of nanosleep running.  Next tests ran nanosleep on the unpatched 4.8.15.  I had one freeze before I could get a second copy of nanosleep running.  A second test froze in 38 minutes with 4 copies of nanosleep.  Not sure nanosleep matters, but thanks for the patch.

Comment 635 Len Brown 2017-01-01 18:24:12 UTC

Created attachment 249491 [details]
turbostat-src.tar.gz

Attached is a copy of the latest development version of turbostat.
It has two additions from what is released in the upstream kernel tree:

1. New --show --hide parameters (90% implemented)
2. disable --debug access to MSR_MISC_PWR_MGMT (MSR 0x1aa) on BYT

This tar file incluces a binary you can run directly, or you can
first "make clean; make" to build it from scratch.

$ tar xzvf turbostat-src.tar.gz
$ cd turbostat-src

$ sudo ./turbostat --debug -o ts.out sleep 10

Both Wolfgang's script and cpupower are limited because they format
the the software counters in /sys/devices/system/cpu/cpu*/cpuidle/state*/*
The software counters show what the kernel requested.

turbostat shows instead the underlying hardware
residency counters.  The difference is important when the hardware
has the ability
to "demote" a software request into a more "shallow" state; and is
particularly applicable when we are experimenting with a patch that
enables/disables the ability of the hardware to do so.

Comment 636 Dmitry 2017-01-01 20:54:25 UTC

Created attachment 249561 [details]
turbostat --debug -o ts.out sleep 10

(In reply to Len Brown from comment #635)
> 
> Attached is a copy of the latest development version of turbostat.
Thank you! Tested. Some new output but also there is an error:
turbostat: msr 0 offset 0x3fe read failed: Input/output error

Comment 637 Len Brown 2017-01-02 03:50:00 UTC

Created attachment 249571 [details]
turbostat-src.tar.gz

thanks for testing the latest turbostat, this update
should fix the issue seen in the last one.

Comment 638 Dmitry 2017-01-02 09:52:04 UTC

Created attachment 249601 [details]
tubostat --debug -o ts.out sleep 10

So my cpu spends 50% of time in C6 state. This is with 4 instances of nanosleep, glxgears and video playback with mpv. Without any activities cpu spends 94% of time in C6.
I forgot to mention that I use 32bit gentoo with UEFI stub capable kernel.

Comment 639 Oemer 2017-01-02 10:22:44 UTC

(In reply to Sudhanshu from comment #537)
> I have been suffering from the same issue, but on a Broadwell system (Dell
...
> Summarising,
> Are there any broadwell co-sufferers here?
> Am I safe to assume this is the same bug as mine?

I am also on a Broadwell system and i suffer from the same occasional freezes. I haven't yet tried changing the max cstate setting though.

Comment 640 Vincent Gerris 2017-01-02 19:06:14 UTC

Hi,

I patched a 4.8.11 kernel with the auto demotion patch:
dmesg shows:
[    1.244957] intel_idle: BYT C6 auto-demotion-disable: 0

In my usual test setup, it freezes after about 15 minutes.
Since I still see quite a variation in time before that happens, I can't really tell if it made much difference (N3520).

The patch from Jochen Hein still works fine and does not freeze at all after the usual test.

Do we have an issue with the C6 state that is not in the errata?

For others on Ubuntu, you can find the auto demotion enabled deb kernels here:
https://www.dropbox.com/sh/c39et4hr6tgp60q/AAC35c56aOEOwkmhjdvtG6dsa?dl=0

@Len Brown: is there anything we can do to pin down the cause as much as possible?
I would really like to see a kernel patch fixing this and I am looking for the best way forward to help achieve that.

Comment 641 Len Brown 2017-01-03 03:05:07 UTC

@ Sudhanshu
@ oemer+kernel@o9z.de

No, this bug report is specific to Baytrail.

Here is the complete list of Baytrail processors:
http://ark.intel.com/products/codename/55844/Bay-Trail?q=bay%20trail#@All

If you have a problem with Broadwell, then file a new bug report -- because this one will be closed when "intel_idle.max_cstate=1" is no longer required on baytrail.

Comment 642 Len Brown 2017-01-03 03:24:09 UTC

@ Vincent Gerris

It seems likely there are multiple baytrail c6 issues, and only if we are very lucky will they turn out to have a common root cause.

When this bug was opened, I assumed that this had nothing to do with cpuidle and that Adrian's i2c patches would handle this. That didn't happen. I think there are multiple failures here, and i2c and i915 changes clearly effect some failures, and so they are both high on the list of suspects.

Also worth checking out is the cpuidle auto-demotion-disable=0 patch that you just tested. The problem with that patch is if it works, we don't know if it is because we are taking a better route through the pcode, or if it is just hiding an i2c or i915 bug because the system is in c6 less...

So the interesting comparison with the auto-demotion-disable=0 patch is:
1. Does it change stability?
For jbMacAZ it seems it may help, but for you it seems it may make no difference. There are a of submitters here, and I'd like to see more testing.

2. Does it make a measurable difference in C6 residency under the same workload (ie. turbostat output with vs without the patch should show this).

Vincent,

Since the cpuidle patch doesn't make any difference on your system, I would say that testing the i2c patches, (or maybe even blacklisting dw i2c if it doing so doesn't hose your system) and perturbing how i915 works to see if any changes effect your failure are the best areas to look.

Also, any efforts to discover how to best cause the failure to happen as soon as possible would be extremely valuable. Eg. experimenting to see if you can provoke the failure sooner by running nanosleep in a certain way with certain parameters might turn out to be extremely valuable. If we can reliably reproduce a failure in under 60 seconds, when we know when it is gone. If it takes a week or so to reproduce a failure, when we'll never know when we are done.

Comment 643 Jochen Hein 2017-01-03 06:39:39 UTC

I'm running right now:
Linux detrius 4.9.0-040900-generic #201612111631 SMP Sun Dec 11 21:33:00 UTC 
which is the ubuntu mainline ppa. Until yesterday that kernel seemed stable,
but yesterday I had a hang as well.

turbostat output:
turbostat version 4.17 1 Jan 2017 - Len Brown <lenb@kernel.org>
CPUID(0): GenuineIntel 11 CPUID levels; family:model:stepping 0x6:37:8 (6:55:8)
CPUID(1): SSE3 MONITOR - EIST TM2 TSC MSR ACPI-TM TM
CPUID(6): APERF, DTS, No-PTM, No-HWP, No-HWPnotify, No-HWPwindow, No-HWPepp, No-HWPpkg, EPB
cpu1: MSR_IA32_MISC_ENABLE: 0x00850089 (TCC EIST MONITOR)
CPUID(7): No-SGX
SLM BCLK: 83.3 Mhz
RAPL: 4581 sec. Joule Counter Range, at 30 Watts
cpu1: MSR_PLATFORM_INFO: 0x60000001600
6 * 83 = 500 MHz max efficiency frequency
22 * 83 = 1833 MHz base frequency
cpu1: MSR_IA32_POWER_CTL: 0x00000000 (C1E auto-promotion: DISabled)
cpu1: MSR_TURBO_RATIO_LIMIT: 0x00000000
cpu1: MSR_NHM_SNB_PKG_CST_CFG_CTL: 0x0016000f (UNlocked: pkg-cstate-limit=15: unknown)
cpu0: MSR_IA32_ENERGY_PERF_BIAS: 0x00000006 (balanced)
cpu0: MSR_RAPL_POWER_UNIT: 0x00000505 (0.031250 Watts, 0.000032 Joules, 0.000977 sec.)
cpu0: MSR_PKG_POWER_LIMIT: 0x003880fa (UNlocked)
cpu0: PKG Limit #1: ENabled (7.812500 Watts, 262144.000000 sec, clamp DISabled)
cpu0: PKG Limit #2: DISabled (0.000000 Watts, 0.000977* sec, clamp DISabled)
cpu0: MSR_PP0_POWER_LIMIT: 0x00000000 (UNlocked)
cpu0: Cores Limit: DISabled (0.000000 Watts, 0.000977 sec, clamp DISabled)
cpu0: MSR_IA32_TEMPERATURE_TARGET: 0x00690000 (105 C)
cpu0: MSR_IA32_THERM_STATUS: 0x883c0100 (45 C +/- 1)
cpu1: MSR_IA32_THERM_STATUS: 0x883c0100 (45 C +/- 1)
cpu2: MSR_IA32_THERM_STATUS: 0x883a0100 (47 C +/- 1)
cpu3: MSR_IA32_THERM_STATUS: 0x883a0100 (47 C +/- 1)
10.002627 sec
	Core	CPU	Avg_MHz	Busy%	Bzy_MHz	TSC_MHz	IRQ	SMI	CPU%c1	CPU%c6	CoreTmp	GFX%rc6	GFXMHz	PkgWatt	CorWatt
	-	-	546	28.89	1889	1834	0	0	2.41	68.71	45	0.00	0	0.77	0.58
	0	0	74	9.23	804	1835	0	0	2.76	88.00	43	0.00	0	0.77	0.58
	1	1	388	23.96	1618	1835	0	0	1.58	74.47	44
	2	2	1653	77.27	2137	1835	0	0	0.26	22.48	45
	3	3	69	5.07	1368	1833	0	0	5.03	89.90	45

Comment 644 BzukTuk 2017-01-03 12:49:59 UTC

Acer Switch 10 with Intel Atom Z3735F - with Len Brown`s one liner patch on vanilla 4.8.15 with ubuntu-ppa config:

up 4 days, 11:24. No freeze. Workload was glxgears and vlc.

Without this patch, same kernel with same workload froze in 12 minutes. Please someone with different baytrail CPU test this patch, so we can move forward.

Another workaround that makes at least my Z3735F rock stable was described in comment 378 (but only with kernel 4.7 and up, so it could be only luck)

Comment 645 Gilbert Dion 2017-01-04 01:00:09 UTC

Since I upgraded to Ubuntu 16.04 last october, my Acer Aspire V11 Touch with Intel Celeron quad core N2940 + Intel Bay Trail has the same problem.

Patch intel_idle.max_cstate=1 does prevent the crashes. How will I know when a fix to the kernel is done?

Comment 646 A Uday K 2017-01-04 10:19:53 UTC

Setting the max C state as 1 does not fix the problem. That is just a temporary measure to make sure that the freezes don't occur when the stakes are high. For example, if you are working on some important project and there is a freeze expectedly, then all of the unsaved data that you are working on will be lost. Additionally, if you are working on battery, then setting the max C-state as 1 will invariably force your processor to consume a lot of power. In other words, your laptop's battery drastically improve if this bug is fixed.
And you will know when it is fixed when the "status", at the top of this page, is marked as “VERIFIED” OR "RESOLVED". it is currently marked as "NEEDINFO".

If you would like, then you can contribute. Your contribution can speed things up. 

You do not need to be an expert programmer to contribute. You just need to know how to apply the patches and update your kernel. Additionally, you may also need to know how to run a few commands on the terminal and post the output here. These are actually very very simple steps. If you do not know how to do them, then you can just go ahead and Google it. These are relatively simple topics there are not that many Complex steps involved. 

NOTE : MAINTAIN CAUTION WHILE TESTING :)

Comment 647 Martin 2017-01-04 17:45:39 UTC

Patched kernel 4.5.4 using Len's auto demotion patch and had a freeze after a day. CPU: J1900.

root@pandora:/usr/src/turbostat-src# ./turbostat -d
turbostat version 4.17 1 Jan 2017 - Len Brown <lenb@kernel.org>
CPUID(0): GenuineIntel 11 CPUID levels; family:model:stepping 0x6:37:3 (6:55:3)
CPUID(1): SSE3 MONITOR - EIST TM2 TSC MSR ACPI-TM TM
CPUID(6): APERF, DTS, No-PTM, No-HWP, No-HWPnotify, No-HWPwindow, No-HWPepp, No-HWPpkg, EPB
cpu2: MSR_IA32_MISC_ENABLE: 0x00850089 (TCC EIST MONITOR)
CPUID(7): No-SGX
SLM BCLK: 83.3 Mhz
RAPL: 4581 sec. Joule Counter Range, at 30 Watts
cpu2: MSR_PLATFORM_INFO: 0x100000001800
16 * 83 = 1333 MHz max efficiency frequency
24 * 83 = 1999 MHz base frequency
cpu2: MSR_IA32_POWER_CTL: 0x00000000 (C1E auto-promotion: DISabled)
cpu2: MSR_TURBO_RATIO_LIMIT: 0x00000000
cpu2: MSR_NHM_SNB_PKG_CST_CFG_CTL: 0x0018000f (UNlocked: pkg-cstate-limit=15: unknown)
cpu0: MSR_IA32_ENERGY_PERF_BIAS: 0x00000006 (balanced)
cpu0: MSR_RAPL_POWER_UNIT: 0x00000505 (0.031250 Watts, 0.000032 Joules, 0.000977 sec.)
cpu0: MSR_PKG_POWER_LIMIT: 0x003880fa (UNlocked)
cpu0: PKG Limit #1: ENabled (7.812500 Watts, 262144.000000 sec, clamp DISabled)
cpu0: PKG Limit #2: DISabled (0.000000 Watts, 0.000977* sec, clamp DISabled)
cpu0: MSR_PP0_POWER_LIMIT: 0x00000000 (UNlocked)
cpu0: Cores Limit: DISabled (0.000000 Watts, 0.000977 sec, clamp DISabled)
cpu0: MSR_IA32_TEMPERATURE_TARGET: 0x00690000 (105 C)
cpu0: MSR_IA32_THERM_STATUS: 0x88360000 (51 C +/- 1)
cpu1: MSR_IA32_THERM_STATUS: 0x88360000 (51 C +/- 1)
cpu2: MSR_IA32_THERM_STATUS: 0x88320000 (55 C +/- 1)
cpu3: MSR_IA32_THERM_STATUS: 0x88320000 (55 C +/- 1)                                                                                                                         
        Core    CPU     Avg_MHz Busy%   Bzy_MHz TSC_MHz IRQ     SMI     CPU%c1  CPU%c6  CoreTmp GFX%rc6 PkgWatt CorWatt                                                      
        -       -       425     22.23   1911    2000    9310    0       19.02   58.75   58      **.**   1.97    0.65                                                         
        0       0       316     19.56   1613    2000    5564    0       36.85   43.59   53      **.**   1.97    0.65                                                         
        1       1       288     17.62   1633    2000    2113    0       20.16   62.21   53                                                                                   
        2       2       381     18.92   2015    2000    1010    0       12.29   68.79   58                                                                                   
        3       3       715     32.80   2179    2000    623     0       6.80    60.41   58                                                                                   
        Core    CPU     Avg_MHz Busy%   Bzy_MHz TSC_MHz IRQ     SMI     CPU%c1  CPU%c6  CoreTmp GFX%rc6 PkgWatt CorWatt                                                      
        -       -       451     24.48   1840    2000    11641   0       40.57   34.95   56      34.17   3.75    0.70                                                         
        0       0       297     18.22   1628    2000    7549    0       81.78   0.00    53      34.17   3.75    0.70                                                         
        1       1       807     38.64   2088    2000    2083    0       38.13   23.23   53
        2       2       356     22.37   1589    2000    998     0       23.38   54.25   56
        3       3       343     18.71   1834    2000    1011    0       18.97   62.32   56

Comment 648 Vincent Gerris 2017-01-04 20:21:47 UTC

Thank you Len, for picking this up and streamlining the hunt for the cause :)!
thanks everyone for the follow ups :).

I am going to try and be as scientific as I can on the matter.
I did 6 attempts to freeze of which 3 on 4.8.0 unpatched and 4.8.11 patched with auto demotion enabled.
The reason of the 0/11 difference is because when I use the automated scripts it pulls a tar and I did not find how to manipulate that. Until anyone finds a way to a quick freeze, this is what I use to freeze up my Lenovo Yoga 2 11 with N3520 processor (assuming that the minor version difference has no big influence, but might):
 - pick an mkv file from a samba share and copy it to local video folder
 - setup bluetooth audio, with high fidelity playback profile to an external speaker (jambox mini)
 - play the same file that is copying from the network with the ubuntu default video player (totem) 

The way I can get the 4.8.0 usually to freeze between 1 and 30 minutes.
An issue is that bluetooth is utter crap: stottering, connecting loss, wrong profile or not able to set it are some that influence it.
The bluetooth seems also to block video playback, maybe it is buffering related.

Any way, with the above, I was unable to freeze the 4.8.11 with demotion this time.
One time, the 4.8.0 without played about 30 minutes, one time it froze instantly.

Further info:
 - not using the laptop plugged in may freeze the 4.8.0 faster, but not sure
 - I tried nanosleep and ran up to 5 times the program, but it didn't seem to make a difference on the freezing speed. It was running while playing video too.

A sample output of turbostat -d on the 4.8.11 with auto demotion , 5 times nanosleep running, playing video with audio over over bluetooth:

ubuntu@ubuntu-Lenovo-Yoga-2-11:~/Downloads/turbostat-src$ sudo ./turbostat -d
turbostat version 4.17 1 Jan 2017 - Len Brown <lenb@kernel.org>
CPUID(0): GenuineIntel 11 CPUID levels; family:model:stepping 0x6:37:3 (6:55:3)
CPUID(1): SSE3 MONITOR - EIST TM2 TSC MSR ACPI-TM TM
CPUID(6): APERF, DTS, No-PTM, No-HWP, No-HWPnotify, No-HWPwindow, No-HWPepp, No-HWPpkg, EPB
cpu1: MSR_IA32_MISC_ENABLE: 0x00850089 (TCC EIST MONITOR)
CPUID(7): No-SGX
SLM BCLK: 83.3 Mhz
RAPL: 4581 sec. Joule Counter Range, at 30 Watts
cpu1: MSR_PLATFORM_INFO: 0x60000001a00
6 * 83 = 500 MHz max efficiency frequency
26 * 83 = 2166 MHz base frequency
cpu1: MSR_IA32_POWER_CTL: 0x00000000 (C1E auto-promotion: DISabled)
cpu1: MSR_TURBO_RATIO_LIMIT: 0x00000000
cpu1: MSR_NHM_SNB_PKG_CST_CFG_CTL: 0x001a000e (UNlocked: pkg-cstate-limit=14: unknown)
cpu0: MSR_IA32_ENERGY_PERF_BIAS: 0x00000006 (balanced)
cpu0: MSR_RAPL_POWER_UNIT: 0x00000505 (0.031250 Watts, 0.000032 Joules, 0.000977 sec.)
cpu0: MSR_PKG_POWER_LIMIT: 0x000280fb (UNlocked)
cpu0: PKG Limit #1: ENabled (7.843750 Watts, 0.001953 sec, clamp DISabled)
cpu0: PKG Limit #2: DISabled (0.000000 Watts, 0.000977* sec, clamp DISabled)
cpu0: MSR_PP0_POWER_LIMIT: 0x00000000 (UNlocked)
cpu0: Cores Limit: DISabled (0.000000 Watts, 0.000977 sec, clamp DISabled)
cpu0: MSR_IA32_TEMPERATURE_TARGET: 0x00690000 (105 C)
cpu0: MSR_IA32_THERM_STATUS: 0x883b0100 (46 C +/- 1)
cpu1: MSR_IA32_THERM_STATUS: 0x883b0100 (46 C +/- 1)
cpu2: MSR_IA32_THERM_STATUS: 0x883a0100 (47 C +/- 1)
cpu3: MSR_IA32_THERM_STATUS: 0x883a0100 (47 C +/- 1)
turbostat: msr 0 offset 0x3fe read failed: Input/output error

Based on the above I dare to conclude that:
 - the auto demotion enablement makes it more stable than without - it seems like a good idea to have to go mainline, a few reports have also stated it
 - it seems there is still an issue that needs further investigation.

Challenges seem:
 - feedback loop : let's all keep looking for a fast freeze on either kernel
 - detailed reports (hard with above challenge) - let's try to be version specific and share test methods or scripts (I for example can dedicate this hardware to it, but I have limited time)

I am very much motivated to find the cause and I hope everyone has read Len's previous requests about using nanosleep, turbostat and that this is ONLY about Bay trail. 
Let us please try to keep it confined to that and nail this bug :). thanks everyone for the help and collaboration!

Comment 649 Len Brown 2017-01-04 21:43:25 UTC

@Vincent Gerris

Thanks for the test report. It is extremely helpful when reports
such as yours, include the processor and system model #.

> turbostat: msr 0 offset 0x3fe read failed: Input/output error

Hopefully this means you are running the turbostat in comment #635
and that error goes away when you run the update in comment #637

> I tried nanosleep and ran up to 5 times the program

Note that adding more load may result in less idle c6,
and thus make the failure more rare.  That is to say,
3 copies may be more effective than 5...   turbostat
(the working version:-) will show the % of c6 residency,
and if that goes down, the system may be too busy
to be exercising c6 enough to provoke the failure.
Aside from number of copies of nanosleep, its default
parameter is 500 wakes per second. 
I don't know if making that higher or lower will cause the failure
sooner, and if somebody has a system that fails quickly,
that would be a great thing to discover and share.

Comment 650 Josep Pujadas-Jubany 2017-01-04 22:34:21 UTC

I don't know if can be related but at work we have

Acer TM (TravelMate) B117M N3150 processor (Braswell Processor but kernel code sees as Cherry Trail Processor)

Lubuntu 14.04 LTS + LTSEnablementStack
(https://wiki.ubuntu.com/Kernel/LTSEnablementStack)

About 110 units of this model. Some of them are freezing using them and after suspend.

On 2016-11-24 we migrated some machines to latest stable kernel (4.8.10). Computers are more stable.
(https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1575467/comments/142)

but... they continue to freeze after suspend.

Modifying /etc/default/grub from

GRUB_CMDLINE_LINUX_DEFAULT="quiet splash acpi_backlight=vendor"

to

GRUB_CMDLINE_LINUX_DEFAULT="quiet splash acpi_backlight=vendor acpi_osi='!Windows 2013' acpi_osi='!Windows 2012'"

prevent freezing after suspend.

I explain this because I suppose that suspending the computer could be an excellent situation for having C6 & C7 states.

Just closing & opening the lid with any application opened (FireFox or VLC) caused the freezing for us.

Comment 651 A Uday K 2017-01-06 07:31:25 UTC

(In reply to Len Brown from comment #649)
> @Vincent Gerris
> 
> > turbostat: msr 0 offset 0x3fe read failed: Input/output error
> 
> Hopefully this means you are running the turbostat in comment #635
> and that error goes away when you run the update in comment #637

Hello Mr. Brown,

That turbostat binary which is included in the tar file won't work right out of the box. i had to run 
"make clean; make" 
inorder to avoid this error....
---
turbostat: msr 0 offset 0x3fe read failed: Input/output error
---

Here's the output of turbostat, right after runnging "make clean; make"....
---
http://pastebin.com/raw/GTXDbZRz
---

Here's the output of 'dmesg | grep idle', right after installing the auto-demoion-enabled kernel....
---
http://pastebin.com/raw/JacrPBG9
---

Here's the current info about my system....
---
http://pastebin.com/raw/RtMeVBSG
---

and one more thing,
that turbostat output,
after my pc freezes, then i should power it down, switch it back again AND THEN post the output of turbostat, right ?
or should i post the output of turbostat now itself ?

Comment 652 Len Brown 2017-01-06 19:58:38 UTC

@ Josep Pujadas-Jubany

Please file a new bug for Cherry-Trail/Braswell suspend issues.

This bug report is specific to the previous generation, Valleyview/Baytrail
that go away with cmdline option intel_idle.max_cstate=0.

Bay Trail processor list:

http://ark.intel.com/products/codename/55844/Bay-Trail?q=bay%20trail#@All

Comment 653 Len Brown 2017-01-06 20:12:30 UTC

@ A Uday K 

Yes, your N3530 is a Baytrail, yes, the auto-demotion patch is installed.

Several things we are trying to discover:

1. does the auto-demotion patch in comment #628 help?
   running the same workload, does using this pach change time to hang?
   It seems that it helps dramatically on some, and not at all on others.

2. do you see a different amount of %c6 when running the autodemotion patch
   vs. not running that patch?  (this is what turbostat can tell us)

3. can you help discover how to make the problem occur sooner?
   nanosleep in comment #629 is a tool that is available to help.
   My guess is that 4 copies should run on a 4-processor system,
   and that they should use the default parameter of 500 wakes/sec.
   But if you can make the problem happen by changing from 500,
   or changing the number of copies, that is a valuable discovery.
   Here again, turbostat is available to help track if you are
   making the system too busy to get into c6.

Comment 654 jbMacAZ 2017-01-07 07:18:55 UTC

Created attachment 250691 [details]
Turbostat for Asus T100CHI

auto-demotion is also helpful with 4.10-rc2 on Manjaro Cinnamon on ASUS T100CHI (Z3775) ~80 hours w/o freeze.  Will resume testing with 4.8.16 (Mint)

Comment 655 Josep Pujadas-Jubany 2017-01-07 10:05:55 UTC

(In reply to Len Brown from comment #652)
> @ Josep Pujadas-Jubany
> 
> Please file a new bug for Cherry-Trail/Braswell suspend issues.
> 

My comment came from

https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1566302/comments/150

https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1566302

And the status for this Linux(Ubuntu) bug is [Won't fix].

I left Windows when Vista appeared. I have been very happy (at home and at work) using Linux (open source, no viruses, speed, stability, ...)

But in the last months (applying latest kernels for latest hardwares) it seems we lost stability.

It's a pity. I would like to help more about this but I'm just an intermediate/advanced user. I'm not capable to do more. I'm sorry!

Comment 656 A Uday K 2017-01-07 19:13:50 UTC

(In reply to Len Brown from comment #653)

>    a different amount of %c6 when running the autodemotion patch
>    vs. not running that patch?  (this is what turbostat can tell us)

Thanks for the clarification :)

>    can you help discover how to make the problem occur sooner?
>    nanosleep in comment #629 is a tool that is available to help.
>    My guess is that 4 copies should run on a 4-processor system,
>    and that they should use the default parameter of 500 wakes/sec.
>    But if you can make the problem happen by changing from 500,
>    or changing the number of copies, that is a valuable discovery.
>    Here again, turbostat is available to help track if you are
>    making the system too busy to get into c6.

I'll continue to experiment and will keep you updated.

Comment 657 jbMacAZ 2017-01-08 00:09:35 UTC

turbostat error: "msr 0 offset 0x3fe read failed: Input/output error"
on Asus T100CHI (Z3775 - SILVERMONT1.)  FWIW, turbostat runs w/o errors on my skylake desktop.  FYI, 4.8.16 with auto-promotion-disable seems stable for me so far.  

Is there any value to testing an older kernel such as 4.2.x, which I found more unstable on my system?

Comment 658 Len Brown 2017-01-08 23:03:14 UTC

@jbMacAZ

For the turbostat error, try "make clean; make" of the latest attachment.
Apparently I sent the latest source but failed to re-build the binary.

FWIW, I expect to upload an updated turbostat this coming week
with some baytrail specific updates.

Re: value in testing older, more unstable, kernels.

My personal bias it to always run the latest upstream kernel,
or at least the latest kernel.org -stable.  That kernel
is what all other kernels follow, eventually.

However, many users are on binary distro binary kernels, and so
it is useful to know where those are too.

The root cause of this particular failure has been elusive.
It seems there are multiple ways of making the root cause
occur more/less frequently.  There may even be multiple
independent root causes.  If we can use an old kernel to
isolate the difference between bad/good to help find
the effect of a certain patch, that is useful.  But
with possible multiple causes, the benefit of a patch
on an unstable could be lost in the noise.

Comment 659 jbMacAZ 2017-01-09 01:57:14 UTC

(In reply to Len Brown from comment #658)
> @jbMacAZ
> 
> For the turbostat error, try "make clean; make" of the latest attachment.
> Apparently I sent the latest source but failed to re-build the binary.
> 
> FWIW, I expect to upload an updated turbostat this coming week
> with some baytrail specific updates.
> 
I played with the source, but only succeeded in changing which offset provokes the error!  Looking forward to the baytrail turbostat update.  In the mean time, I'll stick to testing recent non EOL kernels.

Comment 660 sikorskydenisua 2017-01-09 16:54:12 UTC

Joining to report this bug on ASUS E502MA - Intel(R) Pentium(R) CPU  N3540  @ 2.16GHz, using any linux distro - Ubuntu, Mint, Manjaro, etc...

Fixed this by https://wiki.archlinux.org/index.php/Intel_graphics#X_freeze.2Fcrash_with_intel_driver - last option, which has a link to this site.

Comment 661 Len Brown 2017-01-10 09:05:59 UTC

Created attachment 251091 [details]
latest turbostat utility for baytrail

Here is my latest turbostat utility, updated for Baytrail.
This is a development version, not yet released to the Linux kernel source tree.

$ tar xzvf turbostat-src.tar.gz
$ cd turbostat-src

$ sudo ./turbostat --debug -o ts.out sleep 10

If you are not comfortable running a utility you download from the internet as root, first built it from source:

$ make clean
$ make
and optionally
$ sudo make install

Updates:
1. uses the Baytrail C1 hardware residency counter instead of software
2. shows the Baytrail module c6 hardware residency counter.
   Yes, this is the same on pairs of cores (that is what a module is)
3. shows Package C6
4. does not access/show un-supported counters c3, pc3, c7, pc7

Here are what states/counters are enabled for the interesting parameters to intel_idle.max_cstate:

intel_idle.max_cstate=1 C1
intel_idle.max_cstate=2 C1, Mod-C6
intel_idle.max_cstate=3 C1, Mod-C6, Pkg-C6

This release replaces the turbostat versions attached to comment #635 and comment #637.

Comment 662 amjafuso 2017-01-10 09:21:20 UTC

I do run kernel 4.9.0 now for two weeks without any freeze.

- shuttle xs35v4, j1900
- 4.9.0-sparky-amd64 #1 SMP Tue Dec 20 12:43:44 CET 2016 x86_64 GNU/Linux

@len brown: does it still help you if I run turbostat?

Comment 663 Len Brown 2017-01-10 09:58:15 UTC

Created attachment 251101 [details]
Test script to freeze your baytrail quickly

I have done some testing on two Baytrail systems:

Dell Insprion 3451 laptop (Atom N3540)
Acer Aspire AXC dekstop (Atom J1900)

Currently using Ubuntu 16.10 vmlinuz-4.8.0-32-generic
with no cmdline parameters.

Using the attached script, each box freezes in under 30-minutes, and often much sooner.  I've seen a freeze as quickly as under 60 seconds.

The current script runs 8 copies of "nanosleep 1000" from comment #629 plus one copy of glxgears -fullscreen.  It also displays information about your system that I'd like to see when you report a failure.

I ssh into the test system, and invoke a 1-line shell script that does this:
./byt.test | tee out.`date +%Y%m%d_%H%M%S`

so when the system hangs, there is a record both in the ssh window, and also in the out.* file.  Yes, attaching your out.* file to this bug report is appropriate -- though the the turbostat output gets redundant after a while -- so copy/paste of the top of the file also works...  You can simply show the last timestamp, or say how long to freeze.

Adding more copies of glxgears did not seem to make the failure occur sooner.  When I ran without glxgears, the failure stretched out to 23 hours on the acer, and the dell was still running at 24 hours.  So 1 copy of glxgears seems to be the ticket.

intel_idle.max_cstate=2 still fails, my one attempt took 49 minutes.

Comment 664 Len Brown 2017-01-10 10:22:23 UTC

@ amjafuso

please try the script in comment #663 to see if you can get 4.9 to fail.
I've not tested 4.9 yet.  You've also reported success with intel_idle.max_cstate=2.  If you get 4.9 to fail with no cmdline, please re-test with intel_idle.max_cstate=2 to see if that survives.  My experience is that they will both fail, and that cmdline will simply take a bit longer than the default.

I also acquired an Acer T100 TAS and Acer T100 CHI.  My next step is to wrestle 64-bit unbutu onto their 32-bit BIOS in a dual-boot config, and broaden the testing to those boxes, before I start changing the kernel.

Comment 665 amjafuso 2017-01-10 13:41:32 UTC

Ok, script started 2 hours ago, no freezes. No freeze with kernel 4.9. Boot parameter:

# cat /proc/cmdline 
BOOT_IMAGE=/boot/vmlinuz-4.9.0-sparky-amd64 root=UUID=612f6bbb-b095-4da7-b823-0658edce9dfc ro quiet splash

I didn't patch the kernel (don't know how to do that, sorry).

Comment 666 GConst 2017-01-10 15:39:04 UTC

Hello,

Do anybody tried 4.9.2 in Ubuntu 16.04 from following site? 
http://kernel.ubuntu.com/~kernel-ppa/mainline/v4.9.2/

Comment 667 jbMacAZ 2017-01-10 19:16:18 UTC

Created attachment 251131 [details]
T100CHI turbostat kernel 4.9 patched

Turbostat is working now on CHI. Thank you.

Kernel 4.9.2 with auto-demotion-disable running for ~12 hours so far (just added 4 nanosleeps 1000).  Without nanosleep(x4) CPU%6 was ~90%. 4.9.2 without auto-demotion-disable patch ran about an hour before freezing at idle. (Z3775)

re: T100 test beds
The T100CHI is a little tedious to get going in linux.  CHI's OEM Bluetooth keyboard is offline at boot time (and unpaired at install time).  It's easier to use a powered hub, USB keyboard/mouse and wifi dongle for linux install.  If you add boot32ia.efi to the installer USB /EFI/boot/, edit the installer /boot/grub/grub.cfg to add intel_idle.max_cstate=1 and you can boot most debian derivative installers.  Press <ESC> during power up to get to the boot menu.  Some distros need grub-efi-ia32 & grub-efi-ia32-bin to be installed manually. Wifi needs brcmfmac43241b4-sdio.txt and bluetooth needs BCM4324B3.hcd and works better with blueman device manager...

Comment 668 Juha Sievi-Korte 2017-01-11 18:23:57 UTC

Thanks for all the updates.

I've been now trying to freeze my N3540 laptop with nanosleep and different combinations of other tools, with varying success. Managed once to freeze idle system running 4xnanosleep 250 processes in couple of hours, but then same test again yielded 36+hrs of uptime.

I can confirm that adding glxgears surely helps, 4xnanosleep 250 + glxgears -fullscreen I've gotten now 4 freezes in a row with times to freeze being roughly: 90 mins, 50mins, 15mins and 8,5 hours. Also now when trying to update this report I got freeze in less than 10 mins from reboot, no nanosleep running...

Attached is a turbostat output from few seconds before one of the freezes happened.

turbostat version 4.17 1 Jan 2017 - Len Brown <lenb@kernel.org>
CPUID(0): GenuineIntel 11 CPUID levels; family:model:stepping 0x6:37:8 (6:55:8)
CPUID(1): SSE3 MONITOR - EIST TM2 TSC MSR ACPI-TM TM
CPUID(6): APERF, DTS, No-PTM, No-HWP, No-HWPnotify, No-HWPwindow, No-HWPepp, No-HWPpkg, EPB
cpu1: MSR_IA32_MISC_ENABLE: 0x00850089 (TCC EIST MONITOR)
CPUID(7): No-SGX
SLM BCLK: 83.3 Mhz
RAPL: 4581 sec. Joule Counter Range, at 30 Watts
cpu1: MSR_PLATFORM_INFO: 0x60000001a00
6 * 83 = 500 MHz max efficiency frequency
26 * 83 = 2166 MHz base frequency
cpu1: MSR_IA32_POWER_CTL: 0x00000000 (C1E auto-promotion: DISabled)
cpu1: MSR_TURBO_RATIO_LIMIT: 0x00000000
cpu1: MSR_NHM_SNB_PKG_CST_CFG_CTL: 0x001a000f (UNlocked: pkg-cstate-limit=15: unknown)
cpu0: MSR_IA32_ENERGY_PERF_BIAS: 0x00000006 (balanced)
cpu0: MSR_RAPL_POWER_UNIT: 0x00000505 (0.031250 Watts, 0.000032 Joules, 0.000977 sec.)
cpu0: MSR_PKG_POWER_LIMIT: 0x003880c0 (UNlocked)
cpu0: PKG Limit #1: ENabled (6.000000 Watts, 262144.000000 sec, clamp DISabled)
cpu0: PKG Limit #2: DISabled (0.000000 Watts, 0.000977* sec, clamp DISabled)
cpu0: MSR_PP0_POWER_LIMIT: 0x00000000 (UNlocked)
cpu0: Cores Limit: DISabled (0.000000 Watts, 0.000977 sec, clamp DISabled)
cpu0: MSR_IA32_TEMPERATURE_TARGET: 0x00690000 (105 C)
cpu0: MSR_IA32_THERM_STATUS: 0x88330100 (54 C +/- 1)
cpu1: MSR_IA32_THERM_STATUS: 0x88330100 (54 C +/- 1)
cpu2: MSR_IA32_THERM_STATUS: 0x88340100 (53 C +/- 1)
cpu3: MSR_IA32_THERM_STATUS: 0x88340100 (53 C +/- 1)
10.006062 sec
	Core	CPU	Avg_MHz	Busy%	Bzy_MHz	TSC_MHz	IRQ	SMI	CPU%c1	CPU%c6	CoreTmp	GFX%rc6	GFXMHz	PkgWatt	CorWatt
	-	-	80	11.80	675	2167	0	0	2.18	86.02	53	0.00	0	1.48	0.09
	0	0	67	10.77	622	2167	0	0	3.08	86.15	53	0.00	0	1.48	0.09
	1	1	127	18.55	682	2167	0	0	2.90	78.55	53
	2	2	65	9.83	659	2167	0	0	1.47	88.70	53
	3	3	61	8.06	752	2167	0	0	1.26	90.68	53

Running 4.9.0-1 from opensuse repos at the moment, will try autodemotion patch next.

Comment 669 A Uday K 2017-01-11 19:18:00 UTC

I did as told on my system, 
here's a link to the output as soon as I ran those commands...
---
http://pastebin.com/ByQVV5gc
---

On my system,
If try this line,
---
$ ./byt.test | tee out.`date +%Y%m%d_%H%M%S`
---
It gives the "permission denied" error. I get this error even if I'm in super user mode ( sudo su ).

However, on my system, the workaround is...
---
$ . byt.test | tee out.`date +%Y%m%d_%H%M%S`
---

Also, there's this one more thing....
If you see the 2nd last line of that link, you'll notice this...
---
The program 'glxgears' is currently not installed.
---
How would you suggest I procede ?
Should I go ahead and type....
---
$ sudo apt install mesa-utils
---
Or should I be doing something else ?
What is glxgear ?

Comment 670 jbMacAZ 2017-01-11 22:59:18 UTC

Created attachment 251261 [details]
CHI_freeze_4.9.2_no_demotion_disable_patch

Using Len Brown's freeze script with kernel 4.9.2, I had a freeze in about 20 minutes.  Freeze times range from 10 minutes to around an hour.

Comment 671 Len Brown 2017-01-11 23:39:00 UTC

@ A Uday K

. file
will interpret that file in the current shell session.
that isn't what you want if the script has side effects, like changing directory, or calling exit.

try this

$ chmod +x file
$ ./file

this particular script has an sudo, and so you will be prompted for a password if your session doesn't remember it from previous sudo

yes, glxgears is a simply graphics demo.
it seems to come installed by default in ubuntu 16.10
go ahead an install and try it.

the good thing about glxgears is that it does video w/o doing any audio.
I suspect that the folks freezing their system by playing an audio+video
stream may be running into a known audio issue that hopefully will soon be fixed.

@ Juha Sievi-Korte

Thanks for the testing.
Please update to the latest turbostat from comment #661

when i wrote the test script in comment #663
i expected to be varying the number of copies of glxgears.
I too notice a huge benefit (shorter time to freeze)
from running 1 copy, but did not notice a benefit of running more copies.

@ jbMacAZ

thanks for confirming that 4.9.2 is not magic,
and that the test script from comment #663 fails well for
an unpatched kernel.

so did you eventually get a hang with 4.9 with demotion-re-enabled patch?

what kernel is "LapLet 4.9.2.2n" -- is that unpatched or patched?
your %pc6 is remarkably low in comment #670 (under 2%)

BTW. thanks for the T100 CHI install tips, hopefully I'll get that box going tonight. If it like yours, it will be a great test bed.

Comment 672 jbMacAZ 2017-01-12 01:10:07 UTC

(In reply to Len Brown from comment #671)
> 
> @ jbMacAZ
> 
> so did you eventually get a hang with 4.9 with demotion-re-enabled patch?
> 
> what kernel is "LapLet 4.9.2.2n" -- is that unpatched or patched?
> your %pc6 is remarkably low in comment #670 (under 2%)
> 
> BTW. thanks for the T100 CHI install tips, hopefully I'll get that box going
> tonight.  If it like yours, it will be a great test bed.

4.9.2.2n is 4.9.2 + aufs4.9 + T100 specific patches not yet upstreamed (from T100/Ubuntu G+ group), but no ubuntu patches.  glxgears has a slight stutter while running, so the system may be maxing out??  I'm running mint 18.1 (x86_64) w/cinnamon 3.2.7. I have a system monitor (CPU,mem,net,disk) graphical applet in the system tray, wifi and bluetooth active.
A standard unpatched recent(>4.6) kernel will run acceptably on the CHI, but more of the minor hardware (buttons, backlight, etc.) works with the T100 patches and .config.

I also built 4.9.2.2 which adds your demotion patch.  So far I have not seen 4.9.2.2 freeze - longest test run so far has been about 16 hours.  FWIW, 4.10-rc3 with your patch did freeze, but 4.10 is still too new for me to take seriously.

Comment 673 jechtpurgateur 2017-01-12 11:05:22 UTC

Hello,

how is going this bug ? I'm still have the freeze issue for a year and the workaround doesn't work with my system. At least it's better but come on, the system freeze randomly. Why it's not a priority ?

Comment 674 alvararo 2017-01-12 12:43:29 UTC

If you need a quick way to get freezes, I had a laptop with N3540 which had freezes few seconds after start (the same complete freeze and without logs) when I launched google chrome just after the desktop starts in Manjaro Gnome 16.08 with 3.16 manjaro kernel (Only in Gnome edition), if someone have time and is interested please confirm this, I'm unable to try it right now.

Comment 675 Prashant Poonia 2017-01-12 15:24:56 UTC

n3540 also freezes with 4.9.2, frozen once in 12hr uptime while connected through wifi hotspot of my android device and a call was missed which was notified on netbook through kde connect. Firefox and terminal was running on foreground.
even then 4.8 and later kernels are much better than previous versions, atleast on my Asus x553ma netbook

Comment 676 Len Brown 2017-01-12 18:23:58 UTC

@ jechtpurgateur@gmail.com 

if the workaround (booting with "intel_idle.max_cstate=1") does not
help your system, then you have a different bug.  Please file it.

Comment 677 Len Brown 2017-01-12 18:27:28 UTC

@ alvararo 

It is interesting that you had a failure easily reproducible in 3.16 -- was widely cited as the most stable baytrail kernel before things went south in 3.17.  I'm afraid, however, that interest in 3.16 is about zero right now.  There have been a lot of fixes and more interesting is if you could get Linux-3.9 to fail quickly.

Comment 678 Len Brown 2017-01-12 18:29:40 UTC

@ alvararo

oops, typo, we want to go forward to the present, not back in history:-)

< get Linux-3.9 to fail quickly.
> get Linux-4.9 to fail quickly.

Comment 679 julio.borreguero@gmail.com 2017-01-12 18:51:56 UTC

(In reply to Len Brown from comment #676)
> @ jechtpurgateur@gmail.com 
> 
> if the workaround (booting with "intel_idle.max_cstate=1") does not
> help your system, then you have a different bug.  Please file it.

@len

there seem to be various bugs with baytrail architecture, my system a N2940
still freezes with "intel_idle.max_cstate=1" and for all other N2940 as well.
But for us Kernel 4.12 (i am using it) or even 4.16 (i think) work well without any kernel parameters.
All later versions freeze including 4.8.x and pretty sure 4.9.rcx as well.
All that info is in this thread that seems to be more like a public chat-forum by now ;)
The Point i want to make, as there are several bugs that affect baytrail, most likely related somehow, why would you file a different bug report for N2940 ?
This is the best Bug-Thread there is so far regarding baytrail problems on the net as far as i know.
if some day the baytrail problem will really be solved i am pretty sure it will be solved for us as well.
i would include that valid info for N2940 made by many users in this bug-report to try solving the problem(s) with baytrail
kind regards and thank you for your work on this

Comment 680 Prashant Poonia 2017-01-12 21:45:23 UTC

(In reply to Len Brown from comment #676)
> @ jechtpurgateur@gmail.com 
> 
> if the workaround (booting with "intel_idle.max_cstate=1") does not
> help your system, then you have a different bug.  Please file it.

there is something more to this bug as i have an interesting observation. My n3540 is of baytrail architecture, and I have tested all kernel versions and intel_idle.max_cstate=1 was the ultimate workaround for all, this makes my hardware perfectly fit to regard as affected by this specific bug. But recently when i tested 4.8.0-32 my system froze once even when cstate parameter was in place, it didn't happened again, also the same kernel is the most stable when cstate is not implemented.
and I have noticed a strong relation between freezes and heavy wifi usage too. People who are facing freezes even after max state parameter is set should see if there is a relation between wifi and freezes and report back. Downloading big files in a fast connection as the trigger.

Comment 681 Len Brown 2017-01-12 22:20:25 UTC

@ julio.borreguero@gmail.com

I must insist.

If you have a failure that is anything other than a baytrail hang that goes away with intel_idle.max_cstate=1, then you are best served by a new bug report.

While we always hope there is a magic bullet that fixes multiple similar issues, experience shows that is very rarely the case.  This bug report will be closed when Baytrail systems that used to hang without intel_idle.max_csate=1 no longer need that parameter.  So if that doesn't describe your system, you are best off with a bug report that does.  Go ahead and reference it here, but please put all necessary information describing that failure in that bug report.  Thanks.

Comment 682 Len Brown 2017-01-12 22:42:37 UTC

@ Prashant Poonia

Yes, the n3540 failing with intel_idle.max_cstate=1 is also interesting.  If you can isolate what kind of workload triggers it, please put that in a new bug report describing the failure.  If you suspect WIFI, then I suggest seeing if you can eliminate sound and graphics from the workload to be sure the known problems there are not the actual root cause.

Comment 683 Mika Kuoppala 2017-01-13 14:43:41 UTC

Created attachment 251471 [details]
drm/i915/byt: Avoid tweaking evaluation thresholds

Comment 684 Justin 2017-01-13 16:34:32 UTC

Kind of disagree with the new bug report sentiment...  This bug is almost 2 years old and multiple kernel updates have come out since then.  Are we asking all those who the intel_idle.max_cstate=1 used to work for 2 years ago to go back and wait another 2 years for those issues to be addressed? Simply because they now need additional commands for the intel_idle.max_cstate=1 to work?

Comment 685 Josep Pujadas-Jubany 2017-01-13 18:31:10 UTC

2 years ago? True. It's explained at https://www.phoronix.com/scan.php?page=news_item&px=Intel-Linux-Bay-Trail-Fail

(Bug started at https://bugs.freedesktop.org/show_bug.cgi?id=88012 and moved to a kernel bug)

The hardware bug that it's supposed to origin the problem (VLP52) was reported by Intel on March-2014, https://bugzilla.kernel.org/show_bug.cgi?id=109051#c425

Other OS are also affected for this Intel's hardware bug and others.

Recommended Google searches:

freeze c-state windows

freeze c-state osx

In fact, many windows-gamers recommend to disable c-states.

Comment 686 Josep Pujadas-Jubany 2017-01-13 18:57:26 UTC

Interesting document from Dell, about how to manage c-states in Linux

http://en.community.dell.com/cfs-file/__key/telligent-evolution-components-attachments/13-4491-00-00-20-22-77-64/Controlling_5F00_Processor_5F00_C_2D00_State_5F00_Usage_5F00_in_5F00_Linux_5F00_v1.1_5F00_Nov2013.pdf

Comment 687 Guillermo Molleda 2017-01-14 17:06:03 UTC

I have a Lenovo Yoga 11e 20D9 Intel Celeron N2930 (2.17GHz) 4GB RAM, updated UEFI-BIOS to last version (17-October-2016). 
In Windows 8.1 go perfect -> is not a hardware bug.
But in Linux Mint 18.1 Serena 64bit MATE 1.16.1 kernel 4.4.0-57-generic #78-Ubuntu x86_64 the system freeze when I watch a video in Youtube with firefox before 5 minutes.
With intel_idle.max_cstate=1 do not freeze.

Comment 688 jbMacAZ 2017-01-14 17:14:41 UTC

kernel 4.9.3 seems to be more stable.  Unpatched and no workarounds (no T100 patches either, just custom .config) took over eight hours to freeze.  4.9.2 would freeze in less than an hour on my system. YMMV 

I applied Mika Kuoppala's new patch to 4.9.2 and it is still running 12 hours later with Len Brown's byt.test script.  Without any cstate workaround hard freeze averages around 30 minutes on recent kernels.  (T100CHI - Z3775)

Comment 689 Ernestas Kulik 2017-01-14 20:35:28 UTC

I’m going to jump on the bandwagon here.

I only experience freezes with both onboard WNIC and mode setting enabled.
A semi-consistent reproducer is establishing multiple concurrent network connections (e.g. downloading a torrent) and/or downloading big files at high speeds.

The laptop is an ASUS X553MA with a Celeron N2840.

A probably unrelated thing is that the BIOS has a mysterious setting called “OS Selection” with “Windows 7” and “Windows 8.x” options. Older (probably 3.x and early 4.x) kernels used to not boot with “Windows 8.x” selected. I could manage to get it to boot by using a modified DSDT, so I assumed it was an ACPI problem, but it works fine with the current kernel.

Another probably unrelated thing is that this thing freezes when modules dw_dmac and dw_dmac_core are loaded (I tried documenting these things here: https://wiki.archlinux.org/index.php/ASUS_X553MA#Laptop_freezes_on_boot).

Comment 690 Len Brown 2017-01-15 18:24:58 UTC

@ Mika Kuoppala

Your patch in comment #683 made a dramatic improvement,
when applied to Linux-4.8.17.

Without the patch, the Dell-n3540 hanged in 13 minutes
and the Acer-J1900 hanged in 3 minutes.

With the patch, both machines are still running after 12 hours.

(both fixed at HFM, running 1 copy of glxgears + 8 copies of nanosleep)
(both are using wired ethernet -- wifi is disabled on the Dell
 and it is using an USB/wired-ethernet dongle)
(no audio is being played)

Looking at the patch, it appears to be a revert of

            commit 8fb55197e64d5988ec57b54e973daeea72c3f2ff
            Author: Chris Wilson <chris@chris-wilson.co.uk>
            Date:   Tue Apr 7 16:20:28 2015 +0100
    
            drm/i915: Agressive downclocking on Baytrail

That patch went upstream in Linux-4.2-rc1.  That is interesting
because 4.1 was often cited as a local maximum in baytrail stability
with 4.2 widely cited as less stable.

And so my feedback on that patch is consistent with the favorable result
reported above by jbMacAZ on the T100 TAM z3775.

I tried doing the same comparison using Linux-4.9.3,
but the baseline test of Linux-4.9.3 with no patches
ran for 30 hours on both machines without failure.

Comment 691 AB 2017-01-16 07:50:04 UTC

Can anyone reproduce freezes with Ethernet cable connection and wifi turned off?

On my desktop I see them only when wifi usb dongle is connected (with both RTL chips available for me).

Comment 692 jbMacAZ 2017-01-16 18:31:17 UTC

I tried using both the demotion patch and threshold patch on 4.9.4 only to be stymied by a regression in wifi (also seen in 4.8.17.)  I call it a soft freeze, because the UI only updates about once a minute, but the mouse cursor moves freely.  dmesg fills up with various brcmfmac error -110's.  For purposes of the cstate bug, I'll stick to testing with 4.9.2

FWIW, with 4.9.2 I was able to run Mika's patch for 37 hours without a freeze.  I stopped that test to try other things.  I had done some testing many months ago regarding aggressive down-clocking (comment #93) which showed only slight improvement at that time.

Comment 693 Pshem K 2017-01-16 20:04:24 UTC

(In reply to AB from comment #691)
> Can anyone reproduce freezes with Ethernet cable connection and wifi turned
> off?
> 
> On my desktop I see them only when wifi usb dongle is connected (with both
> RTL chips available for me).

I can easily reproduce this without wifi. I ran a headless router setup and the lockups most frequently occur after (or sometimes during) heavy network activity.

Comment 694 Shev_84 2017-01-18 20:53:53 UTC

Running ./byt.test | tee out.`date +%Y%m%d_%H%M%S` i cannot hang my J1900. After about 36hrs of running, i've stop it, play some movie and get hang in about 10 minutes of playing. For me these script doesn't work (or I should rather say 'isn't effective').
Running on Ubuntu 16.10.
BOOT_IMAGE=/boot/vmlinuz-4.8.0-34-generic root=UUID=d097e0d3-b7a2-4943-95fa-591edd652328 ro quiet splash vt.handoff=7
board_vendor:ASRock
board_name:Q1900DC-ITX
board_version:
bios_date:03/31/2016
bios_vendor:American Megatrends Inc.
bios_version:P1.50
processor       : 0
vendor_id       : GenuineIntel
cpu family      : 6
model           : 55
model name      : Intel(R) Celeron(R) CPU  J1900  @ 1.99GHz
stepping        : 8
microcode       : 0x831
cpu MHz         : 1509.649
cache size      : 1024 KB
physical id     : 0
siblings        : 4
[    1.860044] intel_idle: MWAIT substates: 0x33000020
[    1.860046] intel_idle: v0.4.1 model 0x37
[    1.860414] intel_idle: lapic_timer_reliable_states 0xffffffff
state0/desc:CPUIDLE CORE POLL IDLE
state1/desc:MWAIT 0x00
state2/desc:MWAIT 0x58
state3/desc:MWAIT 0x52
state4/desc:MWAIT 0x60
state5/desc:MWAIT 0x64
nanosleep 1000
nanosleep 1000
nanosleep 1000
nanosleep 1000
nanosleep 1000
nanosleep 1000
nanosleep 1000
nanosleep 1000
glxgears -display :0 -fullscreen
turbostat --debug -i 100
turbostat version 4.17 10 Jan 2017 - Len Brown <lenb@kernel.org>
CPUID(0): GenuineIntel 11 CPUID levels; family:model:stepping 0x6:37:8 (6:55:8)
CPUID(1): SSE3 MONITOR - EIST TM2 TSC MSR ACPI-TM TM
CPUID(6): APERF, DTS, No-PTM, No-HWP, No-HWPnotify, No-HWPwindow, No-HWPepp, No-HWPpkg, EPB
cpu2: MSR_IA32_MISC_ENABLE: 0x00850089 (TCC EIST MONITOR)
CPUID(7): No-SGX
SLM BCLK: 83.3 Mhz
cpu2: MSR_CC6_DEMOTION_POLICY_CONFIG: 0x00000000 (DISable-CC6-Demotion)
cpu2: MSR_MC6_DEMOTION_POLICY_CONFIG: 0x00000000 (DISable-MC6-Demotion)
RAPL: 4581 sec. Joule Counter Range, at 30 Watts
cpu2: MSR_PLATFORM_INFO: 0x100000001800
16 * 83 = 1333 MHz max efficiency frequency
24 * 83 = 1999 MHz base frequency
cpu2: MSR_IA32_POWER_CTL: 0x00000000 (C1E auto-promotion: DISabled)
cpu2: MSR_TURBO_RATIO_LIMIT: 0x00000000
cpu2: MSR_PKG_CST_CONFIG_CONTROL: 0x0018000f (UNlocked: pkg-cstate-limit=15: pc7)
cpu0: MSR_IA32_ENERGY_PERF_BIAS: 0x00000006 (balanced)
cpu0: MSR_RAPL_POWER_UNIT: 0x00000505 (0.031250 Watts, 0.000032 Joules, 0.000977 sec.)
cpu0: MSR_PKG_POWER_LIMIT: 0x003880fa (UNlocked)
cpu0: PKG Limit #1: ENabled (7.812500 Watts, 262144.000000 sec, clamp DISabled)
cpu0: PKG Limit #2: DISabled (0.000000 Watts, 0.000977* sec, clamp DISabled)
cpu0: MSR_PP0_POWER_LIMIT: 0x00000000 (UNlocked)
cpu0: Cores Limit: DISabled (0.000000 Watts, 0.000977 sec, clamp DISabled)
cpu0: MSR_IA32_TEMPERATURE_TARGET: 0x00690000 (105 C)
cpu0: MSR_IA32_THERM_STATUS: 0x88330000 (54 C +/- 1)
cpu1: MSR_IA32_THERM_STATUS: 0x88330000 (54 C +/- 1)
cpu2: MSR_IA32_THERM_STATUS: 0x88310000 (56 C +/- 1)
cpu3: MSR_IA32_THERM_STATUS: 0x88310000 (56 C +/- 1)


        Core    CPU     Avg_MHz Busy%   Bzy_MHz TSC_MHz IRQ     SMI     CPU%c1  CPU%c6  Mod%c6  CoreTmp GFX%rc6 GFXMHz  Pkg%pc6 PkgWatt CorWatt
        -       -       56      4.18    1348    2000    239554  0       8.01    87.81   58.56   57      **.**   187     30.99   0.62    0.21
        0       0       55      4.10    1349    2000    39173   0       5.37    90.53   67.47   55      **.**   187     30.99   0.62    0.21
        1       1       54      4.02    1350    2000    39053   0       5.25    90.73   67.47   55
        2       2       63      4.70    1347    2000    88111   0       11.57   83.73   49.65   57
        3       3       53      3.90    1349    2000    73217   0       9.87    86.23   49.65   57

I can run movie, and then grab these turbostat output, but i don't know if it would be helpful to the topic.

Comment 695 João Paulo Rechi Vita 2017-01-21 15:36:57 UTC

Hello Len,

First, thanks for taking the lead on this. I've recently worked on enabling a device based on the Intel Atom Z3735F at Endless. From what I can tell it is pretty similar to one of the recent Intel Compute Sticks. That device also has a RTL8723BS WiFi adapter, which is known to be very problematic on Bay-Trail platforms. I dug the device out of my drawers and did a couple of tests with it:

I've based my tests on our current -next kernel, which is based on Ubuntu's Zesty master branch, in turn based on Linus' v4.9 tag. Additionally, to be able to use the machine for more than a couple of minutes I need https://patchwork.kernel.org/patch/9478087/, or to disable run-time PM for the SDHCI host controller.

Without your C6 auto-demotion patch, I was not able to reproduce the freeze using your stress test script running for ~8-10h, but playing videos from youtube in loop froze the machine in ~1h. I've also tried heavy downloads without X being running at all, to see if the problem could be isolated to networking / SDHCI, but it also didn't reproduce the freeze. This is the turbostat output when the machine froze playing youtube:

	Core	CPU	Avg_MHz	Busy%	Bzy_MHz	TSC_MHz	IRQ	SMI	CPU%c1	CPU%c6	Mod%c6	CoreTmp	GFX%rc6	GFXMHz	Pkg%pc6	PkgWatt	CorWatt
	-	-	190	15.18	1253	1333	311669	0	13.62	71.20	37.07	75	1.14	646	5.38	2.77	0.33
	0	0	332	24.79	1337	1333	151194	0	23.39	51.81	14.19	75	1.14	646	5.38	2.77	0.33
	1	1	278	20.44	1360	1333	91562	0	19.57	59.99	14.19	75
	2	2	101	10.28	986	1333	46459	0	7.67	82.05	59.95	75
	3	3	50	5.19	961	1333	22454	0	3.85	90.95	59.95	75

Using your C6 auto-demotion patch the machine survived an overnight youtube play loop, but the time on Mod-C6 or Pkg-C6 dropped down to zero most of the time, except once where I got (still super low):

	Core	CPU	Avg_MHz	Busy%	Bzy_MHz	TSC_MHz	IRQ	SMI	CPU%c1	CPU%c6	Mod%c6	CoreTmp	GFX%rc6	GFXMHz	Pkg%pc6	PkgWatt	CorWatt
	-	-	275	15.74	1749	1333	107463	0	32.25	52.01	0.01	68	6.58	229	0.00	1.24	0.49
	0	0	341	19.39	1760	1333	53679	0	66.65	13.97	0.00	66	6.58	229	0.00	1.24	0.49
	1	1	238	13.61	1744	1333	16186	0	17.04	69.35	0.00	66
	2	2	274	15.68	1744	1333	21012	0	26.33	57.99	0.02	68
	3	3	249	14.29	1744	1333	16586	0	18.97	66.74	0.02	68

I've also tried Mika Kuoppala's patch from comment #683 (without the auto-demotion patch), and the machine survived an overnight youtube play loop while still entering all the C6 states:

	Core	CPU	Avg_MHz	Busy%	Bzy_MHz	TSC_MHz	IRQ	SMI	CPU%c1	CPU%c6	Mod%c6	CoreTmp	GFX%rc6	GFXMHz	Pkg%pc6	PkgWatt	CorWatt
	-	-	278	24.62	1130	1333	185129	0	8.55	66.83	38.72	68	**.**	646	20.25	0.79	0.34
	0	0	301	26.78	1123	1333	101421	0	11.76	61.46	36.63	66	**.**	646	20.25	0.79	0.34
	1	1	269	23.18	1159	1333	20539	0	6.21	70.62	36.63	66
	2	2	285	25.80	1103	1333	42486	0	9.73	64.47	40.81	68
	3	3	259	22.74	1139	1333	20683	0	6.49	70.77	40.81	68

I can give another try on a clean v4.9.Y if you think this would be helpful.

Comment 696 Len Brown 2017-01-22 08:59:39 UTC

@ Shev_84

It is interesting that with 4.8.0-distro your j1900 survives my test script for 36 hours, while my j1900 dies quite quickly. I've not examined it closely, but I should mention that I run with cpufreq set to maximum frequency, since I suspect, but have not rigorously proven, that causes the failure sooner.

Even more interesting that on the same j1900, a movie hangs the system in 10 minutes. Can you share exactly how you are playing the movie? Does the movie still hang the system as fast if you have audio disabled? Yes, turbostat from your movie test is interesting.

Finally, the cutting edge is 3.9.stable plus Mika's patch from comment #683, so if you can find a way to make that fail, please share it.

@ João Paulo Rechi Vita

My experience was that my script could make 4.8 fail in under 30 minutes, but that when I tried 4.9, 1st failure was at 24 hours, and another machine was still running at 30 hours. So I'm not surprised that my script didn't hang your 4.9 system after 10 hours. It would be interesting to know it you see consistent observations as I do -- does my script hang your system in under 30 minutes when you run 4.8?

Re: youtube

On the configuration that fails quickly with youtube, I'd be interested to know if it survives longer if sound is disabled. There is a known audio bug, with patch on the way, that may be independent of failures that occur when audio is not active.

Re: Len's demotion patch

I think it is obsolete, and not worth further testing. Your youtube turbostat output shows it made a dramatic difference in mc6 residency -- more than I've seen on other workloads. I spoke to the baytrail pcode author, who concurs that correctness and stability should not require enabling demotion. So the difference in stability with that patch is more likely a side-effect because demotion is simply making c6 less common.

Re: Mika's patch from comment #683

Running my nanosleep + glxgears script, I've not seen *any* failure with it.
My last report said it ran for 12 hours -- I let the n3540 and the j1900 continue running for 7 days and nights and they did not fail. That was based on 4.8.17 -- a baseline that without that patch would typically fail in under 20 minutes.

I'm now testing 4.9.3 + the same patch. Based on the fact that 4.9.3 was robust before the patch, I'm expecting it to be stable.

I will start experimenting with sound, movies, and youtube. I'm interested in hearing other's experiences with 4.9.stable + the patch from comment #683

Comment 697 Sebastian Heyn 2017-01-22 16:44:26 UTC

>It is interesting that with 4.8.0-distro your j1900 survives my test script
>for >36 hours, while my j1900 dies quite quickly. 

are you both using the same cpu microcode?

Comment 698 Len Brown 2017-01-23 05:52:55 UTC

@ Sebastian Heyn 

Good observation -- different microcode.
So today I updated the microcode to match, and re-tested.
I found the microcode version made no difference.

details:

Same CPU:

cpu family      : 6
model           : 55
model name      : Intel(R) Celeron(R) CPU  J1900  @ 1.99GHz
stepping        : 8

But the microcode was different:

Shev_84's ASRock Q1900DC-ITX w/ BIOS 03/31/2016:

microcode       : 0x831

Len's Acer Aspire XC-703G w/ BIOS 8/28/2014:

microcode       : 0x809

So to see if the older microcode is what makes the Acer less stable
than the ASRock, I updated it to Acer XC-703G to BIOS 09/15/2015,
which brought also up to microcode 0x831.

Re-testing vmlinuz-4.8.0-34-generic three times failed after
14, 8 and 15 minutes -- which is typical of the previous microcode.

I can't explain why my Acer XC-703G fails more easily
than Shev_84's ASRock when running just my nanosleep+glxgears script,
but now we know it has nothing to do with the microcode version.

Comment 699 dizzy 2017-01-23 10:32:19 UTC

Hi all,

I am experiencimg the same problems with my notebook too since I switched my distro from Ubuntu 16.04LTS to Fedora 25. The hardware - toshiba tecra R840-110 containing:
- Intel(R) Core(TM) i5-2520M CPU @ 2.50GHz (Sandy Bridge, I think..)
- Network controller: Intel Corporation Centrino Advanced-N 6230 [Rainbow Peak] (rev 34) (will be important, see later)

Symptoms (1st iteration ;-)):
- computer randomly freezes after cca 5 minutes to few (cca 10) hours, regardless if I was using it or not. Computer stopped responding completely (even magic RysRq doesn't work), CPU fan running high after few seconds (after freeze)
- kernel 4.8.6-300.fc25.x86_64

I applied the script c6off+c7on from Wolfgang Reimer (modifying it for SNB) and it definitely helped and the system was stable with no freezes during work (few hours).

The problem came after update to kernel 4.9.X - the script stopped working and the computer was freezing regardless the script has been applied or not, so I reverted my kernel back to version 4.8.6 (nice number, isn't it) and started to examine a bit and following facts came up:
- the computer freezes after few minutes up to few hours randomly if the script has not been applied
- after applying the script you can work with the computer for many hour (5-6 hrs without freeze), but if I leave it turned on inactive (without working with it) IT WILL FREEZE IN 20-30 MINUTES (same symptoms).

So in addition to c6 state problem, there is something happening after 20-30 minutes of inactivity, which causes the computer to deadlock. By examining log files, last log entries before freeze were (among others) were from Network manager (refreshing DHCP leases, WIFI rekeying,...). So I tried to shutdown the Network manager, disabled wifi (as it was the only active network interface during all freezes), unloaded it's modules from kernel (including the whole wifi stack related modules) using rmmod and voila - the computer survived inactivity more than 16 hours without freeze.

So the final workaround for me looks as follows:
- kernel 4.8.6
- Wolfgang Reimer's script (thanks a lot)
- inactive WIFI - maybe wifi without powermanagement, that will be a subject to further investigation

Last note to kernel versions:
- 4.4.0 - actually in Ubuntu 16.04LTS - works fine, with small issues (sometimes screen distorted, wifi malfunction after wakeup from standby) - NO FREEZES (in Ubuntu, not tested with Fedora)
- 4.8.6 - works after applying above workaround
- 4.9.X (4.9.3, 4.9.4 for example) - random freezes, even with Wolfgang's script. After disabling c_states (=0), crashes within 45 minutes (can be caused by WIFI,...)

Hope this helps someone. If you need further information/tests,...len me know.

R.

Comment 700 Len Brown 2017-01-23 17:57:47 UTC

@ dizzy

> Intel(R) Core(TM) i5-2520M CPU @ 2.50GHz

https://ark.intel.com/products/52229/Intel-Core-i5-2520M-Processor-3M-Cache-up-to-3_20-GHz

Yes, your system is a "Sandy Bridge".
As that 2011 processor is of the Core Architecture, rather than the Atom Architecture "Bay Trail", please file a bug specifically describing that failure.  I recommend that before you do, you run memtest overnight.

Comment 701 dizzy 2017-01-24 15:57:27 UTC

(In reply to Len Brown from comment #700)
> @ dizzy
> 
> > Intel(R) Core(TM) i5-2520M CPU @ 2.50GHz
> 
> https://ark.intel.com/products/52229/Intel-Core-i5-2520M-Processor-3M-Cache-
> up-to-3_20-GHz
> 
> Yes, your system is a "Sandy Bridge".
> As that 2011 processor is of the Core Architecture, rather than the Atom
> Architecture "Bay Trail", please file a bug specifically describing that
> failure.  I recommend that before you do, you run memtest overnight.

Hi Len - I'm sorry - I was just thinking, if the HW is similar, symptoms are similar, even workaround is similar, maybe would someone around here find this information helpful (I didn't know the bug is for baytrail strictly). But OK, created another bug (https://bugzilla.kernel.org/show_bug.cgi?id=193261) as suggested (after successful pass of memtest, of course ;-)).

Sorry again for spamming Your bug...

Comment 702 Vincent Gerris 2017-01-24 22:13:38 UTC

Hi,

I put a patched 4.9.0 kernel (the patch from Mika,latests ubuntu-zesty) up for the ubuntu users here that want to try it:
https://www.dropbox.com/sh/c39et4hr6tgp60q/AAC35c56aOEOwkmhjdvtG6dsa?dl=0

I am running:
ubuntu@ubuntu-Lenovo-Yoga-2-11:~/Downloads$ ./byt.test | tee out.`date +%Y%m%d_%H%M%S`
Linux ubuntu-Lenovo-Yoga-2-11 4.9.0-mika-no-tweak-eval-th #3 SMP Tue Jan 24 10:10:26 CET 2017 x86_64 x86_64 x86_64 GNU/Linux
tis 24 jan 2017 22:37:53 CET
BOOT_IMAGE=/boot/vmlinuz-4.9.0-mika-no-tweak-eval-th root=UUID=6a53171b-c5f2-44a4-a69f-a08f38312a8c ro quiet splash vt.handoff=7
board_vendor:LENOVO
board_name:AIUU1
board_version:31900042STD
bios_date:08/19/2015
bios_vendor:LENOVO
bios_version:92CN93WW(V1.93)
processor	: 0
vendor_id	: GenuineIntel
cpu family	: 6
model		: 55
model name	: Intel(R) Pentium(R) CPU  N3520  @ 2.16GHz
stepping	: 3
microcode	: 0x320
cpu MHz		: 1305.507
cache size	: 1024 KB
physical id	: 0
siblings	: 4
[    1.274218] intel_idle: MWAIT substates: 0x33000020
[    1.274220] intel_idle: v0.4.1 model 0x37
[    1.274527] intel_idle: lapic_timer_reliable_states 0xffffffff
state0/desc:CPUIDLE CORE POLL IDLE
state1/desc:MWAIT 0x00
state2/desc:MWAIT 0x58
state3/desc:MWAIT 0x52
state4/desc:MWAIT 0x60
state5/desc:MWAIT 0x64
nanosleep 1000
nanosleep 1000
nanosleep 1000
nanosleep 1000
nanosleep 1000
nanosleep 1000
nanosleep 1000
nanosleep 1000
glxgears -display :0 -fullscreen
turbostat --debug -i 100
[sudo] password for ubuntu: tis 24 jan 2017 22:38:53 CET
tis 24 jan 2017 22:39:53 CET
tis 24 jan 2017 22:40:53 CET
tis 24 jan 2017 22:41:53 CET
tis 24 jan 2017 22:42:53 CET
tis 24 jan 2017 22:43:53 CET
tis 24 jan 2017 22:44:53 CET
tis 24 jan 2017 22:45:53 CET
tis 24 jan 2017 22:46:53 CET

No problems yet.
I also tried my regular stress test at the same time (movie play over network with bluetooth and copy of same file) just to see if it hung and it did not yet.

Please grab the kernel to test as Len asked. 
Thanks Len for driving this, it's greatly appreciated!

I'll post my uptime later on, it looks good for now.

Comment 703 Len Brown 2017-01-25 07:50:58 UTC

Re: youtube or movie playback as a stress test

Can anybody share, exactly, how you play movies to provoke the failure faster than you can provoke it running my script?  I opened youtube surfed for movie previews and played them, and youtube seemed to move onto new videos, but when I came in the next day it had decided to stop streaming.  I thought it was hung, but wiggling the mouse prompted it to start playing again... anyway, it took me just under 24 hours to get a 4.8 kernel to fail using youtube on the Acer n3540, when it took < 15 minutes using my nanosleep+glxgears script.  So unless somebody has a recipe showing exactly how to play movies that fails quickly, I'm at a loss to reproduce and explain Shev_84's experience of movies failing quickly in comment #694.

Comment 704 Len Brown 2017-01-25 07:55:04 UTC

<typo in previous comment, that was an Acer-J1900, not an n3540, ie matched Shev_84's system as best as I could>

Comment 705 Mika Kuoppala 2017-01-25 09:10:45 UTC

Here is my test script which has been rather effective. Usually less than hour but always less than 24h for hang.

glxgears >/dev/null 2>/dev/null &
mpv --vo=vaapi --hwdec=vaapi --loop=inf vid.mp4

Comment 706 Len Brown 2017-01-25 09:13:05 UTC

Re: Asus T100

I finally got the Asus T100-TAM and T100-CHI installed with Ubuntu 16.10.  It seems the secret for the T100-TAM was to not try to dual boot with Windows, but to erase the hard drive.  The T100-CHI, amazingly, installed properly as a Windows dual-boot.  The T100-CHI requires a USB hub and some adapters to give it a network/kbd/mouse and USB thumb drive access.  I'm running both with usb/wired-ethernet.

running my idle torture test script, I find both of these boxes to be delightfully unstable. Both system with 4.8.0-34-generic hanged on average in under 5 minutes.
and interestingly, they did seem any more stable with Ubuntu's 4.9.5 ppa.

@ Vincent Gerris

Thanks for packaging the test kernel, I've kicked off a test of it on all 4 boxes.

Comment 707 Juha Sievi-Korte 2017-01-25 18:37:51 UTC

I've been now running 4.9.0-1 with Mika's patch (and also have the autodemotion patch applied at the moment). Just had a freeze in less than 24 hours with the byt.test script. I'll continue running to see if there is as much deviation on the times to freeze as previously.

Turbostat output from near the failure:

Wed Jan 25 19:36:59 EET 2017
	Core	CPU	Avg_MHz	Busy%	Bzy_MHz	TSC_MHz	IRQ	SMI	CPU%c1	CPU%c6	Mod%c6	CoreTmp	GFX%rc6	GFXMHz	Pkg%pc6	PkgWatt	CorWatt
	-	-	112	11.02	1012	2167	266841	0	14.60	74.38	30.30	49	**.**	396	11.35	0.41	0.16
	0	0	120	12.15	986	2167	76451	0	16.90	70.95	30.94	49	**.**	396	11.35	0.41	0.16
	1	1	93	8.71	1065	2167	51852	0	12.34	78.95	30.94	49
	2	2	134	12.46	1076	2167	66838	0	13.31	74.23	29.66	49
	3	3	100	10.76	927	2167	71700	0	15.86	73.38	29.66	49

So not a big improvement for me on N3540.

Comment 708 Pilot_6572n 2017-01-25 22:03:21 UTC

A very quick and little return about what I experienced with the Qotom bay trail z3735f mother board I used.

With ubuntu (16.04 - linuxium), the Bios has been altered: I got an 'ubuntu' line permanently introduced in the Bios boot menu option list.

I observed insane crashes and regular freezes so quickly arriving that I was unable to load any repair program or test scripts (here above).

With Jessie, I observed the same.

Upgrading the Bios with the micro-code obtained from the furnisher didn't changed a lot. The 'ubuntu' option in the Bios disappeared but continuous crashes were still there present with ubuntu reloaded.

As I am in a production procedure, I changed the bay trail Qotom motherboard for a Braswell proc one (Asrock - N3700 - a little bit larger size but with a processor consuming quite the same energy (6 watts vs 2 )).

All seems fine until now, running 24hrs some real-time gps and videos programs (navit - gnuplot a.s.o..) with a 16.04 'regular' Ubuntu or with Jessie too (4.8).

Now I am testing the last Fedora and all looks fine.

I then let the Bay-cherry trail down for the Braswell with, I hope, no return.

Comment 709 Len Brown 2017-01-26 04:46:07 UTC

Created attachment 253151 [details]
pstate.set script

@ Juha Sievi-Korte

How about if you first run the attached script to configure frequency to the max:

./pstate.set max

Does that cause the failure to occur sooner?

If it can shorten the time to failure, what do you see when you then boot with intel_idle.max_cstate=2 ?  (That will enable C6NS, but will not enable C6Shrink -- so you will core-c6 residency, but no module-c6 or package-c6.)

Comment 710 Len Brown 2017-01-26 05:49:59 UTC

I've tested Vincent Gerris' 3.9+mika kernel on 4 systems, and 3 of them had no issue after 18 hours:

Dell n3540 laptop (wireless)
Acer j1900 desktop (wired net)
Asus T100-TAM convertible (USB/wired net)

But the 4th system, an Asus T100-CHI, fails reliably in under 10 minutes of testing with this kernel, just like it did for un-patched 4.8 and 4.9.

So I booted the T100-CHI with "intel_idle.max_cstate=2" (enables Core-C6, but disables module/package-C6) and it ran over 18 hours without a problem.
Rebooted with intel_idle.max_cstate=3, and it failed again after 3 minutes.

Right now it is running with intel_idle.max_cstate=0, which boots in ACPI mode. Here the C-states are MWAIT 0 (C1), MWAIT 0x51 (CC6), and MWAIT 0x64 (C7s). While Linux does make requests for C7s, those appear to be demoted all the way to CC6, as there is no mc6 or pc6 residency. ie. this is behaving exactly like the "intel_idle.max_cstate=2" case, and so I expect it to still be running when I come in tomorrow...

Unclear why the other 3 systems do not see this, especially the T100-TAM, which is extremely similar. Indeed, the list of difference between the TAM and the CHI are very short. They have identical identical SOC: Z3775 @ 1.46GHz, ucode 0x832, same INT33FD Crystal Cove PMIC. The systems include different wireless, but I'm not using wireless -- instead I'm using the same a USB/wired-ethernet on both.

The T100-TAM has a 1766x768 display, the T100-CHI has a 1920x1200 display. I don't know if the display difference might be related.

Then there is the possibility that my T100-CHI has some unit-detect. But I'm going to assume that is not the case, unless other T100-CHI test results differ.

Comment 711 jbMacAZ 2017-01-26 07:01:14 UTC

I've been chasing some unrelated problems with my CHI.  I've found a couple bad commits which fixed my touchscreen and might fix the wifi soft freeze.  I'm currently testing 4.10-rc5 with T100 patches/.config and Mika's patch.  So far running 11 hours without freeze.  I can post that kernel if that would be useful, it should also work on the T100TAM.  

Another big difference between the T100T* and the T100CHI is the keyboard connection.  The CHI keyboard/touchpad is bluetooth and the TAM is hardwired. 

The CHI seems to be the most freeze prone in the Asus T100 baytrail family.  A few minutes until freeze is not unusual without some kind of c-state limit, though 30 minutes is what I've seen with the newest kernels.  Len's freeze rate reminds me of what I saw with the 4.2 kernel series.

Comment 712 amjafuso 2017-01-26 08:24:20 UTC

@len

not sure if this helps to narrow down the problem...

Before setting intel_idle.max_cstate=1 to solve freezes on my system, turning off hardware acceleration was the way to go:

http://sparkylinux.org/forum/index.php/topic,3296.msg7132.html#msg7132

Settings done: https://bugs.launchpad.net/ubuntu/+source/xserver-xorg-video-intel/+bug/1453298/comments/147

I also had to turn off hw acceleration in every browser.

Comment 713 Shev_84 2017-01-26 09:22:45 UTC

I've done some more tests on my machine.

It occurs that playing movie isn't main factor to get freeze. I must play it in specific way. I.e, when I start movie by this command:
cvlc --quiet --x11-display :0 -f -L ~/Temp/cstate-test/test.mkv &

It can run all day long, and nothing will happened.

But when I play the same movie in Kodi (16.1 current stable version), freeze gets me in about 10-30 minutes (twice i've reached almost 2 hours).

Here are sample outputs of turbostat just before hangs:
śro, 25 sty 2017, 00:25:02 CET
        Core    CPU     Avg_MHz Busy%   Bzy_MHz TSC_MHz IRQ     SMI     CPU%c1  CPU%c6  Mod%c6  CoreTmp GFX%rc6 GFXMHz  Pkg%pc6 PkgWatt CorWatt
        -       -       54      4.01    1334    2000    74894   0       13.07   82.92   65.44   58      **.**   354     41.45   0.78    0.22
        0       0       48      3.63    1333    2000    15576   0       14.82   81.55   67.97   56      **.**   354     41.45   0.78    0.22
        1       1       53      4.00    1333    2000    17137   0       9.28    86.72   67.97   56
        2       2       57      4.30    1334    2000    21421   0       11.67   84.02   62.92   58
        3       3       55      4.12    1334    2000    20760   0       16.50   79.38   62.92   58

###

śro, 25 sty 2017, 18:56:35 CET
        Core    CPU     Avg_MHz Busy%   Bzy_MHz TSC_MHz IRQ     SMI     CPU%c1  CPU%c6  Mod%c6  CoreTmp GFX%rc6 GFXMHz  Pkg%pc6 PkgWatt CorWatt
        -       -       51      3.86    1333    2000    73357   0       12.61   83.53   66.63   58      **.**   208     43.31   0.77    0.22
        0       0       56      4.18    1333    2000    22495   0       15.73   80.09   63.09   56      **.**   208     43.31   0.77    0.22
        1       1       55      4.13    1333    2000    21941   0       12.35   83.53   63.09   56
        2       2       50      3.73    1333    2000    17085   0       10.00   86.27   70.16   58
        3       3       45      3.40    1333    2000    11836   0       12.37   84.23   70.16   58

###

śro, 25 sty 2017, 19:23:37 CET
        Core    CPU     Avg_MHz Busy%   Bzy_MHz TSC_MHz IRQ     SMI     CPU%c1  CPU%c6  Mod%c6  CoreTmp GFX%rc6 GFXMHz  Pkg%pc6 PkgWatt CorWatt
        -       -       51      3.86    1333    2000    72826   0       13.02   83.12   65.43   57      **.**   396     41.86   0.77    0.22
        0       0       45      3.36    1333    2000    15717   0       15.25   81.38   62.70   56      **.**   396     41.86   0.77    0.22
        1       1       46      3.48    1334    2000    16001   0       14.97   81.55   62.70   56
        2       2       59      4.39    1333    2000    22710   0       8.46    87.15   68.17   57
        3       3       56      4.20    1333    2000    18398   0       13.41   82.40   68.17   57

###

śro, 25 sty 2017, 19:49:19 CET
        Core    CPU     Avg_MHz Busy%   Bzy_MHz TSC_MHz IRQ     SMI     CPU%c1  CPU%c6  Mod%c6  CoreTmp GFX%rc6 GFXMHz  Pkg%pc6 PkgWatt CorWatt
        -       -       49      3.69    1334    2000    72568   0       13.22   83.09   66.29   57      **.**   396     41.71   0.77    0.22
        0       0       50      3.76    1333    2000    23108   0       16.20   80.04   56.57   56      **.**   396     41.71   0.77    0.22
        1       1       48      3.61    1333    2000    21970   0       21.68   74.72   56.57   56
        2       2       49      3.70    1335    2000    16326   0       7.02    89.28   76.00   57
        3       3       49      3.69    1334    2000    11164   0       7.99    88.32   76.00   57

###

śro, 25 sty 2017, 23:26:31 CET
        Core    CPU     Avg_MHz Busy%   Bzy_MHz TSC_MHz IRQ     SMI     CPU%c1  CPU%c6  Mod%c6  CoreTmp GFX%rc6 GFXMHz  Pkg%pc6 PkgWatt CorWatt
        -       -       49      3.69    1333    2000    72763   0       14.25   82.06   64.90   58      **.**   854     39.66   0.78    0.22
        0       0       51      3.79    1333    2000    23751   0       17.92   78.28   52.58   56      **.**   854     39.66   0.78    0.22
        1       1       47      3.55    1333    2000    23495   0       25.04   71.41   52.58   56
        2       2       53      3.96    1333    2000    14315   0       6.18    89.86   77.21   58
        3       3       46      3.46    1333    2000    11202   0       7.85    88.69   77.21   58

###

And here is output of turbostat while currently running byt.test script with glxgears and nanosleep for almost 10 hours now:

czw, 26 sty 2017, 10:12:42 CET
        Core    CPU     Avg_MHz Busy%   Bzy_MHz TSC_MHz IRQ     SMI     CPU%c1  CPU%c6  Mod%c6  CoreTmp GFX%rc6 GFXMHz  Pkg%pc6 PkgWatt CorWatt
        -       -       61      4.47    1354    2000    243983  0       7.93    87.59   56.23   57      5.10    187     22.64   0.63    0.22
        0       0       53      3.93    1341    2000    25244   0       3.42    92.66   73.34   55      5.10    187     22.64   0.63    0.22
        1       1       52      3.74    1401    2000    25351   0       3.60    92.66   73.34   55
        2       2       74      5.55    1337    2000    104319  0       13.20   81.25   39.13   57
        3       3       63      4.68    1345    2000    89069   0       11.52   83.81   39.13   57


Of course setting intel_idle.max_cstate=1 solves freezes on my machine. Now I need to find out how to apply these patches attached in this thread to Ubuntu kernel. Then I can do some more tests.

Comment 714 Vincent Gerris 2017-01-26 20:16:50 UTC

@Len Brown, thank you, happy to help out :).
@Chev_84 check my previous post for prepatched 4.9 kernel (Mika's patch only, as Len would like to see tested).

This command has been running for about 47 hours:
ubuntu@ubuntu-Lenovo-Yoga-2-11:~/Downloads$ ./byt.test | tee out.`date +%Y%m%d_%H%M%S`
Linux ubuntu-Lenovo-Yoga-2-11 4.9.0-mika-no-tweak-eval-th #3 SMP Tue Jan 24 10:10:26 CET 2017 x86_64 x86_64 x86_64 GNU/Linux
tis 24 jan 2017 22:37:53 CET
BOOT_IMAGE=/boot/vmlinuz-4.9.0-mika-no-tweak-eval-th root=UUID=6a53171b-c5f2-44a4-a69f-a08f38312a8c ro quiet splash vt.handoff=7
board_vendor:LENOVO
board_name:AIUU1
board_version:31900042STD
bios_date:08/19/2015
bios_vendor:LENOVO
bios_version:92CN93WW(V1.93)
processor	: 0
vendor_id	: GenuineIntel
cpu family	: 6
model		: 55
model name	: Intel(R) Pentium(R) CPU  N3520  @ 2.16GHz
stepping	: 3
microcode	: 0x320
cpu MHz		: 1305.507
cache size	: 1024 KB
physical id	: 0
siblings	: 4
[    1.274218] intel_idle: MWAIT substates: 0x33000020
[    1.274220] intel_idle: v0.4.1 model 0x37
[    1.274527] intel_idle: lapic_timer_reliable_states 0xffffffff
state0/desc:CPUIDLE CORE POLL IDLE
state1/desc:MWAIT 0x00
state2/desc:MWAIT 0x58
state3/desc:MWAIT 0x52
state4/desc:MWAIT 0x60
state5/desc:MWAIT 0x64
nanosleep 1000
nanosleep 1000
nanosleep 1000
nanosleep 1000
nanosleep 1000
nanosleep 1000
nanosleep 1000
nanosleep 1000
glxgears -display :0 -fullscreen
turbostat --debug -i 100
[sudo] password for ubuntu: tis 24 jan 2017 22:38:53 CET
tis 24 jan 2017 22:39:53 CET -
...
...
until...
tor 26 jan 2017 21:06:27 CET

No freezes yet. So for me this seems stable so far.
I will do further testing and report if anything suspicious or interesting happens.
Thanks to everybody for the serious and committed bug hunting and for not giving up :). cheers

Comment 715 Vladislav 2017-01-27 13:00:40 UTC

Also had this issue, now with linux-rt-manjaro kernel(4.9) my N3540 based laptop works more than one week without freezes.

Comment 716 amjafuso 2017-01-27 17:47:59 UTC

I use Kernel 4.9 for almost a month and never had any freezes. Script in comment #663 didn't force freeze as well.

Half an hour ago my system froze the first time! No heavy load, only Firefox (no video), thunar and some sshfs connections. After a reboot I started Firefox, half of the window was black (never saw that before). I switched from "full size" to "custom size" and back, it instantly froze again!


After a second boot dmesg shows me red entries I have not seen before:


[    5.663123] intel_soc_dts_thermal: request_threaded_irq ret -22
...
...
...
[    6.766407] [drm:valleyview_pipestat_irq_handler [i915]] *ERROR* CPU pipe A FIFO underrun


Useful?

Comment 717 Jochen Hein 2017-01-29 13:28:24 UTC

I'm now running 4.9.5 with Mika Kuoppala's patch from #683.
System is stable and was playing local video files without problems.
4.9.5 without the patch hung within minutes in kodi.
No max_idle or other workarounds applied.

I'd like to see Mika's patch submitted for inclusion, even if
it fixes only part of the problems.

Comment 718 Vincent Gerris 2017-01-29 21:32:40 UTC

Hi,

I ran more tests with Mika's patch.
The previous test run again without problems.

However, when I played video (with audio over bluetooth) I got a freeze again.
At some point I even got a freeze while I didn't play video, but just copied a file over wifi from a smb share.
So no sound or video played, although bluetooth was connected.

Perhaps this is a wifi issue as well? Or something with bluetooth.

I will try to see if I can pin this further down, although that make take some time.

It seems like a good idea to get Mika's patch in if it fixes freezes for many people.

Comment 719 luke 2017-01-30 08:13:19 UTC

> Perhaps this is a wifi issue as well? Or something with bluetooth.

My system is an Asrock Q1900-ITX with no wifi or bluetooth. I was experiencing these issues with freezing under Linux about every 12hr to 2 weeks depending on the kernel. Since switching to Widows 10, I've had an uptime of over 2 month now with the same role, a home media server.

Comment 720 Gabriel7340 2017-02-01 14:40:49 UTC

@Len Brown Can you please answer a little question? I am confused. It's possible to have the same issue on Windows 10? My processor is an Intel® Bay Trail-M Quad Core Pentium N3540 Processor.

The notebook( http://www.asus.com/pt/Notebooks/X552MJ/specifications/ ) freezes constantly when idle or when I watch a movie.

I can't disable C-States on Bios ( no option ) and I can't found some option on windows 10 to disable or prevent system going into deep idle state.

I'm thinking if a custom bios can be the best solution to disable c-states. 

Do you have some information about problems related to this processor on windows 10? The symptons are equal, if I change to linux I have to use the max_cstate flag in order to have a stable system. 

Thank you.

Comment 721 Juha Sievi-Korte 2017-02-01 16:26:56 UTC

(In reply to Len Brown from comment #709)
> Created attachment 253151 [details]
> pstate.set script
> 
> @ Juha Sievi-Korte
> 
> How about if you first run the attached script to configure frequency to the
> max:
> 
> ./pstate.set max
> 
> Does that cause the failure to occur sooner?
> 
> If it can shorten the time to failure, what do you see when you then boot
> with intel_idle.max_cstate=2 ?  (That will enable C6NS, but will not enable
> C6Shrink -- so you will core-c6 residency, but no module-c6 or package-c6.)

Thanks Len,

Update as I continued with the same test set after the latest freeze and now it's been running for a week without a freeze on same configuration. So it seems Mika's patch did make a huge difference after all and I was just very unlucky to get so quick freeze on the first try (and my bad not verifying the result is repeatable before commenting here again:)

I'll continue to experiment, if I find a way to reproduce this with the patch applied.

So my N3540 seems now relatively stable with byt.test.

Comment 722 Prashant Poonia 2017-02-04 14:10:36 UTC

(In reply to Gabriel7340 from comment #720)
> @Len Brown Can you please answer a little question? I am confused. It's
> possible to have the same issue on Windows 10? My processor is an Intel® Bay
> Trail-M Quad Core Pentium N3540 Processor.
> 
> The notebook( http://www.asus.com/pt/Notebooks/X552MJ/specifications/ )
> freezes constantly when idle or when I watch a movie.
> 
> I can't disable C-States on Bios ( no option ) and I can't found some option
> on windows 10 to disable or prevent system going into deep idle state.
> 
> I'm thinking if a custom bios can be the best solution to disable c-states. 
> 
> Do you have some information about problems related to this processor on
> windows 10? The symptons are equal, if I change to linux I have to use the
> max_cstate flag in order to have a stable system. 
> 
> Thank you.

Strange, I have Asus x553MA and i have been using win10 for the past few months and my laptop has frozen not more than 2 times. And i have 2gb ram so i can associate those freezes with high ram usage. My netbook is very stable with windows 10, make sure you have the latest BIOS from the asus website, my BIOS version is v214.
My laptop specs
n3540
2gb ddr3l ram 1333mhz 1.35volts
500gb 5200rpm hdd
windows 10 x64 1607

Comment 723 jbMacAZ 2017-02-04 18:57:16 UTC

Yes this c-state issue does affect windows, IMO.  I have 4 systems with windows that have frozen exactly the same way (display frozen, inputs unresponsive, needs hard reset to recover, no obvious reason).  One has frozen once several months ago(i5-6400 skylake) and the other 3 are N3540's baytrail which freeze weekly to monthly.  When 3 of these systems run linux with a c-state bandaid, they don't freeze (The other only has windows...)  Freeze rates on windows are infrequent for me, but the processor is only 1 part of the problem.  Hardware implementation matters as does OS version.

But this is a linux baytrail cstate bug.  With Mika's patch, I haven't had a hard freeze running 4.9.7 or 4.10-rc5 during the last week and a half.  4.10-rc6 has a wifi regression that halts testing after about 12 hours (unrelated soft freeze).  These results with Z3775 baytrail quad core.

Comment 724 Len Brown 2017-02-06 21:19:37 UTC

Yes, on 25-Jan, Mika submitted the patch in comment #683
to the i915 driver owners:

https://lists.freedesktop.org/archives/intel-gfx/2017-January/117932.html

If all goes well, I would expect it to go into the Linux-3.11 merge window.
Ideally, it will then get back-ported to the .stable kernels.

Comment 725 Elmar Melcher 2017-02-07 11:04:41 UTC

kernel 4.9.5 with patch from comment #683 and from https://github.com/burzumishi/linux-baytrail-flexx10/tree/master/kernel/patches/v4.8 patch 0001, 0006, 0008 on Z3735G, command line tsc=reliable only, daily use during 2 weeks, no freeze.
Updated spreadsheet.

Comment 726 Martin 2017-02-10 19:33:17 UTC

As one of the people that arrived at 8fb55197e64d5988ec57b54e973daeea72c3f2ff while bisecting (comment 276) I can confirm that 4.9.0 with Mika's patch would have been voted a good! Uptime 3 days 8 hours and counting. Typical load: HTPC (recording and watching HD TV using MythTV).

Thanks everybody that helped cranking this patch out!

Comment 727 Shev_84 2017-02-11 11:11:03 UTC

I've applied Mika's patch from comment 683 to current Ubuntu kernel, and it seems to work fine without cstate set to 1.
$ uname -a
Linux panda 4.8.0-37-generic #39com683 SMP Thu Feb 9 12:54:37 CET 2017 x86_64 x86_64 x86_64 GNU/Linux

All day i've run movies in Kodi, then i've run some transcoding of stream grabbed from DVB-S2 tuner.

Output of Len's turbostat at the end of test:
        Core    CPU     Avg_MHz Busy%   Bzy_MHz TSC_MHz IRQ     SMI     CPU%c1  CPU%c6  Mod%c6  CoreTmp GFX%rc6 GFXMHz  Pkg%pc6 PkgWatt CorWatt
        -       -       220     13.95   1578    2000    38888   0       1.26    84.79   70.26   61      6.39    854     48.10   1.55    0.33
        0       0       110     6.84    1607    2000    6408    0       0.83    92.33   83.21   59      6.39    854     48.10   1.55    0.33
        1       1       103     6.63    1558    2000    9220    0       1.41    91.97   83.21   59
        2       2       330     19.08   1730    2000    13646   0       1.50    79.42   57.31   61
        3       3       337     23.25   1449    2000    9614    0       1.31    75.44   57.31   61

For now it looks stable, hope it is solved for good.

Comment 728 frr 2017-02-13 14:09:45 UTC

Dear gentlemen, thanks for this exquisite snippet of detective work.

I don't want to be specific about HW models, but over the past month or so, I've been struggling with one particular machine (model - a whole batch, systematic problem) exhibiting the exact symptoms: video playback in Ubuntu 16.04 or Debian Jessie 8.7, in Totem, would result in freezes (video frozen on the screen, whole system locked up). Under Unity+Compiz, the looped-around playback of a short x264 HD movie trailer would lock up within half an hour, under xfce4 with compositing it would take 2-3 hours, xfce4 without compositing would run for several hours (but would typically freeze within a day). Actually the video playback was just a reliable/accelerated problem reproduction method - in real life, the machine would freeze with some simple 2D GUI app within a day or two. I even tried improving Vcore blocking on the boards, which didn't help.

Finally I found this bug report, while fumbling for a way to speculatively over-volt the cores a bit using a manual intervention into EIST...
and I haven't tried patching the kernel yet, but the workaround with max_cstate=1 does seem to have the desired effect. Up for 30 hours and cranking away at the video loop.

Actually I had two sibling models on the table, both in several specimen, with an almost identical motherboard, the only difference being the CPU: Celeron N2807 (dual core, working fine no matter what) vs. Celeron N2930 (quad core, freezing no matter what). Over the last two years I also tried some machines with the Celeron J1900 where the problem does not occur... All of these are "industrial"/embedded PC machines from two vendors (plus the odd ITX board from Gigabyte if memory serves).

Comment 729 Hanno Zulla 2017-02-17 08:14:12 UTC

First of all, thanks a lot.

What is the status of the patch by now? Reading https://lists.freedesktop.org/archives/intel-gfx/2017-January/117932.html it appears that it wasn't accepted.

Comment 730 Olivier 2017-02-17 10:01:21 UTC

We are encountering similar total freeze issues on 16.04.1 (kernel 4.4.0-x, some 4.4.0-62) NUC6i5 devices (Skylake), so not exactly low power like most CPUs in this thread.

We have a hundred of these devices deployed in the field, and they are randomly freezing (at least 25 devices have frozen already), we don't have physical access to many of them (requires special technical intervention).

A few very strange things:

- The freezes do not produce any log (no kernel panic/crash).
- The freezes aren't reproducible easily (do not happen every day) but they always happen exactly 4 hours after boot (our devices reboot daily @05h05), freezes always happen around 09h05.
- We have a hundred of NUC6i3 devices out there with the exact same setup that are not having these freezes.

The devices are unattended POS devices that mostly play webapps (chromium/electron) or video (mpv) (auto-logged in without interaction).

We're going to try the cstate boot flag to see if it fixes things.

I would be extremely grateful if anyone would have an idea on how we could debug this a bit more.


Related askubuntu issue with logs/more info: http://askubuntu.com/questions/884099/troobleshoot-16-04-inexplicable-total-system-freeze-4-hours-after-reboot-on-seve

Comment 731 RussianNeuroMancer 2017-02-18 08:29:40 UTC

Hi, Olivier 
Please read comment #700. You need to fill separate bugreport about your issue, because this one about BayTrail, not Skylake.

Comment 732 Vincent Gerris 2017-02-19 17:24:08 UTC

After many hours of trying, I made great progress in getting the Mika patched kernel to hang on my Lenovo Yoga 2 11 with N3520 on latest 1.93 BIOS :)!
Never thought I'd be so happy crashing my computer.

I can now CONSISTENTLY hang my computer as follows:
- (re)boot without AC power
- start file copy from smb share to local folder with nautilus (310 mb)
- wait, or trigger by unloading bluetooth driver with:
sudo modprobe -r btusb
(or when not loaded: sudo modprobe btusb && sudo modprobe -r btusb)

Now for some more detail, I tried to find influencing factors and these are some.
- With AC power connected, it barely happens (1 out of about 10 times it did)
- on battery power, using an external USB (rt2573) and the internal wifi not connected, I could not trigger this
- when max cstate is one, this does NOT happen
- one time after using a USB, I could not trigger this, but after one more I did.
- sometimes the driver fiddling is not needed and the hang occurs without it

To be very sure the cstate parameter is in play, I kept running the same kernel and rebooting and repeating the procedure, it never hung with the parameter enabled (once with 3 reboots, once with one) and consistently hangs without the parameter.

The kernel is the 4.9.0 patched that I put on dropbox.
Another side note is that sometimes the bluetooth driver does not load after boot, this does not seem to have much influence on the procedure.
What I see in dmesg when that is the case is:
[ 8.604187] Bluetooth: hci0 command 0x1001 tx timeout
[ 16.604447] Bluetooth: hci0: BCM: Reading local version info failed (-110)

There are a few great things about this:
1. I can consistently reproduce the error
2. cstate kernel parameter dependent

The non consistent behavior may be caused by the firmware saving different things?
Also seems clearly related to power management/wifi (not sure if chip or power related).

@Len Brown let me know if I can supply any info that may be useful.
I will not update the laptop and it is dedicated to identify this issue, so happy to make some more time to nail it down.

Comment 733 River Zhou 2017-02-19 17:26:35 UTC

Dose anybody try CONFIG_PREEMPT=y and CONFIG_HZ=1000 ?
On my Lenovo Miix 2 8 (BayTrail Z3740). it will make system very slow sometimes.
I use 4.4.49 kernel + Mika's patch and with no ctate set.

Comment 734 Hans de Goede 2017-02-21 08:07:24 UTC

(In reply to Vincent Gerris from comment #732)
>  - on battery power, using an external USB (rt2573) and the internal wifi
> not connected, I could not trigger this

But you were still using the internal bluetooth, right ?

So this seems to point to a problem with the sdio wifi. I think this means we may still need the patches to force the CPU to not enter C4/C5 when mmc is active which have been used by various baytrail users in the past:

https://github.com/hadess/rtl8723bs/tree/master/patches_4.5

Some patches have been merged to fix this, but IIRC their commit msg mentioned those patches might just make it harder to trigger the problem.

While working on some cherrytrail issues I rebased those patches to a recent upstream kernel (not the latest, but a recent one) I've saved those rebased patches in case we would need them again, I've uploaded them here:

https://fedorapeople.org/~jwrdegoede/trail-mmc/

It would be good to build a kernel with those and see if that fixes your reproducable bug.

Comment 735 Michaël 2017-02-21 10:06:10 UTC

I compiled Mika's i915 modification into a module, this helped stability, but my system froze after 5 days, with no log as usual.  During these 5 days, the machine was sitting mostly idle, with activity coming only from background tasks (ownCloud, mostly).

On Acer TravelMate 115, Intel(R) Pentium(R) CPU  N3540  @ 2.16GHz, no crash with max_cstate=1.

Comment 736 Nicolas Porcel 2017-02-22 23:17:50 UTC

Does anyone tested the new DRM_I915_CAPTURE_ERROR option in 4.10?

I don't know the effect, but I just saw it after upgrading to 4.10 and it might be interesting to try it.

Comment 737 Len Brown 2017-02-24 15:31:45 UTC

@ Michaël

Are you saying that your N3540 hangs, even with intel_idle.max_cstate=1 ?
If yes, does it past overnight memtest, or a thorough soaking
of stressapptest?

Comment 738 Michaël 2017-02-24 16:07:13 UTC

@Len Brown: Sorry if this wasn't clear: this was *without* max_cstate=1.  With max_cstate=1, it is, and always has been, perfectly stable.  I have a ~2.5day average uptime with Mika's patch, which is pretty much the same as without.  `modinfo` does report that i915 is using the patched module.

Comment 739 Len Brown 2017-02-24 16:19:53 UTC

@ Michaël 

Thanks for the clarification.

Please test with Mika's patch plus intel_idle.max_cstate=2.  Per comment #710, the difference from intel_idle.max_cstate=1 is that now core C6 will be ENABLED.  In common with intel_idle.max_cstate=1, Module and Package C6 will continue to be disabled.

On the only machine I have where Mika's patch is not sufficient for 100% stability (the T100-CHI) this works for me.

Comment 740 jbMacAZ 2017-02-25 01:23:50 UTC

(In reply to Len Brown from comment #739)
> 
> On the only machine I have where Mika's patch is not sufficient for 100%
> stability (the T100-CHI) this works for me.

I haven't had a freeze on my CHI since using only Mika's patch and "tsc=reliable clocksource=tsc" for kernel args.  That said, I do have some non-default .config settings and a handful of ASUS device specific patches.

The most serious outstanding bug (with a .config workaround) is bugzilla#150881 and it affects other T100's.  Wifi issues also hamper stable operation although that appears to be fixed as of 4.9.11+ (but not 4.10.0)  

Let me know if you want patches, .config or built kernels to evaluate on your CHI.

Comment 741 Vincent Gerris 2017-02-27 07:32:08 UTC

I have some interesting observations.
@Hans de Goede : if you mean with using the internal bluetooth that the driver was loaded, yes. It was not doing anything like streaming audio or anything else.

I tried your patches on the same 4.9.0 kernel with the Mika patches but I still get the freeze.

Some interesting observations:
 - sometimes it takes a while for a freeze: when this happens, the file copy speed showing in nautilus goes down gradually from around 3,5 mb/s to a few hundred kb/s and then it hangs.

 - to test driver influence I blacklisted btusb and then tried: once when not on AC power I got a hang after a few seconds of file copy(without loading or unloading the driver), once on AC the copy speed reduced a lot but I got no freeze

 - once with the btusb driver blacklisted again, I got no issue, until I loaded and unloaded the driver

Off AC power seems a big factor, isn't there an acpi event trigger when the power is unplugged? Maybe that has influence.

When I look at top when file copy gets slow, there is no significant use of resources. It looks like the kernel slowly hangs itself up.

I hop this helps to get an idea on where to find the issue.
Let me know if I can test anything more.
thank you!

Comment 742 Len Brown 2017-02-28 03:13:58 UTC

Created attachment 254971 [details]
Mika v3: drm/i915: Avoid tweaking evaluation thresholds on Baytrail v3

Please test this patch.

drm/i915: Avoid tweaking evaluation thresholds on Baytrail v3

This patch is expected to have the same function as the previous version, which was attached to comment #683.

It should apply, with offset, from Linux-4.2 through Linux-4.10, and now 4.11-merge/rc.

If we can supply some "Tested-by: " tags, that may help Mika get permission to ship this patch with the i915 tree.

Comment 743 Laszlo Fiat 2017-02-28 18:40:31 UTC

(In reply to Hans de Goede from comment #734)
> (In reply to Vincent Gerris from comment #732)

[snip]

> So this seems to point to a problem with the sdio wifi. I think this means
> we may still need the patches to force the CPU to not enter C4/C5 when mmc
> is active which have been used by various baytrail users in the past:
> 
> https://github.com/hadess/rtl8723bs/tree/master/patches_4.5
> 
> Some patches have been merged to fix this, but IIRC their commit msg
> mentioned those patches might just make it harder to trigger the problem.
> 
> While working on some cherrytrail issues I rebased those patches to a recent
> upstream kernel (not the latest, but a recent one) I've saved those rebased
> patches in case we would need them again, I've uploaded them here:
> 
> https://fedorapeople.org/~jwrdegoede/trail-mmc/
> 
> It would be good to build a kernel with those and see if that fixes your
> reproducable bug.

Hello,

I wrote at [1], that I think that most of the old MMC patches are not needed for kernels 4.7 and above as Adrian Hunter mainlined a patch [2].

We do need a new version of [3], because [4] removed the basis of that patch. If this is not applied, we get IRQ 187 Nobody cared [5]. A new version of [3] with the partly reversed [4] (plus a few other Baytrail related patches) is at [1]. But a proper mainlined solution would be great.

[1]: https://github.com/hadess/rtl8723bs/issues/76#issuecomment-234706390
[2]: https://git.kernel.org/cgit/linux/kernel/git/next/linux-next.git/commit/drivers/mmc?id=6e1c7d6103fe7031035cec321307c6356809adf4
[3]: https://github.com/hadess/rtl8723bs/blob/master/patches_4.5/0002-mmc-sdhci-get-runtime-pm-when-sdio-irq-is-enabled.patch
[4]: https://git.kernel.org/cgit/linux/kernel/git/next/linux-next.git/commit/?id=15e82076a0edbebedbe12652b4ad8f1d93bcb7fe
[5]: https://github.com/hadess/rtl8723bs/issues/76#issuecomment-227532284

Comment 744 jbMacAZ 2017-03-01 03:46:17 UTC

re: Avoid tweaking evaluation thresholds on Baytrail v3

Quick test of 4.10.1 w/v3 ran 12 hours without freezing (Asus T100CHI, Z3775).  The CHI would consistently freeze within several hours without a cstate patch, script or kernel arg.  I'll restart the 4.10 test and try to let it run several days.

Unfortunately, while the new v3 patch does apply to 4.2.8, it still froze in 45 minutes, which is comparable to unpatched.  I am rebuilt 4.2.8 with original patch to retest and it also froze in 1:45.  I guess there were too many other problems not yet fixed in that old kernel series.

Comment 745 Len Brown 2017-03-01 04:57:14 UTC

An i915 GFXMHz observation...

Linux-4.8 out of box (no patches or cmdline workarounds)
was easy to hang in under 20 minutes on my dell n3540 laptop
and acer j1900 desktop.  Turbostat showed i915 GFXMHz of 875.

Linux-4.9 became significantly harder to hang -- surviving
the same stress test for over 24 hours.
turbostat showed i915 GFXMHz of 187.

Linux-4.10 seems even more difficult to hang.  The j1900
took almost 3 days, and the n3540 is still running the
stress test after 4-days.  i915 GFXMHz is 187.

Comment 746 Jochen Hein 2017-03-01 19:31:14 UTC

Re: Avoid tweaking evaluation thresholds on Baytrail v3

I'm running 4.10.1 with the patch from #742 and didn't have a hang since yesterday. You may use my
Tested-by: Jochen Hein <jochen@jochen.org>

Comment 747 Len Brown 2017-03-02 16:00:22 UTC

my comment #745 regardiugn GFX MHz is erroneous

Thanks to Yaroslav Isakov for reporting that
turbostat is re-reading a constant value from  in sysfs,
and presenting the un-changing value in the GFXMHz column.  I'll
post an updated turbostat to handle this shortly. (re-opening
the file works around the problem, just as you'd see if you did this:

$ watch -d cat /sys/class/graphics/fb0/device/drm/card0/gt_cur_freq_mhz

More interesting is that while testing the workaround for that bug,
I logged into my dell-n3540 laptop -- which had been running my
standard byt.test stress test for 7-days on stock 4.10
without a hang.  I fired up a few additional copies of glxgears to
make sure I could see GFXMHz wiggle; and the machine hanged in 5 minutes.
Hopefully a tweak of that stress test to make the graphics run
faster will bring down time-to failure on an un-patched here.

Comment 748 Len Brown 2017-03-05 23:46:24 UTC

Created attachment 255091 [details]
latest turbostat (17.03.04) utility for baytrail

Attached is the latest turbostat utility, please stop using
the older version, and let me know if you have any troubles
with the latest.

This is version 17.03.04 -- slightly newer than the 17.02.24
that was just checked into the Linux-4.11-rc1 source tree.
Above that one, this version fixes the GFXMHz column issue.

Note that turbostat prints more columns than it used to,
and so capturing the output in a file is prudent.
A 10 second snapshot can be gathered in a file "ts.out", this way:

$ sudo ./turbostat -o ts.out sleep 10

Comment 749 Len Brown 2017-03-06 00:10:03 UTC

Re: GFXMHx and glxgears load in comment #747

Adding 5 copies of glxgears to byt.test allows my dell-n3540 to fail on Linux-3.10
in 15 minutes.   Without this additional load, the same hardware and OS
would not fail in 3-days.  That was quite a mystery, as previously
a single copy of glxgears -fullscreen in byt.test would
reliably kill my test systems on Linux-4.8 within 15 minutes.

It appears this is related to Graphics P-states.

You can poll GFXMHz 10 times per second this way:

watch -n .1 -d cat /sys/class/graphics/fb0/device/drm/card0/gt_cur_freq_mhz

I found that if GFXMHz was pegged to minimum or maximum,
there were no failures.  Or if the load was high enough
or low enough that the frequency stayed near max or min
with few changes -- no failure.

However, if the number of glxgears is tuned to a "sweet spot"
such that GFXMHz is different virtually every time it is polled,
the time to failure is shortest.

Curiously, the number of copies of glxgears to hit the "sweet spot"
is quite different on different machines.  Also the -fullscreen parameter
makes a big difference.

on my T100-CHI, it takes only 1 copy of glxgears -fullscreen
to kill the machine.  On that machine, additional copies of glxgears make
the GFXMHz reach and stay at maximum, and the failure is not seen.

Comment 750 Len Brown 2017-03-06 00:17:16 UTC

Re: Avoid tweaking evaluation thresholds on Baytrail v3

Running this patch on 4.10, I've not yet seen a failure on
Dell-n3540, Acer-J1900, ASus-T100-CHI-Z3775 -- while I was able
to fail all of those machines in under 15-minutes without the patch.

However, based on the output of

watch -n .1 -d cat /sys/class/graphics/fb0/device/drm/card0/gt_cur_freq_mhz

This patch seems to make the i915 get pegged to maximum
GFXMHz as soon as there is any GFX workload.  When the workload
is terminated, GFXMHz returns to minimum.  This appears to
put the i915 in "race to halt" mode.  Unclear if that was
the intention of the patch.

Comment 751 Vincent Gerris 2017-03-07 22:24:12 UTC

Hi Laszio,

Thank you for your elaborate reply.
I will try that and the recently posted patch on a recent kernel again and report (may take a while).

@Len Brown since my problem does not seem to be fixed by the patch, but is going away when the max cpu param is used, will it be further pursued in this bug report, especially considering what Laszio wrote?

I have a really good way to trigger the bug abd it would be great if we can fix this properly for everyone.

Thank you and regards,
Vincent

Comment 752 frr 2017-03-08 12:03:53 UTC

Apologies for going slightly off topic here, at this software-side forum:
I cannot help but wonder where the gremlins are possibly hiding :-) and I can't exclude that "someplace in the hardware" is the correct answer.

I haven't looked into details, but the proposed patches seem to modify the rate and aggresivity of GPU clock frequency changes. Also take into account the reported "sweet spot" consisting in a particular number of GLXgears instances, a figure that is HW-specific. As if, with every change in clock frequency, there's a tiny "window of opportunity" for something to clash under the hood, in silicon. The "window of opportunity" (unhandled critical section in the HW?) may be different on different motherboard models, or in different flavours of the BayTrail SoC.

I'm wondering if this is a "window of opportunity" in time (albeit very short, around the "event of clock change") where something in the amazing "silicon clockwork" can clash, or if this is really some power rail instability = lack of proper blocking, at board level, SoC package level, chip level, or chip subsystem level. In my practice, in different situations, I've seen both - I've seen an unhandled "critical section" in an FPGA design doing some very basic counter with current value latching (for bus access from host CPU), and obviously numerous power blocking goofups. None of it was so subtle and so close to the CPU though.

It would potentially be nice to know if the freeze happens when the clock gets bumped up vs. relaxed down (or, irrespective).

I'm a troubleshooter / application geek helping customers integrate industrial/embedded PC's and motherboards with operating systems in their very diverse setups. I got to know about this thread when one particular HW model was curiously misbehaving. More precisely, as I already mentioned in this thread, I had two models of hardware on my lab desk, several pieces of each, with a pretty much identical motherboard, different only in the SoC soldered on the board: one was a dual-core Celeron N2807 (running like a cheetah no matter what), the other is a quad-core Celeron N2930 - freezing reliably under the well known test conditions. Tested on several pieces of either hardware. It makes you wonder: two motherboards, likely identical PCB layout, don't know about possible differences in the BOM (the set of power blocking components) - but the VRM and SMD MLCC's around the socket seemd identical to the naked eye. If they *are* in fact identical, I'm not sure if this means the design is correct or flawed :-)

Unfortunately Intel's reference board designs and detailed power design guidelines are NDA'ed for the recent generations of Intel CPU's and SoC's, so I don't have access to them. All I have is the basic datasheet:
http://www.intel.com/content/dam/www/public/us/en/documents/datasheets/atom-e3800-family-datasheet.pdf
containing some rather coarse notes about what power wells the BayTrail SoC has and how they behave. See chapter 9 "Electrical specifications". There's the CORE_VCC and UNCORE_VNN (that's the GPU?), these two are dynamic and apparently steered by a "serial VID" bus (that has replaced the traditional parallel Voltage ID pins?). Apart from them, there are maybe 2 or 3 static rails around 1 Volt, likely for some less demanding interface/glue logic.

I can see them all on my motherboard, the two dynamic rails (VCC/VNN) have something around 0.85 V if memory serves... but my oscilloscope is probably much too slow to show anything interesting on the power rails (at 300-500 MHz analog bandwidth and 1 GSps true sampling rate). Actually I'm a bit puzzled about this. The Intel datasheet says that (quote) "The voltage rails
should be measured with a bandwidth limited oscilloscope with a roll-off of 3 dB/decade above 20 MHz". Makes me wonder how I'm supposed to judge power blocking for a ~2GHz CPU with a 20MHz 'scope :-) The VRM's producing VCC/VNN on "my" BayTrail motherboards are "all solid" = all the caps are MLCC's. Speaking of MLCC's, their frequency response is illustrated e.g. here:
http://www.avx.com/docs/techinfo/CeramicCapacitors/parasitc.pdf
The modern bulk multi-dozen-uF MLCC's have a lowest impedance at their self-resonant minimum, which is a couple dozen MHz. Better than solid polymer, but probably not enough (not alone) for perfect decoupling of modern 22/14nm silicon, specified to have a maximum consumption of about 10 A per the VCC/VNN rail - and that's peak 10 Amps each, observed by a 20 MHz 'scope (= effectively an "average" over some timespan in dozens of nanoseconds). The nominal core clock rate is about 2 GHz, but individual gates comprising the CPU core must actually be much faster, with switching and propagation times at least a decimal order faster. Imagine that the CPU can step up its consumption by a couple Amps within a nanosecond. The dI/dt is enormous. Also note that even if you manage to build a decent RF blocking capacity out of smaller / higher-resonating ceramic caps, a lambda/4 transmission line can flip the impedance "inside out" (turn a short into high Z), and lambda/4 at 2 GHz is about 37 mm in open space, maybe under 30 mm in a PCB. Considering how fast the gates really are, I would say that an optimum distance to place blocking capacitors is within a few mm from the chip, i.e. on the interposer. Yes there are some caps on the interposer, and they seem so tiny... If this is really a power blocking problem, it may well be a problem on part of Intel, rather than on part of the board maker. The beefy MLCC's on the motherboard (alone) won't cut it, and the VRM's response time to a consumption peak (filtered by the MLCC's) should not be a problem.

In terms of thermals, the board/computer makers often regard the BayTrail as something that almost doesn't need a heatsink (I don't agree, but that's a different story, and I have the thermals right in this case) but it's making me wonder to what extent both the motherboard makers and Intel are possibly soothed by the relatively low consumption at the electrical level. As if "it doesn't draw too much power on average, so it doesn't need that much rail blocking, right?" Oops, that attitude would be a problem. The silicon is admittedly not power hungry, but it's got some screaming fast, latest generation gates, and can ramp its own consumption quite aggressively in very short time quanta...

The step in power consumption due to a change in clock rate alone is likely not *that* abrupt - it's making me wonder if there's possibly some "synchronized gate switch" on a massive scale, brought about by the clock change event. Something that would produce a tall power glitch, lasting for a few picoseconds, that's otherwise not likely to occur in such a perfect synch.

Does Linux fiddle with *voltage* during those GPU power management events? Does it ever tweak the VID? I don't recall anyone mentioning it here... if Linux did fiddle with the VID, the VRM would possibly need some time to ramp up, before it would be safe to increase the clock rate. Again this seems unlikely to me, such a scheme would not be very swift and efficient.

Also, many users say (including my customer) that the hardware doesn't hang in Windows. Which makes me wonder in what way are windows "different", in their handling of the GPU power management.

It's also the outside behavior that's making me wonder. The problem can be suppressed by preventing clock rate changes in the GPU (IGP). Yet apparently it's not just the GPU that hangs, it's the whole computer that hangs. Unfortunately it doesn't mean that a CPU core has frozen - it means that something along the path between the RAM, MCH, cache and the CPU cores has frozen. And it always freezes (in my case) with the picture stuck on the screen, i.e. no random chance of the kernel reporting a panic. Interestingly, dual-core chips generally don't have the problem - it's typically a problem of the quad-core SoC's. Now where does the core count have a cross-section with the GPU clock modulation? Through power consumption? But if the GPU runs off the VNN "power well", it doesn't even share a power well with the CPU cores!

Makes me wonder if someone (Intel R&D ?) has on-chip ICE/ICD capabilities,
would be able to reproduce the problem and take a closer look at what happend.
But I suspect that with this many pieces sold, they would keep the results
to themselves anyway :-/

Hmm... in the kernel code, what does the actual "set GPU clock" or "set GPU power mode" look like? Just a single MSR write? I'm wondering if it takes the GPU hardware some time to actually carry out that order. And, what would happen if another "clock change" instruction came too early... Could that be our "unhandled critical section" ?

I should probably put my crystal ball to rest and get off the hallucinogens.

Thank you guys with an Intel e-mail (and on Intel payroll?) for keeping up the fight for our benefit, even if you don't have full access internally. You're doing a marvellous job. And I think Intel should receive some thanks too - for paying you, and for making the hardware in the first place :-)

Comment 753 Paul Mansfield 2017-03-08 12:36:04 UTC

I'm with Frantisek.Rysanek here that there's deep-rooted problems in the Baytrail SoC.
I'm pretty sure the only way to make it stable with Linux is to constrain the chip to run its clocks at constant frequencies and not try and switch into different sleep states.

I and and six others all have the same convertible tablet, the Toshiba Click Mini with the Z3735F, from different batches. Mine is terribly unstable and locks up at the slightest provocation, yet others have been able to run linux with some success with the same kernel.
As far as we can tell, there's only ever been one stepping level of the Z3735F as we all have the same SKU, and there's no microcode loader for this chip.

The biggest cause of instability comes when using SDIO, e.g. an SDIO wifi adaptor which is common in many baytrail devices. Quite often a device will run acceptably well with a USB wifi adaptor, and lock solid within minutes or seconds of loading the SDIO driver.

ARS Technica have a block diagram of the chip, and my gut feel is that the block labelled "Storage Hub" craps out under certain conditions.
https://cdn.arstechnica.net/wp-content/uploads/2013/09/Screen-Shot-2013-09-13-at-6.32.07-PM.jpg

Intel produced a reference board called Sharks Cove using the Z3735F, but a lot of the documentation has disappeared.. however, if you get lucky it's possible to find third parties carrying the documentation, such as this:
http://www.mouser.com/ds/2/456/Sharks-Cove-Technical-Specifications-587828.pdf
(I recommend people grab a copy before it too disappears)

Comment 754 jbMacAZ 2017-03-08 18:11:01 UTC

I've had 2 freezes (less than an hour of casual use) of the linuxium(Budgie-Ubuntu17.04) w/4.10.0 kernel which includes the v3 patch.  When I rebuilt the linuxium kernel removing the v3 (comment #742) and adding back Mika's original patch (coment #683), I'm freeze free.  The revised patch does not seem to be as effective.  My other kernels run days with either patch version (Mint 18.1, Manjaro).  (T100CHI, Z3775)

Comment 755 Juha Sievi-Korte 2017-03-12 18:22:48 UTC

My results on N3540 running 4.10.1 (g8c10701) with and without v3 patch.

Only 1 run each, so I don't know abou repeatability. I did as Len instructed (watching the gpu frequency). For me even one fullscreen glxgears was enough to cap the frequency to maximum, but running several small screens made the frequency change all the time.

Unpacthed kernel froze in less than two hours, with v3 patch applied, freeze happened while running for about 8 hours with same test set.

Comment 756 Alejandro Morales Lepe 2017-03-16 04:38:10 UTC

Created attachment 255281 [details]
attachment-16106-0.html

I have been running Fedora 25 in my Dell Inspiron 15 3000 Series with Intel
Pentium N3540 and kernel 4.9 has been stable for around 3 weeks now,
completely vanilla, has somebody in Fedora Project/Red Hat tweaked
something? more people should try it too, makes no difference if I use
wayland or xorg.

2017-03-12 11:22 GMT-07:00 <bugzilla-daemon@bugzilla.kernel.org>:

> https://bugzilla.kernel.org/show_bug.cgi?id=109051
>
> --- Comment #755 from Juha Sievi-Korte (jsievikorte@gmail.com) ---
> My results on N3540 running 4.10.1 (g8c10701) with and without v3 patch.
>
> Only 1 run each, so I don't know abou repeatability. I did as Len
> instructed
> (watching the gpu frequency). For me even one fullscreen glxgears was
> enough to
> cap the frequency to maximum, but running several small screens made the
> frequency change all the time.
>
> Unpacthed kernel froze in less than two hours, with v3 patch applied,
> freeze
> happened while running for about 8 hours with same test set.
>
> --
> You are receiving this mail because:
> You are on the CC list for the bug.

Comment 757 Vincent Gerris 2017-03-23 21:39:04 UTC

Hi,

I was able to test with wireless:
https://launchpad.net/ubuntu/zesty/amd64/bcmwl-kernel-source/6.30.223.271+bdcom-0ubuntu2

Some interesting results:
 - the laptop was on power when the wireless driver was installed. When doing the copy and unloading and loading the btusb driver twice, all kept working
 - after a reboot, without the power on, same action made an instant hang happen.

So it seems that:
 - the laptop started without power affects this (noticed that earlier)

Since @Len Brown mentioned that this worked on his Asus :
intel_idle.max_cstate=2

I tried that too. That actually works for my situation as well.

Does that mean the C6 is the cause in combination with wireless?
Is anyone at intel able to patch this up too :)?

I would be very happy to see this resolved still. 
The machine I have is still dedicated to testing.

Thank you

Comment 758 t.sarawinski 2017-03-25 05:42:26 UTC

Arch users can test linux-baytrail410 & linux-baytrail411 (rc3).

including patches from here and more.

My Tablet seems to run much smoother (tested on Gnome 3 - very laggy before)


On stock kernel it often freezes after the Login. This problem is gone for me.


feel free to test and give some feedback.


After some idle time the screen wents black. Sometimes it comes back after a longer time by ramdomly pushing all buttons.

Hibernate and screen suspending are set to disabled.

Anyone have a suggestion?

Comment 759 AndyMG 2017-03-26 21:22:10 UTC

Hi,

Just a quick observation for you guys ;]

I'm running a Ubuntu 16.04 (GalliumOS 2.1 with 4.8.17 kernel) on an Acer CB3-111 Chromebook with Bay Trail CPU. I have been using it a lot for about a month now without any problems or freezes and suddenly today it started to freeze after a couple of minutes (about 15-20) or so and only power button could help (hard reset). I was looking for what might be the reason and I googled to here. After browsing through this thread I got an idea:

I noticed that my Bluetooth is on in xface. I never used bluetooth on this device and so I just turned it off in GUI (xface) and the freezes seem to be no more. I am working for a couple of hours now. Will see in a couple of days but it seems like the issue is fixed now.

What suggested me the solution was some post here about bluetooth (btmon I think). The issue is so annoying that I decided to share in case this simple solution might also work for someone else. I would not like to chang c-state as this would mean additional battery drainage.

Stay safe ;]

Comment 760 jbMacAZ 2017-03-27 19:56:31 UTC

The good news is that 4.11-rc4 has the v3 patch built in.  The bad news for me is that my build hard froze within 5 minutes on Mint 18.1.  The original patch can't be used anymore because some of the code it modified was rewritten. "Delightfully unstable" T100CHI - Z3775.
Update: 4.11-rc4 with intel_idle.max_cstate=2 froze in about 20 minutes.  I'll retest when -rc5 comes out.

Comment 761 Mark_H 2017-03-28 08:02:51 UTC

As AndyMG reported that Bluetooth may have an influence I have switched it off yesterday and had no freeze during 2 hours.
Even intel_idle.max_cstate=0 did not help always with bluetooth on.

Have a nice day

Comment 762 Travis Hall 2017-04-02 23:36:43 UTC

I've been having really good stability with the ck patchset kernel on Arch Linux (linux-ck-silvermont-4.10 available at https://mirror.archlinux.no/repo-ck/os/x86_64/) on my N2940 Lenovo 11e

Been running youtube videos, vlc and other general use for about 16 hours so far

No idea why this would be the case, but it's interesting

Comment 763 jbMacAZ 2017-04-03 17:11:28 UTC

4.11-rc5 is stable so far without any cstate argument on my CHI w/v3 patch.  My rc4 was stable with cstate=1, so I'm beginning to suspect a build error (comment #760) 
Except for WiFi noise (brcm) in dmesg, 4.11-rc5 looks quite good overall for my system (T100CHI - Z3775).

Comment 764 Len Brown 2017-04-07 00:52:03 UTC

Re: comment #750 - Avoid tweaking evaluation thresholds on Baytrail v3

> Running this patch on 4.10, I've not yet seen a failure on
> Dell-n3540, Acer-J1900, ASus-T100-CHI-Z3775 -- while I was able
> to fail all of those machines in under 15-minutes without the patch.

At 6 weeks + 1 hour of running my torture test,
my Dell-N3540 hanged.  (Acer J1900 still running.)

@ Juha Sievi-Korte 

Thanks for testing!
Your n3540/Mika-v3 failure after 8-hours was much more prompt
than my 6-week result!

Comment 765 Len Brown 2017-04-07 01:07:48 UTC

@ Vincent Gerris

Re: intel_idle.max_cstate=2

Yes, that allows C1 (state1) and C6-no-shrink (state2),
but disables C6-Shrink, C7 and C7-Shrink.  Here "shrink"
refers to forcing the cache to be flushed on 1st entry
into that state -- an action that is good for power,
as the cache can be powered-off, but bad for performance,
assuming you plan to access again the cache state that was flushed.

you'll be able to observe this with turbostat,
The result is that you'll have Core C6 residency,
but since the shared module cache will not be flushed, you'll
not often have module (pair of cores) residency, or package C6 residency.

Package C6 is where the external voltage to the package will be changed.
To enter package C6, the graphics must also be in Render-C6 (RC6).

@frr

Regarding voltage changes.  Linux does control them, but in-directly.
Higher voltage is used for high frequency, lower voltage for lower
frequency.  On the CPU, we write the PERF_CTRL MSR with a cookie
that includes the speed, and the hardware translates that into
what it should do with the voltage.

note that I generally run with a fixed frequency -- the highest --
in an attempt to cause worst-case voltage swings.  The voltage doesn't
stay high -- the most frequent cause of voltage changes it the fact
that when all cores go idle, the hardware automatically lowers the
voltage, and then ramps it back up on exit from idle.

GFX has its own P-states, and they are under direct control of the
i915 driver.  "Mika v3" and other patches are tweaking how and when
the GFX P-state is changed -- and this area appears to be very close
to the most common pain (but no universal) point on these systems.

Comment 766 Len Brown 2017-04-07 01:16:37 UTC

@ Mark_H 

Re: bluetooth

> Even intel_idle.max_cstate=0 did not help always with bluetooth on.

when that parameter is used, intel_idle is disabled and you run acpi_idle.
What acpi_idle does varies from machine to machine.
(wee what states it offers w/ :grep . /sys/devices/system/cpu/cpu0/cpuidle/*/*)

So the question is if intel_idle.max_cstate=1 does not make a difference,
but disabling bluetooth does.  If that is the case, then that may be
a different bug.  If that is NOT the case, then BT may simply be very
good at helping us take interrupts and pop in and out of idle to make
the failure happen sooner.

Comment 767 b.peguero 2017-04-08 17:28:38 UTC

I see the status as "NEEDINFO" but no obvious indication of what information is needed. This bug is pushing ~800 comments, most of them "me too" and related. What information needs to be provided from users and is there a matrix of affected/non-affected vs context (like the BT one mentioned a couple of comments above)?

Comment 768 zgrabe 2017-04-10 07:51:35 UTC

@b.peguero 

See here for user reports: https://docs.google.com/spreadsheets/d/1oajcMYL9oSt0O6VTpaIj0osGJxKGKSPSYtLnqr3UHNk/edit#gid=0

Comment 769 Mark_H 2017-04-10 08:53:53 UTC

@zgrabe

As bluetooth enabed seems to have impact, maybe there should be an additional column (e.g. bluetooth enabled? yes/no)

Comment 770 frr 2017-04-11 06:17:29 UTC

Dear gents,

I've previously wallpapered enough of this forum in posts 728 and 752.
Just to add a bit of recent experience, I've had two other industrial machines (models) pass through my lab, they both had a Celeron J1900, and both passed the torture chamber just fine (same test environment). In addition to the loopy video playback (same file as before, same OS setup and kernel), I've also tried GLXgears (while the video playback test was turned off), gradually stepping up the number of instances of GLXgears from 1 to 5 - I added another one after a day of flawless operation. No matter what I did, those two machines were stable. They were a Nexcom IPPC-1840P (very similar hardware to APPC-xx40 series which also seem to work good) and an AAOEN OMNI-5175 (engineering sample, apparently). I kept teasing them for a straight week, before I had to ship them to a customer.

=> makes me feel as if the BayTrail is merely susceptible to some kinds of motherboard design and testing deficiency, that either Intel did not properly warn the mobo makers about, or the mobo makers did not take Intel's guidelines seriously enough and there are no tests in their QC benches for this particular "corner case"...

Comment 771 Paul Mansfield 2017-04-11 10:57:47 UTC

I don't know if it's the done thing, but I would like to propose this bug be closed, and a new one created referencing this one, and only current information be put in the new bug.

This is because it's very hard to determine what the current situation is regarding the cstates that work in combination with which patches, and any specific hardware issues such as SDIO and video, which seem to trigger problems more quickly.

Comment 772 gutosoni 2017-04-11 12:56:42 UTC

I'm using the 4.4.0-72 kernel, it looks like they fixed the problem. The freezes are over, I urge you to test this kernel.

Hardware: Intel Celeron Bay Trail 4x CPU N2930 @ 1,83GHz

Comment 773 Vincent Gerris 2017-04-11 13:22:05 UTC

@gutosoni : your comment is less than helpful if you do not specify which exact kernel you mean, where you got it, etc. Please share some concrete links and tell which one did not work, and preferably what change you think fixed this.

@Len Brown
Thank you for you elaborate explanation.
Can you say if the issue I have (N3520, Yoga 2 11) will be fixed in this bug report, or do I need to file a new bug report?

I also noticed this comment that I missed before (from Mika):
Another long shot to try is to see if:

'intel_reg write 0xa168 0x0'

has any effect on occurrence.

I will see if that does anything.
As reported, I still have consistent freezes when doing a file transfer over wifi, on 4.11 with the latest patch (sometimes even without touching bluetooth).

Kind regards,
Vincent

Comment 774 Mika Kuoppala 2017-04-11 13:38:01 UTC

(In reply to Vincent Gerris from comment #773)
> 
> 'intel_reg write 0xa168 0x0'
> 
> has any effect on occurrence.
> 

Very likely a waste of time. That change wont last as we rewrite the pmintrmask
often. You would need to change the mask in the kernel and recompile (there was a patch long way back).

One intresting triaging point is not limiting the cstate but rather
limiting the number of active cpus.

Please try if 'maxcpus=2' will make a difference.

Comment 775 gutosoni 2017-04-11 14:21:34 UTC

(In reply to Vincent Gerris from comment #773)
> @gutosoni : your comment is less than helpful if you do not specify which
> exact kernel you mean, where you got it, etc. Please share some concrete
> links and tell which one did not work, and preferably what change you think
> fixed this.
> 
> @Len Brown
> Thank you for you elaborate explanation.
> Can you say if the issue I have (N3520, Yoga 2 11) will be fixed in this bug
> report, or do I need to file a new bug report?
> 
> I also noticed this comment that I missed before (from Mika):
> Another long shot to try is to see if:
> 
> 'intel_reg write 0xa168 0x0'
> 
> has any effect on occurrence.
> 
> I will see if that does anything.
> As reported, I still have consistent freezes when doing a file transfer over
> wifi, on 4.11 with the latest patch (sometimes even without touching
> bluetooth).
> 
> Kind regards,
> Vincent

http://packages.ubuntu.com/xenial/linux-image-4.4.0-72-generic

I have been using it for over a week, so far everything is fine, no problems.

Comment 776 Hal 2017-04-11 16:48:56 UTC

(In reply to gutosoni from comment #775)
> http://packages.ubuntu.com/xenial/linux-image-4.4.0-72-generic
> I have been using it for over a week, so far everything is fine, no problems.

My office desktop (Zotac ZBOX-CI320NANO with Intel Celeron N2930) has that same exact version provided through Linux Mint 18.1 updates. 
Without intel_idle.max_cstate=1 it freezes within the hour. 

Same machine with 4.10.9-041009 (from http://kernel.ubuntu.com/~kernel-ppa/mainline/v4.10.9/linux-image-4.10.9-041009-generic_4.10.9-041009.201704080516_amd64.deb) runs a bit longer - about 3-4 hours before freezing solid!

When loading the kernel with intel_idle.max_cstate=1 the machine performs flawlessly for months with no crash.

Hal

Comment 777 Fred 2017-04-11 17:18:41 UTC

Looking at a 4.1.12 prempted-rt kernel with a bay trail 3805 ( headless ) meaning there is no graphics engine.  So the patch from comment# 683 https://bugzilla.kernel.org/show_bug.cgi?id=109051#c683 does not apply.

What they are seeing is an occasional 5 – 7 mS of additional latency around some of our pthread_cond_timedwait() calls.   For example, if they tell pthread_cond_timedwait() to wait up to 50ms, it actually isn’t returning for 56 ms.   If they use intel_idle.max_cstate=1 on the kernel command line, the problem goes away.

So I am wondering if we are looking for the issue in the wrong place.   OR should this be listed as a separate/ new bug?

Comment 778 Mika Kuoppala 2017-04-12 07:31:29 UTC

(In reply to Fred from comment #777)
> Looking at a 4.1.12 prempted-rt kernel with a bay trail 3805 ( headless )
> meaning there is no graphics engine.  So the patch from comment# 683
> https://bugzilla.kernel.org/show_bug.cgi?id=109051#c683 does not apply.
> 
> What they are seeing is an occasional 5 – 7 mS of additional latency around
> some of our pthread_cond_timedwait() calls.   For example, if they tell
> pthread_cond_timedwait() to wait up to 50ms, it actually isn’t returning for
> 56 ms.   If they use intel_idle.max_cstate=1 on the kernel command line, the
> problem goes away.
> 
> So I am wondering if we are looking for the issue in the wrong place.   OR
> should this be listed as a separate/ new bug?

Fred, please take a look at:
https://bugzilla.kernel.org/show_bug.cgi?id=195255

Comment 779 Fred 2017-04-12 11:57:12 UTC

(In reply to Mika Kuoppala from comment #778)
> (In reply to Fred from comment #777)
> > Looking at a 4.1.12 prempted-rt kernel with a bay trail 3805 ( headless )
> > meaning there is no graphics engine.  So the patch from comment# 683
> > https://bugzilla.kernel.org/show_bug.cgi?id=109051#c683 does not apply.
> > 
> > What they are seeing is an occasional 5 – 7 mS of additional latency around
> > some of our pthread_cond_timedwait() calls.   For example, if they tell
> > pthread_cond_timedwait() to wait up to 50ms, it actually isn’t returning
> for
> > 56 ms.   If they use intel_idle.max_cstate=1 on the kernel command line,
> the
> > problem goes away.
> > 
> > So I am wondering if we are looking for the issue in the wrong place.   OR
> > should this be listed as a separate/ new bug?
> 
> Fred, please take a look at:
> https://bugzilla.kernel.org/show_bug.cgi?id=195255

Thanks Mika!

Comment 780 Vincent Gerris 2017-04-17 16:51:11 UTC

(In reply to Mika Kuoppala from comment #774)
> (In reply to Vincent Gerris from comment #773)
> > 
> > 'intel_reg write 0xa168 0x0'
> > 
> > has any effect on occurrence.
> > 
> 
> Very likely a waste of time. That change wont last as we rewrite the
> pmintrmask
> often. You would need to change the mask in the kernel and recompile (there
> was a patch long way back).
> 
> One intresting triaging point is not limiting the cstate but rather
> limiting the number of active cpus.
> 
> Please try if 'maxcpus=2' will make a difference.

Hi Mika,

Thanks, maxcpus=2 makes it stable for me.
How would you like me to proceed?

Comment 781 jbMacAZ 2017-04-19 08:37:26 UTC

I've built the new 4.10.11 with the v3 patch backport. It froze twice, once within 10 minutes and then again within an hour. For third try, I added intel_idle.max_cstate=2 as previously recommended. That resulted in a "soft" freeze within 3.5 hours (apparently a wifi issue - dmesg was spammed with brcmfmac error -110s.) Technically, that wasn't a cstate hard freeze, but either way, all three tests ended with the system unusable.

4.11-rc5, rc6 & rc7 all run fine without any cstate arguments, scripts or patches (v3 patch was mainstreamed in rc4.) 4.10.10 is rock solid with Mika's original patch. Did something else need to be backported to avoid freezing in 4.10.11? (Also see comment 754)

For test #4 I'm using maxcpus=2. 9+ hours of run time so far. I've had previous good results with maxcpus before (comments 191, 197) before settling on intel_idle.max_cstate=1. Curiously, the wifi dmesg spam is conspicuously absent so far in this test. I prefer cstate=1 workaround, since maxcpus cuts system performance - video streaming shows more stuttering. I can test any new patches if/when available.

Thanks to all at Intel (and elsewhere) for applying some real horsepower to this cluster of baytrail freeze problems. For my system - T100CHI, Z3775, 4.11 looks great. 4.10 has more issues muddying the waters. Baytrail sound was still evolving and (broadcom) wifi has had chronic issues since early December backported to several kernel series. 4.10[EOL] can't happen soon enough! YMMV.

Comment 782 jbMacAZ 2017-04-21 08:05:23 UTC

(In reply to jbMacAZ from comment #781)
> I've built the new 4.10.11 with the v3 patch backport.

I repeated test 3 with wifi disabled and got a typical hard freeze in about 4 hours, better than the earlier test, but not good.  Setting intel_idle.max_cstate=1 appears to restore stability for my system with 4.10.11.  I also got my first freeze today on 4.11-rc7 after 2 weeks of successful 4.11 (rc5, rc6 & rc7) testing.  It's time to throw in the towel on this clunker.  Even other T100 models (e.g. T100TA...) are far less prone to freezing...

Comment 783 Andrey 2017-04-22 15:13:51 UTC

I have very similar problem, on LENOVO V510-15IKB with i7 7500U (Kaby Lake).
With default kernel (4.4) on Ubuntu 16.04 freezes were very often (from 5 minutes up to maximum 2 hours of work). 

Now I've update kernel to 4.10.10 and set intel_idle.max_cstate=1, freezes still happens but in 3-12 hours of work.

Somebody using Kaby Lake?
How to diagnose that is it same c-state bug?

Comment 784 jbMacAZ 2017-04-22 19:52:13 UTC

(In reply to Andrey from comment #783)
> I have very similar problem, on LENOVO V510-15IKB with i7 7500U (Kaby Lake).
> With default kernel (4.4) on Ubuntu 16.04 freezes were very often (from 5
> minutes up to maximum 2 hours of work). 
> 
> Now I've update kernel to 4.10.10 and set intel_idle.max_cstate=1, freezes
> still happens but in 3-12 hours of work.
> 
> Somebody using Kaby Lake?
> How to diagnose that is it same c-state bug?

You probably have a different issue because the hallmark of this bug is setting intel_idle.max_cstate=1 virtually stops the freezes.  To diagnose, it is best to change things one at a time.  if setting intel_idle.max_cstate=1 with your default kernel still freezes under 2 hours (at least 2 attempts) then cstate is probably unrelated to your issue.

BTW my Dell kabylake (i7-7500U) hasn't frozen on me yet, so that processor can run without freezing (about 2 months).

Comment 785 slumbergod 2017-04-25 11:53:11 UTC

I have a laptop with a 2nd generation Intel i3 CPU (Ivy Bridge i3-3110) and ever since I installed Ubuntu 16.04 I have been have random freezes as well. But the thread suggests that with my CPU it is *not* the cstate bug. 

Can anyone suggest which bug thread it is for the people with the *same* random hangs as the cstate bug but for CPUs other than the Bay Fail?

I've tried Xubuntu 16.04, 16.04.1, and 16.04.2 and all the kernels available, including the latest mainline ones. Same result. If I leave my machine running at some point it will freeze and nothing but a hard power off will solve it. Unfortunately, doing that has corrupted the file system twice and required reinstallation because fsck wasn't able to resolve it.

Comment 786 Hal 2017-04-25 13:10:20 UTC

(In reply to slumbergod from comment #785)
> I have a laptop with a 2nd generation Intel i3 CPU (Ivy Bridge i3-3110) and
> ever since I installed Ubuntu 16.04 I have been have random freezes as well.
> But the thread suggests that with my CPU it is *not* the cstate bug. 
> 
> Can anyone suggest which bug thread it is for the people with the *same*
> random hangs as the cstate bug but for CPUs other than the Bay Fail?
> 
> I've tried Xubuntu 16.04, 16.04.1, and 16.04.2 and all the kernels
> available, including the latest mainline ones. Same result. If I leave my
> machine running at some point it will freeze and nothing but a hard power
> off will solve it. Unfortunately, doing that has corrupted the file system
> twice and required reinstallation because fsck wasn't able to resolve it.

Just for the heck of it why don't you try to load the kernel with intel_idle.max_cstate=1 and see if it helps to avoid freezing while you gather more info about i3-3110 specific issues. You might be onto something.
The typical issues that I personally experienced with Ivy Bridge CPUs (not exactly your model) were graphics and USB related.

Comment 787 slumbergod 2017-04-25 21:18:24 UTC

I have a laptop with a 2nd generation Intel i3 CPU (Ivy Bridge i3-3110) and ever since I installed Ubuntu 16.04 I have been have random freezes as well. But the thread suggests that with my CPU it is *not* the cstate bug. 

Can anyone suggest which bug thread it is for the people with the *same* random hangs as the cstate bug but for CPUs other than the Bay Fail?

I've tried Xubuntu 16.04, 16.04.1, and 16.04.2 and all the kernels available, including the latest mainline ones. Same result. If I leave my machine running at some point it will freeze and nothing but a hard power off will solve it. Unfortunately, doing that has corrupted the file system twice and required reinstallation because fsck wasn't able to resolve it.

Comment 788 slumbergod 2017-04-25 21:24:15 UTC

(there is no way to remove the repost that somehow happened?)

@Hal, thanks. Yes, I am testing the cstate=1 solution but like the Bay Fail bug, whatever it is that affects the i3-3110 CPU is also *very random*. I could get 24 hours before a freeze or a couple of days. I suspect it is an Intel graphics driver bug, as you suggested. If I have no luck with the cstate=1 solution I will try rolling back to a pre-Ubuntu 16.04 kernel. I look back fondly and remember when my machine could run for weeks or months without a restart! Then came 16.04

Comment 789 luke 2017-04-27 04:45:45 UTC

slumbergod, (In reply to slumbergod from comment #788)

As Andred posed above:

> However, I fear (and has already been mentioned in earlier comments) this bug 
> report has long since lost any usefulness it might have once had and has just 
> turned into a dumping ground for random comments and updates and now reads
> like > some web forum thread


This is not a support form. Please have some respect for others and stop SPAMing us with your unrelated issues.

Comment 790 Hanno Zulla 2017-04-27 08:40:53 UTC

Hi.

It is very difficult to keep up.

Could please someone summarize and clarify the current status of this bug?

Please correct me if the following observations are wrong:

- the symptom is known, but not the root cause.

- for some reason, the bug does not affect Windows 10, but it affects Linux.

- the bug affects 4-core Bay Trail CPUs, but not 2-core Bay Trail CPUs.

- there is a workaround setting (the original subject of this bug) which is detrimental to battery runtime.

- there is a workaround patch (by Mika), some users of the patch report that it makes things better, others still report crashes.

- all in all, the bug is still unresolved.

Thanks for clarifying.

Thanks to everyone for their hard work on this bug, it is very appreciated. (I can't wait to use the cute little Bay Trail machine I have lying around here for my kids.)

Comment 791 slumbergod 2017-04-27 11:54:43 UTC

Hi Luke, here's a big huge FUCK YOU for being A FUCKING ASSHOLE!!
Go spam yourself you social rejects.

Comment 792 Hanno Zulla 2017-05-03 08:08:32 UTC

Sorry for asking again, but a clarification on the current status of this bug would be very much appreciated. See comment 790 on https://bugzilla.kernel.org/show_bug.cgi?id=109051#c790

Thank you.

Comment 793 jbMacAZ 2017-05-03 19:22:23 UTC

(In reply to Hanno Zulla from comment #790)
> 
> - there is a workaround patch (by Mika), some users of the patch report that
> it makes things better, others still report crashes.
> 
The v3 patch was mainstreamed into 4.11-rc4 and has been back-ported into 4.9 and 4.10.  

For my system, the original patch is effective in 4.9.25 and 4.10.13, but v3 (mainstreamed) patch is not.  Neither patch works for me in 4.11-rc8+, only setting .._cstate=1 will stop freezing.  (Asus T100-CHI Z3775)

The lack of any other reports of continued freezing (that can still be fixed with .._cstate=1) suggest that the v3 patch might be sufficient for most users.

Comment 794 Vincent Gerris 2017-05-04 20:07:33 UTC

Hi,

Thanks for the update, it was unclear to me and I guess others that the patch landed there.

I wonder if it fixed most peoples issues, but any is a plus I would say.

I am also wondering about status here because I still have the issue.
Mika, can you let us know if you want to continue investigating the maxcpus path for the people affected?

I am happy to contribute, but I would like to know if this will continue.
My issue like some others, seems to be related to wireless/bluetooth and power management.

I was hoping that these issues can also be fixed with current feedback and as posted before, I am happy to test!

Comment 795 Hal 2017-05-07 00:38:44 UTC

I've been testing 4.11.0 (Ubuntu's compile) on my Zotac (ZBOX-CI320NANO with Intel Celeron N2930) as I gathered that some patched were applied to it.
Without cstate=1 it freezes within 4 to 4.5 hours, no matter the workload. 
So, not out of the woods yet...
Hal

Comment 796 jbMacAZ 2017-05-07 22:32:31 UTC

correction:  The original c-state patch DOES stop freezing in 4.11.0.  There were other changes made about the same time as the v3 patch that interfere with applying the original patch.  Those other changes are probably the problem rather than the v3 patch itself, but I leave that to the experts to ponder.

Comment 797 jerameel 2017-05-11 07:11:30 UTC

Using kernel 4.10.14 from ubuntu on xubuntu 16.04.2 asus x453m laptop with intel baytrail, I'm running fine for several days now without any patch or workaround from cstate. I have tried both powersave and performance governor and still working fine on full load conditions.

TL:DR; I assume this has been already fixed with 4.10.14 (ubuntu)

Comment 798 Mika Kuoppala 2017-05-11 08:53:35 UTC

Fix is overstatement. As the commit message notes, we have only a workaround
that only helps on some cases.

One intresting datapoint is that with my J1900 using kernel param 'nohz=off' hangs the system in very short time.

zhangrelay

关注

1
点赞
踩
15

收藏

觉得还不错? 一键收藏
7
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫

专栏目录

Ubuntu 意外死机 （Linux Crash/Hang）解决

Ubuntu 意外死机（Linux Crash/Hang）解决