Problems:

For a iGPU and Nvidia GPU system with Ubuntu 17.04 Desktop, cuda 8.0.

If use prime-select nvidia, then nvidia GPU are both used as display and computing.

If use prime-select intel, then nvidia-smi and deviceQuery will not found the Nvidia GPU in system. and propmts that libnvidia-ml.so can not be found.

Way to solve:

    sudo echo "/usr/lib/nvidia-375" >> /etc/ld.so.conf.d/nvidia.conf

note that the path is which contains the libnvidia-ml.so file in your system.

Ads and Cons of use Intel for display

Normally the CPU fan will have a little noise when in idle mode than GPU in idle mode, if you do not use your system for heavy tasks. The good thing is you will have more GPU memory for you cuda computing with out the display thing consume you nvidia GPU mem. This is better if you want to use you nvidia GPU for some machine learning tasks, which are normally more memory consuming if you want to exersize on read world data.

Follow up on Ubuntuu 18.04, nvidia-driver 410.78

When I installed new Ubuntu 18.04, and use the run files from nvidia driver website https://tw.download.nvidia.com/XFree86/Linux-x86_64/410.78/NVIDIA-Linux-x86_64-410.78.run And run the install with NVIDIA-Linux-x86_64-410.48.run --no-opengl-files, then nvidia driver will not install openGL libraries to my system. So the thoritically, my system only can use the intel iGPU for OpenGL, and it should use the iGPU for any hardware video/graphics acceleration. But it doesn’t.

Issues

To be more concret, I can observe serveal issues in my system.

glmark2 and glxinfo shows the OpenGL driver is Vmware provided, just like what’s been done in the virtual machine, which has basiacally very suck visual acceleration.

litao@deep: ~ $ glmark2                                                                                                                             
** GLX does not support GLX_EXT_swap_control or GLX_MESA_swap_control!
** Failed to set swap interval. Results may be bounded above by refresh rate.
=======================================================
  glmark2 2014.03+git20150611.fa71af2d
=======================================================
  OpenGL Information
  GL_VENDOR:     VMware, Inc.
  GL_RENDERER:   llvmpipe (LLVM 6.0, 256 bits)
  GL_VERSION:    3.0 Mesa 18.0.5
=======================================================
...

glxinfo show the following

litao@deep: ~ $ glxinfo | grep OpenGL | grep string                                                                                                                                                 [1:13:48]
OpenGL vendor string: VMware, Inc.
OpenGL renderer string: llvmpipe (LLVM 6.0, 256 bits)
OpenGL core profile version string: 3.3 (Core Profile) Mesa 18.0.5
OpenGL core profile shading language version string: 3.30
OpenGL version string: 3.0 Mesa 18.0.5
OpenGL shading language version string: 1.30
OpenGL ES profile version string: OpenGL ES 3.0 Mesa 18.0.5
OpenGL ES profile shading language version string: OpenGL ES GLSL ES 3.00

intel_gpu_top shows very GPU activity when I use the chrome to play videos and use any video players.

sudo intel_gpu_top

Gnome graphics animation sucks, almost every thing is very slow.
nvidia-smi shows there are x-server process running on nvidia gpu, which consumes when precious GPU memory when I want to use them as pure cuda compute device in linux. The out are like the following.

Sat Mar 16 01:12:53 2019       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 410.78       Driver Version: 410.78       CUDA Version: 10.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce GTX 106...  Off  | 00000000:01:00.0  On |                  N/A |
| 29%   31C    P8     9W / 120W |    164MiB /  6078MiB |      2%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    0     13305      G   /usr/lib/xorg/Xorg                            32MiB |
|    0     13700      G   /usr/lib/xorg/Xorg                           129MiB |
+-----------------------------------------------------------------------------+

Solution

There should be serveral solutions, and they should have the same functionality, while everyone can choose any one of this by their own need.

Step 1

Blacklist nvidia-drm module

The first solution is what I have achived based on my system, and I don’t need to re-install the nvidia-driver. Only add the following to /etc/modprobe.d/blacklist-nvidia.conf

blacklist nvidia-drm
alias nvidia-drm off

Then optionally (I don’t know why it’s needed or not, but both works) run the command update-initramfs, to re-genetate the initramfs.

I guess this would disable the nvidia-drm module, which used for the X-display related things, but not leave the other nvidia driver module to be enabled. Because, in last section, we can see that prime-select intel commnad also generate a black list to disable all nvidia and nvidia-drm and nvidia-modset module. To make the cuda program run, we need to keep the nvidia and nvidia-modeset by commenting them out of the blacklist.

#blacklist nvidia
blacklist nvidia-drm
#blacklist nvidia-modeset
#alias nvidia off
alias nvidia-drm off
#alias nvidia-modeset off

Use –no-drm option when install nvidia driver from a runfile.

I didn’t test it, but it should works like a charm, since blacklist the module works. The option explanation can be found by nvidia-installer -A commmand if you already had nvidia driver installed.

litao@deep: ~ $ nvidia-installer -A  | grep drm                       [1:21:06]
  --no-drm
      Do not install the nvidia-drm kernel module. This kernel module provides
      that run independently of X11. The '--no-drm' option should only be used
      to work around failures to build or install the nvidia-drm kernel module

Install the nvidia-headless-XXX driver package provided by apt

I didn’t test it, but it should work, the following is that the package said by apt show. And it should have the same functionality as –no-drm option when you use the runfile. Please google it before you try this method.

litao@deep: ~ $ apt show nvidia-headless-390                                                                                         [1:21:11]
Package: nvidia-headless-390
#.....ignore serveral un-useful lines.
Description: NVIDIA headless metapackage
 This metapackage installs the NVIDIA driver and the libraries that enable
 parallel general purpose computation through CUDA and
 OpenCL.
 .
 Install this package if you do not need X11 or Wayland support, which is
 provided by the nvidia-driver-390 metapackage.

Step 2, /etc/X11/xorg.conf

If you have blacklisted the nvidia-drm module, or just didn’t install it. Your xorg.conf should not contain nvidia dvice, and consider it as an intel only, like the following one.

Section "ServerLayout"
    Identifier     "Layout0"
    Screen         "Screen0"
    InputDevice    "Keyboard0" "CoreKeyboard"
    InputDevice    "Mouse0" "CorePointer"
EndSection

Section "Files"
EndSection

Section "InputDevice"
    # generated from default
    Identifier     "Mouse0"
    Driver         "mouse"
    Option         "Protocol" "auto"
    Option         "Device" "/dev/psaux"
    Option         "Emulate3Buttons" "no"
    Option         "ZAxisMapping" "4 5"
EndSection

Section "InputDevice"
    # generated from default
    Identifier     "Keyboard0"
    Driver         "kbd"
EndSection

Section "Device"
    Identifier     "Device0"
    Driver         "intel"
    VendorName     "Intel"
EndSection

Section "Screen"
    Identifier     "Screen0"
    Device         "Device0"
EndSection

If you have already blacklisted the nvidia-drm module, but has the nvidia on Device section of the xorg.conf, like the following. Then your blacked list will not have affect. You can see that nvidia-smi/glmark2/glxinfo still shows the info like the first section.

Section "ServerLayout"
    Identifier     "Layout0"
    Screen      0  "Screen0"
    Screen      1  "Screen1"
    InputDevice    "Keyboard0" "CoreKeyboard"
    InputDevice    "Mouse0" "CorePointer"
EndSection

Section "Files"
EndSection

Section "InputDevice"
    # generated from default
    Identifier     "Mouse0"
    Driver         "mouse"
    Option         "Protocol" "auto"
    Option         "Device" "/dev/psaux"
    Option         "Emulate3Buttons" "no"
    Option         "ZAxisMapping" "4 5"
EndSection

Section "InputDevice"
    # generated from default
    Identifier     "Keyboard0"
    Driver         "kbd"
EndSection

Section "Monitor"
    Identifier     "Monitor0"
    VendorName     "Unknown"
    ModelName      "Unknown"
    HorizSync       28.0 - 33.0
    VertRefresh     43.0 - 72.0
    Option         "DPMS"
EndSection

Section "Device"
    Identifier     "Device0"
    Driver         "intel"
    VendorName     "Intel"
EndSection

Section "Screen"
    Identifier     "Screen0"
    Device         "Device0"
    Monitor        "Monitor0"
    DefaultDepth    24
    SubSection     "Display"
        Depth       24
    EndSubSection
EndSection

Section "Device"
    Identifier     "Device1"
    Driver         "nvidia"
    VendorName     "NVIDIA Corporation"
EndSection

Section "Screen"
    Identifier     "Screen1"
    Device         "Device1"
    Monitor        "Monitor0"
    DefaultDepth    24
    SubSection     "Display"
        Depth       24
    EndSubSection
EndSection

And also, you can only see the screen output to the moniter connectting to nvidia gpu. Alternatively, you could see that lsmod | grep nvidia shows the following after you have nvidia in xorg.conf

nvidia_modeset       1040384  2
nvidia              16588800  125 nvidia_modeset
ipmi_msghandler        53248  2 ipmi_devintf,nvidia

If you addtionally blacklist the nvidia_modeset in the blacklist, and keep nvidia in xorg.conf. Then you will get a black screen in ubuntu login, and any monitor will not work (both the one connecting to nvidia gpu and the one connecting to motherboard).

To summary.

You need to blacklist nvidia-modeset and nvidia-drm, and make a intel-only xorg.conf The other alternatives are the following:

1 If you only blacklist the nvidia-drm, and use intel-only xorg.conf, that works.
2 If you only blacklist the nvidia-drm, and use nvidia-intel xorg.conf, that will not work. (has all the issues listed in #Issues section).
3 If you blacklist both nvidia-modeset and nvidia-drm and use the nvidia-intel xorg.conf, you will get a black screen. But nvidia-smi can work, no xorg process on gpu, cuda will also work.
4 If you don’t blacklist any nvidia (nvidia nvidia-drm nvidia-modeset), and use intel-only xorg.conf, you will get a black screen. But nvidia-smi can work, no xorg process on gpu, and cuda will also work.
5 If you don’t blacklist any nvidia (nvidia nvidia-drm nvidia-modeset), and use nvidia-intel xorg.conf, this is exactly same as option 2), use both nvidia for xorg and computing, opengl will not work, intel iGPU will not work.
6.If you blacklist wll nvidia (nvidia, nvidia-drm nvidia-modeset), the cuda will never work since no driver can be found.

Tips need to be considerd when install cuda

After you installed the nvidia-driver, you may also want to install a cuda toolkit used to develop cuda programm.

And when you do that, sometimes cuda installation tool will overwrite your driver install with the one attached with cuda installer. Please pay attention on that. This may overwrite. what you have already done I suggest to install cuda toolkit by running file, and the install guide can be found here. https://docs.nvidia.com/cuda/cuda-installation-guide-linux/index.html#runfile-advanced

I use this command, sudo cuda-linux.10.0.130-24817639.run --toolkit --samples --no-opengl-libs

How to test if I have successfully solve the problem?

glmark2 and glxinfo are showing right info about intel OpenGL like the following

litao@deep: ~ $ glmark2                                                                                                                                                                            [23:55:17]
=======================================================
 glmark2 2014.03+git20150611.fa71af2d
=======================================================
 OpenGL Information
 GL_VENDOR:     Intel Open Source Technology Center
 GL_RENDERER:   Mesa DRI Intel(R) HD Graphics 630 (Kaby Lake GT2) 
 GL_VERSION:    3.0 Mesa 18.0.5
=======================================================

litao@deep: ~ $ glxinfo | grep OpenGL | grep string                                                                                                          [0:11:11]
OpenGL vendor string: Intel Open Source Technology Center
OpenGL renderer string: Mesa DRI Intel(R) HD Graphics 630 (Kaby Lake GT2) 
OpenGL core profile version string: 4.5 (Core Profile) Mesa 18.0.5
OpenGL core profile shading language version string: 4.50
OpenGL version string: 3.0 Mesa 18.0.5
OpenGL shading language version string: 1.30
OpenGL ES profile version string: OpenGL ES 3.2 Mesa 18.0.5
OpenGL ES profile shading language version string: OpenGL ES GLSL ES 3.20

nvidia-smi shows no xserver process and ‘no running process’ when you don’t run any cuda program.

litao@deep: ~ $ nvidia-smi                                                                                                                                                                         [23:56:38]
Fri Mar 15 23:56:50 2019       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 410.78       Driver Version: 410.78       CUDA Version: 10.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce GTX 106...  Off  | 00000000:01:00.0 Off |                  N/A |
| 37%   31C    P0    26W / 120W |      0MiB /  6078MiB |      3%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

intel_gpu_top shows the following when you playing an chrome youtube video, it should using intel open gl to do haraware accelerate. sudo intel_gpu_top
lsmod to show loaded nvidia and intel mod driver.

litao@deep: ~ $ lsmod | grep intel                                                                                                                           [0:26:23]
intel_rapl             20480  0
intel_powerclamp       16384  0
kvm_intel             212992  0
kvm                   598016  1 kvm_intel
ghash_clmulni_intel    16384  0
aesni_intel           188416  3
aes_x86_64             20480  1 aesni_intel
crypto_simd            16384  1 aesni_intel
glue_helper            16384  1 aesni_intel
cryptd                 24576  3 crypto_simd,ghash_clmulni_intel,aesni_intel
snd_hda_intel          40960  8
intel_cstate           20480  0
intel_rapl_perf        16384  0
snd_hda_codec         126976  4 snd_hda_codec_generic,snd_hda_codec_hdmi,snd_hda_intel,snd_hda_codec_realtek
snd_hda_core           81920  5 snd_hda_codec_generic,snd_hda_codec_hdmi,snd_hda_intel,snd_hda_codec,snd_hda_codec_realtek
snd_pcm                98304  4 snd_hda_codec_hdmi,snd_hda_intel,snd_hda_codec,snd_hda_core
btintel                16384  1 btusb
bluetooth             548864  33 btrtl,btintel,btbcm,bnep,btusb,rfcomm
snd                    81920  27 snd_hda_codec_generic,snd_seq,snd_seq_device,snd_hda_codec_hdmi,snd_hwdep,snd_hda_intel,snd_hda_codec,snd_hda_codec_realtek,snd_timer,snd_pcm,snd_rawmidi

Only nvidia was loaded, not any nvidia-drm thing.

itao@deep: ~ $ lsmod | grep nvidia                                                                                                                          [0:26:51]
nvidia              16588800  0
ipmi_msghandler        53248  2 ipmi_devintf,nvidia

Things does work, or work but in the wrong way.

prime-select command provided by nvidia-prime package. When use sudo prime-select intel do switch to the iGPU only, and disabls the nvidia gpu. I can observe that openGL part is correct, and I can use intel iGPU to accelerate chrome/video stuff. But when I run nvidia-smi, it gives an error like the following， and it errors by “can not find any cuda devices”, when I run deviceQuery sample or any cuda program.

litao@deep: ~ $ nvidia-smi                                                                                                                          
NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.

FAIL: 9

./deviceQuery Starting...

 CUDA Device Query (Runtime API) version (CUDART static linking)

cudaGetDeviceCount returned 38
-> no CUDA-capable device is detected
Result = FAIL
FAIL: 1

I tried to follow tthe first section which I found useful in old ubuntu (although I don’t remember why and how.) To add /usr/lib/x86_64-linux-gpu (where the libnvidia-ml.so was, found by locate libnvidia-ml.so), to /etc/ld.so.conf.d/nvidia.conf , and re-run the sudo ldconfig. I still got the same error when runing nvidia-smi.

The reason why prime-select intel works for OpenGL, and make the nvidia gpu totally lost, is because it just completely disbaled any nvidia driver (includeing nouveau and nvidia private driver).

The command will write the following content to /etc/modprobe.d/blacklist-nvidia.conf file,

blacklist nvidia
blacklist nvidia-drm
blacklist nvidia-modeset
alias nvidia off
alias nvidia-drm off
alias nvidia-modeset off

then the commnad will re-generate the initfs by update-initramfs.

Addtionally, the command will add the kernel starting command line parameters nouveau.runpm=0 to /boot/grub/grub.cfg, like the following

<linux	/boot/vmlinuz-4.15.0-46-generic root=UUID=785c9aa3-fff4-4c78-b4b8-390619ea4184 ro  quiet splash $vt_handoff
===
>linux	/boot/vmlinuz-4.15.0-46-generic root=UUID=785c9aa3-fff4-4c78-b4b8-390619ea4184 ro  quiet splash nouveau.runpm=0 $vt_handoff

About the meaning, please google it, basically it means to disable the ‘nouveau’ driver(open source version of nvidia gpu driver).

Addtionally, the prime-select intel command will install a service to the system, which basically do the following during some phase in the startup stage (I don’t remember whem, but that’s what it does.). About the meaning of this, please see Ubuntu HybridGraphics
```
 echo OFF > /sys/kernel/debug/vgaswitcheroo/switch
```

Ref:

Nvidia-smi problem solved

Use intel for display nvidia for computing

OUTLINES

FEATURED TAGS