运维开发网

centos6 – 如何使用kdump / crash来调查OOM问题?

运维开发网 https://www.qedev.com 2020-05-07 08:22 出处:网络 作者:运维开发网整理
问题 一台服务器在多次“Out of memory”消息后崩溃,我正试图找出罪魁祸首.如果它在userland中 – 哪个进程.如果它在内核中 – 哪个内核模块.
问题

一台服务器在多次“Out of memory”消息后崩溃,我正试图找出罪魁祸首.如果它在userland中 – 哪个进程.如果它在内核中 – 哪个内核模块.

细节

我试图找出如何使用crash实用程序来调查在服务器上触发OOM的原因.

作为安装一对新服务器的一部分,我开始初始化14TB DRBD设备.大约在那个时候,在玩DRBD同步器速率配置,并且一些绑定的网络接口上下移动时,其中一个服务器崩溃了.在30秒的时间内,它产生了39个内存不足:杀死进程####消息.然后它崩溃了:

Kernel panic - not syncing: Out of memory and no killable processes...

系统崩溃触发了kdump.现在我有一个很好的vmcore.flat文件应该可以直接用于调查问题,但我很难找到所有内存的去向.

我所知道的唯一资源是Dedoimedo’s site,它有nice instructions和Kernel Crash Book.这些也恰好是答案中建议的唯一资源,所以我认为崩溃是唯一的调查方法.

如果有另一种方法对事件进行验尸,我愿意接受.只是崩溃才是我所知道的唯一实用工具.我现在所拥有的只是vmcore.flat文件,我需要知道的是哪个组件占用了所有内存.我怀疑是内核模块问题,更具体地说是一个绑定模块(因为我带来了一个接口时触发),DRBD模块(版本8.3.15由CentOS 6.3上的树构建),或者其中一个10G以太网模块(mlnx_en由树构建,这是我关闭的接口,或树内bnx2x,它是保持活动的接口).我需要知道的是,是否有办法证实我的怀疑.

到目前为止,我只设法使用crash实用程序提取以下信息:

检查使用了多少内存

$crash /usr/lib/debug/lib/modules/2.6.32-279.5.2.el6.x86_64/vmLinux vmcore.flat
....
crash> kmem -i
              PAGES        TOTAL      PERCENTAGE
 TOTAL MEM  16482587      62.9 GB         ----
      FREE    54610     213.3 MB    0% of TOTAL MEM
      USED  16427977      62.7 GB   99% of TOTAL MEM
    SHARED     4683      18.3 MB    0% of TOTAL MEM
   BUFFERS      118       472 KB    0% of TOTAL MEM
    CACHED       82       328 KB    0% of TOTAL MEM
      SLAB    46635     182.2 MB    0% of TOTAL MEM

TOTAL SWAP        0            0         ----
 SWAP USED        0            0  100% of TOTAL SWAP
 SWAP FREE        0            0    0% of TOTAL SWAP

显然,它耗尽了内存.所有的64G都不见了……但是在哪里?

试图查看是否有任何进程泄漏内存

唯一似乎相关的命令是ps(这是崩溃的ps子命令).它没有显示任何异常,但它也没有显示内核线程.

crash> ps
   PID    PPID  CPU       TASK        ST  %MEM     VSZ    RSS  COMM
      0      0   0  ffffffff81a8d020  RU   0.0       0      0  [swapper]
>     0      0   1  ffff88102c456040  RU   0.0       0      0  [swapper]
>     0      0   2  ffff88082c772aa0  RU   0.0       0      0  [swapper]
>     0      0   3  ffff88102c456aa0  RU   0.0       0      0  [swapper]
      0      0   4  ffff88082c7b8ae0  RU   0.0       0      0  [swapper]
>     0      0   5  ffff88102c457500  RU   0.0       0      0  [swapper]
>     0      0   6  ffff88082c7d6aa0  RU   0.0       0      0  [swapper]
>     0      0   7  ffff88102c506080  RU   0.0       0      0  [swapper]
>     0      0   8  ffff88082c016ae0  RU   0.0       0      0  [swapper]
>     0      0   9  ffff88102c506ae0  RU   0.0       0      0  [swapper]
>     0      0  10  ffff88082c05caa0  RU   0.0       0      0  [swapper]
>     0      0  11  ffff88102c507540  RU   0.0       0      0  [swapper]
>     0      0  12  ffff88082c09cae0  RU   0.0       0      0  [swapper]
.....
   4926      1   5  ffff880828a38ae0  ??   0.0       0      0  mingetty
   4928      1   1  ffff88102a4e8040  ??   0.0       0      0  mingetty
   4930      1  19  ffff880827af4080  ??   0.0       0      0  mingetty
   4932      1   2  ffff88100f122040  ??   0.0       0      0  mingetty
   4934      1  18  ffff8810296ea080  ??   0.0       0      0  mingetty
   4936   1047   4  ffff880ff342d540  IN   0.0   11184    948  udevd
   4937   1047   5  ffff88082a240080  IN   0.0   11184    948  udevd
   5060   3772   2  ffff88082881d540  ??   0.0       0      0  sshd
   5078      1   1  ffff88100f060ae0  ??   0.0       0      0  sshd
   5079      1   1  ffff88082b882ae0  ??   0.0       0      0  bash

如果我取出内核线程(无论如何都显示%MEM的零),我们可以看到我最后几乎没有运行:

crash> ps -u
   PID    PPID  CPU       TASK        ST  %MEM     VSZ    RSS  COMM
      1      0   1  ffff88082c41b500  ??   0.0   19348    348  init
   1047      1   2  ffff881029524040  IN   0.0   11188    948  udevd
   3171      1   3  ffff880826ccaaa0  IN   0.0   27636    240  auditd
   3172      1  17  ffff881029d1b500  IN   0.0   27636    240  auditd
>  3772      1   0  ffff88102b257500  RU   0.0   64072    668  sshd
   4800      1   0  ffff88100f061540  ??   0.0       0      0  dsm_om_shrsvcd
   4842      1  16  ffff88100f012ae0  ??   0.0       0      0  cmcld
   4854      1  17  ffff88082a241540  ??   0.0       0      0  cmlogd
   4855      1   3  ffff88082796cae0  ??   0.0       0      0  cmfileassistd
   4856      1  18  ffff88082809d500  ??   0.0       0      0  cmnetd
   4860      1   0  ffff88082705aae0  ??   0.0       0      0  cmresourced
   4924      1   9  ffff88102a4e8aa0  ??   0.0       0      0  mingetty
   4926      1   5  ffff880828a38ae0  ??   0.0       0      0  mingetty
   4928      1   1  ffff88102a4e8040  ??   0.0       0      0  mingetty
   4930      1  19  ffff880827af4080  ??   0.0       0      0  mingetty
   4932      1   2  ffff88100f122040  ??   0.0       0      0  mingetty
   4934      1  18  ffff8810296ea080  ??   0.0       0      0  mingetty
   4936   1047   4  ffff880ff342d540  IN   0.0   11184    948  udevd
   4937   1047   5  ffff88082a240080  IN   0.0   11184    948  udevd
   5060   3772   2  ffff88082881d540  ??   0.0       0      0  sshd
   5078      1   1  ffff88100f060ae0  ??   0.0       0      0  sshd
   5079      1   1  ffff88082b882ae0  ??   0.0       0      0  bash
   5257      1   1  ffff8808279e6aa0  ??   0.0       0      0  jnx_mlnxsnmp_da

更新:

包括Soham建议的更多输出.不幸的是,我无法从中得出任何进一步的结论.我能做的最好的事情就是怀疑内核漏掉了内存,因为用户进程几乎全都死了.

The (almost full) output of log -m is here

crash> ps -G | tail -n +2 | cut -b2- | gawk '{mem += $8} END {print "total " mem/1048576 "GB"}'
total 0.00391006GB

请注意,此时几乎所有用户空间进程都已停止,因此预计使用率较低.

内存不足消息:

正如我上面提到的,有39个“Out of memory:”消息,这里它们是:

crash> log -m | grep Out
<3>[  223.556616] Out of memory: Kill process 3189 (portreserve) score 1 or sacrifice child
<3>[  223.787234] Out of memory: Kill process 3196 (rsyslogd) score 1 or sacrifice child
<3>[  224.237119] Out of memory: Kill process 3728 (dbus-daemon) score 1 or sacrifice child
<3>[  228.771770] Out of memory: Kill process 3758 (snmpd) score 1 or sacrifice child
<3>[  229.033466] Out of memory: Kill process 3782 (xinetd) score 1 or sacrifice child
<3>[  229.257710] Out of memory: Kill process 3782 (xinetd) score 1 or sacrifice child
<3>[  229.484321] Out of memory: Kill process 3782 (xinetd) score 1 or sacrifice child
<3>[  229.711169] Out of memory: Kill process 3782 (xinetd) score 1 or sacrifice child
<3>[  229.934955] Out of memory: Kill process 3801 (cmproxyd) score 1 or sacrifice child
<3>[  230.159542] Out of memory: Kill process 3812 (ntpd) score 1 or sacrifice child
<3>[  230.382083] Out of memory: Kill process 3953 (master) score 1 or sacrifice child
<3>[  230.606613] Out of memory: Kill process 3953 (master) score 1 or sacrifice child
<3>[  230.829515] Out of memory: Kill process 3953 (master) score 1 or sacrifice child
<3>[  230.832105] Out of memory: Kill process 3961 (crond) score 1 or sacrifice child
<3>[  236.749746] Out of memory: Kill process 3974 (atd) score 1 or sacrifice child
<3>[  236.969421] Out of memory: Kill process 4272 (dsm_sa_datamgrd) score 1 or sacrifice child
<3>[  237.192102] Out of memory: Kill process 4492 (dsm_sa_datamgrd) score 1 or sacrifice child
<3>[  237.746301] Out of memory: Kill process 4552 (dsm_sa_eventmgr) score 1 or sacrifice child
<3>[  237.968308] Out of memory: Kill process 4613 (dsm_sa_snmpd) score 1 or sacrifice child
<3>[  238.190550] Out of memory: Kill process 4614 (dsm_sa_snmpd) score 1 or sacrifice child
<3>[  238.644020] Out of memory: Kill process 4643 (dsm_om_connsvcd) score 1 or sacrifice child
<3>[  238.865658] Out of memory: Kill process 4643 (dsm_om_connsvcd) score 1 or sacrifice child
<3>[  251.285450] Out of memory: Kill process 4643 (dsm_om_connsvcd) score 1 or sacrifice child
<3>[  251.506601] Out of memory: Kill process 4800 (dsm_om_shrsvcd) score 1 or sacrifice child
<3>[  251.727570] Out of memory: Kill process 4842 (cmcld) score 1 or sacrifice child
<3>[  251.947085] Out of memory: Kill process 4842 (cmcld) score 1 or sacrifice child
<3>[  252.167096] Out of memory: Kill process 4854 (cmlogd) score 1 or sacrifice child
<3>[  252.384090] Out of memory: Kill process 4855 (cmfileassistd) score 1 or sacrifice child
<3>[  252.603324] Out of memory: Kill process 4924 (mingetty) score 1 or sacrifice child
<3>[  252.820757] Out of memory: Kill process 4926 (mingetty) score 1 or sacrifice child
<3>[  253.037558] Out of memory: Kill process 4928 (mingetty) score 1 or sacrifice child
<3>[  253.254908] Out of memory: Kill process 4930 (mingetty) score 1 or sacrifice child
<3>[  253.257391] Out of memory: Kill process 4932 (mingetty) score 1 or sacrifice child
<3>[  253.259357] Out of memory: Kill process 4934 (mingetty) score 1 or sacrifice child
<3>[  253.261353] Out of memory: Kill process 5060 (sshd) score 1 or sacrifice child
<3>[  253.263365] Out of memory: Kill process 5060 (sshd) score 1 or sacrifice child
<3>[  253.264392] Out of memory: Kill process 5079 (bash) score 1 or sacrifice child
<3>[  253.266352] Out of memory: Kill process 5257 (jnx_mlnxsnmp_da) score 1 or sacrifice child
<0>[  253.529344] Kernel panic - not syncing: Out of memory and no killable processes...

sys输出:

crash> sys
      KERNEL: /usr/lib/debug/lib/modules/2.6.32-279.5.2.el6.x86_64/vmLinux
    DUMPFILE: pcdata03.vmcore.flat  [PARTIAL DUMP]
        CPUS: 32
        DATE: Wed Feb  6 02:11:52 2013
      UPTIME: 00:04:12
LOAD AVERAGE: 3.03, 0.95, 0.34
       TASKS: 578
    NODENAME: ....
     RELEASE: 2.6.32-279.5.2.el6.x86_64
     VERSION: #1 SMP Fri Aug 24 01:07:11 UTC 2012
     MACHINE: x86_64  (2700 Mhz)
      MEMORY: 64 GB
       PANIC: "[  253.529344] Kernel panic - not syncing: Out of memory and no killable processes..."

kmem -z

crash> kmem -z
NODE: 0  ZONE: 0  ADDR: ffff88000000a0c0  NAME: "DMA"
  SIZE: 4095  PRESENT: 3839  MIN/LOW/HIGH: 5/6/7
  VM_STAT:
          NR_FREE_PAGES: 3936
       NR_INACTIVE_ANON: 0
         NR_ACTIVE_ANON: 0
       NR_INACTIVE_FILE: 0
         NR_ACTIVE_FILE: 0
         NR_UNEVICTABLE: 0
               NR_MLOCK: 0
          NR_ANON_PAGES: 0
         NR_FILE_MAPPED: 0
          NR_FILE_PAGES: 0
          NR_FILE_DIRTY: 0
           NR_WRITEBACK: 0
    NR_SLAB_RECLAIMABLE: 0
  NR_SLAB_UNRECLAIMABLE: 0
           NR_PAGETABLE: 0
        NR_KERNEL_STACK: 0
        NR_UNSTABLE_NFS: 0
              NR_BOUNCE: 0
        NR_VMSCAN_WRITE: 0
    NR_VMSCAN_IMMEDIATE: 0
      NR_WRITEBACK_TEMP: 0
       NR_ISOLATED_ANON: 0
       NR_ISOLATED_FILE: 0
               NR_SHMEM: 0
               NUMA_HIT: 0
              NUMA_MISS: 0
           NUMA_FOREIGN: 0
    NUMA_INTERLEAVE_HIT: 0
             NUMA_LOCAL: 0
             NUMA_OTHER: 0
NR_ANON_TRANSPARENT_HUGEPAGES: 0

NODE: 0  ZONE: 1  ADDR: ffff880000012780  NAME: "DMA32"
  SIZE: 1044480  PRESENT: 756520  MIN/LOW/HIGH: 1030/1287/1545
  VM_STAT:
          NR_FREE_PAGES: 30117
       NR_INACTIVE_ANON: 0
         NR_ACTIVE_ANON: 0
       NR_INACTIVE_FILE: 1
         NR_ACTIVE_FILE: 0
         NR_UNEVICTABLE: 0
               NR_MLOCK: 0
          NR_ANON_PAGES: 0
         NR_FILE_MAPPED: 0
          NR_FILE_PAGES: 1
          NR_FILE_DIRTY: 0
           NR_WRITEBACK: 0
    NR_SLAB_RECLAIMABLE: 4
  NR_SLAB_UNRECLAIMABLE: 4150
           NR_PAGETABLE: 0
        NR_KERNEL_STACK: 0
        NR_UNSTABLE_NFS: 0
              NR_BOUNCE: 0
        NR_VMSCAN_WRITE: 0
    NR_VMSCAN_IMMEDIATE: 0
      NR_WRITEBACK_TEMP: 0
       NR_ISOLATED_ANON: 0
       NR_ISOLATED_FILE: 0
               NR_SHMEM: 0
               NUMA_HIT: 575606
              NUMA_MISS: 3
           NUMA_FOREIGN: 0
    NUMA_INTERLEAVE_HIT: 0
             NUMA_LOCAL: 575598
             NUMA_OTHER: 11
NR_ANON_TRANSPARENT_HUGEPAGES: 0

NODE: 0  ZONE: 2  ADDR: ffff88000001ae40  NAME: "Normal"
  SIZE: 7602176  PRESENT: 7498240  MIN/LOW/HIGH: 10217/12771/15325
  VM_STAT:
          NR_FREE_PAGES: 10443
       NR_INACTIVE_ANON: 134
         NR_ACTIVE_ANON: 197
       NR_INACTIVE_FILE: -47
         NR_ACTIVE_FILE: 42
         NR_UNEVICTABLE: 0
               NR_MLOCK: 0
          NR_ANON_PAGES: 219
         NR_FILE_MAPPED: 115
          NR_FILE_PAGES: 45
          NR_FILE_DIRTY: 0
           NR_WRITEBACK: 0
    NR_SLAB_RECLAIMABLE: 908
  NR_SLAB_UNRECLAIMABLE: 18771
           NR_PAGETABLE: 91
        NR_KERNEL_STACK: 556
        NR_UNSTABLE_NFS: 0
              NR_BOUNCE: 0
        NR_VMSCAN_WRITE: 0
    NR_VMSCAN_IMMEDIATE: 0
      NR_WRITEBACK_TEMP: 0
       NR_ISOLATED_ANON: 0
       NR_ISOLATED_FILE: 0
               NR_SHMEM: 34
               NUMA_HIT: 8243991
              NUMA_MISS: 648
           NUMA_FOREIGN: 4593726
    NUMA_INTERLEAVE_HIT: 20066
             NUMA_LOCAL: 8243829
             NUMA_OTHER: 810
NR_ANON_TRANSPARENT_HUGEPAGES: 0

NODE: 0  ZONE: 3  ADDR: ffff880000023500  NAME: "Movable"
  [unpopulated]

NODE: 1  ZONE: 0  ADDR: ffff880840000040  NAME: "DMA"
  [unpopulated]

NODE: 1  ZONE: 1  ADDR: ffff880840008700  NAME: "DMA32"
  [unpopulated]

NODE: 1  ZONE: 2  ADDR: ffff880840010dc0  NAME: "Normal"
  SIZE: 8388608  PRESENT: 8273920  MIN/LOW/HIGH: 11274/14092/16911
  VM_STAT:
          NR_FREE_PAGES: 10114
       NR_INACTIVE_ANON: 417
         NR_ACTIVE_ANON: 83
       NR_INACTIVE_FILE: 47
         NR_ACTIVE_FILE: 32
         NR_UNEVICTABLE: 0
               NR_MLOCK: 0
          NR_ANON_PAGES: 436
         NR_FILE_MAPPED: 22
          NR_FILE_PAGES: 154
          NR_FILE_DIRTY: 0
           NR_WRITEBACK: 0
    NR_SLAB_RECLAIMABLE: 863
  NR_SLAB_UNRECLAIMABLE: 21939
           NR_PAGETABLE: 134
        NR_KERNEL_STACK: 27
        NR_UNSTABLE_NFS: 0
              NR_BOUNCE: 0
        NR_VMSCAN_WRITE: 3
    NR_VMSCAN_IMMEDIATE: 5
      NR_WRITEBACK_TEMP: 0
       NR_ISOLATED_ANON: 0
       NR_ISOLATED_FILE: 23
               NR_SHMEM: 20
               NUMA_HIT: 4332488
              NUMA_MISS: 4593726
           NUMA_FOREIGN: 665
    NUMA_INTERLEAVE_HIT: 20007
             NUMA_LOCAL: 4309300
             NUMA_OTHER: 4616914
NR_ANON_TRANSPARENT_HUGEPAGES: 0

NODE: 1  ZONE: 3  ADDR: ffff880840019480  NAME: "Movable"
  [unpopulated]

kmem -f

crash> kmem -f
NODE
  0
ZONE  NAME        SIZE    FREE      MEM_MAP       START_PADDR  START_MAPNR
  0   DMA         4095    3936  ffffea0000000038      1000          0     
AREA    SIZE  FREE_AREA_STRUCT  BLOCKS  PAGES
  0       4k  ffff880000012128       2      2
  0       4k  ffff880000012138       0      0
  0       4k  ffff880000012148       0      0
  0       4k  ffff880000012158       0      0
  0       4k  ffff880000012168       0      0
  1       8k  ffff880000012180       1      2
  1       8k  ffff880000012190       0      0
  1       8k  ffff8800000121a0       0      0
  1       8k  ffff8800000121b0       0      0
  1       8k  ffff8800000121c0       0      0
  2      16k  ffff8800000121d8       1      4
  2      16k  ffff8800000121e8       0      0
  2      16k  ffff8800000121f8       0      0
  2      16k  ffff880000012208       0      0
  2      16k  ffff880000012218       0      0
  3      32k  ffff880000012230       1      8
  3      32k  ffff880000012240       0      0
  3      32k  ffff880000012250       0      0
  3      32k  ffff880000012260       0      0
  3      32k  ffff880000012270       0      0
  4      64k  ffff880000012288       1     16
  4      64k  ffff880000012298       0      0
  4      64k  ffff8800000122a8       0      0
  4      64k  ffff8800000122b8       0      0
  4      64k  ffff8800000122c8       0      0
  5     128k  ffff8800000122e0       0      0
  5     128k  ffff8800000122f0       0      0
  5     128k  ffff880000012300       0      0
  5     128k  ffff880000012310       0      0
  5     128k  ffff880000012320       0      0
  6     256k  ffff880000012338       1     64
  6     256k  ffff880000012348       0      0
  6     256k  ffff880000012358       0      0
  6     256k  ffff880000012368       0      0
  6     256k  ffff880000012378       0      0
  7     512k  ffff880000012390       0      0
  7     512k  ffff8800000123a0       0      0
  7     512k  ffff8800000123b0       0      0
  7     512k  ffff8800000123c0       0      0
  7     512k  ffff8800000123d0       0      0
  8    1024k  ffff8800000123e8       1    256
  8    1024k  ffff8800000123f8       0      0
  8    1024k  ffff880000012408       0      0
  8    1024k  ffff880000012418       0      0
  8    1024k  ffff880000012428       0      0
  9    2048k  ffff880000012440       0      0
  9    2048k  ffff880000012450       0      0
  9    2048k  ffff880000012460       0      0
  9    2048k  ffff880000012470       1    512
  9    2048k  ffff880000012480       0      0
 10    4096k  ffff880000012498       0      0
 10    4096k  ffff8800000124a8       0      0
 10    4096k  ffff8800000124b8       3   3072
 10    4096k  ffff8800000124c8       0      0
 10    4096k  ffff8800000124d8       0      0

ZONE  NAME        SIZE    FREE      MEM_MAP       START_PADDR  START_MAPNR
  1   DMA32     1044480   30117  ffffea0000038000    1000000        4095   
AREA    SIZE  FREE_AREA_STRUCT  BLOCKS  PAGES
  0       4k  ffff88000001a7e8      24     24
  0       4k  ffff88000001a7f8       4      4
  0       4k  ffff88000001a808      13     13
  0       4k  ffff88000001a818       0      0
  0       4k  ffff88000001a828       0      0
  1       8k  ffff88000001a840       2      4
  1       8k  ffff88000001a850       2      4
  1       8k  ffff88000001a860       4      8
  1       8k  ffff88000001a870       0      0
  1       8k  ffff88000001a880       0      0
  2      16k  ffff88000001a898       0      0
  2      16k  ffff88000001a8a8       3     12
  2      16k  ffff88000001a8b8       4     16
  2      16k  ffff88000001a8c8       0      0
  2      16k  ffff88000001a8d8       0      0
  3      32k  ffff88000001a8f0       0      0
  3      32k  ffff88000001a900       3     24
  3      32k  ffff88000001a910       3     24
  3      32k  ffff88000001a920       0      0
  3      32k  ffff88000001a930       0      0
  4      64k  ffff88000001a948       1     16
  4      64k  ffff88000001a958       3     48
  4      64k  ffff88000001a968       6     96
  4      64k  ffff88000001a978       0      0
  4      64k  ffff88000001a988       0      0
  5     128k  ffff88000001a9a0       0      0
  5     128k  ffff88000001a9b0       3     96
  5     128k  ffff88000001a9c0       7    224
  5     128k  ffff88000001a9d0       0      0
  5     128k  ffff88000001a9e0       0      0
  6     256k  ffff88000001a9f8       0      0
  6     256k  ffff88000001aa08       1     64
  6     256k  ffff88000001aa18       6    384
  6     256k  ffff88000001aa28       0      0
  6     256k  ffff88000001aa38       0      0
  7     512k  ffff88000001aa50       1    128
  7     512k  ffff88000001aa60       0      0
  7     512k  ffff88000001aa70       8   1024
  7     512k  ffff88000001aa80       0      0
  7     512k  ffff88000001aa90       0      0
  8    1024k  ffff88000001aaa8       1    256
  8    1024k  ffff88000001aab8       1    256
  8    1024k  ffff88000001aac8       5   1280
  8    1024k  ffff88000001aad8       0      0
  8    1024k  ffff88000001aae8       0      0
  9    2048k  ffff88000001ab00       0      0
  9    2048k  ffff88000001ab10       1    512
  9    2048k  ffff88000001ab20       3   1536
  9    2048k  ffff88000001ab30       1    512
  9    2048k  ffff88000001ab40       0      0
 10    4096k  ffff88000001ab58       0      0
 10    4096k  ffff88000001ab68       0      0
 10    4096k  ffff88000001ab78      22  22528
 10    4096k  ffff88000001ab88       1   1024
 10    4096k  ffff88000001ab98       0      0

ZONE  NAME        SIZE    FREE      MEM_MAP       START_PADDR  START_MAPNR
  2   Normal    7602176   10443  ffffea0003800000   100000000     1048575  
AREA    SIZE  FREE_AREA_STRUCT  BLOCKS  PAGES
  0       4k  ffff880000022ea8     365    365
  0       4k  ffff880000022eb8     274    274
  0       4k  ffff880000022ec8     274    274
  0       4k  ffff880000022ed8       0      0
  0       4k  ffff880000022ee8       0      0
  1       8k  ffff880000022f00      99    198
  1       8k  ffff880000022f10      94    188
  1       8k  ffff880000022f20     360    720
  1       8k  ffff880000022f30       0      0
  1       8k  ffff880000022f40       0      0
  2      16k  ffff880000022f58      30    120
  2      16k  ffff880000022f68      41    164
  2      16k  ffff880000022f78     204    816
  2      16k  ffff880000022f88       0      0
  2      16k  ffff880000022f98       0      0
  3      32k  ffff880000022fb0       9     72
  3      32k  ffff880000022fc0      19    152
  3      32k  ffff880000022fd0     138   1104
  3      32k  ffff880000022fe0       0      0
  3      32k  ffff880000022ff0       0      0
  4      64k  ffff880000023008       7    112
  4      64k  ffff880000023018       4     64
  4      64k  ffff880000023028      77   1232
  4      64k  ffff880000023038       0      0
  4      64k  ffff880000023048       0      0
  5     128k  ffff880000023060       3     96
  5     128k  ffff880000023070       3     96
  5     128k  ffff880000023080      43   1376
  5     128k  ffff880000023090       0      0
  5     128k  ffff8800000230a0       0      0
  6     256k  ffff8800000230b8       0      0
  6     256k  ffff8800000230c8       0      0
  6     256k  ffff8800000230d8      13    832
  6     256k  ffff8800000230e8       0      0
  6     256k  ffff8800000230f8       0      0
  7     512k  ffff880000023110       0      0
  7     512k  ffff880000023120       0      0
  7     512k  ffff880000023130       5    640
  7     512k  ffff880000023140       0      0
  7     512k  ffff880000023150       0      0
  8    1024k  ffff880000023168       0      0
  8    1024k  ffff880000023178       0      0
  8    1024k  ffff880000023188       0      0
  8    1024k  ffff880000023198       0      0
  8    1024k  ffff8800000231a8       0      0
  9    2048k  ffff8800000231c0       0      0
  9    2048k  ffff8800000231d0       0      0
  9    2048k  ffff8800000231e0       1    512
  9    2048k  ffff8800000231f0       0      0
  9    2048k  ffff880000023200       0      0
 10    4096k  ffff880000023218       0      0
 10    4096k  ffff880000023228       0      0
 10    4096k  ffff880000023238       0      0
 10    4096k  ffff880000023248       1   1024
 10    4096k  ffff880000023258       0      0

ZONE  NAME        SIZE    FREE      MEM_MAP       START_PADDR  START_MAPNR
  3   Movable        0       0         0               0            0     

--------------------------------------------------------------------------

NODE
  1
ZONE  NAME        SIZE    FREE      MEM_MAP       START_PADDR  START_MAPNR
  0   DMA            0       0         0               0            0     

ZONE  NAME        SIZE    FREE      MEM_MAP       START_PADDR  START_MAPNR
  1   DMA32          0       0         0               0            0     

ZONE  NAME        SIZE    FREE      MEM_MAP       START_PADDR  START_MAPNR
  2   Normal    8388608   10114  ffffea001ce00000   840000000        0     
AREA    SIZE  FREE_AREA_STRUCT  BLOCKS  PAGES
  0       4k  ffff880840018e28     405    405
  0       4k  ffff880840018e38     162    162
  0       4k  ffff880840018e48     317    317
  0       4k  ffff880840018e58       0      0
  0       4k  ffff880840018e68       0      0
  1       8k  ffff880840018e80     106    212
  1       8k  ffff880840018e90      70    140
  1       8k  ffff880840018ea0     269    538
  1       8k  ffff880840018eb0       0      0
  1       8k  ffff880840018ec0       0      0
  2      16k  ffff880840018ed8      24     96
  2      16k  ffff880840018ee8      18     72
  2      16k  ffff880840018ef8     207    828
  2      16k  ffff880840018f08       0      0
  2      16k  ffff880840018f18       0      0
  3      32k  ffff880840018f30      20    160
  3      32k  ffff880840018f40       4     32
  3      32k  ffff880840018f50     148   1184
  3      32k  ffff880840018f60       0      0
  3      32k  ffff880840018f70       0      0
  4      64k  ffff880840018f88      17    272
  4      64k  ffff880840018f98       2     32
  4      64k  ffff880840018fa8      95   1520
  4      64k  ffff880840018fb8       0      0
  4      64k  ffff880840018fc8       0      0
  5     128k  ffff880840018fe0       4    128
  5     128k  ffff880840018ff0       1     32
  5     128k  ffff880840019000      37   1184
  5     128k  ffff880840019010       0      0
  5     128k  ffff880840019020       0      0
  6     256k  ffff880840019038       0      0
  6     256k  ffff880840019048       0      0
  6     256k  ffff880840019058       8    512
  6     256k  ffff880840019068       0      0
  6     256k  ffff880840019078       0      0
  7     512k  ffff880840019090       0      0
  7     512k  ffff8808400190a0       0      0
  7     512k  ffff8808400190b0       1    128
  7     512k  ffff8808400190c0       0      0
  7     512k  ffff8808400190d0       0      0
  8    1024k  ffff8808400190e8       0      0
  8    1024k  ffff8808400190f8       0      0
  8    1024k  ffff880840019108       1    256
  8    1024k  ffff880840019118       0      0
  8    1024k  ffff880840019128       0      0
  9    2048k  ffff880840019140       0      0
  9    2048k  ffff880840019150       0      0
  9    2048k  ffff880840019160       1    512
  9    2048k  ffff880840019170       1    512
  9    2048k  ffff880840019180       0      0
 10    4096k  ffff880840019198       0      0
 10    4096k  ffff8808400191a8       0      0
 10    4096k  ffff8808400191b8       0      0
 10    4096k  ffff8808400191c8       1   1024
 10    4096k  ffff8808400191d8       0      0

ZONE  NAME        SIZE    FREE      MEM_MAP       START_PADDR  START_MAPNR
  3   Movable        0       0         0               0            0     

nr_free_pages: 54610  (found 54742)

与foreach bt的一点运动

crash> foreach bt | awk '$1 == "#0" { $2 = ""; print }' | sort | uniq -c
     31 #0  crash_nmi_callback at ffffffff81029df6
      1 #0  machine_kexec at ffffffff8103281b
    546 #0  schedule at ffffffff814fda62

实际上,它们要么已经崩溃,要么正在等待记忆(或者我没有正确阅读它).

检查前20个最大的物理内存消费者(驻留集大小).

crash> ps -G | sed 's/>//g' | sort -k 8,8 -n | awk '$8 ~ /[0-9]/{ $8 = $8/1024" MB"; print }' | tail -20

检查largepages的数量.

crash> p -d nr_huge_pages

更新:

A)从以下内核版本捕获崩溃转储.

$crash --osrelease vmcore.flat 
2.6.32-279.5.2.el6.x86_64                    

B)让我们从kernel-debug-debuginfo包中提取vmLinux文件.

$rpm2cpio kernel-debug-debuginfo-2.6.32-279.5.2.el6.x86_64.rpm | \
  cpio -idv ./usr/lib/debug/lib/modules/*/vmLinux

C)使用崩溃实用程序打开vmcore文件.

$bunzip2 vmcore.flat.bz2 
$crash vmcore.flat ./usr/lib/debug/lib/modules/2.6.32-279.5.2.el6.x86_64/vmLinux

D)系统信息.

crash> sys
      KERNEL: ./usr/lib/debug/lib/modules/2.6.32-279.5.2.el6.x86_64/vmLinux
    DUMPFILE: vmcore.flat  [PARTIAL DUMP]
        CPUS: 32
        DATE: Tue Feb  5 12:11:52 2013
      UPTIME: 00:04:12
LOAD AVERAGE: 3.03, 0.95, 0.34
       TASKS: 578
    NODENAME: ...
     RELEASE: 2.6.32-279.5.2.el6.x86_64
     VERSION: #1 SMP Fri Aug 24 01:07:11 UTC 2012
     MACHINE: x86_64  (2700 Mhz)
      MEMORY: 64 GB
       PANIC: "[  253.529344] Kernel panic - not syncing: Out of memory and no killable processes..."

a)由于内存不足而发生恐慌,但系统上禁用了“panic_on_oom”参数.

crash> p -d sysctl_panic_on_oom
sysctl_panic_on_oom = $6 = 0

此参数启用或禁用内存不足功能的混乱.如果将其设置为0,内核将杀死一些名为oom_killer的流氓进程.通常,oom_killer可以杀死恶意进程并且系统将继续存在.如果将其设置为1,则内存发生时内核会发生混乱.

b)那么,我们是如何在oom事件发生时捕获vmcore的?

好吧,让我们检查mm / oom_kill.c源代码.它说如果系统上没有任何东西可以杀死那么只是挂起或恐慌.

++++++
499         /* Found nothing?!?! Either we hang forever, or we panic. */   
500         if (!p) {                                                  
501             read_unlock(&tasklist_lock);                                     
502             cpuset_unlock();                                        
503             panic("Out of memory and no killable processes...\n");  <<<------  
504         }                                                       
505 
++++++

因此我们达到了恐慌状态,并且在此系统上启用了kdump服务时,捕获了vmcore.

E)让我们检查内核环缓冲区,

crash> log
[..]
[  253.351427] Node 0 DMA free:15744kB min:20kB low:24kB high:28kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15356kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:0kB slab_unreclaimable:0kB kernel_stack:0kB pagetables:0kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? yes
[  253.352234] lowmem_reserve[]: 0 2955 32245 32245
[  253.352812] Node 0 DMA32 free:120436kB min:4120kB low:5148kB high:6180kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:32kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:3026080kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:20kB slab_unreclaimable:16600kB kernel_stack:0kB pagetables:0kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:1 all_unreclaimable? no
[  253.353637] lowmem_reserve[]: 0 0 29290 29290
[  253.354216] Node 0 Normal free:40580kB min:40868kB low:51084kB high:61300kB active_anon:956kB inactive_anon:536kB active_file:260kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:29992960kB mlocked:0kB dirty:0kB writeback:0kB mapped:460kB shmem:136kB slab_reclaimable:3640kB slab_unreclaimable:75128kB kernel_stack:4448kB pagetables:428kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
[  253.355047] lowmem_reserve[]: 0 0 0 0
[  253.355624] Node 1 Normal free:39896kB min:45096kB low:56368kB high:67644kB active_anon:412kB inactive_anon:1668kB active_file:288kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):220kB present:33095680kB mlocked:0kB dirty:0kB writeback:0kB mapped:92kB shmem:80kB slab_reclaimable:3496kB slab_unreclaimable:87864kB kernel_stack:216kB pagetables:564kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
[  253.356457] lowmem_reserve[]: 0 0 0 0
[  253.357034] Node 0 DMA: 2*4kB 1*8kB 1*16kB 1*32kB 1*64kB 0*128kB 1*256kB 0*512kB 1*1024kB 1*2048kB 3*4096kB = 15744kB
[  253.358351] Node 0 DMA32: 41*4kB 8*8kB 7*16kB 6*32kB 10*64kB 10*128kB 7*256kB 9*512kB 7*1024kB 5*2048kB 23*4096kB = 120468kB
[  253.359674] Node 0 Normal: 718*4kB 558*8kB 278*16kB 169*32kB 88*64kB 47*128kB 13*256kB 5*512kB 0*1024kB 1*2048kB 1*4096kB = 40872kB
[  253.360995] Node 1 Normal: 876*4kB 447*8kB 249*16kB 174*32kB 116*64kB 40*128kB 8*256kB 1*512kB 1*1024kB 2*2048kB 1*4096kB = 40952kB
[  253.362319] 154 total pagecache pages
[  253.362502] 0 pages in swap cache
[  253.362684] Swap cache stats: add 0, delete 0, find 0/0
[  253.362869] Free swap  = 0kB
[  253.363050] Total swap = 0kB
[  253.526814] 16777215 pages RAM
[  253.526999] 294628 pages reserved
[  253.527190] 114911 pages shared
[  253.527372] 16392561 pages non-shared
[..]

F)让我们在崩溃时检查系统上的内存状态.

crash> kmem -i
              PAGES        TOTAL      PERCENTAGE
 TOTAL MEM  16482587      62.9 GB         ----        -------------------------------+
      FREE    54610     213.3 MB    0% of TOTAL MEM                                  |
      USED  16427977      62.7 GB   99% of TOTAL MEM                                 |
    SHARED     4683      18.3 MB    0% of TOTAL MEM                                  |
   BUFFERS      118       472 KB    0% of TOTAL MEM                                  |
    CACHED       82       328 KB    0% of TOTAL MEM                                  |
      SLAB    46635     182.2 MB    0% of TOTAL MEM                                  |
                                                                                     |
TOTAL SWAP        0            0         ----         ----------------------+        |
 SWAP USED        0            0  100% of TOTAL SWAP                        |        |
 SWAP FREE        0            0    0% of TOTAL SWAP                        |        |
                                                                            |        | 
                                                                            |        |
crash> p -d totalram_pages                                                  |        |
totalram_pages = $5 = 16482587                                              |        |
                                                                            |        |
crash> !echo "scale=5;(16482587*4096)/2^30"|bc -q                           |        |
62.87607                  <<<-----[ Total physical memory is 62.9 GB ] <<<--|--------+
                                                                            |
crash> p -d total_swap_pages                                                |
total_swap_pages = $6 = 0 <<<------[ No Swap on the system ]  <<<-----------+

>我们总共有~63GiB的物理内存.

>交换分区或文件未在系统上创建,因此我们没有在此服务器上交换.

>用于缓存的内存非常少,328KB,缓冲区的内存为472KB.

>平板中使用的内存也非常少,只有182.2 MB.

G)分配给进程的总内存为0.00391006GiB.

crash> ps -G | tail -n +2 | cut -b2- | gawk '{mem += $8} END {print "total " mem/1048576 "GB"}'
total 0.00391006GB

H)应用程序进程没有在系统上使用内存.

crash> ps -G | sed 's/>//g' | sort -k 8,8 -n | awk '$8 ~ /[0-9]/{ $8 = $8/1024" MB"; print }' | tail -20
965 2 21 ffff8808292f1500 IN 0.0 0 0 MB [ext4-dio-unwrit]
966 2 22 ffff8808292d4080 IN 0.0 0 0 MB [ext4-dio-unwrit]
967 2 23 ffff8808292ce040 IN 0.0 0 0 MB [ext4-dio-unwrit]
968 2 24 ffff8808299b5540 IN 0.0 0 0 MB [ext4-dio-unwrit]
969 2 25 ffff880829aa6040 IN 0.0 0 0 MB [ext4-dio-unwrit]
970 2 26 ffff880827367500 IN 0.0 0 0 MB [ext4-dio-unwrit]
971 2 27 ffff880827366aa0 IN 0.0 0 0 MB [ext4-dio-unwrit]
972 2 28 ffff880827366040 IN 0.0 0 0 MB [ext4-dio-unwrit]
97 2 23 ffff88082c1ac080 IN 0.0 0 0 MB [ksoftirqd/23]
973 2 29 ffff880827371540 IN 0.0 0 0 MB [ext4-dio-unwrit]
974 2 30 ffff880827370ae0 IN 0.0 0 0 MB [ext4-dio-unwrit]
975 2 31 ffff880827370080 IN 0.0 0 0 MB [ext4-dio-unwrit]
98 2 23 ffff88082c1bb500 IN 0.0 0 0 MB [watchdog/23]
99 2 24 ffff88082c1baaa0 IN 0.0 0 0 MB [migration/24]
3171 1 3 ffff880826ccaaa0 IN 0.0 27636 0.234375 MB auditd
1 0 1 ffff88082c41b500 UN 0.0 19348 0.339844 MB init
3772 1 0 ffff88102b257500 RU 0.0 64072 0.652344 MB sshd
1047 1 2 ffff881029524040 IN 0.0 11188 0.925781 MB udevd
4936 1047 4 ffff880ff342d540 IN 0.0 11184 0.925781 MB udevd
4937 1047 5 ffff88082a240080 IN 0.0 11184 0.925781 MB udevd

I)让我们验证系统上的内存调整参数.

crash> p -d sysctl_overcommit_memory
sysctl_overcommit_memory = $7 = 0

此值包含一个启用内存过量使用的标志.当此标志为0时,内核尝试估计当用户空间请求更多内存时剩余的可用内存量.

crash> p -d sysctl_overcommit_ratio
sysctl_overcommit_ratio = $8 = 50

当overcommit_memory设置为2时,不允许提交的地址空间超过交换加上物理RAM的这个百分比.

crash> p -d zone_reclaim_mode 
zone_reclaim_mode = $4 = 0

Zone_reclaim_mode允许某人在区域内存不足时设置或多或少的积极方法来回收内存.如果设置为零,则不会发生区域回收.

crash> p -d min_free_kbytes
min_free_kbytes = $3 = 90112  <<<--------[ 88 MB ]

在系统中保持空闲的最小千字节数.该值用于计算每个低存储区的水印值,然后为其分配与其大小成比例的多个保留​​空闲页.设置此参数时,太低和太高的值都可能会造成损坏.

换句话说,将min_free_kbytes设置得太低会阻止系统回收内存.这可能导致系统挂起和OOM杀死多个进程.但是,将此参数设置为过高的值(占系统内存总量的5-10%)将导致系统立即变为内存不足. Linux旨在使用所有可用的RAM来缓存文件系统数据.设置高min_free_kbytes值会导致系统花费太多时间回收内存.

上面参数的值看起来还可以,那么我的记忆在哪里?

假设:

>主要罪犯不在用户空间.根据我的经验,不负责任的记忆是由Mellanox和DRBD模块引起的,但我不确定你的情况.>由于大多数页面都是从vmcore文件中丢弃的,以减小vmcore文件的大小(core_collector makedumpfile -d 31 -c).我无法检查巨页大小.

0

精彩评论

暂无评论...
验证码 换一张
取 消