诊断需求
当MogDB数据库由于某种原因占用了较大的系统负载,比如CPU占用接近100%,那么如何知道到底是数据库里的哪个会话或者哪些会话占用了资源?
在Oracle数据库中,这样的问题诊断,通常都会关联 vsession, vprocess, 以及操作系统top命令或者ps命令中查到的操作系统进程ID。
但是MogDB本身是线程模型,在操作系统上只能看到一个进程号,那么该如何定位问题?
因为MogDB使用的是线程模型,与PostgreSQL的进程模型不同,在操作系统级别如果用lsof命令查看网络端口的接入,虽然是会看到有多个用户会话接入,但是在PID列只会显示进程ID,无法对应到线程。
$ lsof -i 4 -a -p `pgrep -u omm3 mogdb` COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME mogdb 12027 omm3 8u IPv4 20313752 0t0 TCP *:biimenu (LISTEN) mogdb 12027 omm3 9u IPv4 20313753 0t0 TCP *:18001 (LISTEN) mogdb 12027 omm3 325u IPv4 28320946 0t0 TCP mogdb-kernel-0004:biimenu->softbank060132034173.bbtec.net:45206 (ESTABLISHED) mogdb 12027 omm3 330u IPv4 28316174 0t0 TCP mogdb-kernel-0004:biimenu->softbank060132034173.bbtec.net:45208 (ESTABLISHED) mogdb 12027 omm3 336u IPv4 28302815 0t0 TCP mogdb-kernel-0004:biimenu->softbank060132034173.bbtec.net:45210 (ESTABLISHED) mogdb 12027 omm3 340u IPv4 28323140 0t0 TCP mogdb-kernel-0004:biimenu->softbank060132034173.bbtec.net:45212 (ESTABLISHED) mogdb 12027 omm3 360u IPv4 28323141 0t0 TCP mogdb-kernel-0004:biimenu->softbank060132034173.bbtec.net:45214 (ESTABLISHED) mogdb 12027 omm3 375u IPv4 28305050 0t0 TCP mogdb-kernel-0004:biimenu->softbank060132034173.bbtec.net:45216 (ESTABLISHED)
如何获取线程ID
可以使用htop,打开htop界面以后,按F5进行进程树形显示,第一个PID是进程号,而树形结构下的每一行的PID都是该进程中对应的线程号。
也可以直接使用ps命令。-L参数表示显示线程,-o参数可以指定显示感兴趣的列。
# ps -Lp `pgrep -u omm3 mogdb` -o %cpu,tid,pid,ppid,cmd,comm %CPU TID PID PPID CMD COMMAND 0.0 17847 17847 1 /opt/mogdb3/app/bin/mogdb - mogdb 0.0 17848 17847 1 /opt/mogdb3/app/bin/mogdb - jemalloc_bg_thd 0.0 17854 17847 1 /opt/mogdb3/app/bin/mogdb - mogdb 0.0 17855 17847 1 /opt/mogdb3/app/bin/mogdb - syslogger 0.0 17856 17847 1 /opt/mogdb3/app/bin/mogdb - reaper 0.0 17857 17847 1 /opt/mogdb3/app/bin/mogdb - jemalloc_bg_thd 0.0 17858 17847 1 /opt/mogdb3/app/bin/mogdb - jemalloc_bg_thd 0.0 17860 17847 1 /opt/mogdb3/app/bin/mogdb - jemalloc_bg_thd 0.0 17884 17847 1 /opt/mogdb3/app/bin/mogdb - checkpointer 0.0 17885 17847 1 /opt/mogdb3/app/bin/mogdb - Spbgwriter 0.1 17886 17847 1 /opt/mogdb3/app/bin/mogdb - pagewriter 0.0 17887 17847 1 /opt/mogdb3/app/bin/mogdb - pagewriter 0.0 17888 17847 1 /opt/mogdb3/app/bin/mogdb - pagewriter 0.0 17889 17847 1 /opt/mogdb3/app/bin/mogdb - pagewriter 0.0 17890 17847 1 /opt/mogdb3/app/bin/mogdb - pagewriter 0.8 17891 17847 1 /opt/mogdb3/app/bin/mogdb - WALwriter 0.0 17892 17847 1 /opt/mogdb3/app/bin/mogdb - WALwriteraux 0.0 17893 17847 1 /opt/mogdb3/app/bin/mogdb - AVClauncher 0.0 17894 17847 1 /opt/mogdb3/app/bin/mogdb - Jobscheduler 0.0 17895 17847 1 /opt/mogdb3/app/bin/mogdb - asyncundolaunch 0.0 17896 17847 1 /opt/mogdb3/app/bin/mogdb - globalstats 0.0 17897 17847 1 /opt/mogdb3/app/bin/mogdb - applylauncher 0.0 17898 17847 1 /opt/mogdb3/app/bin/mogdb - statscollector 0.0 17899 17847 1 /opt/mogdb3/app/bin/mogdb - snapshotworker 0.1 17900 17847 1 /opt/mogdb3/app/bin/mogdb - TrackStmtWorker 0.0 17901 17847 1 /opt/mogdb3/app/bin/mogdb - 2pccleaner 0.0 17902 17847 1 /opt/mogdb3/app/bin/mogdb - faultmonitor 0.0 17904 17847 1 /opt/mogdb3/app/bin/mogdb - undorecycler 0.0 18372 17847 1 /opt/mogdb3/app/bin/mogdb - worker 0.0 18373 17847 1 /opt/mogdb3/app/bin/mogdb - worker 0.0 18374 17847 1 /opt/mogdb3/app/bin/mogdb - worker 0.0 18375 17847 1 /opt/mogdb3/app/bin/mogdb - worker 0.0 18376 17847 1 /opt/mogdb3/app/bin/mogdb - worker 0.0 18377 17847 1 /opt/mogdb3/app/bin/mogdb - worker 0.0 18378 17847 1 /opt/mogdb3/app/bin/mogdb - worker 0.0 18379 17847 1 /opt/mogdb3/app/bin/mogdb - worker 0.0 18380 17847 1 /opt/mogdb3/app/bin/mogdb - worker 0.0 18381 17847 1 /opt/mogdb3/app/bin/mogdb - worker 0.0 18382 17847 1 /opt/mogdb3/app/bin/mogdb - worker 0.0 18454 17847 1 /opt/mogdb3/app/bin/mogdb - worker 0.0 19475 17847 1 /opt/mogdb3/app/bin/mogdb - worker 0.0 19480 17847 1 /opt/mogdb3/app/bin/mogdb - worker 0.0 29529 17847 1 /opt/mogdb3/app/bin/mogdb - worker 0.0 30999 17847 1 /opt/mogdb3/app/bin/mogdb - worker
comm列中显示为woker的都是用户会话的数据库端后台进程,通常占用CPU较高的会是用户会话,因此可以用grep命令来过滤只显示用户会话。
# ps -Lp `pgrep -u omm3 mogdb` -o %cpu,tid,pid,ppid,cmd,comm | grep -w worker 0.0 18372 17847 1 /opt/mogdb3/app/bin/mogdb - worker 0.0 18373 17847 1 /opt/mogdb3/app/bin/mogdb - worker 0.0 18374 17847 1 /opt/mogdb3/app/bin/mogdb - worker 0.0 18375 17847 1 /opt/mogdb3/app/bin/mogdb - worker 0.0 18376 17847 1 /opt/mogdb3/app/bin/mogdb - worker 0.0 18377 17847 1 /opt/mogdb3/app/bin/mogdb - worker 0.0 18378 17847 1 /opt/mogdb3/app/bin/mogdb - worker 0.0 18379 17847 1 /opt/mogdb3/app/bin/mogdb - worker 0.0 18380 17847 1 /opt/mogdb3/app/bin/mogdb - worker 0.0 18381 17847 1 /opt/mogdb3/app/bin/mogdb - worker 0.0 18382 17847 1 /opt/mogdb3/app/bin/mogdb - worker 0.0 18454 17847 1 /opt/mogdb3/app/bin/mogdb - worker 0.0 19475 17847 1 /opt/mogdb3/app/bin/mogdb - worker 0.0 19480 17847 1 /opt/mogdb3/app/bin/mogdb - worker 0.0 29529 17847 1 /opt/mogdb3/app/bin/mogdb - worker 0.0 30999 17847 1 /opt/mogdb3/app/bin/mogdb - worker
如何将操作系统线程ID跟数据库会话对应
假设在这台服务器上,线程ID=18372的线程占用了大量CPU。
在MogDB里查询pg_os_threads视图,可以获得该线程在数据库中对应的会话ID 。
MogDB=# select * from pg_os_threads where lwpid=18372; node_name | pid | lwpid | thread_name | creation_time -----------+-----------------+-------+-------------+------------------------------- dn_6001 | 140545137571584 | 18372 | dn_6001 | 2022-05-30 19:54:42.459129+08 (1 row)
pg_os_threads视图里记录了轻量级线程号和会话ID的关系,其中lwpid是操作系统线程ID,pid列是数据库会话ID。具体信息可以参考MogDB文档-PG_OS_THREADS。
如果具备monadmin权限,也可以查询dbe_perf模式下的os_threads视图,信息是一样的。
找到数据库会话ID之后,就可以为所欲为了,比如可以查询dbe_perf.session_stat_activity视图来获取该会话的应用名,发起的客户端IP地址,还有该会话正在执行的SQL。
MogDB=# select application_name,client_addr,query from dbe_perf.session_stat_activity where pid=140545137571584; application_name | client_addr | query ------------------+--------------+--------------------------------------------------------- dn_6001 | 172.16.0.176 | SELECT cfg_value FROM bmsql_config WHERE cfg_name = $1 (1 row)
也可以查询dbe_perf.thread_wait_status视图来获取会话的当前等待事件。BTW,实际上该视图中有lwtid字段,可以直接对应到线程ID。
MogDB=# select lwtid,wait_status,wait_event from dbe_perf.thread_wait_status where sessionid=140545137571584; lwtid | wait_status | wait_event -------+-------------+------------ 18372 | wait cmd | wait cmd (1 row)