又见close_wait

close_wait真是一个让人崩溃的状态,常常只有几千个的时候就让服务器直接挂了。 可不这次又碰上了,报警说_close_wait_太多,查了一下是都是公网网卡那块_close_wait_ 通过tcp的4次挥手我们可以知道,close_wait_是发生在客户端往服务器端发送_FIN_请求后,客户端会把自己置为_finwait1, 而服务器返回给客户端后会把自己的状态置为_close_wait_,再等待一段时间后会发把自己置为_last_ack_ 理论上讲,_close_wait_的产生是由于程序意外中断导致的,但是原因可能是多种多样的。 用_netstat -s_也查看了下网络情况,因为这个是网络启动到现在的累积情况,所以需要对比之前5分钟的看,但是也没看出多少异常来。

266813 invalid SYN cookies received
7511 resets received for embryonic SYN_RECV sockets
3096 packets pruned from receive queue because of socket buffer overrun
11 ICMP packets dropped because they were out-of-window
603754 TCP sockets finished time wait in fast timer
12 time wait sockets recycled by time stamp
7082 packets rejects in established connections because of timestamp
890380 delayed acks sent
615 delayed acks further delayed because of locked socket
Quick ack mode was activated 161020 times
10129 times the listen queue of a socket overflowed
10129 SYNs to LISTEN sockets ignored
74494 packets directly queued to recvmsg prequeue.
29399 packets directly received from prequeue
20966540 packets header predicted
6813794 acknowledgments not containing data received
12672614 predicted acknowledgments
1 times recovered from packet loss due to fast retransmit
4237 times recovered from packet loss due to SACK data
42 bad SACKs received
Detected reordering 311 times using FACK
Detected reordering 56228 times using SACK
Detected reordering 35401 times using time stamp
683 congestion windows fully recovered
56047 congestion windows partially recovered using Hoe heuristic
TCPDSACKUndo: 19806
24593 congestion windows recovered after partial ack
5911 TCP data loss events
TCPLostRetransmit: 569
2 timeouts after reno fast retransmit
3830 timeouts after SACK recovery
1556 timeouts in loss state
60642 fast retransmits
6040 forward retransmits
17845 retransmits in slow start
79714 other TCP timeouts
493 sack retransmits failed
15031 packets collapsed in receive queue due to low socket buffer
160518 DSACKs sent for old packets
43 DSACKs sent for out of order packets
87404 DSACKs received
754 DSACKs for out of order packets received
3733 connections reset due to unexpected data
242941 connections reset due to early user close
10649 connections aborted due to timeout
TCPSACKDiscard: 1094
TCPDSACKIgnoredOld: 1384
TCPDSACKIgnoredNoUndo: 35778
TCPSpuriousRTOs: 103
TCPSackShifted: 21957
TCPSackMerged: 27204
TCPSackShiftFallback: 207287
TCPBacklogDrop: 1
TCPOFOQueue: 30173
TCPOFOMerge: 45
TCPChallengeACK: 21202
TCPSYNChallenge: 21096
TCPFromZeroWindowAdv: 117
TCPToZeroWindowAdv: 119
TCPWantZeroWindowAdv: 8861

然后认为是网卡队列堵了,因为netstat查看有很多_Recv-Q_,这个表明网卡接收队列堵了,于是就想着先优化一下tcp队列吧。那就优化下

net.ipv4.tcp_max_syn_backlog = 2000
net.core.netdev_max_backlog = 2000
net.ipv4.ip_local_port_range = 5000 65535
net.ipv4.tcp_max_tw_buckets = 5000

观察下效果好像是缓解了,但是看着并没有本质上解决。于是又看了下并发,也就2万左右,当初1台机器我们都可以抗20多万的并发呢,现在nginx怎么会抗不住呢,百思不得其解。 又看了下日志,发现这些请求都是https的相关请求,于是怀疑是ssl握手导致的,查看了当时cpu的状态,果然那个时候的cpu使用率都到了96%到97%了,这个明显是nginx响应不过来了,看了nginx的error log里很多stack traceback这样的错误信息,之前就一直以为是lua程序写的有问题。 既然找到了是https的问题,那就得优化下nginx关于tls部分的设置了。 ssl握手协议的时候RSA和ECDHE这2个算法开销会比较小,而RC4和MD5的开销会比较大。而默认ssl buffer size是16k,这样导致的结果当buffer比较小的情况下会继续等待满了16k才会返回,改成4k后会尽快的进行返回。

    ssl_ciphers          ECDHE-ECDSA-CHACHA20-POLY1305:ECDHE-RSA-CHACHA20-POLY1305:ECDHE-ECDSA-AES128-GCM-SHA256:ECDHE-RSA-AES128-GCM-SHA256:ECDHE-ECDSA-AES256-GCM-SHA384:ECDHE-RSA-AES256-GCM-SHA384:ECDHE-ECDSA-AES128-SHA:ECDHE-RSA-AES128-SHA:ECDHE-ECDSA-AES256-SHA384:ECDHE-RSA-AES256-SHA384:ECDHE-ECDSA-AES256-SHA:ECDHE-RSA-AES256-SHA:ECDHE-ECDSA-AES128-SHA256:ECDHE-RSA-AES128-SHA256;

    ssl_prefer_server_ciphers  on;
    ssl_protocols        TLSv1 TLSv1.1 TLSv1.2 TLSv1.3;
    ssl_session_cache shared:SSL:10m;
    ssl_session_timeout 10m;
    ssl_buffer_size 4k;

    ssl_session_tickets      on;
    ssl_stapling      on;
    ssl_stapling_verify      on;