linux-kernelorg-stable/net/core
Eric Dumazet ecfea98b7d tcp: add net.ipv4.tcp_rcvbuf_low_rtt
This is a follow up of commit aa251c8463 ("tcp: fix too slow
tcp_rcvbuf_grow() action") which brought again the issue that I tried
to fix in commit 65c5287892 ("tcp: fix sk_rcvbuf overshoot")

We also recently increased tcp_rmem[2] to 32 MB in commit 572be9bf9d
("tcp: increase tcp_rmem[2] to 32 MB")

Idea of this patch is to not let tcp_rcvbuf_grow() grow sk->sk_rcvbuf
too fast for small RTT flows. If sk->sk_rcvbuf is too big, this can
force NIC driver to not recycle pages from their page pool, and also
can cause cache evictions for DDIO enabled cpus/NIC, as receivers
are usually slower than senders.

Add net.ipv4.tcp_rcvbuf_low_rtt sysctl, set by default to 1000 usec (1 ms)

If RTT if smaller than the sysctl value, use the RTT/tcp_rcvbuf_low_rtt
ratio to control sk_rcvbuf inflation.

Tested:

Pair of hosts with a 200Gbit IDPF NIC. Using netperf/netserver

Client initiates 8 TCP bulk flows, asking netserver to use CPU #10 only.

super_netperf 8 -H server -T,10 -l 30

On server, use perf -e tcp:tcp_rcvbuf_grow while test is running.

Before:

sysctl -w net.ipv4.tcp_rcvbuf_low_rtt=1
perf record -a -e tcp:tcp_rcvbuf_grow sleep 30 ; perf script|tail -20|cut -c30-230
 1153.051201: tcp:tcp_rcvbuf_grow: time=398 rtt_us=382 copied=6905856 inq=180224 space=6115328 ooo=0 scaling_ratio=240 rcvbuf=27666235 rcv_ssthresh=25878235 window_clamp=25937095 rcv_wnd=25600000 famil
 1153.138752: tcp:tcp_rcvbuf_grow: time=446 rtt_us=413 copied=5529600 inq=180224 space=4505600 ooo=0 scaling_ratio=240 rcvbuf=23068672 rcv_ssthresh=21571860 window_clamp=21626880 rcv_wnd=21286912 famil
 1153.361484: tcp:tcp_rcvbuf_grow: time=415 rtt_us=380 copied=7061504 inq=204800 space=6725632 ooo=0 scaling_ratio=240 rcvbuf=27666235 rcv_ssthresh=25878235 window_clamp=25937095 rcv_wnd=25600000 famil
 1153.457642: tcp:tcp_rcvbuf_grow: time=483 rtt_us=421 copied=5885952 inq=720896 space=4407296 ooo=0 scaling_ratio=240 rcvbuf=23763511 rcv_ssthresh=22223271 window_clamp=22278291 rcv_wnd=21430272 famil
 1153.466002: tcp:tcp_rcvbuf_grow: time=308 rtt_us=281 copied=3244032 inq=180224 space=2883584 ooo=0 scaling_ratio=240 rcvbuf=44854314 rcv_ssthresh=41992059 window_clamp=42050919 rcv_wnd=41713664 famil
 1153.747792: tcp:tcp_rcvbuf_grow: time=394 rtt_us=332 copied=4460544 inq=585728 space=3063808 ooo=0 scaling_ratio=240 rcvbuf=44854314 rcv_ssthresh=41992059 window_clamp=42050919 rcv_wnd=41373696 famil
 1154.260747: tcp:tcp_rcvbuf_grow: time=652 rtt_us=226 copied=10977280 inq=737280 space=9486336 ooo=0 scaling_ratio=240 rcvbuf=31165538 rcv_ssthresh=29197743 window_clamp=29217691 rcv_wnd=28368896 fami
 1154.375019: tcp:tcp_rcvbuf_grow: time=461 rtt_us=443 copied=7573504 inq=507904 space=6856704 ooo=0 scaling_ratio=240 rcvbuf=27666235 rcv_ssthresh=25878235 window_clamp=25937095 rcv_wnd=25288704 famil
 1154.463072: tcp:tcp_rcvbuf_grow: time=494 rtt_us=408 copied=7983104 inq=200704 space=7065600 ooo=0 scaling_ratio=240 rcvbuf=27666235 rcv_ssthresh=25878235 window_clamp=25937095 rcv_wnd=25579520 famil
 1154.474658: tcp:tcp_rcvbuf_grow: time=507 rtt_us=459 copied=5586944 inq=540672 space=4718592 ooo=0 scaling_ratio=240 rcvbuf=17852266 rcv_ssthresh=16692999 window_clamp=16736499 rcv_wnd=16056320 famil
 1154.584657: tcp:tcp_rcvbuf_grow: time=494 rtt_us=427 copied=8126464 inq=204800 space=7782400 ooo=0 scaling_ratio=240 rcvbuf=27666235 rcv_ssthresh=25878235 window_clamp=25937095 rcv_wnd=25600000 famil
 1154.702117: tcp:tcp_rcvbuf_grow: time=480 rtt_us=406 copied=5734400 inq=180224 space=5349376 ooo=0 scaling_ratio=240 rcvbuf=23068672 rcv_ssthresh=21571860 window_clamp=21626880 rcv_wnd=21286912 famil
 1155.941595: tcp:tcp_rcvbuf_grow: time=717 rtt_us=670 copied=11042816 inq=3784704 space=7159808 ooo=0 scaling_ratio=240 rcvbuf=19581357 rcv_ssthresh=18333222 window_clamp=18357522 rcv_wnd=14614528 fam
 1156.384735: tcp:tcp_rcvbuf_grow: time=529 rtt_us=473 copied=9011200 inq=180224 space=7258112 ooo=0 scaling_ratio=240 rcvbuf=19581357 rcv_ssthresh=18333222 window_clamp=18357522 rcv_wnd=18018304 famil
 1157.821676: tcp:tcp_rcvbuf_grow: time=529 rtt_us=272 copied=8224768 inq=602112 space=6545408 ooo=0 scaling_ratio=240 rcvbuf=67000000 rcv_ssthresh=62793576 window_clamp=62812500 rcv_wnd=62115840 famil
 1158.906379: tcp:tcp_rcvbuf_grow: time=710 rtt_us=445 copied=11845632 inq=540672 space=10240000 ooo=0 scaling_ratio=240 rcvbuf=31165538 rcv_ssthresh=29205935 window_clamp=29217691 rcv_wnd=28536832 fam
 1164.600160: tcp:tcp_rcvbuf_grow: time=841 rtt_us=430 copied=12976128 inq=1290240 space=11304960 ooo=0 scaling_ratio=240 rcvbuf=31165538 rcv_ssthresh=29212591 window_clamp=29217691 rcv_wnd=27856896 fa
 1165.163572: tcp:tcp_rcvbuf_grow: time=845 rtt_us=800 copied=12632064 inq=540672 space=7921664 ooo=0 scaling_ratio=240 rcvbuf=27666235 rcv_ssthresh=25912795 window_clamp=25937095 rcv_wnd=25260032 fami
 1165.653464: tcp:tcp_rcvbuf_grow: time=388 rtt_us=309 copied=4493312 inq=180224 space=3874816 ooo=0 scaling_ratio=240 rcvbuf=44854314 rcv_ssthresh=41995899 window_clamp=42050919 rcv_wnd=41713664 famil
 1166.651211: tcp:tcp_rcvbuf_grow: time=556 rtt_us=553 copied=6328320 inq=540672 space=5554176 ooo=0 scaling_ratio=240 rcvbuf=23068672 rcv_ssthresh=21571860 window_clamp=21626880 rcv_wnd=20946944 famil

After:

sysctl -w net.ipv4.tcp_rcvbuf_low_rtt=1000
perf record -a -e tcp:tcp_rcvbuf_grow sleep 30 ; perf script|tail -20|cut -c30-230
 1457.053149: tcp:tcp_rcvbuf_grow: time=128 rtt_us=24 copied=1441792 inq=40960 space=1269760 ooo=0 scaling_ratio=240 rcvbuf=2960741 rcv_ssthresh=2605474 window_clamp=2775694 rcv_wnd=2568192 family=AF_I
 1458.000778: tcp:tcp_rcvbuf_grow: time=128 rtt_us=31 copied=1441792 inq=24576 space=1400832 ooo=0 scaling_ratio=240 rcvbuf=3060163 rcv_ssthresh=2810042 window_clamp=2868902 rcv_wnd=2674688 family=AF_I
 1458.088059: tcp:tcp_rcvbuf_grow: time=190 rtt_us=110 copied=3227648 inq=385024 space=2781184 ooo=0 scaling_ratio=240 rcvbuf=6728240 rcv_ssthresh=6252705 window_clamp=6307725 rcv_wnd=5799936 family=AF
 1458.148549: tcp:tcp_rcvbuf_grow: time=232 rtt_us=129 copied=3956736 inq=237568 space=2842624 ooo=0 scaling_ratio=240 rcvbuf=6731333 rcv_ssthresh=6252705 window_clamp=6310624 rcv_wnd=5918720 family=AF
 1458.466861: tcp:tcp_rcvbuf_grow: time=193 rtt_us=83 copied=2949120 inq=180224 space=2457600 ooo=0 scaling_ratio=240 rcvbuf=5751438 rcv_ssthresh=5357689 window_clamp=5391973 rcv_wnd=5054464 family=AF_
 1458.775476: tcp:tcp_rcvbuf_grow: time=257 rtt_us=127 copied=4304896 inq=352256 space=3346432 ooo=0 scaling_ratio=240 rcvbuf=8067131 rcv_ssthresh=7523275 window_clamp=7562935 rcv_wnd=7061504 family=AF
 1458.776631: tcp:tcp_rcvbuf_grow: time=200 rtt_us=96 copied=3260416 inq=143360 space=2768896 ooo=0 scaling_ratio=240 rcvbuf=6397256 rcv_ssthresh=5938567 window_clamp=5997427 rcv_wnd=5828608 family=AF_
 1459.707973: tcp:tcp_rcvbuf_grow: time=215 rtt_us=96 copied=2506752 inq=163840 space=1388544 ooo=0 scaling_ratio=240 rcvbuf=3068867 rcv_ssthresh=2768282 window_clamp=2877062 rcv_wnd=2555904 family=AF_
 1460.246494: tcp:tcp_rcvbuf_grow: time=231 rtt_us=80 copied=3756032 inq=204800 space=3117056 ooo=0 scaling_ratio=240 rcvbuf=7288091 rcv_ssthresh=6773725 window_clamp=6832585 rcv_wnd=6471680 family=AF_
 1460.714596: tcp:tcp_rcvbuf_grow: time=270 rtt_us=110 copied=4714496 inq=311296 space=3719168 ooo=0 scaling_ratio=240 rcvbuf=8957739 rcv_ssthresh=8339020 window_clamp=8397880 rcv_wnd=7933952 family=AF
 1462.029977: tcp:tcp_rcvbuf_grow: time=101 rtt_us=19 copied=1105920 inq=40960 space=1036288 ooo=0 scaling_ratio=240 rcvbuf=2338970 rcv_ssthresh=2091684 window_clamp=2192784 rcv_wnd=1986560 family=AF_I
 1462.802385: tcp:tcp_rcvbuf_grow: time=89 rtt_us=45 copied=1069056 inq=0 space=1064960 ooo=0 scaling_ratio=240 rcvbuf=2338970 rcv_ssthresh=2091684 window_clamp=2192784 rcv_wnd=2035712 family=AF_INET6
 1462.918648: tcp:tcp_rcvbuf_grow: time=105 rtt_us=33 copied=1441792 inq=180224 space=1069056 ooo=0 scaling_ratio=240 rcvbuf=2383282 rcv_ssthresh=2091684 window_clamp=2234326 rcv_wnd=1896448 family=AF_
 1463.222533: tcp:tcp_rcvbuf_grow: time=273 rtt_us=144 copied=4603904 inq=385024 space=3469312 ooo=0 scaling_ratio=240 rcvbuf=8422564 rcv_ssthresh=7891053 window_clamp=7896153 rcv_wnd=7409664 family=AF
 1466.519312: tcp:tcp_rcvbuf_grow: time=130 rtt_us=23 copied=1343488 inq=0 space=1261568 ooo=0 scaling_ratio=240 rcvbuf=2780158 rcv_ssthresh=2493778 window_clamp=2606398 rcv_wnd=2494464 family=AF_INET6
 1466.681003: tcp:tcp_rcvbuf_grow: time=128 rtt_us=21 copied=1441792 inq=12288 space=1343488 ooo=0 scaling_ratio=240 rcvbuf=2932027 rcv_ssthresh=2578555 window_clamp=2748775 rcv_wnd=2568192 family=AF_I
 1470.689959: tcp:tcp_rcvbuf_grow: time=255 rtt_us=122 copied=3932160 inq=204800 space=3551232 ooo=0 scaling_ratio=240 rcvbuf=8182038 rcv_ssthresh=7647384 window_clamp=7670660 rcv_wnd=7442432 family=AF
 1471.754154: tcp:tcp_rcvbuf_grow: time=188 rtt_us=95 copied=2138112 inq=577536 space=1429504 ooo=0 scaling_ratio=240 rcvbuf=3113650 rcv_ssthresh=2806426 window_clamp=2919046 rcv_wnd=2248704 family=AF_
 1476.813542: tcp:tcp_rcvbuf_grow: time=269 rtt_us=99 copied=3088384 inq=180224 space=2564096 ooo=0 scaling_ratio=240 rcvbuf=6219470 rcv_ssthresh=5771893 window_clamp=5830753 rcv_wnd=5509120 family=AF_
 1477.738309: tcp:tcp_rcvbuf_grow: time=166 rtt_us=54 copied=1777664 inq=180224 space=1417216 ooo=0 scaling_ratio=240 rcvbuf=3117118 rcv_ssthresh=2874958 window_clamp=2922298 rcv_wnd=2613248 family=AF_

We can see sk_rcvbuf values are much smaller, and that rtt_us (estimation of rtt
from a receiver point of view) is kept small, instead of being bloated.

No difference in throughput.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>
Tested-by: Paolo Abeni <pabeni@redhat.com>
Link: https://patch.msgid.link/20251119084813.3684576-3-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-11-20 17:44:23 -08:00
..
Makefile queue_api: add support for fetching per queue DMA dev 2025-08-28 16:05:31 -07:00
bpf_sk_storage.c
datagram.c net: datagram: introduce datagram_poll_queue for custom receive queues 2025-10-23 15:46:04 +02:00
dev.c net: add prefetch() in skb_defer_free_flush() 2025-11-07 19:00:16 -08:00
dev.h net: Extend NAPI threaded polling to allow kthread based busy polling 2025-11-03 18:11:40 -08:00
dev_addr_lists.c net: s/dev_pre_changeaddr_notify/netif_pre_changeaddr_notify/ 2025-07-18 17:27:47 -07:00
dev_addr_lists_test.c
dev_api.c net: define an enum for the napi threaded state 2025-07-24 18:34:55 -07:00
dev_ioctl.c Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net 2025-11-20 09:13:26 -08:00
devmem.c net: devmem: refresh devmem TX dst in case of route invalidation 2025-10-29 19:23:21 -07:00
devmem.h net: devmem: Remove unused declaration net_devmem_bind_tx_release() 2025-11-04 17:29:19 -08:00
drop_monitor.c
dst.c net: dst: introduce dst->dev_rcu 2025-08-29 19:36:31 -07:00
dst_cache.c
failover.c
fib_notifier.c
fib_rules.c
filter.c bpf: Make bpf_skb_change_head helper metadata-safe 2025-11-10 10:52:32 -08:00
flow_dissector.c
flow_offload.c
gen_estimator.c net_sched: gen_estimator: fix est_timer() vs CONFIG_PREEMPT_RT=y 2025-08-29 19:04:20 -07:00
gen_stats.c
gro.c net: gro: clear skb_shinfo(skb)->hwtstamps in napi_reuse_skb() 2025-10-16 15:42:49 -07:00
gro_cells.c net: gro_cells: Reduce lock scope in gro_cell_poll 2025-11-05 17:41:29 -08:00
gso.c
hotdata.c net: increase skb_defer_max default to 128 2025-11-07 19:02:40 -08:00
hwbm.c
ieee8021q_helpers.c
link_watch.c net: replace use of system_unbound_wq with system_dfl_wq 2025-09-22 17:40:30 -07:00
lock_debug.c
lwt_bpf.c ipv4: Convert ->flowi4_tos to dscp_t. 2025-08-26 17:34:31 -07:00
lwtunnel.c
mp_dmabuf_devmem.h
neighbour.c neighbour: Convert rwlock of struct neigh_table to spinlock. 2025-10-24 17:57:20 -07:00
net-procfs.c net: use NUMA drop counters for softnet_data.dropped 2025-09-14 11:35:17 -07:00
net-sysfs.c Networking changes for 6.18. 2025-10-02 15:17:01 -07:00
net-sysfs.h
net-traces.c
net_namespace.c tcp: add net.ipv4.tcp_rcvbuf_low_rtt 2025-11-20 17:44:23 -08:00
net_test.c
netclassid_cgroup.c
netdev-genl-gen.c net: Extend NAPI threaded polling to allow kthread based busy polling 2025-11-03 18:11:40 -08:00
netdev-genl-gen.h
netdev-genl.c net: devmem: allow binding on rx queues with same DMA devices 2025-08-28 16:05:32 -07:00
netdev_queues.c queue_api: add support for fetching per queue DMA dev 2025-08-28 16:05:31 -07:00
netdev_rx_queue.c net: add helper to pre-check if PP for an Rx queue will be unreadable 2025-09-04 10:19:17 +02:00
netevent.c
netmem_priv.h netmem: replace __netmem_clear_lsb() with netmem_to_nmdesc() 2025-10-14 13:37:26 +02:00
netpoll.c net: netpoll: fix incorrect refcount handling causing incorrect cleanup 2025-11-10 18:34:44 -08:00
netprio_cgroup.c
of_net.c
page_pool.c page_pool: Fix PP_MAGIC_MASK to avoid crashing on some 32-bit arches 2025-10-06 12:14:04 -07:00
page_pool_priv.h
page_pool_user.c
pktgen.c net: pktgen: Use min()/min_t() to improve pktgen_finalize_skb() 2025-08-21 10:12:11 +02:00
ptp_classifier.c
request_sock.c tcp: Remove stale locking comment for TFO. 2025-09-23 18:21:36 -07:00
rtnetlink.c rtnetlink: honor RTEXT_FILTER_SKIP_STATS in IFLA_STATS 2025-11-04 16:07:37 -08:00
scm.c scm: use masked_user_access_begin() in put_cmsg() 2025-09-29 18:03:42 -07:00
secure_seq.c
selftests.c net: selftests: export packet creation helpers for driver use 2025-11-06 13:38:11 +01:00
skb_fault_injection.c
skbuff.c net: prefetch the next skb in napi_skb_cache_get() 2025-11-19 20:29:25 -08:00
skmsg.c net: replace use of system_wq with system_percpu_wq 2025-09-22 17:40:30 -07:00
sock.c net: Convert proto callbacks from sockaddr to sockaddr_unsized 2025-11-04 19:10:33 -08:00
sock_destructor.h
sock_diag.c net: WQ_PERCPU added to alloc_workqueue users 2025-09-22 17:40:30 -07:00
sock_map.c
sock_reuseport.c
stream.c net: stream: add description for sk_stream_write_space() 2025-07-18 16:57:21 -07:00
sysctl_net_core.c net: Introduce net.core.bypass_prot_mem sysctl. 2025-10-16 12:04:47 -07:00
timestamping.c
tso.c
utils.c
xdp.c net: xdp: pass full flags to xdp_update_skb_shared_info() 2025-09-11 12:00:20 +02:00