Commit Graph

79 Commits

Author SHA1 Message Date
Jakub Kicinski c26f1b98a6 tls: hw: rx: use return value of tls_device_decrypted() to carry status
CVE-2024-26583

[ Upstream commit 71471ca32505afa7c3f7f6a8268716e1ddb81cd4 ]

Instead of tls_device poking into internals of the message
return 1 from tls_device_decrypted() if the device handled
the decryption.

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Stable-dep-of: f7fa16d49837 ("tls: decrement decrypt_pending if no async completion will be called")
Signed-off-by: Sasha Levin <sashal@kernel.org>
(backported from commit 1abd49fa1ffb43ef31369cfcfdc9d0409db4ea58 linux-5.15.y)
[juergh: Adjusted context due to missing commit:
 8538d29cea ("net/tls: add tracing for device/offload events")]
Signed-off-by: Juerg Haefliger <juerg.haefliger@canonical.com>
Acked-by: Manuel Diewald <manuel.diewald@canonical.com>
Acked-by: Philip Cox <philip.cox@canonical.com>
Signed-off-by: Stefan Bader <stefan.bader@canonical.com>
2024-07-05 10:51:49 +02:00
Jakub Kicinski 78bf9e8c79 tls: rx: don't store the decryption status in socket context
CVE-2024-26583

[ Upstream commit 7dc59c33d62c4520a119051d4486c214ef5caa23 ]

Similar justification to previous change, the information
about decryption status belongs in the skb.

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Stable-dep-of: f7fa16d49837 ("tls: decrement decrypt_pending if no async completion will be called")
Signed-off-by: Sasha Levin <sashal@kernel.org>
(backported from commit de0970d258efa793fd1236a362f0838c9c9d2384 linux-5.15.y)
[juergh: Minor adjustments due to missing commits:
 bc76e5bb12 ("net/tls: store decrypted on a single bit")
 5c5458ec9d ("net/tls: store async_capable on a single bit")]
 8538d29cea ("net/tls: add tracing for device/offload events")]
Signed-off-by: Juerg Haefliger <juerg.haefliger@canonical.com>
Acked-by: Manuel Diewald <manuel.diewald@canonical.com>
Acked-by: Philip Cox <philip.cox@canonical.com>
Signed-off-by: Stefan Bader <stefan.bader@canonical.com>
2024-07-05 10:51:49 +02:00
Jakub Kicinski 4bd59b2628 net: tls: avoid discarding data on record close
CVE-2024-26583

[ Upstream commit 6b47808f223c70ff564f9b363446d2a5fa1e05b2 ]

TLS records end with a 16B tag. For TLS device offload we only
need to make space for this tag in the stream, the device will
generate and replace it with the actual calculated tag.

Long time ago the code would just re-reference the head frag
which mostly worked but was suboptimal because it prevented TCP
from combining the record into a single skb frag. I'm not sure
if it was correct as the first frag may be shorter than the tag.

The commit under fixes tried to replace that with using the page
frag and if the allocation failed rolling back the data, if record
was long enough. It achieves better fragment coalescing but is
also buggy.

We don't roll back the iterator, so unless we're at the end of
send we'll skip the data we designated as tag and start the
next record as if the rollback never happened.
There's also the possibility that the record was constructed
with MSG_MORE and the data came from a different syscall and
we already told the user space that we "got it".

Allocate a single dummy page and use it as fallback.

Found by code inspection, and proven by forcing allocation
failures.

Fixes: e7b159a48b ("net/tls: remove the record tail optimization")
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Sasha Levin <sashal@kernel.org>
(backported from commit e2d10f1de1fac24e6e41bed71301d7a95aea43c6 linux-5.15.y)
[juergh: Adjusted context due to missing commit:
 6942a284fb ("net/tls: make inline helpers protocol-aware")]
Signed-off-by: Juerg Haefliger <juerg.haefliger@canonical.com>
Acked-by: Manuel Diewald <manuel.diewald@canonical.com>
Acked-by: Philip Cox <philip.cox@canonical.com>
Signed-off-by: Stefan Bader <stefan.bader@canonical.com>
2024-07-05 10:51:49 +02:00
Tariq Toukan 2af1a6e550 net/tls: Multi-threaded calls to TX tls_dev_del
CVE-2024-26583

[ Upstream commit 7adc91e0c93901a0eeeea10665d0feb48ffde2d4 ]

Multiple TLS device-offloaded contexts can be added in parallel via
concurrent calls to .tls_dev_add, while calls to .tls_dev_del are
sequential in tls_device_gc_task.

This is not a sustainable behavior. This creates a rate gap between add
and del operations (addition rate outperforms the deletion rate).  When
running for enough time, the TLS device resources could get exhausted,
failing to offload new connections.

Replace the single-threaded garbage collector work with a per-context
alternative, so they can be handled on several cores in parallel. Use
a new dedicated destruct workqueue for this.

Tested with mlx5 device:
Before: 22141 add/sec,   103 del/sec
After:  11684 add/sec, 11684 del/sec

Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Reviewed-by: Maxim Mikityanskiy <maximmi@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Stable-dep-of: 6b47808f223c ("net: tls: avoid discarding data on record close")
Signed-off-by: Sasha Levin <sashal@kernel.org>
(cherry picked from commit 9a15ca893909e1b87d975d0726b79caf5b2f8830 linux-5.15.y)
Signed-off-by: Juerg Haefliger <juerg.haefliger@canonical.com>
Acked-by: Manuel Diewald <manuel.diewald@canonical.com>
Acked-by: Philip Cox <philip.cox@canonical.com>
Signed-off-by: Stefan Bader <stefan.bader@canonical.com>
2024-07-05 10:51:49 +02:00
Tariq Toukan cf15f60d75 net/tls: Perform immediate device ctx cleanup when possible
CVE-2024-26583

[ Upstream commit 113671b255ee3b9f5585a6d496ef0e675e698698 ]

TLS context destructor can be run in atomic context. Cleanup operations
for device-offloaded contexts could require access and interaction with
the device callbacks, which might sleep. Hence, the cleanup of such
contexts must be deferred and completed inside an async work.

For all others, this is not necessary, as cleanup is atomic. Invoke
cleanup immediately for them, avoiding queueing redundant gc work.

Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Reviewed-by: Maxim Mikityanskiy <maximmi@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Stable-dep-of: 6b47808f223c ("net: tls: avoid discarding data on record close")
Signed-off-by: Sasha Levin <sashal@kernel.org>
(cherry picked from commit 2d93157b7e2dac9acf3ee4e52189b4c79d1ac77c linux-5.15.y)
Signed-off-by: Juerg Haefliger <juerg.haefliger@canonical.com>
Acked-by: Manuel Diewald <manuel.diewald@canonical.com>
Acked-by: Philip Cox <philip.cox@canonical.com>
Signed-off-by: Stefan Bader <stefan.bader@canonical.com>
2024-07-05 10:51:49 +02:00
Jakub Kicinski 5e68421cd8 net/tls: pass context to tls_device_decrypted()
CVE-2024-26583

Avoid unnecessary pointer chasing and calculations, callers already
have most of the state tls_device_decrypted() needs.

Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Dirk van der Merwe <dirk.vandermerwe@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
(backported from commit 4de30a8d58)
[juergh: Adjusted context.]
Signed-off-by: Juerg Haefliger <juerg.haefliger@canonical.com>
Acked-by: Manuel Diewald <manuel.diewald@canonical.com>
Acked-by: Philip Cox <philip.cox@canonical.com>
Signed-off-by: Stefan Bader <stefan.bader@canonical.com>
2024-07-05 10:51:49 +02:00
Maxim Mikityanskiy 8f560c7185 net/tls: Remove the context from the list in tls_device_down
CVE-2024-26583

tls_device_down takes a reference on all contexts it's going to move to
the degraded state (software fallback). If sk_destruct runs afterwards,
it can reduce the reference counter back to 1 and return early without
destroying the context. Then tls_device_down will release the reference
it took and call tls_device_free_ctx. However, the context will still
stay in tls_device_down_list forever. The list will contain an item,
memory for which is released, making a memory corruption possible.

Fix the above bug by properly removing the context from all lists before
any call to tls_device_free_ctx.

Fixes: 3740651bf7e2 ("tls: Fix context leak on tls_device_down")
Signed-off-by: Maxim Mikityanskiy <maximmi@nvidia.com>
Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
(cherry picked from commit f6336724a4d4220c89a4ec38bca84b03b178b1a3)
Signed-off-by: Juerg Haefliger <juerg.haefliger@canonical.com>
Acked-by: Manuel Diewald <manuel.diewald@canonical.com>
Acked-by: Philip Cox <philip.cox@canonical.com>
Signed-off-by: Stefan Bader <stefan.bader@canonical.com>
2024-07-05 10:51:49 +02:00
Tariq Toukan 726e00ea41 net/tls: Check for errors in tls_device_init
CVE-2024-26583

Add missing error checks in tls_device_init.

Fixes: e8f6979981 ("net/tls: Add generic NIC offload infrastructure")
Reported-by: Jakub Kicinski <kuba@kernel.org>
Reviewed-by: Maxim Mikityanskiy <maximmi@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://lore.kernel.org/r/20220714070754.1428-1-tariqt@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
(backported from commit 3d8c51b25a235e283e37750943bbf356ef187230)
[juergh: Declare err, drop unregister_pernet_subsys() due to missing commit:
 d26b698dd3 ("net/tls: add skeleton of MIB statistics").
Signed-off-by: Juerg Haefliger <juerg.haefliger@canonical.com>
Acked-by: Manuel Diewald <manuel.diewald@canonical.com>
Acked-by: Philip Cox <philip.cox@canonical.com>
Signed-off-by: Stefan Bader <stefan.bader@canonical.com>
2024-07-05 10:51:49 +02:00
Maxim Mikityanskiy 525ac5b33c tls: Fix context leak on tls_device_down
CVE-2024-26583

The commit cited below claims to fix a use-after-free condition after
tls_device_down. Apparently, the description wasn't fully accurate. The
context stayed alive, but ctx->netdev became NULL, and the offload was
torn down without a proper fallback, so a bug was present, but a
different kind of bug.

Due to misunderstanding of the issue, the original patch dropped the
refcount_dec_and_test line for the context to avoid the alleged
premature deallocation. That line has to be restored, because it matches
the refcount_inc_not_zero from the same function, otherwise the contexts
that survived tls_device_down are leaked.

This patch fixes the described issue by restoring refcount_dec_and_test.
After this change, there is no leak anymore, and the fallback to
software kTLS still works.

Fixes: c55dcdd435 ("net/tls: Fix use-after-free after the TLS device goes down and up")
Signed-off-by: Maxim Mikityanskiy <maximmi@nvidia.com>
Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://lore.kernel.org/r/20220512091830.678684-1-maximmi@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
(cherry picked from commit 3740651bf7e200109dd42d5b2fb22226b26f960a)
Signed-off-by: Juerg Haefliger <juerg.haefliger@canonical.com>
Acked-by: Manuel Diewald <manuel.diewald@canonical.com>
Acked-by: Philip Cox <philip.cox@canonical.com>
Signed-off-by: Stefan Bader <stefan.bader@canonical.com>
2024-07-05 10:51:49 +02:00
Maxim Mikityanskiy 1f2290853a net/tls: Fix use-after-free after the TLS device goes down and up
CVE-2024-26583

When a netdev with active TLS offload goes down, tls_device_down is
called to stop the offload and tear down the TLS context. However, the
socket stays alive, and it still points to the TLS context, which is now
deallocated. If a netdev goes up, while the connection is still active,
and the data flow resumes after a number of TCP retransmissions, it will
lead to a use-after-free of the TLS context.

This commit addresses this bug by keeping the context alive until its
normal destruction, and implements the necessary fallbacks, so that the
connection can resume in software (non-offloaded) kTLS mode.

On the TX side tls_sw_fallback is used to encrypt all packets. The RX
side already has all the necessary fallbacks, because receiving
non-decrypted packets is supported. The thing needed on the RX side is
to block resync requests, which are normally produced after receiving
non-decrypted packets.

The necessary synchronization is implemented for a graceful teardown:
first the fallbacks are deployed, then the driver resources are released
(it used to be possible to have a tls_dev_resync after tls_dev_del).

A new flag called TLS_RX_DEV_DEGRADED is added to indicate the fallback
mode. It's used to skip the RX resync logic completely, as it becomes
useless, and some objects may be released (for example, resync_async,
which is allocated and freed by the driver).

Fixes: e8f6979981 ("net/tls: Add generic NIC offload infrastructure")
Signed-off-by: Maxim Mikityanskiy <maximmi@nvidia.com>
Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
(backported from commit c55dcdd435)
[juergh: Adjusted context due to missing commit:
 d5bee7374b ("net/tls: Annotate access to sk_prot with READ_ONCE/WRITE_ONCE")]
Signed-off-by: Juerg Haefliger <juerg.haefliger@canonical.com>
Acked-by: Manuel Diewald <manuel.diewald@canonical.com>
Acked-by: Philip Cox <philip.cox@canonical.com>
Signed-off-by: Stefan Bader <stefan.bader@canonical.com>
2024-07-05 10:51:49 +02:00
Maxim Mikityanskiy 997f98f9e0 net/tls: Replace TLS_RX_SYNC_RUNNING with RCU
CVE-2024-26583

RCU synchronization is guaranteed to finish in finite time, unlike a
busy loop that polls a flag. This patch is a preparation for the bugfix
in the next patch, where the same synchronize_net() call will also be
used to sync with the TX datapath.

Signed-off-by: Maxim Mikityanskiy <maximmi@nvidia.com>
Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
(backported from commit 05fc8b6cbd)
[juergh: Adjusted context due to missing commit:
 8538d29cea ("net/tls: add tracing for device/offload events")]
Signed-off-by: Juerg Haefliger <juerg.haefliger@canonical.com>
Acked-by: Manuel Diewald <manuel.diewald@canonical.com>
Acked-by: Philip Cox <philip.cox@canonical.com>
Signed-off-by: Stefan Bader <stefan.bader@canonical.com>
2024-07-05 10:51:49 +02:00
Tariq Toukan 60aefa3f3a net/tls: Fix race in TLS device down flow
BugLink: https://bugs.launchpad.net/bugs/1988225

[ Upstream commit f08d8c1bb97c48f24a82afaa2fd8c140f8d3da8b ]

Socket destruction flow and tls_device_down function sync against each
other using tls_device_lock and the context refcount, to guarantee the
device resources are freed via tls_dev_del() by the end of
tls_device_down.

In the following unfortunate flow, this won't happen:
- refcount is decreased to zero in tls_device_sk_destruct.
- tls_device_down starts, skips the context as refcount is zero, going
  all the way until it flushes the gc work, and returns without freeing
  the device resources.
- only then, tls_device_queue_ctx_destruction is called, queues the gc
  work and frees the context's device resources.

Solve it by decreasing the refcount in the socket's destruction flow
under the tls_device_lock, for perfect synchronization.  This does not
slow down the common likely destructor flow, in which both the refcount
is decreased and the spinlock is acquired, anyway.

Fixes: e8f6979981 ("net/tls: Add generic NIC offload infrastructure")
Reviewed-by: Maxim Mikityanskiy <maximmi@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Reviewed-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Sasha Levin <sashal@kernel.org>
Signed-off-by: Kamal Mostafa <kamal@canonical.com>
Signed-off-by: Stefan Bader <stefan.bader@canonical.com>
2022-09-16 10:59:45 +02:00
Maxim Mikityanskiy f6554d063b tls: Skip tls_append_frag on zero copy size
BugLink: https://bugs.launchpad.net/bugs/1979014

[ Upstream commit a0df71948e9548de819a6f1da68f5f1742258a52 ]

Calling tls_append_frag when max_open_record_len == record->len might
add an empty fragment to the TLS record if the call happens to be on the
page boundary. Normally tls_append_frag coalesces the zero-sized
fragment to the previous one, but not if it's on page boundary.

If a resync happens then, the mlx5 driver posts dump WQEs in
tx_post_resync_dump, and the empty fragment may become a data segment
with byte_count == 0, which will confuse the NIC and lead to a CQE
error.

This commit fixes the described issue by skipping tls_append_frag on
zero size to avoid adding empty fragments. The fix is not in the driver,
because an empty fragment is hardly the desired behavior.

Fixes: e8f6979981 ("net/tls: Add generic NIC offload infrastructure")
Signed-off-by: Maxim Mikityanskiy <maximmi@nvidia.com>
Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://lore.kernel.org/r/20220426154949.159055-1-maximmi@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
Signed-off-by: Kamal Mostafa <kamal@canonical.com>
Signed-off-by: Stefan Bader <stefan.bader@canonical.com>
2022-06-22 14:51:13 +02:00
Maxim Mikityanskiy adc65774e7 net/tls: Protect from calling tls_dev_del for TLS RX twice
BugLink: https://bugs.launchpad.net/bugs/1908564

[ Upstream commit 025cc2fb6a ]

tls_device_offload_cleanup_rx doesn't clear tls_ctx->netdev after
calling tls_dev_del if TLX TX offload is also enabled. Clearing
tls_ctx->netdev gets postponed until tls_device_gc_task. It leaves a
time frame when tls_device_down may get called and call tls_dev_del for
RX one extra time, confusing the driver, which may lead to a crash.

This patch corrects this racy behavior by adding a flag to prevent
tls_device_down from calling tls_dev_del the second time.

Fixes: e8f6979981 ("net/tls: Add generic NIC offload infrastructure")
Signed-off-by: Maxim Mikityanskiy <maximmi@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Link: https://lore.kernel.org/r/20201125221810.69870-1-saeedm@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Kamal Mostafa <kamal@canonical.com>
Signed-off-by: Ian May <ian.may@canonical.com>
2021-01-20 14:24:50 +01:00
Rohit Maheshwari 4119bf7c10 net/tls: sendfile fails with ktls offload
BugLink: https://bugs.launchpad.net/bugs/1902115

[ Upstream commit ea1dd3e9d0 ]

At first when sendpage gets called, if there is more data, 'more' in
tls_push_data() gets set which later sets pending_open_record_frags, but
when there is no more data in file left, and last time tls_push_data()
gets called, pending_open_record_frags doesn't get reset. And later when
2 bytes of encrypted alert comes as sendmsg, it first checks for
pending_open_record_frags, and since this is set, it creates a record with
0 data bytes to encrypt, meaning record length is prepend_size + tag_size
only, which causes problem.
 We should set/reset pending_open_record_frags based on more bit.

Fixes: e8f6979981 ("net/tls: Add generic NIC offload infrastructure")
Signed-off-by: Rohit Maheshwari <rohitm@chelsio.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Kamal Mostafa <kamal@canonical.com>
Signed-off-by: Ian May <ian.may@canonical.com>
2020-11-09 14:48:30 +01:00
Ira Weiny 146ea3db7e net/tls: Fix kmap usage
BugLink: https://bugs.launchpad.net/bugs/1892417

[ Upstream commit b06c19d9f8 ]

When MSG_OOB is specified to tls_device_sendpage() the mapped page is
never unmapped.

Hold off mapping the page until after the flags are checked and the page
is actually needed.

Fixes: e8f6979981 ("net/tls: Add generic NIC offload infrastructure")
Signed-off-by: Ira Weiny <ira.weiny@intel.com>
Reviewed-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Kamal Mostafa <kamal@canonical.com>
Signed-off-by: Kelsey Skunberg <kelsey.skunberg@canonical.com>
2020-09-04 16:29:26 -03:00
Rohit Maheshwari a568ecadb3 net/tls: Fix to avoid gettig invalid tls record
BugLink: https://bugs.launchpad.net/bugs/1866333

[ Upstream commit 06f5201c63 ]

Current code doesn't check if tcp sequence number is starting from (/after)
1st record's start sequnce number. It only checks if seq number is before
1st record's end sequnce number. This problem will always be a possibility
in re-transmit case. If a record which belongs to a requested seq number is
already deleted, tls_get_record will start looking into list and as per the
check it will look if seq number is before the end seq of 1st record, which
will always be true and will return 1st record always, it should in fact
return NULL.
As part of the fix, start looking each record only if the sequence number
lies in the list else return NULL.
There is one more check added, driver look for the start marker record to
handle tcp packets which are before the tls offload start sequence number,
hence return 1st record if the record is tls start marker and seq number is
before the 1st record's starting sequence number.

Fixes: e8f6979981 ("net/tls: Add generic NIC offload infrastructure")
Signed-off-by: Rohit Maheshwari <rohitm@chelsio.com>
Reviewed-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Paolo Pisati <paolo.pisati@canonical.com>
2020-03-06 12:38:05 +01:00
Valentin Vidic a8088d610d net/tls: Fix return values to avoid ENOTSUPP
BugLink: https://bugs.launchpad.net/bugs/1858426

[ Upstream commit 4a5cdc604b ]

ENOTSUPP is not available in userspace, for example:

  setsockopt failed, 524, Unknown error 524

Signed-off-by: Valentin Vidic <vvidic@valentin-vidic.from.hr>
Acked-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Seth Forshee <seth.forshee@canonical.com>
2020-01-06 07:42:34 -06:00
Jakub Kicinski 79ffe6087e net/tls: add a TX lock
TLS TX needs to release and re-acquire the socket lock if send buffer
fills up.

TLS SW TX path currently depends on only allowing one thread to enter
the function by the abuse of sk_write_pending. If another writer is
already waiting for memory no new ones are allowed in.

This has two problems:
 - writers don't wake other threads up when they leave the kernel;
   meaning that this scheme works for single extra thread (second
   application thread or delayed work) because memory becoming
   available will send a wake up request, but as Mallesham and
   Pooja report with larger number of threads it leads to threads
   being put to sleep indefinitely;
 - the delayed work does not get _scheduled_ but it may _run_ when
   other writers are present leading to crashes as writers don't
   expect state to change under their feet (same records get pushed
   and freed multiple times); it's hard to reliably bail from the
   work, however, because the mere presence of a writer does not
   guarantee that the writer will push pending records before exiting.

Ensuring wakeups always happen will make the code basically open
code a mutex. Just use a mutex.

The TLS HW TX path does not have any locking (not even the
sk_write_pending hack), yet it uses a per-socket sg_tx_data
array to push records.

Fixes: a42055e8d2 ("net/tls: Add support for async encryption of records for performance")
Reported-by: Mallesham  Jatharakonda <mallesh537@gmail.com>
Reported-by: Pooja Trivedi <poojatrivedi@gmail.com>
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Simon Horman <simon.horman@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-11-06 17:33:32 -08:00
Jakub Kicinski 02b1fa07bb net/tls: don't pay attention to sk_write_pending when pushing partial records
sk_write_pending being not zero does not guarantee that partial
record will be pushed. If the thread waiting for memory times out
the pending record may get stuck.

In case of tls_device there is no path where parial record is
set and writer present in the first place. Partial record is
set only in tls_push_sg() and tls_push_sg() will return an
error immediately. All tls_device callers of tls_push_sg()
will return (and not wait for memory) if it failed.

Fixes: a42055e8d2 ("net/tls: Add support for async encryption of records for performance")
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Simon Horman <simon.horman@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-11-06 17:33:32 -08:00
Jakub Kicinski e681cc603a net/tls: align non temporal copy to cache lines
Unlike normal TCP code TLS has to touch the cache lines
it copies into to fill header info. On memory-heavy workloads
having non temporal stores and normal accesses targeting
the same cache line leads to significant overhead.

Measured 3% overhead running 3600 round robin connections
with additional memory heavy workload.

Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Dirk van der Merwe <dirk.vandermerwe@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-09-07 18:10:34 +02:00
Jakub Kicinski e7b159a48b net/tls: remove the record tail optimization
For TLS device offload the tag/message authentication code are
filled in by the device. The kernel merely reserves space for
them. Because device overwrites it, the contents of the tag make
do no matter. Current code tries to save space by reusing the
header as the tag. This, however, leads to an additional frag
being created and defeats buffer coalescing (which trickles
all the way down to the drivers).

Remove this optimization, and try to allocate the space for
the tag in the usual way, leave the memory uninitialized.
If memory allocation fails rewind the record pointer so that
we use the already copied user data as tag.

Note that the optimization was actually buggy, as the tag
for TLS 1.2 is 16 bytes, but header is just 13, so the reuse
may had looked past the end of the page..

Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Dirk van der Merwe <dirk.vandermerwe@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-09-07 18:10:34 +02:00
Jakub Kicinski d4774ac0d4 net/tls: use RCU for the adder to the offload record list
All modifications to TLS record list happen under the socket
lock. Since records form an ordered queue readers are only
concerned about elements being removed, additions can happen
concurrently.

Use RCU primitives to ensure the correct access types
(READ_ONCE/WRITE_ONCE).

Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Dirk van der Merwe <dirk.vandermerwe@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-09-07 18:10:34 +02:00
Jakub Kicinski 7ccd451912 net/tls: unref frags in order
It's generally more cache friendly to walk arrays in order,
especially those which are likely not in cache.

Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Dirk van der Merwe <dirk.vandermerwe@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-09-07 18:10:34 +02:00
Jakub Kicinski 6e3d02b670 net/tls: dedup the record cleanup
If retransmit record hint fall into the cleanup window we will
free it by just walking the list. No need to duplicate the code.

Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: John Hurley <john.hurley@netronome.com>
Reviewed-by: Dirk van der Merwe <dirk.vandermerwe@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-09-05 09:49:49 +02:00
Jakub Kicinski 3544c98acd net/tls: narrow down the critical area of device_offload_lock
On setsockopt path we need to hold device_offload_lock from
the moment we check netdev is up until the context is fully
ready to be added to the tls_device_list.

No need to hold it around the get_netdev_for_sock().
Change the code and remove the confusing comment.

Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: John Hurley <john.hurley@netronome.com>
Reviewed-by: Dirk van der Merwe <dirk.vandermerwe@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-09-05 09:49:49 +02:00
Jakub Kicinski 90962b4894 net/tls: don't jump to return
Reusing parts of error path for normal exit will make
next commit harder to read, untangle the two.

Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: John Hurley <john.hurley@netronome.com>
Reviewed-by: Dirk van der Merwe <dirk.vandermerwe@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-09-05 09:49:49 +02:00
Jakub Kicinski 15a7dea750 net/tls: use RCU protection on icsk->icsk_ulp_data
We need to make sure context does not get freed while diag
code is interrogating it. Free struct tls_context with
kfree_rcu().

We add the __rcu annotation directly in icsk, and cast it
away in the datapath accessor. Presumably all ULPs will
do a similar thing.

Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-08-31 23:44:28 -07:00
David S. Miller 446bf64b61 Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Merge conflict of mlx5 resolved using instructions in merge
commit 9566e650bf.

Signed-off-by: David S. Miller <davem@davemloft.net>
2019-08-19 11:54:03 -07:00
Jakub Kicinski 414776621d net/tls: prevent skb_orphan() from leaking TLS plain text with offload
sk_validate_xmit_skb() and drivers depend on the sk member of
struct sk_buff to identify segments requiring encryption.
Any operation which removes or does not preserve the original TLS
socket such as skb_orphan() or skb_clone() will cause clear text
leaks.

Make the TCP socket underlying an offloaded TLS connection
mark all skbs as decrypted, if TLS TX is in offload mode.
Then in sk_validate_xmit_skb() catch skbs which have no socket
(or a socket with no validation) and decrypted flag set.

Note that CONFIG_SOCK_VALIDATE_XMIT, CONFIG_TLS_DEVICE and
sk->sk_validate_xmit_skb are slightly interchangeable right now,
they all imply TLS offload. The new checks are guarded by
CONFIG_TLS_DEVICE because that's the option guarding the
sk_buff->decrypted member.

Second, smaller issue with orphaning is that it breaks
the guarantee that packets will be delivered to device
queues in-order. All TLS offload drivers depend on that
scheduling property. This means skb_orphan_partial()'s
trick of preserving partial socket references will cause
issues in the drivers. We need a full orphan, and as a
result netem delay/throttling will cause all TLS offload
skbs to be dropped.

Reusing the sk_buff->decrypted flag also protects from
leaking clear text when incoming, decrypted skb is redirected
(e.g. by TC).

See commit 0608c69c9a ("bpf: sk_msg, sock{map|hash} redirect
through ULP") for justification why the internal flag is safe.
The only location which could leak the flag in is tcp_bpf_sendmsg(),
which is taken care of by clearing the previously unused bit.

v2:
 - remove superfluous decrypted mark copy (Willem);
 - remove the stale doc entry (Boris);
 - rely entirely on EOR marking to prevent coalescing (Boris);
 - use an internal sendpages flag instead of marking the socket
   (Boris).
v3 (Willem):
 - reorganize the can_skb_orphan_partial() condition;
 - fix the flag leak-in through tcp_bpf_sendmsg.

Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Acked-by: Willem de Bruijn <willemb@google.com>
Reviewed-by: Boris Pismenny <borisp@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-08-08 22:39:35 -07:00
Jonathan Lemon b54c9d5bd6 net: Use skb_frag_off accessors
Use accessor functions for skb fragment's page_offset instead
of direct references, in preparation for bvec conversion.

Signed-off-by: Jonathan Lemon <jonathan.lemon@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-07-30 14:21:32 -07:00
Matthew Wilcox (Oracle) d8e18a516f net: Use skb accessors in network core
In preparation for unifying the skb_frag and bio_vec, use the fine
accessors which already exist and use skb_frag_t instead of
struct skb_frag_struct.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-07-22 20:47:56 -07:00
Jakub Kicinski ab232e61e7 net/tls: add missing prot info init
Turns out TLS_TX in HW offload mode does not initialize tls_prot_info.
Since commit 9cd81988cc ("net/tls: use version from prot") we actually
use this field on the datapath.  Luckily we always compare it to TLS 1.3,
and assume 1.2 otherwise. So since zero is not equal to 1.3, everything
worked fine.

Fixes: 9cd81988cc ("net/tls: use version from prot")
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Dirk van der Merwe <dirk.vandermerwe@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-07-08 20:21:09 -07:00
Dirk van der Merwe b5d9a834f4 net/tls: don't clear TX resync flag on error
Introduce a return code for the tls_dev_resync callback.

When the driver TX resync fails, kernel can retry the resync again
until it succeeds.  This prevents drivers from attempting to offload
TLS packets if the connection is known to be out of sync.

We don't worry about the RX resync since they will be retried naturally
as more encrypted records get received.

Signed-off-by: Dirk van der Merwe <dirk.vandermerwe@netronome.com>
Reviewed-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-07-08 20:21:09 -07:00
David S. Miller af144a9834 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Two cases of overlapping changes, nothing fancy.

Signed-off-by: David S. Miller <davem@davemloft.net>
2019-07-08 19:48:57 -07:00
Jakub Kicinski acd3e96d53 net/tls: make sure offload also gets the keys wiped
Commit 86029d10af ("tls: zero the crypto information from tls_context
before freeing") added memzero_explicit() calls to clear the key material
before freeing struct tls_context, but it missed tls_device.c has its
own way of freeing this structure. Replace the missing free.

Fixes: 86029d10af ("tls: zero the crypto information from tls_context before freeing")
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Dirk van der Merwe <dirk.vandermerwe@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-07-01 19:22:36 -07:00
Jakub Kicinski 618bac4593 net/tls: reject offload of TLS 1.3
Neither drivers nor the tls offload code currently supports TLS
version 1.3. Check the TLS version when installing connection
state. TLS 1.3 will just fallback to the kernel crypto for now.

Fixes: 130b392c6c ("net: tls: Add tls 1.3 support")
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Dirk van der Merwe <dirk.vandermerwe@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-07-01 19:21:31 -07:00
Jakub Kicinski 5018007409 net/tls: add kernel-driven resync mechanism for TX
TLS offload drivers keep track of TCP seq numbers to make sure
the packets are fed into the HW in order.

When packets get dropped on the way through the stack, the driver
will get out of sync and have to use fallback encryption, but unless
TCP seq number is resynced it will never match the packets correctly
(or even worse - use incorrect record sequence number after TCP seq
wraps).

Existing drivers (mlx5) feed the entire record on every out-of-order
event, allowing FW/HW to always be in sync.

This patch adds an alternative, more akin to the RX resync.  When
driver sees a frame which is past its expected sequence number the
stream must have gotten out of order (if the sequence number is
smaller than expected its likely a retransmission which doesn't
require resync).  Driver will ask the stack to perform TX sync
before it submits the next full record, and fall back to software
crypto until stack has performed the sync.

Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Dirk van der Merwe <dirk.vandermerwe@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-11 12:22:27 -07:00
Jakub Kicinski eeb2efaf36 net/tls: generalize the resync callback
Currently only RX direction is ever resynced, however, TX may
also get out of sequence if packets get dropped on the way to
the driver.  Rename the resync callback and add a direction
parameter.

Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Dirk van der Merwe <dirk.vandermerwe@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-11 12:22:27 -07:00
Jakub Kicinski f953d33ba1 net/tls: add kernel-driven TLS RX resync
TLS offload device may lose sync with the TCP stream if packets
arrive out of order.  Drivers can currently request a resync at
a specific TCP sequence number.  When a record is found starting
at that sequence number kernel will inform the device of the
corresponding record number.

This requires the device to constantly scan the stream for a
known pattern (constant bytes of the header) after sync is lost.

This patch adds an alternative approach which is entirely under
the control of the kernel.  Kernel tracks records it had to fully
decrypt, even though TLS socket is in TLS_HW mode.  If multiple
records did not have any decrypted parts - it's a pretty strong
indication that the device is out of sync.

We choose the min number of fully encrypted records to be 2,
which should hopefully be more than will get retransmitted at
a time.

After kernel decides the device is out of sync it schedules a
resync request.  If the TCP socket is empty the resync gets
performed immediately.  If socket is not empty we leave the
record parser to resync when next record comes.

Before resync in message parser we peek at the TCP socket and
don't attempt the sync if the socket already has some of the
next record queued.

On resync failure (encrypted data continues to flow in) we
retry with exponential backoff, up to once every 128 records
(with a 16k record thats at most once every 2M of data).

Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Dirk van der Merwe <dirk.vandermerwe@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-11 12:22:26 -07:00
Jakub Kicinski fe58a5a02c net/tls: rename handle_device_resync()
handle_device_resync() doesn't describe the function very well.
The function checks if resync should be issued upon parsing of
a new record.

Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Dirk van der Merwe <dirk.vandermerwe@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-11 12:22:26 -07:00
Jakub Kicinski 89fec474fa net/tls: pass record number as a byte array
TLS offload code casts record number to a u64.  The buffer
should be aligned to 8 bytes, but its actually a __be64, and
the rest of the TLS code treats it as big int.  Make the
offload callbacks take a byte array, drivers can make the
choice to do the ugly cast if they want to.

Prepare for copying the record number onto the stack by
defining a constant for max size of the byte array.

Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Dirk van der Merwe <dirk.vandermerwe@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-11 12:22:26 -07:00
Jakub Kicinski 4967373959 net/tls: simplify seq calculation in handle_device_resync()
We subtract "TLS_HEADER_SIZE - 1" from req_seq, then if they
match we add the same constant to seq.  Just add it to seq,
and we don't have to touch req_seq.

Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Dirk van der Merwe <dirk.vandermerwe@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-11 12:22:26 -07:00
David S. Miller a6cdeeb16b Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Some ISDN files that got removed in net-next had some changes
done in mainline, take the removals.

Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-07 11:00:14 -07:00
Jakub Kicinski fb0f886fa2 net/tls: don't pass version to tls_advance_record_sn()
All callers pass prot->version as the last parameter
of tls_advance_record_sn(), yet tls_advance_record_sn()
itself needs a pointer to prot.  Pass prot from callers.

Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Dirk van der Merwe <dirk.vandermerwe@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-04 14:33:50 -07:00
Jakub Kicinski 9cd81988cc net/tls: use version from prot
ctx->prot holds the same information as per-direction contexts.
Almost all code gets TLS version from this structure, convert
the last two stragglers, this way we can improve the cache
utilization by moving the per-direction data into cold cache lines.

Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Dirk van der Merwe <dirk.vandermerwe@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-04 14:33:50 -07:00
Jakub Kicinski 1fe275d434 net/tls: don't re-check msg decrypted status in tls_device_decrypted()
tls_device_decrypted() is only called from decrypt_skb_update(),
when ctx->decrypted == false, there is no need to re-check the bit.

Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Dirk van der Merwe <dirk.vandermerwe@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-04 14:33:50 -07:00
Jakub Kicinski aeb11ff0dc net/tls: check return values from skb_copy_bits() and skb_store_bits()
In light of recent bugs, we should make a better effort of
checking return values.  In theory none of the functions should
fail today.

Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Dirk van der Merwe <dirk.vandermerwe@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-04 14:33:50 -07:00
Jakub Kicinski e52972c11d net/tls: replace the sleeping lock around RX resync with a bit lock
Commit 38030d7cb7 ("net/tls: avoid NULL-deref on resync during device removal")
tried to fix a potential NULL-dereference by taking the
context rwsem.  Unfortunately the RX resync may get called
from soft IRQ, so we can't use the rwsem to protect from
the device disappearing.  Because we are guaranteed there
can be only one resync at a time (it's called from strparser)
use a bit to indicate resync is busy and make device
removal wait for the bit to get cleared.

Note that there is a leftover "flags" field in struct
tls_context already.

Fixes: 4799ac81e5 ("tls: Add rx inline crypto offload")
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-04 13:34:37 -07:00
Jakub Kicinski 27393f8c6e Revert "net/tls: avoid NULL-deref on resync during device removal"
This reverts commit 38030d7cb7.
Unfortunately the RX resync may get called from soft IRQ,
so we can't take the rwsem to protect from the device
disappearing.

Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-04 13:34:37 -07:00