summaryrefslogtreecommitdiff
Commit message (Collapse)AuthorAgeFilesLines
* wireguard: allowedips: don't corrupt stack when detecting overflowjd/new-archsJason A. Donenfeld2022-07-302-6/+9
| | | | | | | | | | | | | | | | | | | | In case push_rcu() and related functions are buggy, there's a WARN_ON(len >= 128), which the selftest tries to hit by being tricky. In case it is hit, we shouldn't corrupt the kernel's stack, though; otherwise it may be hard to even receive the report that it's buggy. So conditionalize the stack write based on that WARN_ON()'s return value. Note that this never *actually* happens anyway. The WARN_ON() in the first place is bounded by IS_ENABLED(DEBUG), and isn't expected to ever actually hit. This is just a debugging sanity check. Additionally, hoist the constant 128 into a named enum, MAX_ALLOWEDIPS_BITS, so that it's clear why this value is chosen. Suggested-by: Linus Torvalds <torvalds@linux-foundation.org> Link: https://lore.kernel.org/all/CAHk-=wjJZGA6w_DxA+k7Ejbqsq+uGK==koPai3sqdsfJqemvag@mail.gmail.com/ Fixes: a8f1bc7bdea3 ("net: WireGuard secure network tunnel") Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
* wireguard: ratelimiter: use hrtimer in selftestJason A. Donenfeld2022-07-301-11/+14
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | Using msleep() is problematic because it's compared against ratelimiter.c's ktime_get_coarse_boottime_ns(), which means on systems with slow jiffies (such as UML's forced HZ=100), the result is inaccurate. So switch to using schedule_hrtimeout(). However, hrtimer gives us access only to the traditional posix timers, and none of the _COARSE variants. So now, rather than being too imprecise like jiffies, it's too precise. One solution would be to give it a large "range" value, but this will still fire early on a loaded system. A better solution is to align the timeout to the actual coarse timer, and then round up to the nearest tick, plus change. So add the timeout to the current coarse time, and then schedule_hrtimer() until the absolute computed time. This should hopefully reduce flakes in CI as well. Note that we keep the retry loop in case the entire function is running behind, because the test could still be scheduled out, by either the kernel or by the hypervisor's kernel, in which case restarting the test and hoping to not be scheduled out still helps. Fixes: a8f1bc7bdea3 ("net: WireGuard secure network tunnel") Suggested-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
* Merge branch 'wireguard-patches-for-5-18-rc4'Jakub Kicinski2022-04-221-1/+2
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Jason A. Donenfeld says: ==================== wireguard patches for 5.18-rc4 Here are two small wireguard fixes for 5.18-rc4: 1) We enable ACPI in the QEMU test harness, so that multiple CPUs are actually used on x86 for testing for races. 2) Sending skbs with metadata dsts attached resulted in a null pointer dereference, triggerable from executing eBPF programs. The fix is a oneliner, changing a skb_dst() null check into a skb_valid_dst() boolean check. ==================== Link: https://lore.kernel.org/r/20220421134805.279118-1-Jason@zx2c4.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
| * wireguard: device: check for metadata_dst with skb_valid_dst()Nikolay Aleksandrov2022-04-221-1/+2
|/ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When we try to transmit an skb with md_dst attached through wireguard we hit a null pointer dereference in wg_xmit() due to the use of dst_mtu() which calls into dst_blackhole_mtu() which in turn tries to dereference dst->dev. Since wireguard doesn't use md_dsts we should use skb_valid_dst(), which checks for DST_METADATA flag, and if it's set, then falls back to wireguard's device mtu. That gives us the best chance of transmitting the packet; otherwise if the blackhole netdev is used we'd get ETH_MIN_MTU. [ 263.693506] BUG: kernel NULL pointer dereference, address: 00000000000000e0 [ 263.693908] #PF: supervisor read access in kernel mode [ 263.694174] #PF: error_code(0x0000) - not-present page [ 263.694424] PGD 0 P4D 0 [ 263.694653] Oops: 0000 [#1] PREEMPT SMP NOPTI [ 263.694876] CPU: 5 PID: 951 Comm: mausezahn Kdump: loaded Not tainted 5.18.0-rc1+ #522 [ 263.695190] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.15.0-1.fc35 04/01/2014 [ 263.695529] RIP: 0010:dst_blackhole_mtu+0x17/0x20 [ 263.695770] Code: 00 00 00 0f 1f 44 00 00 c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 48 8b 47 10 48 83 e0 fc 8b 40 04 85 c0 75 09 48 8b 07 <8b> 80 e0 00 00 00 c3 66 90 0f 1f 44 00 00 48 89 d7 be 01 00 00 00 [ 263.696339] RSP: 0018:ffffa4a4422fbb28 EFLAGS: 00010246 [ 263.696600] RAX: 0000000000000000 RBX: ffff8ac9c3553000 RCX: 0000000000000000 [ 263.696891] RDX: 0000000000000401 RSI: 00000000fffffe01 RDI: ffffc4a43fb48900 [ 263.697178] RBP: ffffa4a4422fbb90 R08: ffffffff9622635e R09: 0000000000000002 [ 263.697469] R10: ffffffff9b69a6c0 R11: ffffa4a4422fbd0c R12: ffff8ac9d18b1a00 [ 263.697766] R13: ffff8ac9d0ce1840 R14: ffff8ac9d18b1a00 R15: ffff8ac9c3553000 [ 263.698054] FS: 00007f3704c337c0(0000) GS:ffff8acaebf40000(0000) knlGS:0000000000000000 [ 263.698470] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 263.698826] CR2: 00000000000000e0 CR3: 0000000117a5c000 CR4: 00000000000006e0 [ 263.699214] Call Trace: [ 263.699505] <TASK> [ 263.699759] wg_xmit+0x411/0x450 [ 263.700059] ? bpf_skb_set_tunnel_key+0x46/0x2d0 [ 263.700382] ? dev_queue_xmit_nit+0x31/0x2b0 [ 263.700719] dev_hard_start_xmit+0xd9/0x220 [ 263.701047] __dev_queue_xmit+0x8b9/0xd30 [ 263.701344] __bpf_redirect+0x1a4/0x380 [ 263.701664] __dev_queue_xmit+0x83b/0xd30 [ 263.701961] ? packet_parse_headers+0xb4/0xf0 [ 263.702275] packet_sendmsg+0x9a8/0x16a0 [ 263.702596] ? _raw_spin_unlock_irqrestore+0x23/0x40 [ 263.702933] sock_sendmsg+0x5e/0x60 [ 263.703239] __sys_sendto+0xf0/0x160 [ 263.703549] __x64_sys_sendto+0x20/0x30 [ 263.703853] do_syscall_64+0x3b/0x90 [ 263.704162] entry_SYSCALL_64_after_hwframe+0x44/0xae [ 263.704494] RIP: 0033:0x7f3704d50506 [ 263.704789] Code: 48 c7 c0 ff ff ff ff eb b7 66 2e 0f 1f 84 00 00 00 00 00 90 41 89 ca 64 8b 04 25 18 00 00 00 85 c0 75 11 b8 2c 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 72 c3 90 55 48 83 ec 30 44 89 4c 24 2c 4c 89 [ 263.705652] RSP: 002b:00007ffe954b0b88 EFLAGS: 00000246 ORIG_RAX: 000000000000002c [ 263.706141] RAX: ffffffffffffffda RBX: 0000558bb259b490 RCX: 00007f3704d50506 [ 263.706544] RDX: 000000000000004a RSI: 0000558bb259b7b2 RDI: 0000000000000003 [ 263.706952] RBP: 0000000000000000 R08: 00007ffe954b0b90 R09: 0000000000000014 [ 263.707339] R10: 0000000000000000 R11: 0000000000000246 R12: 00007ffe954b0b90 [ 263.707735] R13: 000000000000004a R14: 0000558bb259b7b2 R15: 0000000000000001 [ 263.708132] </TASK> [ 263.708398] Modules linked in: bridge netconsole bonding [last unloaded: bridge] [ 263.708942] CR2: 00000000000000e0 Fixes: a8f1bc7bdea3 ("net: WireGuard secure network tunnel") Link: https://github.com/cilium/cilium/issues/19428 Reported-by: Martynas Pumputis <m@lambda.lt> Signed-off-by: Nikolay Aleksandrov <razor@blackwall.org> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
* Merge branch 'wireguard-patches-for-5-18-rc1'Jakub Kicinski2022-03-302-3/+5
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Jason A. Donenfeld says: ==================== wireguard patches for 5.18-rc1 Here's a small set of fixes for the next net push: 1) Pipacs reported a CFI violation in a cleanup routine, which he triggered using grsec's RAP. I haven't seen reports of this yet from the Android/CFI world yet, but it's only a matter of time there. 2) A small rng cleanup to the self test harness to make it initialize faster on 5.18. 3) Wang reported and fixed a skb leak for CONFIG_IPV6=n. 4) After Wang's fix for the direct leak, I investigated how that code path even could be hit, and found that the netlink layer still handles IPv6 endpoints, when it probably shouldn't. ==================== Link: https://lore.kernel.org/r/20220330013127.426620-1-Jason@zx2c4.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
| * wireguard: socket: ignore v6 endpoints when ipv6 is disabledJason A. Donenfeld2022-03-301-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | The previous commit fixed a memory leak on the send path in the event that IPv6 is disabled at compile time, but how did a packet even arrive there to begin with? It turns out we have previously allowed IPv6 endpoints even when IPv6 support is disabled at compile time. This is awkward and inconsistent. Instead, let's just ignore all things IPv6, the same way we do other malformed endpoints, in the case where IPv6 is disabled. Fixes: a8f1bc7bdea3 ("net: WireGuard secure network tunnel") Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
| * wireguard: socket: free skb in send6 when ipv6 is disabledWang Hai2022-03-301-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | I got a memory leak report: unreferenced object 0xffff8881191fc040 (size 232): comm "kworker/u17:0", pid 23193, jiffies 4295238848 (age 3464.870s) hex dump (first 32 bytes): 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ backtrace: [<ffffffff814c3ef4>] slab_post_alloc_hook+0x84/0x3b0 [<ffffffff814c8977>] kmem_cache_alloc_node+0x167/0x340 [<ffffffff832974fb>] __alloc_skb+0x1db/0x200 [<ffffffff82612b5d>] wg_socket_send_buffer_to_peer+0x3d/0xc0 [<ffffffff8260e94a>] wg_packet_send_handshake_initiation+0xfa/0x110 [<ffffffff8260ec81>] wg_packet_handshake_send_worker+0x21/0x30 [<ffffffff8119c558>] process_one_work+0x2e8/0x770 [<ffffffff8119ca2a>] worker_thread+0x4a/0x4b0 [<ffffffff811a88e0>] kthread+0x120/0x160 [<ffffffff8100242f>] ret_from_fork+0x1f/0x30 In function wg_socket_send_buffer_as_reply_to_skb() or wg_socket_send_ buffer_to_peer(), the semantics of send6() is required to free skb. But when CONFIG_IPV6 is disable, kfree_skb() is missing. This patch adds it to fix this bug. Signed-off-by: Wang Hai <wanghai38@huawei.com> Fixes: a8f1bc7bdea3 ("net: WireGuard secure network tunnel") Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
| * wireguard: queueing: use CFI-safe ptr_ring cleanup functionJason A. Donenfeld2022-03-301-1/+2
|/ | | | | | | | | | | | | | We make too nuanced use of ptr_ring to entirely move to the skb_array wrappers, but we at least should avoid the naughty function pointer cast when cleaning up skbs. Otherwise RAP/CFI will honk at us. This patch uses the __skb_array_destroy_skb wrapper for the cleanup, rather than directly providing kfree_skb, which is what other drivers in the same situation do too. Reported-by: PaX Team <pageexec@freemail.hu> Fixes: 0188f76c8d48 ("wireguard: receive: use ring buffer for incoming handshakes") Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
* wireguard: device: clear keys on VM forkJason A. Donenfeld2022-03-121-11/+27
| | | | | | | | | | | | | | | | | | | When a virtual machine forks, it's important that WireGuard clear existing sessions so that different plaintexts are not transmitted using the same key+nonce, which can result in catastrophic cryptographic failure. To accomplish this, we simply hook into the newly added vmfork notifier. As a bonus, it turns out that, like the vmfork registration function, the PM registration function is stubbed out when CONFIG_PM_SLEEP is not set, so we can actually just remove the maze of ifdefs, which makes it really quite clean to support both notifiers at once. Cc: Dominik Brodowski <linux@dominikbrodowski.net> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Theodore Ts'o <tytso@mit.edu> Acked-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
* lib/crypto: blake2s: move hmac construction into wireguardJason A. Donenfeld2022-01-181-6/+39
| | | | | | | | | | | | | | | Basically nobody should use blake2s in an HMAC construction; it already has a keyed variant. But unfortunately for historical reasons, Noise, used by WireGuard, uses HKDF quite strictly, which means we have to use this. Because this really shouldn't be used by others, this commit moves it into wireguard's noise.c locally, so that kernels that aren't using WireGuard don't get this superfluous code baked in. On m68k systems, this shaves off ~314 bytes. Cc: Herbert Xu <herbert@gondor.apana.org.au> Tested-by: Geert Uytterhoeven <geert@linux-m68k.org> Acked-by: Ard Biesheuvel <ardb@kernel.org> Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
* Merge tag '5.17-net-next' of ↵Linus Torvalds2022-01-101-3/+1
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next Pull networking updates from Jakub Kicinski: "Core ---- - Defer freeing TCP skbs to the BH handler, whenever possible, or at least perform the freeing outside of the socket lock section to decrease cross-CPU allocator work and improve latency. - Add netdevice refcount tracking to locate sources of netdevice and net namespace refcount leaks. - Make Tx watchdog less intrusive - avoid pausing Tx and restarting all queues from a single CPU removing latency spikes. - Various small optimizations throughout the stack from Eric Dumazet. - Make netdev->dev_addr[] constant, force modifications to go via appropriate helpers to allow us to keep addresses in ordered data structures. - Replace unix_table_lock with per-hash locks, improving performance of bind() calls. - Extend skb drop tracepoint with a drop reason. - Allow SO_MARK and SO_PRIORITY setsockopt under CAP_NET_RAW. BPF --- - New helpers: - bpf_find_vma(), find and inspect VMAs for profiling use cases - bpf_loop(), runtime-bounded loop helper trading some execution time for much faster (if at all converging) verification - bpf_strncmp(), improve performance, avoid compiler flakiness - bpf_get_func_arg(), bpf_get_func_ret(), bpf_get_func_arg_cnt() for tracing programs, all inlined by the verifier - Support BPF relocations (CO-RE) in the kernel loader. - Further the support for BTF_TYPE_TAG annotations. - Allow access to local storage in sleepable helpers. - Convert verifier argument types to a composable form with different attributes which can be shared across types (ro, maybe-null). - Prepare libbpf for upcoming v1.0 release by cleaning up APIs, creating new, extensible ones where missing and deprecating those to be removed. Protocols --------- - WiFi (mac80211/cfg80211): - notify user space about long "come back in N" AP responses, allow it to react to such temporary rejections - allow non-standard VHT MCS 10/11 rates - use coarse time in airtime fairness code to save CPU cycles - Bluetooth: - rework of HCI command execution serialization to use a common queue and work struct, and improve handling errors reported in the middle of a batch of commands - rework HCI event handling to use skb_pull_data, avoiding packet parsing pitfalls - support AOSP Bluetooth Quality Report - SMC: - support net namespaces, following the RDMA model - improve connection establishment latency by pre-clearing buffers - introduce TCP ULP for automatic redirection to SMC - Multi-Path TCP: - support ioctls: SIOCINQ, OUTQ, and OUTQNSD - support socket options: IP_TOS, IP_FREEBIND, IP_TRANSPARENT, IPV6_FREEBIND, and IPV6_TRANSPARENT, TCP_CORK and TCP_NODELAY - support cmsgs: TCP_INQ - improvements in the data scheduler (assigning data to subflows) - support fastclose option (quick shutdown of the full MPTCP connection, similar to TCP RST in regular TCP) - MCTP (Management Component Transport) over serial, as defined by DMTF spec DSP0253 - "MCTP Serial Transport Binding". Driver API ---------- - Support timestamping on bond interfaces in active/passive mode. - Introduce generic phylink link mode validation for drivers which don't have any quirks and where MAC capability bits fully express what's supported. Allow PCS layer to participate in the validation. Convert a number of drivers. - Add support to set/get size of buffers on the Rx rings and size of the tx copybreak buffer via ethtool. - Support offloading TC actions as first-class citizens rather than only as attributes of filters, improve sharing and device resource utilization. - WiFi (mac80211/cfg80211): - support forwarding offload (ndo_fill_forward_path) - support for background radar detection hardware - SA Query Procedures offload on the AP side New hardware / drivers ---------------------- - tsnep - FPGA based TSN endpoint Ethernet MAC used in PLCs with real-time requirements for isochronous communication with protocols like OPC UA Pub/Sub. - Qualcomm BAM-DMUX WWAN - driver for data channels of modems integrated into many older Qualcomm SoCs, e.g. MSM8916 or MSM8974 (qcom_bam_dmux). - Microchip LAN966x multi-port Gigabit AVB/TSN Ethernet Switch driver with support for bridging, VLANs and multicast forwarding (lan966x). - iwlmei driver for co-operating between Intel's WiFi driver and Intel's Active Management Technology (AMT) devices. - mse102x - Vertexcom MSE102x Homeplug GreenPHY chips - Bluetooth: - MediaTek MT7921 SDIO devices - Foxconn MT7922A - Realtek RTL8852AE Drivers ------- - Significantly improve performance in the datapaths of: lan78xx, ax88179_178a, lantiq_xrx200, bnxt. - Intel Ethernet NICs: - igb: support PTP/time PEROUT and EXTTS SDP functions on 82580/i354/i350 adapters - ixgbevf: new PF -> VF mailbox API which avoids the risk of mailbox corruption with ESXi - iavf: support configuration of VLAN features of finer granularity, stacked tags and filtering - ice: PTP support for new E822 devices with sub-ns precision - ice: support firmware activation without reboot - Mellanox Ethernet NICs (mlx5): - expose control over IRQ coalescing mode (CQE vs EQE) via ethtool - support TC forwarding when tunnel encap and decap happen between two ports of the same NIC - dynamically size and allow disabling various features to save resources for running in embedded / SmartNIC scenarios - Broadcom Ethernet NICs (bnxt): - use page frag allocator to improve Rx performance - expose control over IRQ coalescing mode (CQE vs EQE) via ethtool - Other Ethernet NICs: - amd-xgbe: add Ryzen 6000 (Yellow Carp) Ethernet support - Microsoft cloud/virtual NIC (mana): - add XDP support (PASS, DROP, TX) - Mellanox Ethernet switches (mlxsw): - initial support for Spectrum-4 ASICs - VxLAN with IPv6 underlay - Marvell Ethernet switches (prestera): - support flower flow templates - add basic IP forwarding support - NXP embedded Ethernet switches (ocelot & felix): - support Per-Stream Filtering and Policing (PSFP) - enable cut-through forwarding between ports by default - support FDMA to improve packet Rx/Tx to CPU - Other embedded switches: - hellcreek: improve trapping management (STP and PTP) packets - qca8k: support link aggregation and port mirroring - Qualcomm 802.11ax WiFi (ath11k): - qca6390, wcn6855: enable 802.11 power save mode in station mode - BSS color change support - WCN6855 hw2.1 support - 11d scan offload support - scan MAC address randomization support - full monitor mode, only supported on QCN9074 - qca6390/wcn6855: report signal and tx bitrate - qca6390: rfkill support - qca6390/wcn6855: regdb.bin support - Intel WiFi (iwlwifi): - support SAR GEO Offset Mapping (SGOM) and Time-Aware-SAR (TAS) in cooperation with the BIOS - support for Optimized Connectivity Experience (OCE) scan - support firmware API version 68 - lots of preparatory work for the upcoming Bz device family - MediaTek WiFi (mt76): - Specific Absorption Rate (SAR) support - mt7921: 160 MHz channel support - RealTek WiFi (rtw88): - Specific Absorption Rate (SAR) support - scan offload - Other WiFi NICs - ath10k: support fetching (pre-)calibration data from nvmem - brcmfmac: configure keep-alive packet on suspend - wcn36xx: beacon filter support" * tag '5.17-net-next' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (2048 commits) tcp: tcp_send_challenge_ack delete useless param `skb` net/qla3xxx: Remove useless DMA-32 fallback configuration rocker: Remove useless DMA-32 fallback configuration hinic: Remove useless DMA-32 fallback configuration lan743x: Remove useless DMA-32 fallback configuration net: enetc: Remove useless DMA-32 fallback configuration cxgb4vf: Remove useless DMA-32 fallback configuration cxgb4: Remove useless DMA-32 fallback configuration cxgb3: Remove useless DMA-32 fallback configuration bnx2x: Remove useless DMA-32 fallback configuration et131x: Remove useless DMA-32 fallback configuration be2net: Remove useless DMA-32 fallback configuration vmxnet3: Remove useless DMA-32 fallback configuration bna: Simplify DMA setting net: alteon: Simplify DMA setting myri10ge: Simplify DMA setting qlcnic: Simplify DMA setting net: allwinner: Fix print format page_pool: remove spinlock in page_pool_refill_alloc_cache() amt: fix wrong return type of amt_send_membership_update() ...
| * Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski2021-12-029-52/+59
| |\ | | | | | | | | | Signed-off-by: Jakub Kicinski <kuba@kernel.org>
| * \ Merge branch 'skbuff-struct-group'David S. Miller2021-11-221-3/+1
| |\ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Kees Cook says: ==================== skbuff: Switch structure bounds to struct_group() This is a pair of patches to add struct_group() to struct sk_buff. The first is needed to work around sparse-specific complaints, and is new for v2. The second patch is the same as originally sent as v1. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
| | * | skbuff: Switch structure bounds to struct_group()Kees Cook2021-11-221-3/+1
| |/ / | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | In preparation for FORTIFY_SOURCE performing compile-time and run-time field bounds checking for memcpy(), memmove(), and memset(), avoid intentionally writing across neighboring fields. Replace the existing empty member position markers "headers_start" and "headers_end" with a struct_group(). This will allow memcpy() and sizeof() to more easily reason about sizes, and improve readability. "pahole" shows no size nor member offset changes to struct sk_buff. "objdump -d" shows no object code changes (outside of WARNs affected by source line number changes). Signed-off-by: Kees Cook <keescook@chromium.org> Reviewed-by: Gustavo A. R. Silva <gustavoars@kernel.org> Reviewed-by: Jason A. Donenfeld <Jason@zx2c4.com> # drivers/net/wireguard/* Link: https://lore.kernel.org/lkml/20210728035006.GD35706@embeddedor Signed-off-by: David S. Miller <davem@davemloft.net>
* | | Merge drm/drm-next into drm-misc-next-fixesThomas Zimmermann2021-12-179-52/+59
|\ \ \ | |/ / |/| / | |/ | | | | | | Backmerging to bring drm-misc-next-fixes up to the latest state for the current release cycle. Signed-off-by: Thomas Zimmermann <tzimmermann@suse.de>
| * Merge branch 'wireguard-siphash-patches-for-5-16-rc6'Jakub Kicinski2021-11-299-52/+59
|/| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Jason A. Donenfeld says: ==================== wireguard/siphash patches for 5.16-rc Here's quite a largeish set of stable patches I've had queued up and testing for a number of months now: - Patch (1) squelches a sparse warning by fixing an annotation. - Patches (2), (3), and (5) are minor improvements and fixes to the test suite. - Patch (4) is part of a tree-wide cleanup to have module-specific init and exit functions. - Patch (6) fixes a an issue with dangling dst references, by having a function to release references immediately rather than deferring, and adds an associated test case to prevent this from regressing. - Patches (7) and (8) help mitigate somewhat a potential DoS on the ingress path due to the use of skb_list's locking hitting contention on multiple cores by switching to using a ring buffer and dropping packets on contention rather than locking up another core spinning. - Patch (9) switches kvzalloc to kvcalloc for better form. - Patch (10) fixes alignment traps in siphash with clang-13 (and maybe other compilers) on armv6, by switching to using the unaligned functions by default instead of the aligned functions by default. ==================== Link: https://lore.kernel.org/r/20211129153929.3457-1-Jason@zx2c4.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
| * wireguard: ratelimiter: use kvcalloc() instead of kvzalloc()Gustavo A. R. Silva2021-11-291-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | Use 2-factor argument form kvcalloc() instead of kvzalloc(). Link: https://github.com/KSPP/linux/issues/162 Fixes: a8f1bc7bdea3 ("net: WireGuard secure network tunnel") Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org> [Jason: Gustavo's link above is for KSPP, but this isn't actually a security fix, as table_size is bounded to 8192 anyway, and gcc realizes this, so the codegen comes out to be about the same.] Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
| * wireguard: receive: drop handshakes if queue lock is contendedJason A. Donenfeld2021-11-291-3/+13
| | | | | | | | | | | | | | | | | | | | | | | | | | | | If we're being delivered packets from multiple CPUs so quickly that the ring lock is contended for CPU tries, then it's safe to assume that the queue is near capacity anyway, so just drop the packet rather than spinning. This helps deal with multicore DoS that can interfere with data path performance. It _still_ does not completely fix the issue, but it again chips away at it. Reported-by: Streun Fabio <fstreun@student.ethz.ch> Fixes: a8f1bc7bdea3 ("net: WireGuard secure network tunnel") Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
| * wireguard: receive: use ring buffer for incoming handshakesJason A. Donenfeld2021-11-295-43/+37
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Apparently the spinlock on incoming_handshake's skb_queue is highly contended, and a torrent of handshake or cookie packets can bring the data plane to its knees, simply by virtue of enqueueing the handshake packets to be processed asynchronously. So, we try switching this to a ring buffer to hopefully have less lock contention. This alleviates the problem somewhat, though it still isn't perfect, so future patches will have to improve this further. However, it at least doesn't completely diminish the data plane. Reported-by: Streun Fabio <fstreun@student.ethz.ch> Reported-by: Joel Wanner <joel.wanner@inf.ethz.ch> Fixes: a8f1bc7bdea3 ("net: WireGuard secure network tunnel") Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
| * wireguard: device: reset peer src endpoint when netns exitsJason A. Donenfeld2021-11-292-1/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Each peer's endpoint contains a dst_cache entry that takes a reference to another netdev. When the containing namespace exits, we take down the socket and prevent future sockets from being created (by setting creating_net to NULL), which removes that potential reference on the netns. However, it doesn't release references to the netns that a netdev cached in dst_cache might be taking, so the netns still might fail to exit. Since the socket is gimped anyway, we can simply clear all the dst_caches (by way of clearing the endpoint src), which will release all references. However, the current dst_cache_reset function only releases those references lazily. But it turns out that all of our usages of wg_socket_clear_peer_endpoint_src are called from contexts that are not exactly high-speed or bottle-necked. For example, when there's connection difficulty, or when userspace is reconfiguring the interface. And in particular for this patch, when the netns is exiting. So for those cases, it makes more sense to call dst_release immediately. For that, we add a small helper function to dst_cache. This patch also adds a test to netns.sh from Hangbin Liu to ensure this doesn't regress. Tested-by: Hangbin Liu <liuhangbin@gmail.com> Reported-by: Xiumei Mu <xmu@redhat.com> Cc: Toke Høiland-Jørgensen <toke@redhat.com> Cc: Paolo Abeni <pabeni@redhat.com> Fixes: be0d977e56c7 ("wireguard: device: avoid circular netns references") Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
| * wireguard: main: rename 'mod_init' & 'mod_exit' functions to be module-specificRandy Dunlap2021-11-291-4/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | Rename module_init & module_exit functions that are named "mod_init" and "mod_exit" so that they are unique in both the System.map file and in initcall_debug output instead of showing up as almost anonymous "mod_init". This is helpful for debugging and in determining how long certain module_init calls take to execute. Signed-off-by: Randy Dunlap <rdunlap@infradead.org> Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
| * wireguard: allowedips: add missing __rcu annotation to satisfy sparseJason A. Donenfeld2021-11-291-1/+1
|/ | | | | | | | | A __rcu annotation got lost during refactoring, which caused sparse to become enraged. Fixes: aa6fe5251561 ("wireguard: allowedips: free empty intermediate nodes when removing single node") Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
* Merge branch 'wireguard-fixes'David S. Miller2021-06-048-194/+226
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Jason A. Donenfeld says: ==================== wireguard fixes for 5.13-rc5 Here are bug fixes to WireGuard for 5.13-rc5: 1-2,6) These are small, trivial tweaks to our test harness. 3) Linus thinks -O3 is still dangerous to enable. The code gen wasn't so much different with -O2 either. 4) We were accidentally calling synchronize_rcu instead of synchronize_net while holding the rtnl_lock, resulting in some rather large stalls that hit production machines. 5) Peer allocation was wasting literally hundreds of megabytes on real world deployments, due to oddly sized large objects not fitting nicely into a kmalloc slab. 7-9) We move from an insanely expensive O(n) algorithm to a fast O(1) algorithm, and cleanup a massive memory leak in the process, in which allowed ips churn would leave danging nodes hanging around without cleanup until the interface was removed. The O(1) algorithm eliminates packet stalls and high latency issues, in addition to bringing operations that took as much as 10 minutes down to less than a second. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
| * wireguard: allowedips: free empty intermediate nodes when removing single nodeJason A. Donenfeld2021-06-043-131/+137
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When removing single nodes, it's possible that that node's parent is an empty intermediate node, in which case, it too should be removed. Otherwise the trie fills up and never is fully emptied, leading to gradual memory leaks over time for tries that are modified often. There was originally code to do this, but was removed during refactoring in 2016 and never reworked. Now that we have proper parent pointers from the previous commits, we can implement this properly. In order to reduce branching and expensive comparisons, we want to keep the double pointer for parent assignment (which lets us easily chain up to the root), but we still need to actually get the parent's base address. So encode the bit number into the last two bits of the pointer, and pack and unpack it as needed. This is a little bit clumsy but is the fastest and less memory wasteful of the compromises. Note that we align the root struct here to a minimum of 4, because it's embedded into a larger struct, and we're relying on having the bottom two bits for our flag, which would only be 16-bit aligned on m68k. The existing macro-based helpers were a bit unwieldy for adding the bit packing to, so this commit replaces them with safer and clearer ordinary functions. We add a test to the randomized/fuzzer part of the selftests, to free the randomized tries by-peer, refuzz it, and repeat, until it's supposed to be empty, and then then see if that actually resulted in the whole thing being emptied. That combined with kmemcheck should hopefully make sure this commit is doing what it should. Along the way this resulted in various other cleanups of the tests and fixes for recent graphviz. Fixes: a8f1bc7bdea3 ("net: WireGuard secure network tunnel") Cc: stable@vger.kernel.org Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * wireguard: allowedips: allocate nodes in kmem_cacheJason A. Donenfeld2021-06-043-8/+38
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The previous commit moved from O(n) to O(1) for removal, but in the process introduced an additional pointer member to a struct that increased the size from 60 to 68 bytes, putting nodes in the 128-byte slab. With deployed systems having as many as 2 million nodes, this represents a significant doubling in memory usage (128 MiB -> 256 MiB). Fix this by using our own kmem_cache, that's sized exactly right. This also makes wireguard's memory usage more transparent in tools like slabtop and /proc/slabinfo. Fixes: a8f1bc7bdea3 ("net: WireGuard secure network tunnel") Suggested-by: Arnd Bergmann <arnd@arndb.de> Suggested-by: Matthew Wilcox <willy@infradead.org> Cc: stable@vger.kernel.org Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * wireguard: allowedips: remove nodes in O(1)Jason A. Donenfeld2021-06-042-84/+57
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Previously, deleting peers would require traversing the entire trie in order to rebalance nodes and safely free them. This meant that removing 1000 peers from a trie with a half million nodes would take an extremely long time, during which we're holding the rtnl lock. Large-scale users were reporting 200ms latencies added to the networking stack as a whole every time their userspace software would queue up significant removals. That's a serious situation. This commit fixes that by maintaining a double pointer to the parent's bit pointer for each node, and then using the already existing node list belonging to each peer to go directly to the node, fix up its pointers, and free it with RCU. This means removal is O(1) instead of O(n), and we don't use gobs of stack. The removal algorithm has the same downside as the code that it fixes: it won't collapse needlessly long runs of fillers. We can enhance that in the future if it ever becomes a problem. This commit documents that limitation with a TODO comment in code, a small but meaningful improvement over the prior situation. Currently the biggest flaw, which the next commit addresses, is that because this increases the node size on 64-bit machines from 60 bytes to 68 bytes. 60 rounds up to 64, but 68 rounds up to 128. So we wind up using twice as much memory per node, because of power-of-two allocations, which is a big bummer. We'll need to figure something out there. Fixes: a8f1bc7bdea3 ("net: WireGuard secure network tunnel") Cc: stable@vger.kernel.org Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * wireguard: allowedips: initialize list head in selftestJason A. Donenfeld2021-06-041-1/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The randomized trie tests weren't initializing the dummy peer list head, resulting in a NULL pointer dereference when used. Fix this by initializing it in the randomized trie test, just like we do for the static unit test. While we're at it, all of the other strings like this have the word "self-test", so add it to the missing place here. Fixes: a8f1bc7bdea3 ("net: WireGuard secure network tunnel") Cc: stable@vger.kernel.org Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * wireguard: peer: allocate in kmem_cacheJason A. Donenfeld2021-06-043-4/+27
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | With deployments having upwards of 600k peers now, this somewhat heavy structure could benefit from more fine-grained allocations. Specifically, instead of using a 2048-byte slab for a 1544-byte object, we can now use 1544-byte objects directly, thus saving almost 25% per-peer, or with 600k peers, that's a savings of 303 MiB. This also makes wireguard's memory usage more transparent in tools like slabtop and /proc/slabinfo. Fixes: b955554b47af ("wireguard: queueing: get rid of per-peer ring buffers") Suggested-by: Arnd Bergmann <arnd@arndb.de> Suggested-by: Matthew Wilcox <willy@infradead.org> Cc: stable@vger.kernel.org Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * wireguard: use synchronize_net rather than synchronize_rcuJason A. Donenfeld2021-06-042-4/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Many of the synchronization points are sometimes called under the rtnl lock, which means we should use synchronize_net rather than synchronize_rcu. Under the hood, this expands to using the expedited flavor of function in the event that rtnl is held, in order to not stall other concurrent changes. This fixes some very, very long delays when removing multiple peers at once, which would cause some operations to take several minutes. Fixes: a8f1bc7bdea3 ("net: WireGuard secure network tunnel") Cc: stable@vger.kernel.org Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * wireguard: do not use -O3Jason A. Donenfeld2021-06-041-2/+1
|/ | | | | | | | | | | | | | | Apparently, various versions of gcc have O3-related miscompiles. Looking at the difference between -O2 and -O3 for gcc 11 doesn't indicate miscompiles, but the difference also doesn't seem so significant for performance that it's worth risking. Link: https://lore.kernel.org/lkml/CAHk-=wjuoGyxDhAF8SsrTkN0-YfCx7E6jUN3ikC_tn2AKWTTsA@mail.gmail.com/ Link: https://lore.kernel.org/lkml/CAHmME9otB5Wwxp7H8bR_i2uH2esEMvoBMC8uEXBMH9p0q1s6Bw@mail.gmail.com/ Reported-by: Linus Torvalds <torvalds@linux-foundation.org> Fixes: a8f1bc7bdea3 ("net: WireGuard secure network tunnel") Cc: stable@vger.kernel.org Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* Merge branch 'for-linus' of ↵Linus Torvalds2021-04-300-0/+0
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/hid/hid Pull HID updates from Jiri Kosina: - Surface Aggregator Module support from Maximilian Luz - Apple Magic Mouse 2 support from John Chen - Support for newer Quad/BT 2.0 Logitech receivers in HID proxy mode from Hans de Goede - Thinkpad X1 Tablet keyboard support from Hans de Goede - Support for FTDI FT260 I2C host adapter from Michael Zaidman - other various small device-specific quirks, fixes and cleanups * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/hid/hid: (46 commits) HID: wacom: Setup pen input capabilities to the targeted tools HID: hid-sensor-hub: Move 'hsdev' description to correct struct definition HID: hid-sensor-hub: Remove unused struct member 'quirks' HID: wacom_sys: Demote kernel-doc abuse HID: hid-sensor-custom: Remove unused variable 'ret' HID: hid-uclogic-params: Ensure function names are present and correct in kernel-doc headers HID: hid-uclogic-rdesc: Kernel-doc is for functions and structs HID: hid-logitech-hidpp: Fix conformant kernel-doc header and demote abuses HID: hid-picolcd_core: Remove unused variable 'ret' HID: hid-kye: Fix incorrect function name for kye_tablet_enable() HID: hid-core: Fix incorrect function name in header HID: hid-alps: Correct struct misnaming HID: usbhid: hid-pidff: Demote a couple kernel-doc abuses HID: usbhid: Repair a formatting issue in a struct description HID: hid-thrustmaster: Demote a bunch of kernel-doc abuses HID: input: map battery capacity (00850065) HID: magicmouse: fix reconnection of Magic Mouse 2 HID: magicmouse: fix 3 button emulation of Mouse 2 HID: magicmouse: add Apple Magic Mouse 2 support HID: lenovo: Add support for Thinkpad X1 Tablet Thin keyboard ...
| * Merge tag 'platform-drivers-x86-surface-aggregator-v5.13-1' of ↵Jiri Kosina2021-03-309-103/+155
| |\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/pdx86/platform-drivers-x86 into for-5.13/surface-system-aggregator-intergration Pull immutable integration branch from Hans de Goede to have a stable base for SSAM (Surface System Aggregator Module) HID transport subsystem merge. ===== Signed tag for the immutable platform-surface-aggregator-registry branch for merging into other sub-systems. Note this is based on v5.12-rc2. =====
* | \ Merge tag 'irqchip-fixes-5.12-1' of ↵Thomas Gleixner2021-03-149-103/+155
|\ \ \ | | |/ | |/| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/maz/arm-platforms into irq/urgent Pull irqchip fixes from Marc Zyngier: - More compatible strings for the Ingenic irqchip (introducing the JZ4760B SoC) - Select GENERIC_IRQ_MULTI_HANDLER on the ARM ep93xx platform - Drop all GENERIC_IRQ_MULTI_HANDLER selections from the irqchip Kconfig, now relying on the architecture to get it right - Drop the debugfs_file field from struct irq_domain, now that debugfs can track things on its own
| * | Merge tag 'net-5.12-rc1' of ↵Linus Torvalds2021-02-259-103/+155
| |\ \ | | |/ | |/| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net Pull networking fixes from Jakub Kicinski: "Rather small batch this time. Current release - regressions: - bcm63xx_enet: fix sporadic kernel panic due to queue length mis-accounting Current release - new code bugs: - bcm4908_enet: fix RX path possible mem leak - bcm4908_enet: fix NAPI poll returned value - stmmac: fix missing spin_lock_init in visconti_eth_dwmac_probe() - sched: cls_flower: validate ct_state for invalid and reply flags Previous releases - regressions: - net: introduce CAN specific pointer in the struct net_device to prevent mis-interpreting memory - phy: micrel: set soft_reset callback to genphy_soft_reset for KSZ8081 - psample: fix netlink skb length with tunnel info Previous releases - always broken: - icmp: pass zeroed opts from icmp{,v6}_ndo_send before sending - wireguard: device: do not generate ICMP for non-IP packets - mptcp: provide subflow aware release function to avoid a mem leak - hsr: add support for EntryForgetTime - r8169: fix jumbo packet handling on RTL8168e - octeontx2-af: fix an off by one in rvu_dbg_qsize_write() - i40e: fix flow for IPv6 next header (extension header) - phy: icplus: call phy_restore_page() when phy_select_page() fails - dpaa_eth: fix the access method for the dpaa_napi_portal" * tag 'net-5.12-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (55 commits) r8169: fix jumbo packet handling on RTL8168e net: phy: micrel: set soft_reset callback to genphy_soft_reset for KSZ8081 net: psample: Fix netlink skb length with tunnel info net: broadcom: bcm4908_enet: fix NAPI poll returned value net: broadcom: bcm4908_enet: fix RX path possible mem leak net: hsr: add support for EntryForgetTime net: dsa: sja1105: Remove unneeded cast in sja1105_crc32() ibmvnic: fix a race between open and reset net: stmmac: Fix missing spin_lock_init in visconti_eth_dwmac_probe() net: introduce CAN specific pointer in the struct net_device net: usb: qmi_wwan: support ZTE P685M modem wireguard: kconfig: use arm chacha even with no neon wireguard: queueing: get rid of per-peer ring buffers wireguard: device: do not generate ICMP for non-IP packets wireguard: peer: put frequently used members above cache lines wireguard: selftests: test multiple parallel streams wireguard: socket: remove bogus __be32 annotation wireguard: avoid double unlikely() notation when using IS_ERR() net: qrtr: Fix memory leak in qrtr_tun_open vxlan: move debug check after netdev unregister ...
| | * Merge branch 'wireguard-fixes-for-5-12-rc1'Jakub Kicinski2021-02-239-103/+155
| |/| |/| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Jason Donenfeld says: ==================== wireguard fixes for 5.12-rc1 This series has a collection of fixes that have piled up for a little while now, that I unfortunately didn't get a chance to send out earlier. 1) Removes unlikely() from IS_ERR(), since it's already implied. 2) Remove a bogus sparse annotation that hasn't been needed for years. 3) Addition test in the test suite for stressing parallel ndo_start_xmit. 4) Slight struct reordering in preparation for subsequent fix. 5) If skb->protocol is bogus, we no longer attempt to send icmp messages. 6) Massive memory usage fix, hit by larger deployments. 7) Fix typo in kconfig dependency logic. (1) and (2) are tiny cleanups, and (3) is just a test, so if you're trying to reduce churn, you could not backport these. But (4), (5), (6), and (7) fix problems and should be applied to stable. IMO, it's probably easiest to just apply them all to stable. ==================== Link: https://lore.kernel.org/r/20210222162549.3252778-1-Jason@zx2c4.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
| | * wireguard: queueing: get rid of per-peer ring buffersJason A. Donenfeld2021-02-238-93/+144
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Having two ring buffers per-peer means that every peer results in two massive ring allocations. On an 8-core x86_64 machine, this commit reduces the per-peer allocation from 18,688 bytes to 1,856 bytes, which is an 90% reduction. Ninety percent! With some single-machine deployments approaching 500,000 peers, we're talking about a reduction from 7 gigs of memory down to 700 megs of memory. In order to get rid of these per-peer allocations, this commit switches to using a list-based queueing approach. Currently GSO fragments are chained together using the skb->next pointer (the skb_list_* singly linked list approach), so we form the per-peer queue around the unused skb->prev pointer (which sort of makes sense because the links are pointing backwards). Use of skb_queue_* is not possible here, because that is based on doubly linked lists and spinlocks. Multiple cores can write into the queue at any given time, because its writes occur in the start_xmit path or in the udp_recv path. But reads happen in a single workqueue item per-peer, amounting to a multi-producer, single-consumer paradigm. The MPSC queue is implemented locklessly and never blocks. However, it is not linearizable (though it is serializable), with a very tight and unlikely race on writes, which, when hit (some tiny fraction of the 0.15% of partial adds on a fully loaded 16-core x86_64 system), causes the queue reader to terminate early. However, because every packet sent queues up the same workqueue item after it is fully added, the worker resumes again, and stopping early isn't actually a problem, since at that point the packet wouldn't have yet been added to the encryption queue. These properties allow us to avoid disabling interrupts or spinning. The design is based on Dmitry Vyukov's algorithm [1]. Performance-wise, ordinarily list-based queues aren't preferable to ringbuffers, because of cache misses when following pointers around. However, we *already* have to follow the adjacent pointers when working through fragments, so there shouldn't actually be any change there. A potential downside is that dequeueing is a bit more complicated, but the ptr_ring structure used prior had a spinlock when dequeueing, so all and all the difference appears to be a wash. Actually, from profiling, the biggest performance hit, by far, of this commit winds up being atomic_add_unless(count, 1, max) and atomic_ dec(count), which account for the majority of CPU time, according to perf. In that sense, the previous ring buffer was superior in that it could check if it was full by head==tail, which the list-based approach cannot do. But all and all, this enables us to get massive memory savings, allowing WireGuard to scale for real world deployments, without taking much of a performance hit. [1] http://www.1024cores.net/home/lock-free-algorithms/queues/intrusive-mpsc-node-based-queue Reviewed-by: Dmitry Vyukov <dvyukov@google.com> Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com> Fixes: a8f1bc7bdea3 ("net: WireGuard secure network tunnel") Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
| | * wireguard: device: do not generate ICMP for non-IP packetsJason A. Donenfeld2021-02-231-3/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | If skb->protocol doesn't match the actual skb->data header, it's probably not a good idea to pass it off to icmp{,v6}_ndo_send, which is expecting to reply to a valid IP packet. So this commit has that early mismatch case jump to a later error label. Fixes: a8f1bc7bdea3 ("net: WireGuard secure network tunnel") Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
| | * wireguard: peer: put frequently used members above cache linesJason A. Donenfeld2021-02-231-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The is_dead boolean is checked for every single packet, while the internal_id member is used basically only for pr_debug messages. So it makes sense to hoist up is_dead into some space formerly unused by a struct hole, while demoting internal_api to below the lowest struct cache line. Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
| | * wireguard: socket: remove bogus __be32 annotationJann Horn2021-02-231-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The endpoint->src_if4 has nothing to do with fixed-endian numbers; remove the bogus annotation. This was introduced in https://git.zx2c4.com/wireguard-monolithic-historical/commit?id=14e7d0a499a676ec55176c0de2f9fcbd34074a82 in the historical WireGuard repo because the old code used to zero-initialize multiple members as follows: endpoint->src4.s_addr = endpoint->src_if4 = fl.saddr = 0; Because fl.saddr is fixed-endian and an assignment returns a value with the type of its left operand, this meant that sparse detected an assignment between values of different endianness. Since then, this assignment was already split up into separate statements; just the cast survived. Signed-off-by: Jann Horn <jannh@google.com> Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
| | * wireguard: avoid double unlikely() notation when using IS_ERR()Antonio Quartulli2021-02-232-3/+3
| |/ |/| | | | | | | | | | | | | | | | | | | | | | | The definition of IS_ERR() already applies the unlikely() notation when checking the error status of the passed pointer. For this reason there is no need to have the same notation outside of IS_ERR() itself. Clean up code by removing redundant notation. Signed-off-by: Antonio Quartulli <a@unstable.cc> Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
| * Merge tag 'amba-make-remove-return-void' of ↵Russell King2021-02-022-3/+3
| |\ | |/ |/| | | | | | | https://git.pengutronix.de/git/ukl/linux into devel-stable Tag for adaptions to struct amba_driver::remove changing prototype
* | Merge tag 'selinux-pr-20201214' of ↵Linus Torvalds2020-12-161-2/+2
|\ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/selinux Pull selinux updates from Paul Moore: "While we have a small number of SELinux patches for v5.11, there are a few changes worth highlighting: - Change the LSM network hooks to pass flowi_common structs instead of the parent flowi struct as the LSMs do not currently need the full flowi struct and they do not have enough information to use it safely (missing information on the address family). This patch was discussed both with Herbert Xu (representing team netdev) and James Morris (representing team LSMs-other-than-SELinux). - Fix how we handle errors in inode_doinit_with_dentry() so that we attempt to properly label the inode on following lookups instead of continuing to treat it as unlabeled. - Tweak the kernel logic around allowx, auditallowx, and dontauditx SELinux policy statements such that the auditx/dontauditx are effective even without the allowx statement. Everything passes our test suite" * tag 'selinux-pr-20201214' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/selinux: lsm,selinux: pass flowi_common instead of flowi to the LSM hooks selinux: Fix fall-through warnings for Clang selinux: drop super_block backpointer from superblock_security_struct selinux: fix inode_doinit_with_dentry() LABEL_INVALID error handling selinux: allow dontauditx and auditallowx rules to take effect without allowx selinux: fix error initialization in inode_doinit_with_dentry()
| * | lsm,selinux: pass flowi_common instead of flowi to the LSM hooksPaul Moore2020-11-231-2/+2
| |/ | | | | | | | | | | | | | | | | | | | | | | As pointed out by Herbert in a recent related patch, the LSM hooks do not have the necessary address family information to use the flowi struct safely. As none of the LSMs currently use any of the protocol specific flowi information, replace the flowi pointers with pointers to the address family independent flowi_common struct. Reported-by: Herbert Xu <herbert@gondor.apana.org.au> Acked-by: James Morris <jamorris@linux.microsoft.com> Signed-off-by: Paul Moore <paul@paul-moore.com>
* | Merge branch 'net-add-and-use-dev_get_tstats64'Jakub Kicinski2020-11-091-1/+1
|\ \ | |/ |/| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Heiner Kallweit says: ==================== net: add and use dev_get_tstats64 It's a frequent pattern to use netdev->stats for the less frequently accessed counters and per-cpu counters for the frequently accessed counters (rx/tx bytes/packets). Add a default ndo_get_stats64() implementation for this use case. Subsequently switch more drivers to use this pattern. v2: - add patches for replacing ip_tunnel_get_stats64 Requested additional migrations will come in a separate series. v3: - add atomic_long_t member rx_frame_errors in patch 3 for making counter updates atomic ==================== Link: https://lore.kernel.org/r/99273e2f-c218-cd19-916e-9161d8ad8c56@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
| * wireguard: switch to dev_get_tstats64Heiner Kallweit2020-11-091-1/+1
|/ | | | | | | | | Replace ip_tunnel_get_stats64() with the new identical core function dev_get_tstats64(). Reviewed-by: Jason A. Donenfeld <Jason@zx2c4.com> Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
* Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netDavid S. Miller2020-09-222-7/+9
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | Two minor conflicts: 1) net/ipv4/route.c, adding a new local variable while moving another local variable and removing it's initial assignment. 2) drivers/net/dsa/microchip/ksz9477.c, overlapping changes. One pretty prints the port mode differently, whilst another changes the driver to try and obtain the port mode from the port node rather than the switch node. Signed-off-by: David S. Miller <davem@davemloft.net>
| * Merge branch 'wireguard-fixes'David S. Miller2020-09-092-7/+9
| |\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Jason A. Donenfeld says: ==================== wireguard fixes for 5.9-rc5 Yesterday, Eric reported a race condition found by syzbot. This series contains two commits, one that fixes the direct issue, and another that addresses the more general issue, as a defense in depth. 1) The basic problem syzbot unearthed was that one particular mutation of handshake->entry was not protected by the handshake mutex like the other cases, so this patch basically just reorders a line to make sure the mutex is actually taken at the right point. Most of the work here went into making sure the race was fully understood and making a reproducer (which syzbot was unable to do itself, due to the rarity of the race). 2) Eric's initial suggestion for fixing this was taking a spinlock around the hash table replace function where the null ptr deref was happening. This doesn't address the main problem in the most precise possible way like (1) does, but it is a good suggestion for defense-in-depth, in case related issues come up in the future, and basically costs nothing from a performance perspective. I thought it aided in implementing a good general rule: all mutators of that hash table take the table lock. So that's part of this series as a companion. Both of these contain Fixes: tags and are good candidates for stable. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
| | * wireguard: peerlookup: take lock before checking hash in replace operationJason A. Donenfeld2020-09-091-3/+8
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Eric's suggested fix for the previous commit's mentioned race condition was to simply take the table->lock in wg_index_hashtable_replace(). The table->lock of the hash table is supposed to protect the bucket heads, not the entires, but actually, since all the mutator functions are already taking it, it makes sense to take it too for the test to hlist_unhashed, as a defense in depth measure, so that it no longer races with deletions, regardless of what other locks are protecting individual entries. This is sensible from a performance perspective because, as Eric pointed out, the case of being unhashed is already the unlikely case, so this won't add common contention. And comparing instructions, this basically doesn't make much of a difference other than pushing and popping %r13, used by the new `bool ret`. More generally, I like the idea of locking consistency across table mutator functions, and this might let me rest slightly easier at night. Suggested-by: Eric Dumazet <edumazet@google.com> Link: https://lore.kernel.org/wireguard/20200908145911.4090480-1-edumazet@google.com/ Fixes: a8f1bc7bdea3 ("net: WireGuard secure network tunnel") Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| | * wireguard: noise: take lock when removing handshake entry from tableJason A. Donenfeld2020-09-091-4/+1
| |/ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Eric reported that syzkaller found a race of this variety: CPU 1 CPU 2 -------------------------------------------|--------------------------------------- wg_index_hashtable_replace(old, ...) | if (hlist_unhashed(&old->index_hash)) | | wg_index_hashtable_remove(old) | hlist_del_init_rcu(&old->index_hash) | old->index_hash.pprev = NULL hlist_replace_rcu(&old->index_hash, ...) | *old->index_hash.pprev | Syzbot wasn't actually able to reproduce this more than once or create a reproducer, because the race window between checking "hlist_unhashed" and calling "hlist_replace_rcu" is just so small. Adding an mdelay(5) or similar there helps make this demonstrable using this simple script: #!/bin/bash set -ex trap 'kill $pid1; kill $pid2; ip link del wg0; ip link del wg1' EXIT ip link add wg0 type wireguard ip link add wg1 type wireguard wg set wg0 private-key <(wg genkey) listen-port 9999 wg set wg1 private-key <(wg genkey) peer $(wg show wg0 public-key) endpoint 127.0.0.1:9999 persistent-keepalive 1 wg set wg0 peer $(wg show wg1 public-key) ip link set wg0 up yes link set wg1 up | ip -force -batch - & pid1=$! yes link set wg1 down | ip -force -batch - & pid2=$! wait The fundumental underlying problem is that we permit calls to wg_index_ hashtable_remove(handshake.entry) without requiring the caller to take the handshake mutex that is intended to protect members of handshake during mutations. This is consistently the case with calls to wg_index_ hashtable_insert(handshake.entry) and wg_index_hashtable_replace( handshake.entry), but it's missing from a pertinent callsite of wg_ index_hashtable_remove(handshake.entry). So, this patch makes sure that mutex is taken. The original code was a little bit funky though, in the form of: remove(handshake.entry) lock(), memzero(handshake.some_members), unlock() remove(handshake.entry) The original intention of that double removal pattern outside the lock appears to be some attempt to prevent insertions that might happen while locks are dropped during expensive crypto operations, but actually, all callers of wg_index_hashtable_insert(handshake.entry) take the write lock and then explicitly check handshake.state, as they should, which the aforementioned memzero clears, which means an insertion should already be impossible. And regardless, the original intention was necessarily racy, since it wasn't guaranteed that something else would run after the unlock() instead of after the remove(). So, from a soundness perspective, it seems positive to remove what looks like a hack at best. The crash from both syzbot and from the script above is as follows: general protection fault, probably for non-canonical address 0xdffffc0000000000: 0000 [#1] PREEMPT SMP KASAN KASAN: null-ptr-deref in range [0x0000000000000000-0x0000000000000007] CPU: 0 PID: 7395 Comm: kworker/0:3 Not tainted 5.9.0-rc4-syzkaller #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 Workqueue: wg-kex-wg1 wg_packet_handshake_receive_worker RIP: 0010:hlist_replace_rcu include/linux/rculist.h:505 [inline] RIP: 0010:wg_index_hashtable_replace+0x176/0x330 drivers/net/wireguard/peerlookup.c:174 Code: 00 fc ff df 48 89 f9 48 c1 e9 03 80 3c 01 00 0f 85 44 01 00 00 48 b9 00 00 00 00 00 fc ff df 48 8b 45 10 48 89 c6 48 c1 ee 03 <80> 3c 0e 00 0f 85 06 01 00 00 48 85 d2 4c 89 28 74 47 e8 a3 4f b5 RSP: 0018:ffffc90006a97bf8 EFLAGS: 00010246 RAX: 0000000000000000 RBX: ffff888050ffc4f8 RCX: dffffc0000000000 RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff88808e04e010 RBP: ffff88808e04e000 R08: 0000000000000001 R09: ffff8880543d0000 R10: ffffed100a87a000 R11: 000000000000016e R12: ffff8880543d0000 R13: ffff88808e04e008 R14: ffff888050ffc508 R15: ffff888050ffc500 FS: 0000000000000000(0000) GS:ffff8880ae600000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00000000f5505db0 CR3: 0000000097cf7000 CR4: 00000000001526f0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: wg_noise_handshake_begin_session+0x752/0xc9a drivers/net/wireguard/noise.c:820 wg_receive_handshake_packet drivers/net/wireguard/receive.c:183 [inline] wg_packet_handshake_receive_worker+0x33b/0x730 drivers/net/wireguard/receive.c:220 process_one_work+0x94c/0x1670 kernel/workqueue.c:2269 worker_thread+0x64c/0x1120 kernel/workqueue.c:2415 kthread+0x3b5/0x4a0 kernel/kthread.c:292 ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:294 Reported-by: syzbot <syzkaller@googlegroups.com> Reported-by: Eric Dumazet <edumazet@google.com> Link: https://lore.kernel.org/wireguard/20200908145911.4090480-1-edumazet@google.com/ Fixes: a8f1bc7bdea3 ("net: WireGuard secure network tunnel") Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | Merge branch 'netlink-allow-NLA_BINARY-length-range-validation'David S. Miller2020-08-181-7/+7
|\ \ | |/ |/| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Johannes Berg says: ==================== netlink: allow NLA_BINARY length range validation In quite a few places (perhaps particularly in wireless) we need to validation an NLA_BINARY attribute with both a minimum and a maximum length. Currently, we can do either of the two, but not both, given that we have NLA_MIN_LEN (minimum length) and NLA_BINARY (maximum). Extend the range mechanisms that we use for integer validation to apply to NLA_BINARY as well. After converting everything to use NLA_POLICY_MIN_LEN() we can thus get rid of the NLA_MIN_LEN type since that's now a special case of NLA_BINARY with a minimum length validation. Similarly, NLA_EXACT_LEN can be specified using NLA_POLICY_EXACT_LEN() and also maps to the new NLA_BINARY validation (min == max == desired length). Finally, NLA_POLICY_EXACT_LEN_WARN() also gets to be a somewhat special case of this. I haven't included the patch here now that converts nl82011 to use this because it doesn't apply without another cleanup patch, but we can remove a number of hand-coded min/max length checks and get better error messages from the general validation code while doing that. As I had originally built the netlink policy export to userspace in a way that has min/max length for NLA_BINARY (for the types that we used to call NLA_MIN_LEN, NLA_BINARY and NLA_EXACT_LEN) anyway, it doesn't really change anything there except that now there's a chance that userspace sees min length < max length, which previously wasn't possible. v2: * fix the min<max comment to correctly say min<=max ==================== Signed-off-by: David S. Miller <davem@davemloft.net>