summaryrefslogtreecommitdiff
Commit message (Collapse)AuthorAgeFilesLines
* wireguard: allowedips: batch process peer removalsjd/deferred-aip-removalJason A. Donenfeld2021-05-196-101/+128
| | | | | | | | | | | | | | | | | | | Deleting peers requires traversing the entire trie in order to rebalance nodes and safely free them so that we can use RCU in the critical path and never block. But for a structure filled with half million nodes, removing a few thousand of them can take an extremely long time, during which we're holding the rtnl lock. Large-scale users were reporting 200ms latencies added to the networking stack as a whole every time their userspace software would queue up significant removals. This commit works around the problem by marking nodes as dead, and then scheduling a deferred cleanup routine a second later to do one sweep of the entire structure, in order to amortize removals to just a single traversal. Not only should this remove the added latencies to the stack, but it should also make update operations that include peer removal or allowedips changes much faster. Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
* wireguard: do not use -O3Jason A. Donenfeld2021-05-191-2/+1
| | | | | | | | | | | | | Apparently, various versions of gcc have O3-related miscompiles. Looking at the difference between -O2 and -O3 for gcc 11 doesn't indicate miscompiles, but the difference also doesn't seem so significant for performance that it's worth risking. Link: https://lore.kernel.org/lkml/CAHk-=wjuoGyxDhAF8SsrTkN0-YfCx7E6jUN3ikC_tn2AKWTTsA@mail.gmail.com/ Link: https://lore.kernel.org/lkml/CAHmME9otB5Wwxp7H8bR_i2uH2esEMvoBMC8uEXBMH9p0q1s6Bw@mail.gmail.com/ Reported-by: Linus Torvalds <torvalds@linux-foundation.org> Fixes: a8f1bc7bdea3 ("net: WireGuard secure network tunnel") Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
* wireguard: use synchronize_net rather than synchronize_rcuJason A. Donenfeld2021-05-192-4/+4
| | | | | | | | | | | | | | Many of the synchronization points are sometimes called under the rtnl lock, which means we should use synchronize_net rather than synchronize_rcu. Under the hood, this expands to using the expedited flavor of function in the event that rtnl is held, in order to not stall other concurrent changes. This fixes some very, very long delays when removing multiple peers at once, which would cause some operations to take several minutes. Fixes: a8f1bc7bdea3 ("net: WireGuard secure network tunnel") Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
* wireguard: allowedips: initialize list head in selftestJason A. Donenfeld2021-05-191-1/+2
| | | | | | | | | | | | | The randomized trie tests weren't initializing the dummy peer list head, resulting in a NULL pointer dereference when used. Fix this by initializing it in the randomized trie test, just like we do for the static unit test. While we're at it, all of the other strings like this have the word "self-test", so add it to the missing place here. Fixes: a8f1bc7bdea3 ("net: WireGuard secure network tunnel") Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
* Merge branch 'for-linus' of ↵Linus Torvalds2021-04-300-0/+0
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/hid/hid Pull HID updates from Jiri Kosina: - Surface Aggregator Module support from Maximilian Luz - Apple Magic Mouse 2 support from John Chen - Support for newer Quad/BT 2.0 Logitech receivers in HID proxy mode from Hans de Goede - Thinkpad X1 Tablet keyboard support from Hans de Goede - Support for FTDI FT260 I2C host adapter from Michael Zaidman - other various small device-specific quirks, fixes and cleanups * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/hid/hid: (46 commits) HID: wacom: Setup pen input capabilities to the targeted tools HID: hid-sensor-hub: Move 'hsdev' description to correct struct definition HID: hid-sensor-hub: Remove unused struct member 'quirks' HID: wacom_sys: Demote kernel-doc abuse HID: hid-sensor-custom: Remove unused variable 'ret' HID: hid-uclogic-params: Ensure function names are present and correct in kernel-doc headers HID: hid-uclogic-rdesc: Kernel-doc is for functions and structs HID: hid-logitech-hidpp: Fix conformant kernel-doc header and demote abuses HID: hid-picolcd_core: Remove unused variable 'ret' HID: hid-kye: Fix incorrect function name for kye_tablet_enable() HID: hid-core: Fix incorrect function name in header HID: hid-alps: Correct struct misnaming HID: usbhid: hid-pidff: Demote a couple kernel-doc abuses HID: usbhid: Repair a formatting issue in a struct description HID: hid-thrustmaster: Demote a bunch of kernel-doc abuses HID: input: map battery capacity (00850065) HID: magicmouse: fix reconnection of Magic Mouse 2 HID: magicmouse: fix 3 button emulation of Mouse 2 HID: magicmouse: add Apple Magic Mouse 2 support HID: lenovo: Add support for Thinkpad X1 Tablet Thin keyboard ...
| * Merge tag 'platform-drivers-x86-surface-aggregator-v5.13-1' of ↵Jiri Kosina2021-03-309-103/+155
| |\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/pdx86/platform-drivers-x86 into for-5.13/surface-system-aggregator-intergration Pull immutable integration branch from Hans de Goede to have a stable base for SSAM (Surface System Aggregator Module) HID transport subsystem merge. ===== Signed tag for the immutable platform-surface-aggregator-registry branch for merging into other sub-systems. Note this is based on v5.12-rc2. =====
* | \ Merge tag 'irqchip-fixes-5.12-1' of ↵Thomas Gleixner2021-03-149-103/+155
|\ \ \ | | |/ | |/| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/maz/arm-platforms into irq/urgent Pull irqchip fixes from Marc Zyngier: - More compatible strings for the Ingenic irqchip (introducing the JZ4760B SoC) - Select GENERIC_IRQ_MULTI_HANDLER on the ARM ep93xx platform - Drop all GENERIC_IRQ_MULTI_HANDLER selections from the irqchip Kconfig, now relying on the architecture to get it right - Drop the debugfs_file field from struct irq_domain, now that debugfs can track things on its own
| * | Merge tag 'net-5.12-rc1' of ↵Linus Torvalds2021-02-259-103/+155
| |\ \ | | |/ | |/| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net Pull networking fixes from Jakub Kicinski: "Rather small batch this time. Current release - regressions: - bcm63xx_enet: fix sporadic kernel panic due to queue length mis-accounting Current release - new code bugs: - bcm4908_enet: fix RX path possible mem leak - bcm4908_enet: fix NAPI poll returned value - stmmac: fix missing spin_lock_init in visconti_eth_dwmac_probe() - sched: cls_flower: validate ct_state for invalid and reply flags Previous releases - regressions: - net: introduce CAN specific pointer in the struct net_device to prevent mis-interpreting memory - phy: micrel: set soft_reset callback to genphy_soft_reset for KSZ8081 - psample: fix netlink skb length with tunnel info Previous releases - always broken: - icmp: pass zeroed opts from icmp{,v6}_ndo_send before sending - wireguard: device: do not generate ICMP for non-IP packets - mptcp: provide subflow aware release function to avoid a mem leak - hsr: add support for EntryForgetTime - r8169: fix jumbo packet handling on RTL8168e - octeontx2-af: fix an off by one in rvu_dbg_qsize_write() - i40e: fix flow for IPv6 next header (extension header) - phy: icplus: call phy_restore_page() when phy_select_page() fails - dpaa_eth: fix the access method for the dpaa_napi_portal" * tag 'net-5.12-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (55 commits) r8169: fix jumbo packet handling on RTL8168e net: phy: micrel: set soft_reset callback to genphy_soft_reset for KSZ8081 net: psample: Fix netlink skb length with tunnel info net: broadcom: bcm4908_enet: fix NAPI poll returned value net: broadcom: bcm4908_enet: fix RX path possible mem leak net: hsr: add support for EntryForgetTime net: dsa: sja1105: Remove unneeded cast in sja1105_crc32() ibmvnic: fix a race between open and reset net: stmmac: Fix missing spin_lock_init in visconti_eth_dwmac_probe() net: introduce CAN specific pointer in the struct net_device net: usb: qmi_wwan: support ZTE P685M modem wireguard: kconfig: use arm chacha even with no neon wireguard: queueing: get rid of per-peer ring buffers wireguard: device: do not generate ICMP for non-IP packets wireguard: peer: put frequently used members above cache lines wireguard: selftests: test multiple parallel streams wireguard: socket: remove bogus __be32 annotation wireguard: avoid double unlikely() notation when using IS_ERR() net: qrtr: Fix memory leak in qrtr_tun_open vxlan: move debug check after netdev unregister ...
| | * Merge branch 'wireguard-fixes-for-5-12-rc1'Jakub Kicinski2021-02-239-103/+155
| |/| |/| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Jason Donenfeld says: ==================== wireguard fixes for 5.12-rc1 This series has a collection of fixes that have piled up for a little while now, that I unfortunately didn't get a chance to send out earlier. 1) Removes unlikely() from IS_ERR(), since it's already implied. 2) Remove a bogus sparse annotation that hasn't been needed for years. 3) Addition test in the test suite for stressing parallel ndo_start_xmit. 4) Slight struct reordering in preparation for subsequent fix. 5) If skb->protocol is bogus, we no longer attempt to send icmp messages. 6) Massive memory usage fix, hit by larger deployments. 7) Fix typo in kconfig dependency logic. (1) and (2) are tiny cleanups, and (3) is just a test, so if you're trying to reduce churn, you could not backport these. But (4), (5), (6), and (7) fix problems and should be applied to stable. IMO, it's probably easiest to just apply them all to stable. ==================== Link: https://lore.kernel.org/r/20210222162549.3252778-1-Jason@zx2c4.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
| | * wireguard: queueing: get rid of per-peer ring buffersJason A. Donenfeld2021-02-238-93/+144
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Having two ring buffers per-peer means that every peer results in two massive ring allocations. On an 8-core x86_64 machine, this commit reduces the per-peer allocation from 18,688 bytes to 1,856 bytes, which is an 90% reduction. Ninety percent! With some single-machine deployments approaching 500,000 peers, we're talking about a reduction from 7 gigs of memory down to 700 megs of memory. In order to get rid of these per-peer allocations, this commit switches to using a list-based queueing approach. Currently GSO fragments are chained together using the skb->next pointer (the skb_list_* singly linked list approach), so we form the per-peer queue around the unused skb->prev pointer (which sort of makes sense because the links are pointing backwards). Use of skb_queue_* is not possible here, because that is based on doubly linked lists and spinlocks. Multiple cores can write into the queue at any given time, because its writes occur in the start_xmit path or in the udp_recv path. But reads happen in a single workqueue item per-peer, amounting to a multi-producer, single-consumer paradigm. The MPSC queue is implemented locklessly and never blocks. However, it is not linearizable (though it is serializable), with a very tight and unlikely race on writes, which, when hit (some tiny fraction of the 0.15% of partial adds on a fully loaded 16-core x86_64 system), causes the queue reader to terminate early. However, because every packet sent queues up the same workqueue item after it is fully added, the worker resumes again, and stopping early isn't actually a problem, since at that point the packet wouldn't have yet been added to the encryption queue. These properties allow us to avoid disabling interrupts or spinning. The design is based on Dmitry Vyukov's algorithm [1]. Performance-wise, ordinarily list-based queues aren't preferable to ringbuffers, because of cache misses when following pointers around. However, we *already* have to follow the adjacent pointers when working through fragments, so there shouldn't actually be any change there. A potential downside is that dequeueing is a bit more complicated, but the ptr_ring structure used prior had a spinlock when dequeueing, so all and all the difference appears to be a wash. Actually, from profiling, the biggest performance hit, by far, of this commit winds up being atomic_add_unless(count, 1, max) and atomic_ dec(count), which account for the majority of CPU time, according to perf. In that sense, the previous ring buffer was superior in that it could check if it was full by head==tail, which the list-based approach cannot do. But all and all, this enables us to get massive memory savings, allowing WireGuard to scale for real world deployments, without taking much of a performance hit. [1] http://www.1024cores.net/home/lock-free-algorithms/queues/intrusive-mpsc-node-based-queue Reviewed-by: Dmitry Vyukov <dvyukov@google.com> Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com> Fixes: a8f1bc7bdea3 ("net: WireGuard secure network tunnel") Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
| | * wireguard: device: do not generate ICMP for non-IP packetsJason A. Donenfeld2021-02-231-3/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | If skb->protocol doesn't match the actual skb->data header, it's probably not a good idea to pass it off to icmp{,v6}_ndo_send, which is expecting to reply to a valid IP packet. So this commit has that early mismatch case jump to a later error label. Fixes: a8f1bc7bdea3 ("net: WireGuard secure network tunnel") Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
| | * wireguard: peer: put frequently used members above cache linesJason A. Donenfeld2021-02-231-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The is_dead boolean is checked for every single packet, while the internal_id member is used basically only for pr_debug messages. So it makes sense to hoist up is_dead into some space formerly unused by a struct hole, while demoting internal_api to below the lowest struct cache line. Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
| | * wireguard: socket: remove bogus __be32 annotationJann Horn2021-02-231-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The endpoint->src_if4 has nothing to do with fixed-endian numbers; remove the bogus annotation. This was introduced in https://git.zx2c4.com/wireguard-monolithic-historical/commit?id=14e7d0a499a676ec55176c0de2f9fcbd34074a82 in the historical WireGuard repo because the old code used to zero-initialize multiple members as follows: endpoint->src4.s_addr = endpoint->src_if4 = fl.saddr = 0; Because fl.saddr is fixed-endian and an assignment returns a value with the type of its left operand, this meant that sparse detected an assignment between values of different endianness. Since then, this assignment was already split up into separate statements; just the cast survived. Signed-off-by: Jann Horn <jannh@google.com> Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
| | * wireguard: avoid double unlikely() notation when using IS_ERR()Antonio Quartulli2021-02-232-3/+3
| |/ |/| | | | | | | | | | | | | | | | | | | | | | | The definition of IS_ERR() already applies the unlikely() notation when checking the error status of the passed pointer. For this reason there is no need to have the same notation outside of IS_ERR() itself. Clean up code by removing redundant notation. Signed-off-by: Antonio Quartulli <a@unstable.cc> Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
| * Merge tag 'amba-make-remove-return-void' of ↵Russell King2021-02-022-3/+3
| |\ | |/ |/| | | | | | | https://git.pengutronix.de/git/ukl/linux into devel-stable Tag for adaptions to struct amba_driver::remove changing prototype
* | Merge tag 'selinux-pr-20201214' of ↵Linus Torvalds2020-12-161-2/+2
|\ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/selinux Pull selinux updates from Paul Moore: "While we have a small number of SELinux patches for v5.11, there are a few changes worth highlighting: - Change the LSM network hooks to pass flowi_common structs instead of the parent flowi struct as the LSMs do not currently need the full flowi struct and they do not have enough information to use it safely (missing information on the address family). This patch was discussed both with Herbert Xu (representing team netdev) and James Morris (representing team LSMs-other-than-SELinux). - Fix how we handle errors in inode_doinit_with_dentry() so that we attempt to properly label the inode on following lookups instead of continuing to treat it as unlabeled. - Tweak the kernel logic around allowx, auditallowx, and dontauditx SELinux policy statements such that the auditx/dontauditx are effective even without the allowx statement. Everything passes our test suite" * tag 'selinux-pr-20201214' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/selinux: lsm,selinux: pass flowi_common instead of flowi to the LSM hooks selinux: Fix fall-through warnings for Clang selinux: drop super_block backpointer from superblock_security_struct selinux: fix inode_doinit_with_dentry() LABEL_INVALID error handling selinux: allow dontauditx and auditallowx rules to take effect without allowx selinux: fix error initialization in inode_doinit_with_dentry()
| * | lsm,selinux: pass flowi_common instead of flowi to the LSM hooksPaul Moore2020-11-231-2/+2
| |/ | | | | | | | | | | | | | | | | | | | | | | As pointed out by Herbert in a recent related patch, the LSM hooks do not have the necessary address family information to use the flowi struct safely. As none of the LSMs currently use any of the protocol specific flowi information, replace the flowi pointers with pointers to the address family independent flowi_common struct. Reported-by: Herbert Xu <herbert@gondor.apana.org.au> Acked-by: James Morris <jamorris@linux.microsoft.com> Signed-off-by: Paul Moore <paul@paul-moore.com>
* | Merge branch 'net-add-and-use-dev_get_tstats64'Jakub Kicinski2020-11-091-1/+1
|\ \ | |/ |/| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Heiner Kallweit says: ==================== net: add and use dev_get_tstats64 It's a frequent pattern to use netdev->stats for the less frequently accessed counters and per-cpu counters for the frequently accessed counters (rx/tx bytes/packets). Add a default ndo_get_stats64() implementation for this use case. Subsequently switch more drivers to use this pattern. v2: - add patches for replacing ip_tunnel_get_stats64 Requested additional migrations will come in a separate series. v3: - add atomic_long_t member rx_frame_errors in patch 3 for making counter updates atomic ==================== Link: https://lore.kernel.org/r/99273e2f-c218-cd19-916e-9161d8ad8c56@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
| * wireguard: switch to dev_get_tstats64Heiner Kallweit2020-11-091-1/+1
|/ | | | | | | | | Replace ip_tunnel_get_stats64() with the new identical core function dev_get_tstats64(). Reviewed-by: Jason A. Donenfeld <Jason@zx2c4.com> Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
* Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netDavid S. Miller2020-09-222-7/+9
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | Two minor conflicts: 1) net/ipv4/route.c, adding a new local variable while moving another local variable and removing it's initial assignment. 2) drivers/net/dsa/microchip/ksz9477.c, overlapping changes. One pretty prints the port mode differently, whilst another changes the driver to try and obtain the port mode from the port node rather than the switch node. Signed-off-by: David S. Miller <davem@davemloft.net>
| * Merge branch 'wireguard-fixes'David S. Miller2020-09-092-7/+9
| |\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Jason A. Donenfeld says: ==================== wireguard fixes for 5.9-rc5 Yesterday, Eric reported a race condition found by syzbot. This series contains two commits, one that fixes the direct issue, and another that addresses the more general issue, as a defense in depth. 1) The basic problem syzbot unearthed was that one particular mutation of handshake->entry was not protected by the handshake mutex like the other cases, so this patch basically just reorders a line to make sure the mutex is actually taken at the right point. Most of the work here went into making sure the race was fully understood and making a reproducer (which syzbot was unable to do itself, due to the rarity of the race). 2) Eric's initial suggestion for fixing this was taking a spinlock around the hash table replace function where the null ptr deref was happening. This doesn't address the main problem in the most precise possible way like (1) does, but it is a good suggestion for defense-in-depth, in case related issues come up in the future, and basically costs nothing from a performance perspective. I thought it aided in implementing a good general rule: all mutators of that hash table take the table lock. So that's part of this series as a companion. Both of these contain Fixes: tags and are good candidates for stable. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
| | * wireguard: peerlookup: take lock before checking hash in replace operationJason A. Donenfeld2020-09-091-3/+8
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Eric's suggested fix for the previous commit's mentioned race condition was to simply take the table->lock in wg_index_hashtable_replace(). The table->lock of the hash table is supposed to protect the bucket heads, not the entires, but actually, since all the mutator functions are already taking it, it makes sense to take it too for the test to hlist_unhashed, as a defense in depth measure, so that it no longer races with deletions, regardless of what other locks are protecting individual entries. This is sensible from a performance perspective because, as Eric pointed out, the case of being unhashed is already the unlikely case, so this won't add common contention. And comparing instructions, this basically doesn't make much of a difference other than pushing and popping %r13, used by the new `bool ret`. More generally, I like the idea of locking consistency across table mutator functions, and this might let me rest slightly easier at night. Suggested-by: Eric Dumazet <edumazet@google.com> Link: https://lore.kernel.org/wireguard/20200908145911.4090480-1-edumazet@google.com/ Fixes: a8f1bc7bdea3 ("net: WireGuard secure network tunnel") Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| | * wireguard: noise: take lock when removing handshake entry from tableJason A. Donenfeld2020-09-091-4/+1
| |/ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Eric reported that syzkaller found a race of this variety: CPU 1 CPU 2 -------------------------------------------|--------------------------------------- wg_index_hashtable_replace(old, ...) | if (hlist_unhashed(&old->index_hash)) | | wg_index_hashtable_remove(old) | hlist_del_init_rcu(&old->index_hash) | old->index_hash.pprev = NULL hlist_replace_rcu(&old->index_hash, ...) | *old->index_hash.pprev | Syzbot wasn't actually able to reproduce this more than once or create a reproducer, because the race window between checking "hlist_unhashed" and calling "hlist_replace_rcu" is just so small. Adding an mdelay(5) or similar there helps make this demonstrable using this simple script: #!/bin/bash set -ex trap 'kill $pid1; kill $pid2; ip link del wg0; ip link del wg1' EXIT ip link add wg0 type wireguard ip link add wg1 type wireguard wg set wg0 private-key <(wg genkey) listen-port 9999 wg set wg1 private-key <(wg genkey) peer $(wg show wg0 public-key) endpoint 127.0.0.1:9999 persistent-keepalive 1 wg set wg0 peer $(wg show wg1 public-key) ip link set wg0 up yes link set wg1 up | ip -force -batch - & pid1=$! yes link set wg1 down | ip -force -batch - & pid2=$! wait The fundumental underlying problem is that we permit calls to wg_index_ hashtable_remove(handshake.entry) without requiring the caller to take the handshake mutex that is intended to protect members of handshake during mutations. This is consistently the case with calls to wg_index_ hashtable_insert(handshake.entry) and wg_index_hashtable_replace( handshake.entry), but it's missing from a pertinent callsite of wg_ index_hashtable_remove(handshake.entry). So, this patch makes sure that mutex is taken. The original code was a little bit funky though, in the form of: remove(handshake.entry) lock(), memzero(handshake.some_members), unlock() remove(handshake.entry) The original intention of that double removal pattern outside the lock appears to be some attempt to prevent insertions that might happen while locks are dropped during expensive crypto operations, but actually, all callers of wg_index_hashtable_insert(handshake.entry) take the write lock and then explicitly check handshake.state, as they should, which the aforementioned memzero clears, which means an insertion should already be impossible. And regardless, the original intention was necessarily racy, since it wasn't guaranteed that something else would run after the unlock() instead of after the remove(). So, from a soundness perspective, it seems positive to remove what looks like a hack at best. The crash from both syzbot and from the script above is as follows: general protection fault, probably for non-canonical address 0xdffffc0000000000: 0000 [#1] PREEMPT SMP KASAN KASAN: null-ptr-deref in range [0x0000000000000000-0x0000000000000007] CPU: 0 PID: 7395 Comm: kworker/0:3 Not tainted 5.9.0-rc4-syzkaller #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 Workqueue: wg-kex-wg1 wg_packet_handshake_receive_worker RIP: 0010:hlist_replace_rcu include/linux/rculist.h:505 [inline] RIP: 0010:wg_index_hashtable_replace+0x176/0x330 drivers/net/wireguard/peerlookup.c:174 Code: 00 fc ff df 48 89 f9 48 c1 e9 03 80 3c 01 00 0f 85 44 01 00 00 48 b9 00 00 00 00 00 fc ff df 48 8b 45 10 48 89 c6 48 c1 ee 03 <80> 3c 0e 00 0f 85 06 01 00 00 48 85 d2 4c 89 28 74 47 e8 a3 4f b5 RSP: 0018:ffffc90006a97bf8 EFLAGS: 00010246 RAX: 0000000000000000 RBX: ffff888050ffc4f8 RCX: dffffc0000000000 RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff88808e04e010 RBP: ffff88808e04e000 R08: 0000000000000001 R09: ffff8880543d0000 R10: ffffed100a87a000 R11: 000000000000016e R12: ffff8880543d0000 R13: ffff88808e04e008 R14: ffff888050ffc508 R15: ffff888050ffc500 FS: 0000000000000000(0000) GS:ffff8880ae600000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00000000f5505db0 CR3: 0000000097cf7000 CR4: 00000000001526f0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: wg_noise_handshake_begin_session+0x752/0xc9a drivers/net/wireguard/noise.c:820 wg_receive_handshake_packet drivers/net/wireguard/receive.c:183 [inline] wg_packet_handshake_receive_worker+0x33b/0x730 drivers/net/wireguard/receive.c:220 process_one_work+0x94c/0x1670 kernel/workqueue.c:2269 worker_thread+0x64c/0x1120 kernel/workqueue.c:2415 kthread+0x3b5/0x4a0 kernel/kthread.c:292 ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:294 Reported-by: syzbot <syzkaller@googlegroups.com> Reported-by: Eric Dumazet <edumazet@google.com> Link: https://lore.kernel.org/wireguard/20200908145911.4090480-1-edumazet@google.com/ Fixes: a8f1bc7bdea3 ("net: WireGuard secure network tunnel") Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | Merge branch 'netlink-allow-NLA_BINARY-length-range-validation'David S. Miller2020-08-181-7/+7
|\ \ | |/ |/| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Johannes Berg says: ==================== netlink: allow NLA_BINARY length range validation In quite a few places (perhaps particularly in wireless) we need to validation an NLA_BINARY attribute with both a minimum and a maximum length. Currently, we can do either of the two, but not both, given that we have NLA_MIN_LEN (minimum length) and NLA_BINARY (maximum). Extend the range mechanisms that we use for integer validation to apply to NLA_BINARY as well. After converting everything to use NLA_POLICY_MIN_LEN() we can thus get rid of the NLA_MIN_LEN type since that's now a special case of NLA_BINARY with a minimum length validation. Similarly, NLA_EXACT_LEN can be specified using NLA_POLICY_EXACT_LEN() and also maps to the new NLA_BINARY validation (min == max == desired length). Finally, NLA_POLICY_EXACT_LEN_WARN() also gets to be a somewhat special case of this. I haven't included the patch here now that converts nl82011 to use this because it doesn't apply without another cleanup patch, but we can remove a number of hand-coded min/max length checks and get better error messages from the general validation code while doing that. As I had originally built the netlink policy export to userspace in a way that has min/max length for NLA_BINARY (for the types that we used to call NLA_MIN_LEN, NLA_BINARY and NLA_EXACT_LEN) anyway, it doesn't really change anything there except that now there's a chance that userspace sees min length < max length, which previously wasn't possible. v2: * fix the min<max comment to correctly say min<=max ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
| * netlink: consistently use NLA_POLICY_MIN_LEN()Johannes Berg2020-08-181-2/+2
| | | | | | | | | | | | | | | | | | Change places that open-code NLA_POLICY_MIN_LEN() to use the macro instead, giving us flexibility in how we handle the details of the macro. Signed-off-by: Johannes Berg <johannes.berg@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * netlink: consistently use NLA_POLICY_EXACT_LEN()Johannes Berg2020-08-181-5/+5
|/ | | | | | | | | | Change places that open-code NLA_POLICY_EXACT_LEN() to use the macro instead, giving us flexibility in how we handle the details of the macro. Signed-off-by: Johannes Berg <johannes.berg@intel.com> Acked-by: Matthieu Baerts <matthieu.baerts@tessares.net> Signed-off-by: David S. Miller <davem@davemloft.net>
* Merge branch 'for-linus' of ↵jd/xdp-l3Linus Torvalds2020-08-100-0/+0
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/dtor/input Pull input updates from Dmitry Torokhov: - an update to Elan touchpad controller driver supporting newer ICs with enhanced precision reports and a new firmware update process - an update to EXC3000 touch controller supporting additional parts - assorted driver fixups * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/dtor/input: (27 commits) Input: exc3000 - add support to query model and fw_version Input: exc3000 - add reset gpio support Input: exc3000 - add EXC80H60 and EXC80H84 support dt-bindings: touchscreen: Convert EETI EXC3000 touchscreen to json-schema Input: sentelic - fix error return when fsp_reg_write fails Input: alps - remove redundant assignment to variable ret Input: ims-pcu - return error code rather than -ENOMEM Input: elan_i2c - add ic type 0x15 Input: atmel_mxt_ts - only read messages in mxt_acquire_irq() when necessary Input: uinput - fix typo in function name documentation Input: ati_remote2 - add missing newlines when printing module parameters Input: psmouse - add a newline when printing 'proto' by sysfs Input: synaptics-rmi4 - drop a duplicated word Input: elan_i2c - add support for high resolution reports Input: elan_i2c - do not constantly re-query pattern ID Input: elan_i2c - add firmware update info for ICs 0x11, 0x13, 0x14 Input: elan_i2c - handle firmware updated on newer ICs Input: elan_i2c - add support for different firmware page sizes Input: elan_i2c - fix detecting IAP version on older controllers Input: elan_i2c - handle devices with patterns above 1 ...
| * Merge branch 'next' into for-linusDmitry Torokhov2020-08-0710-93/+94
| |\ | | | | | | | | | Prepare input updates for 5.9 merge window.
* | | mm, treewide: rename kzfree() to kfree_sensitive()Waiman Long2020-08-072-3/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | As said by Linus: A symmetric naming is only helpful if it implies symmetries in use. Otherwise it's actively misleading. In "kzalloc()", the z is meaningful and an important part of what the caller wants. In "kzfree()", the z is actively detrimental, because maybe in the future we really _might_ want to use that "memfill(0xdeadbeef)" or something. The "zero" part of the interface isn't even _relevant_. The main reason that kzfree() exists is to clear sensitive information that should not be leaked to other future users of the same memory objects. Rename kzfree() to kfree_sensitive() to follow the example of the recently added kvfree_sensitive() and make the intention of the API more explicit. In addition, memzero_explicit() is used to clear the memory to make sure that it won't get optimized away by the compiler. The renaming is done by using the command sequence: git grep -w --name-only kzfree |\ xargs sed -i 's/kzfree/kfree_sensitive/' followed by some editing of the kfree_sensitive() kerneldoc and adding a kzfree backward compatibility macro in slab.h. [akpm@linux-foundation.org: fs/crypto/inline_crypt.c needs linux/slab.h] [akpm@linux-foundation.org: fix fs/crypto/inline_crypt.c some more] Suggested-by: Joe Perches <joe@perches.com> Signed-off-by: Waiman Long <longman@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Acked-by: David Howells <dhowells@redhat.com> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Jarkko Sakkinen <jarkko.sakkinen@linux.intel.com> Cc: James Morris <jmorris@namei.org> Cc: "Serge E. Hallyn" <serge@hallyn.com> Cc: Joe Perches <joe@perches.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: David Rientjes <rientjes@google.com> Cc: Dan Carpenter <dan.carpenter@oracle.com> Cc: "Jason A . Donenfeld" <Jason@zx2c4.com> Link: http://lkml.kernel.org/r/20200616154311.12314-3-longman@redhat.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | | Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netLinus Torvalds2020-07-103-18/+4
|\ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull networking fixes from David Miller: 1) Restore previous behavior of CAP_SYS_ADMIN wrt loading networking BPF programs, from Maciej Żenczykowski. 2) Fix dropped broadcasts in mac80211 code, from Seevalamuthu Mariappan. 3) Slay memory leak in nl80211 bss color attribute parsing code, from Luca Coelho. 4) Get route from skb properly in ip_route_use_hint(), from Miaohe Lin. 5) Don't allow anything other than ARPHRD_ETHER in llc code, from Eric Dumazet. 6) xsk code dips too deeply into DMA mapping implementation internals. Add dma_need_sync and use it. From Christoph Hellwig 7) Enforce power-of-2 for BPF ringbuf sizes. From Andrii Nakryiko. 8) Check for disallowed attributes when loading flow dissector BPF programs. From Lorenz Bauer. 9) Correct packet injection to L3 tunnel devices via AF_PACKET, from Jason A. Donenfeld. 10) Don't advertise checksum offload on ipa devices that don't support it. From Alex Elder. 11) Resolve several issues in TCP MD5 signature support. Missing memory barriers, bogus options emitted when using syncookies, and failure to allow md5 key changes in established states. All from Eric Dumazet. 12) Fix interface leak in hsr code, from Taehee Yoo. 13) VF reset fixes in hns3 driver, from Huazhong Tan. 14) Make loopback work again with ipv6 anycast, from David Ahern. 15) Fix TX starvation under high load in fec driver, from Tobias Waldekranz. 16) MLD2 payload lengths not checked properly in bridge multicast code, from Linus Lüssing. 17) Packet scheduler code that wants to find the inner protocol currently only works for one level of VLAN encapsulation. Allow Q-in-Q situations to work properly here, from Toke Høiland-Jørgensen. 18) Fix route leak in l2tp, from Xin Long. 19) Resolve conflict between the sk->sk_user_data usage of bpf reuseport support and various protocols. From Martin KaFai Lau. 20) Fix socket cgroup v2 reference counting in some situations, from Cong Wang. 21) Cure memory leak in mlx5 connection tracking offload support, from Eli Britstein. * git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (146 commits) mlxsw: pci: Fix use-after-free in case of failed devlink reload mlxsw: spectrum_router: Remove inappropriate usage of WARN_ON() net: macb: fix call to pm_runtime in the suspend/resume functions net: macb: fix macb_suspend() by removing call to netif_carrier_off() net: macb: fix macb_get/set_wol() when moving to phylink net: macb: mark device wake capable when "magic-packet" property present net: macb: fix wakeup test in runtime suspend/resume routines bnxt_en: fix NULL dereference in case SR-IOV configuration fails libbpf: Fix libbpf hashmap on (I)LP32 architectures net/mlx5e: CT: Fix memory leak in cleanup net/mlx5e: Fix port buffers cell size value net/mlx5e: Fix 50G per lane indication net/mlx5e: Fix CPU mapping after function reload to avoid aRFS RX crash net/mlx5e: Fix VXLAN configuration restore after function reload net/mlx5e: Fix usage of rcu-protected pointer net/mxl5e: Verify that rpriv is not NULL net/mlx5: E-Switch, Fix vlan or qos setting in legacy mode net/mlx5: Fix eeprom support for SFP module cgroup: Fix sock_cgroup_data on big-endian. selftests: bpf: Fix detach from sockmap tests ...
| * \ \ Merge branch 'support-AF_PACKET-for-layer-3-devices'David S. Miller2020-06-303-18/+4
| |\ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Jason A. Donenfeld says: ==================== support AF_PACKET for layer 3 devices Hans reported that packets injected by a correct-looking and trivial libpcap-based program were not being accepted by wireguard. In investigating that, I noticed that a few devices weren't properly handling AF_PACKET-injected packets, and so this series introduces a bit of shared infrastructure to support that. The basic problem begins with socket(AF_PACKET, SOCK_RAW, htons(ETH_P_ALL)) sockets. When sendto is called, AF_PACKET examines the headers of the packet with this logic: static void packet_parse_headers(struct sk_buff *skb, struct socket *sock) { if ((!skb->protocol || skb->protocol == htons(ETH_P_ALL)) && sock->type == SOCK_RAW) { skb_reset_mac_header(skb); skb->protocol = dev_parse_header_protocol(skb); } skb_probe_transport_header(skb); } The middle condition there triggers, and we jump to dev_parse_header_protocol. Note that this is the only caller of dev_parse_header_protocol in the kernel, and I assume it was designed for this purpose: static inline __be16 dev_parse_header_protocol(const struct sk_buff *skb) { const struct net_device *dev = skb->dev; if (!dev->header_ops || !dev->header_ops->parse_protocol) return 0; return dev->header_ops->parse_protocol(skb); } Since AF_PACKET already knows which netdev the packet is going to, the dev_parse_header_protocol function can see if that netdev has a way it prefers to figure out the protocol from the header. This, again, is the only use of parse_protocol in the kernel. At the moment, it's only used with ethernet devices, via eth_header_parse_protocol. This makes sense, as mostly people are used to AF_PACKET-injecting ethernet frames rather than layer 3 frames. But with nothing in place for layer 3 netdevs, this function winds up returning 0, and skb->protocol then is set to 0, and then by the time it hits the netdev's ndo_start_xmit, the driver doesn't know what to do with it. This is a problem because drivers very much rely on skb->protocol being correct, and routinely reject packets where it's incorrect. That's why having this parsing happen for injected packets is quite important. In wireguard, ipip, and ipip6, for example, packets from AF_PACKET are just dropped entirely. For tun devices, it's sort of uglier, with the tun "packet information" header being passed to userspace containing a bogus protocol value. Some userspace programs are ill-equipped to deal with that. (But of course, that doesn't happen with tap devices, which benefit from the similar shared infrastructure for layer 2 netdevs, further motiviating this patchset for layer 3 netdevs.) This patchset addresses the issue by first adding a layer 3 header parse function, much akin to the existing one for layer 2 packets, and then adds a shared header_ops structure that, also much akin to the existing one for layer 2 packets. Then it wires it up to a few immediate places that stuck out as requiring it, and does a bit of cleanup. This patchset seems like it's fixing real bugs, so it might be appropriate for stable. But they're also very old bugs, so if you'd rather not backport to stable, that'd make sense to me too. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
| | * | | wireguard: queueing: make use of ip_tunnel_parse_protocolJason A. Donenfeld2020-06-302-18/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Now that wg_examine_packet_protocol has been added for general consumption as ip_tunnel_parse_protocol, it's possible to remove wg_examine_packet_protocol and simply use the new ip_tunnel_parse_protocol function directly. Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| | * | | wireguard: implement header_ops->parse_protocol for AF_PACKETJason A. Donenfeld2020-06-301-0/+1
| |/ / / | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | WireGuard uses skb->protocol to determine packet type, and bails out if it's not set or set to something it's not expecting. For AF_PACKET injection, we need to support its call chain of: packet_sendmsg -> packet_snd -> packet_parse_headers -> dev_parse_header_protocol -> parse_protocol Without a valid parse_protocol, this returns zero, and wireguard then rejects the skb. So, this wires up the ip_tunnel handler for layer 3 packets for that case. Reported-by: Hans Wippel <ndev@hwipl.net> Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | | | Merge tag 'irq-urgent-2020-07-05' of ↵Linus Torvalds2020-07-050-0/+0
|\ \ \ \ | |/ / / |/| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull irq fixes from Thomas Gleixner: "A set of interrupt chip driver fixes: - Ensure the atomicity of affinity updates in the GIC driver - Don't try to sleep in atomic context when waiting for the GICv4.1 to respond. Use polling instead. - Typo fixes in Kconfig and warnings" * tag 'irq-urgent-2020-07-05' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: irqchip/gic: Atomically update affinity irqchip/riscv-intc: Fix a typo in a pr_warn() irqchip/gic-v4.1: Use readx_poll_timeout_atomic() to fix sleep in atomic irqchip/loongson-pci-msi: Fix a typo in Kconfig
| * | | Merge tag 'irqchip-fixes-5.8-1' of ↵Thomas Gleixner2020-06-3010-93/+94
| |\ \ \ | | |/ / | |/| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/maz/arm-platforms into irq/urgent Pull irqchip fixes from Marc Zyngier: - Fix atomicity of affinity update in the GIC driver - Don't sleep in atomic when waiting for a GICv4.1 RD to respond - Fix a couple of typos in user-visible messages
* | | | Merge branch 'napi_gro_receive-caller-return-value-cleanups'David S. Miller2020-06-251-8/+2
|\ \ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Jason A. Donenfeld says: ==================== napi_gro_receive caller return value cleanups In 6570bc79c0df ("net: core: use listified Rx for GRO_NORMAL in napi_gro_receive()"), the GRO_NORMAL case stopped calling netif_receive_skb_internal, checking its return value, and returning GRO_DROP in case it failed. Instead, it calls into netif_receive_skb_list_internal (after a bit of indirection), which doesn't return any error. Therefore, napi_gro_receive will never return GRO_DROP, making handling GRO_DROP dead code. I emailed the author of 6570bc79c0df on netdev [1] to see if this change was intentional, but the dlink.ru email address has been disconnected, and looking a bit further myself, it seems somewhat infeasible to start propagating return values backwards from the internal machinations of netif_receive_skb_list_internal. Taking a look at all the callers of napi_gro_receive, it appears that three are checking the return value for the purpose of comparing it to the now never-happening GRO_DROP, and one just casts it to (void), a likely historical leftover. Every other of the 120 callers does not bother checking the return value. And it seems like these remaining 116 callers are doing the right thing: after calling napi_gro_receive, the packet is now in the hands of the upper layers of the newtworking, and the device driver itself has no business now making decisions based on what the upper layers choose to do. Incrementing stats counters on GRO_DROP seems like a mistake, made by these three drivers, but not by the remaining 117. It would seem, therefore, that after rectifying these four callers of napi_gro_receive, that I should go ahead and just remove returning the value from napi_gro_receive all together. However, napi_gro_receive has a function event tracer, and being able to introspect into the networking stack to see how often napi_gro_receive is returning whatever interesting GRO status (aside from _DROP) remains an interesting data point worth keeping for debugging. So, this series simply gets rid of the return value checking for the four useless places where that check never evaluates to anything meaningful. [1] https://lore.kernel.org/netdev/20200624210606.GA1362687@zx2c4.com/ ==================== Acked-by: Edward Cree <ecree@solarflare.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * | | | wireguard: receive: account for napi_gro_receive never returning GRO_DROPJason A. Donenfeld2020-06-251-8/+2
|/ / / / | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The napi_gro_receive function no longer returns GRO_DROP ever, making handling GRO_DROP dead code. This commit removes that dead code. Further, it's not even clear that device drivers have any business in taking action after passing off received packets; that's arguably out of their hands. Fixes: a8f1bc7bdea3 ("net: WireGuard secure network tunnel") Fixes: 6570bc79c0df ("net: core: use listified Rx for GRO_NORMAL in napi_gro_receive()") Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | | | Merge branch 'wg-fixes'David S. Miller2020-06-235-47/+57
|\ \ \ \ | |_|/ / |/| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Jason A. Donenfeld says: ==================== wireguard fixes for 5.8-rc3 This series contains two fixes, one cosmetic and one quite important: 1) Avoid the `if ((x = f()) == y)` pattern, from Frank Werner-Krippendorf. 2) Mitigate a potential memory leak by creating circular netns references, while also making the netns semantics a bit more robust. Patch (2) has a "Fixes:" line and should be backported to stable. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
| * | | wireguard: device: avoid circular netns referencesJason A. Donenfeld2020-06-234-45/+55
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Before, we took a reference to the creating netns if the new netns was different. This caused issues with circular references, with two wireguard interfaces swapping namespaces. The solution is to rather not take any extra references at all, but instead simply invalidate the creating netns pointer when that netns is deleted. In order to prevent this from happening again, this commit improves the rough object leak tracking by allowing it to account for created and destroyed interfaces, aside from just peers and keys. That then makes it possible to check for the object leak when having two interfaces take a reference to each others' namespaces. Fixes: a8f1bc7bdea3 ("net: WireGuard secure network tunnel") Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * | | wireguard: noise: do not assign initiation time in if conditionFrank Werner-Krippendorf2020-06-231-2/+2
|/ / / | | | | | | | | | | | | | | | | | | | | | | | | | | | Fixes an error condition reported by checkpatch.pl which caused by assigning a variable in an if condition in wg_noise_handshake_consume_ initiation(). Signed-off-by: Frank Werner-Krippendorf <mail@hb9fxq.ch> Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | | Merge tag 'efi-core-2020-06-01' of ↵Linus Torvalds2020-06-010-0/+0
|\ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull EFI updates from Ingo Molnar: "The EFI changes for this cycle are: - preliminary changes for RISC-V - Add support for setting the resolution on the EFI framebuffer - Simplify kernel image loading for arm64 - Move .bss into .data via the linker script instead of relying on symbol annotations. - Get rid of __pure getters to access global variables - Clean up the config table matching arrays - Rename pr_efi/pr_efi_err to efi_info/efi_err, and use them consistently - Simplify and unify initrd loading - Parse the builtin command line on x86 (if provided) - Implement printk() support, including support for wide character strings - Simplify GDT handling in early mixed mode thunking code - Some other minor fixes and cleanups" * tag 'efi-core-2020-06-01' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (79 commits) efi/x86: Don't blow away existing initrd efi/x86: Drop the special GDT for the EFI thunk efi/libstub: Add missing prototype for PE/COFF entry point efi/efivars: Add missing kobject_put() in sysfs entry creation error path efi/libstub: Use pool allocation for the command line efi/libstub: Don't parse overlong command lines efi/libstub: Use snprintf with %ls to convert the command line efi/libstub: Get the exact UTF-8 length efi/libstub: Use %ls for filename efi/libstub: Add UTF-8 decoding to efi_puts efi/printf: Add support for wchar_t (UTF-16) efi/gop: Add an option to list out the available GOP modes efi/libstub: Add definitions for console input and events efi/libstub: Implement printk-style logging efi/printf: Turn vsprintf into vsnprintf efi/printf: Abort on invalid format efi/printf: Refactor code to consolidate padding and output efi/printf: Handle null string input efi/printf: Factor out integer argument retrieval efi/printf: Factor out width/precision parsing ...
| * \ \ Merge tag 'v5.7-rc7' into efi/core, to refresh the branch and pick up fixesIngo Molnar2020-05-2510-93/+94
| |\ \ \ | | | |/ | | |/| | | | | Signed-off-by: Ingo Molnar <mingo@kernel.org>
| * | | Merge tag 'efi-next' of ↵Ingo Molnar2020-04-257-50/+51
| |\ \ \ | | | |/ | | |/| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/efi/efi into efi/core Pull EFI changes for v5.8 from Ard Biesheuvel: "- preliminary changes for RISC-V - add support for setting the resolution on the EFI framebuffer - simplify kernel image loading for arm64 - Move .bss into .data via the linker script instead of relying on symbol annotations. - Get rid of __pure getters to access global variables - Clean up the config table matching arrays" Signed-off-by: Ingo Molnar <mingo@kernel.org>
* | | | Merge tag 'core-rcu-2020-06-01' of ↵Linus Torvalds2020-06-010-0/+0
|\ \ \ \ | |_|_|/ |/| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull RCU updates from Ingo Molnar: "The RCU updates for this cycle were: - RCU-tasks update, including addition of RCU Tasks Trace for BPF use and TASKS_RUDE_RCU - kfree_rcu() updates. - Remove scheduler locking restriction - RCU CPU stall warning updates. - Torture-test updates. - Miscellaneous fixes and other updates" * tag 'core-rcu-2020-06-01' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (103 commits) rcu: Allow for smp_call_function() running callbacks from idle rcu: Provide rcu_irq_exit_check_preempt() rcu: Abstract out rcu_irq_enter_check_tick() from rcu_nmi_enter() rcu: Provide __rcu_is_watching() rcu: Provide rcu_irq_exit_preempt() rcu: Make RCU IRQ enter/exit functions rely on in_nmi() rcu/tree: Mark the idle relevant functions noinstr x86: Replace ist_enter() with nmi_enter() x86/mce: Send #MC singal from task work x86/entry: Get rid of ist_begin/end_non_atomic() sched,rcu,tracing: Avoid tracing before in_nmi() is correct sh/ftrace: Move arch_ftrace_nmi_{enter,exit} into nmi exception lockdep: Always inline lockdep_{off,on}() hardirq/nmi: Allow nested nmi_enter() arm64: Prepare arch_nmi_enter() for recursion printk: Disallow instrumenting print_nmi_enter() printk: Prepare for nested printk_nmi_enter() rcutorture: Convert ULONG_CMP_LT() to time_before() torture: Add a --kasan argument torture: Save a few lines by using config_override_param initially ...
| * | | Merge tag 'noinstr-lds-2020-05-19' into core/rcuThomas Gleixner2020-05-195-36/+25
| |\ \ \ | | | | | | | | | | | | | | | Get the noinstr section and annotation markers to base the RCU parts on.
| * \ \ \ Merge branch 'for-mingo' of ↵Thomas Gleixner2020-05-117-50/+51
| |\ \ \ \ | | |_|/ / | |/| | / | | | |/ | | |/| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu into core/rcu Pull RCU updates from Paul McKenney: 1. Miscellaneous fixes. 2. kfree_rcu() updates. 3. Remove scheduler locking restriction 4. RCU-tasks update, including addition of RCU Tasks Trace for BPF use and RCU Tasks Rude. (This branch is on top of #3 due to overlap of changed code.) 5. RCU CPU stall warning updates. 6. Torture-test updates.
* | | | Merge branch 'wireguard-fixes'David S. Miller2020-05-207-58/+70
|\ \ \ \ | |_|_|/ |/| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Jason A. Donenfeld says: ==================== wireguard fixes for 5.7-rc7 Hopefully these are the last fixes for 5.7: 1) A trivial bump in the selftest harness to support gcc-10. build.wireguard.com is still on gcc-9 but I'll probably switch to gcc-10 in the coming weeks. 2) A concurrency fix regarding userspace modifying the pre-shared key at the same time as packets are being processed, reported by Matt Dunwoodie. 3) We were previously clearing skb->hash on egress, which broke fq_codel, cake, and other things that actually make use of the flow hash for queueing, reported by Dave Taht and Toke Høiland-Jørgensen. 4) A fix for the increased memory usage caused by (3). This can be thought of as part of patch (3), but because of the separate reasoning and breadth of it I thought made it a bit cleaner to put in a standalone commit. Fixes (2), (3), and (4) are -stable material. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
| * | | wireguard: noise: separate receive counter from send counterJason A. Donenfeld2020-05-205-53/+48
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | In "wireguard: queueing: preserve flow hash across packet scrubbing", we were required to slightly increase the size of the receive replay counter to something still fairly small, but an increase nonetheless. It turns out that we can recoup some of the additional memory overhead by splitting up the prior union type into two distinct types. Before, we used the same "noise_counter" union for both sending and receiving, with sending just using a simple atomic64_t, while receiving used the full replay counter checker. This meant that most of the memory being allocated for the sending counter was being wasted. Since the old "noise_counter" type increased in size in the prior commit, now is a good time to split up that union type into a distinct "noise_replay_ counter" for receiving and a boring atomic64_t for sending, each using neither more nor less memory than required. Also, since sometimes the replay counter is accessed without necessitating additional accesses to the bitmap, we can reduce cache misses by hoisting the always-necessary lock above the bitmap in the struct layout. We also change a "noise_replay_counter" stack allocation to kmalloc in a -DDEBUG selftest so that KASAN doesn't trigger a stack frame warning. All and all, removing a bit of abstraction in this commit makes the code simpler and smaller, in addition to the motivating memory usage recuperation. For example, passing around raw "noise_symmetric_key" structs is something that really only makes sense within noise.c, in the one place where the sending and receiving keys can safely be thought of as the same type of object; subsequent to that, it's important that we uniformly access these through keypair->{sending,receiving}, where their distinct roles are always made explicit. So this patch allows us to draw that distinction clearly as well. Fixes: a8f1bc7bdea3 ("net: WireGuard secure network tunnel") Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * | | wireguard: queueing: preserve flow hash across packet scrubbingJason A. Donenfeld2020-05-204-4/+17
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | It's important that we clear most header fields during encapsulation and decapsulation, because the packet is substantially changed, and we don't want any info leak or logic bug due to an accidental correlation. But, for encapsulation, it's wrong to clear skb->hash, since it's used by fq_codel and flow dissection in general. Without it, classification does not proceed as usual. This change might make it easier to estimate the number of innerflows by examining clustering of out of order packets, but this shouldn't open up anything that can't already be inferred otherwise (e.g. syn packet size inference), and fq_codel can be disabled anyway. Furthermore, it might be the case that the hash isn't used or queried at all until after wireguard transmits the encrypted UDP packet, which means skb->hash might still be zero at this point, and thus no hash taken over the inner packet data. In order to address this situation, we force a calculation of skb->hash before encrypting packet data. Of course this means that fq_codel might transmit packets slightly more out of order than usual. Toke did some testing on beefy machines with high quantities of parallel flows and found that increasing the reply-attack counter to 8192 takes care of the most pathological cases pretty well. Reported-by: Dave Taht <dave.taht@gmail.com> Reviewed-and-tested-by: Toke Høiland-Jørgensen <toke@toke.dk> Fixes: a8f1bc7bdea3 ("net: WireGuard secure network tunnel") Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * | | wireguard: noise: read preshared key while taking lockJason A. Donenfeld2020-05-201-1/+5
|/ / / | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Prior we read the preshared key after dropping the handshake lock, which isn't an actual crypto issue if it races, but it's still not quite correct. So copy that part of the state into a temporary like we do with the rest of the handshake state variables. Then we can release the lock, operate on the temporary, and zero it out at the end of the function. In performance tests, the impact of this was entirely unnoticable, probably because those bytes are coming from the same cacheline as other things that are being copied out in the same manner. Reported-by: Matt Dunwoodie <ncon@noconroy.net> Fixes: a8f1bc7bdea3 ("net: WireGuard secure network tunnel") Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com> Signed-off-by: David S. Miller <davem@davemloft.net>