Planet Debian

Subscribe to Planet Debian feed
Planet Debian - https://planet.debian.org/
Updated: 2 hours 33 min ago

François Marier: Deleting non-decryptable restic snapshots

13 April, 2021 - 10:19

Due to what I suspect is disk corruption error due to a faulty RAM module or network interface on my GnuBee, my restic backup failed with the following error:

$ restic check
using temporary cache in /var/tmp/restic-tmp/restic-check-cache-854484247
repository b0b0516c opened successfully, password is correct
created new cache in /var/tmp/restic-tmp/restic-check-cache-854484247
create exclusive lock for repository
load indexes
check all packs
check snapshots, trees and blobs
error for tree 4645312b:
  decrypting blob 4645312b443338d57295550f2f4c135c34bda7b17865c4153c9b99d634ae641c failed: ciphertext verification failed
error for tree 2c3248ce:
  decrypting blob 2c3248ce5dc7a4bc77f03f7475936041b6b03e0202439154a249cd28ef4018b6 failed: ciphertext verification failed
Fatal: repository contains errors

I started by locating the snapshots which make use of these corrupt trees:

$ restic find --tree 4645312b
repository b0b0516c opened successfully, password is correct
Found tree 4645312b443338d57295550f2f4c135c34bda7b17865c4153c9b99d634ae641c
 ... path /usr/include/boost/spirit/home/support/auxiliary
 ... in snapshot 41e138c8 (2021-01-31 08:35:16)
Found tree 4645312b443338d57295550f2f4c135c34bda7b17865c4153c9b99d634ae641c
 ... path /usr/include/boost/spirit/home/support/auxiliary
 ... in snapshot e75876ed (2021-02-28 08:35:29)

$ restic find --tree 2c3248ce
repository b0b0516c opened successfully, password is correct
Found tree 2c3248ce5dc7a4bc77f03f7475936041b6b03e0202439154a249cd28ef4018b6
 ... path /usr/include/boost/spirit/home/support/char_encoding
 ... in snapshot 41e138c8 (2021-01-31 08:35:16)
Found tree 2c3248ce5dc7a4bc77f03f7475936041b6b03e0202439154a249cd28ef4018b6
 ... path /usr/include/boost/spirit/home/support/char_encoding
 ... in snapshot e75876ed (2021-02-28 08:35:29)

and then deleted them:

$ restic forget 41e138c8 e75876ed
repository b0b0516c opened successfully, password is correct
[0:00] 100.00%  2 / 2 files deleted

$ restic prune 
repository b0b0516c opened successfully, password is correct
counting files in repo
building new index for repo
[13:23] 100.00%  58964 / 58964 packs
repository contains 58964 packs (1417910 blobs) with 278.913 GiB
processed 1417910 blobs: 0 duplicate blobs, 0 B duplicate
load all snapshots
find data that is still in use for 20 snapshots
[1:15] 100.00%  20 / 20 snapshots
found 1364852 of 1417910 data blobs still in use, removing 53058 blobs
will remove 0 invalid files
will delete 942 packs and rewrite 1358 packs, this frees 6.741 GiB
[10:50] 31.96%  434 / 1358 packs rewritten
hash does not match id: want 9ec955794534be06356655cfee6abe73cb181f88bb86b0cd769cf8699f9f9e57, got 95d90aa48ffb18e6d149731a8542acd6eb0e4c26449a4d4c8266009697fd1904
github.com/restic/restic/internal/repository.Repack
    github.com/restic/restic/internal/repository/repack.go:37
main.pruneRepository
    github.com/restic/restic/cmd/restic/cmd_prune.go:242
main.runPrune
    github.com/restic/restic/cmd/restic/cmd_prune.go:62
main.glob..func19
    github.com/restic/restic/cmd/restic/cmd_prune.go:27
github.com/spf13/cobra.(*Command).execute
    github.com/spf13/cobra/command.go:852
github.com/spf13/cobra.(*Command).ExecuteC
    github.com/spf13/cobra/command.go:960
github.com/spf13/cobra.(*Command).Execute
    github.com/spf13/cobra/command.go:897
main.main
    github.com/restic/restic/cmd/restic/main.go:98
runtime.main
    runtime/proc.go:204
runtime.goexit
    runtime/asm_amd64.s:1374

As you can see above, the prune command failed due to a corrupt pack and so I followed the process I previously wrote about and identified the affected snapshots using:

$ restic find --pack 9ec955794534be06356655cfee6abe73cb181f88bb86b0cd769cf8699f9f9e57

before deleting them with:

$ restic forget 031ab8f1 1672a9e1 1f23fb5b 2c58ea3a 331c7231 5e0e1936 735c6744 94f74bdb b11df023 dfa17ba8 e3f78133 eefbd0b0 fe88aeb5 
repository b0b0516c opened successfully, password is correct
[0:00] 100.00%  13 / 13 files deleted

$ restic prune
repository b0b0516c opened successfully, password is correct
counting files in repo
building new index for repo
[13:37] 100.00%  60020 / 60020 packs
repository contains 60020 packs (1548315 blobs) with 283.466 GiB
processed 1548315 blobs: 129812 duplicate blobs, 4.331 GiB duplicate
load all snapshots
find data that is still in use for 8 snapshots
[0:53] 100.00%  8 / 8 snapshots
found 1219895 of 1548315 data blobs still in use, removing 328420 blobs
will remove 0 invalid files
will delete 6232 packs and rewrite 1275 packs, this frees 36.302 GiB
[23:37] 100.00%  1275 / 1275 packs rewritten
counting files in repo
[11:45] 100.00%  52822 / 52822 packs
finding old index files
saved new indexes as [a31b0fc3 9f5aa9b5 db19be6f 4fd9f1d8 941e710b 528489d9 fb46b04a 6662cd78 4b3f5aad 0f6f3e07 26ae96b2 2de7b89f 78222bea 47e1a063 5abf5c2d d4b1d1c3 f8616415 3b0ebbaa]
remove 23 old index files
[0:00] 100.00%  23 / 23 files deleted
remove 7507 old packs
[0:08] 100.00%  7507 / 7507 files deleted
done

And with 13 of my 21 snapshots deleted, the checks now pass:

$ restic check
using temporary cache in /var/tmp/restic-tmp/restic-check-cache-407999210
repository b0b0516c opened successfully, password is correct
created new cache in /var/tmp/restic-tmp/restic-check-cache-407999210
create exclusive lock for repository
load indexes
check all packs
check snapshots, trees and blobs
no errors were found

This represents a significant amount of lost backup history, but at least it's not all of it.

Shirish Agarwal: what to write

13 April, 2021 - 10:14

First up, I am alive and well. I have been receiving calls from sometime but now that I have become deaf, it is a pain and the hearing aids aren’t all that useful. But moreover, where we have been finding ourselves each and every day sinking lower and lower feels absurd as to what to write and not write about India. Thankfully, I ran across this piece which does tell in far more detail than I ever could. The only interesting and somewhat positive news I had is from south of India otherwise especially for the poor. The saddest story is that this time Covid has reached alarming proportions in India and surprise, surprise this time the villain for many is my state of Maharashtra even though it hasn’t received its share of GST proceeds for last two years and this was Kerala’s perspective, different state, different party, different political ideology altogether.

Kerala Finance Minister Thomas Issac views on GST, October 22, 2020 Indian Express.

I briefly also share the death of somewhat liberal Film censorship in India unlike Italy which abolished film censorship altogether. I don’t really want spend too much on how we have become No. 2 in Covid cases in the world and perhaps death also. Many people still believe in herd immunity but don’t really know what it means. So without taking too much time and effort, bid adieu. May post when I’m hopefully emotionally feeling better, stronger

Russell Coker: Yama

13 April, 2021 - 06:00

I’ve just setup the Yama LSM module on some of my Linux systems. Yama controls ptrace which is the debugging and tracing API for Unix systems. The aim is to prevent a compromised process from using ptrace to compromise other processes and cause more damage. In most cases a process which can ptrace another process which usually means having capability SYS_PTRACE (IE being root) or having the same UID as the target process can interfere with that process in other ways such as modifying it’s configuration and data files. But even so I think it has the potential for making things more difficult for attackers without making the system more difficult to use.

If you put “kernel.yama.ptrace_scope = 1” in sysctl.conf (or write “1” to /proc/sys/kernel/yama/ptrace_scope) then a user process can only trace it’s child processes. This means that “strace -p” and “gdb -p” will fail when run as non-root but apart from that everything else will work. Generally “strace -p” (tracing the system calls of another process) is of most use to the sysadmin who can do it as root. The command “gdb -p” and variants of it are commonly used by developers so yama wouldn’t be a good thing on a system that is primarily used for software development.

Another option is “kernel.yama.ptrace_scope = 3” which means no-one can trace and it can’t be disabled without a reboot. This could be a good option for production servers that have no need for software development. It wouldn’t work well for a small server where the sysadmin needs to debug everything, but when dozens or hundreds of servers have their configuration rolled out via a provisioning tool this would be a good setting to include.

See Documentation/admin-guide/LSM/Yama.rst in the kernel source for the details.

When running with capability SYS_PTRACE (IE root shell) you can ptrace anything else and if necessary disable Yama by writing “0” to /proc/sys/kernel/yama/ptrace_scope .

I am enabling mode 1 on all my systems because I think it will make things harder for attackers while not making things more difficult for me.

Also note that SE Linux restricts SYS_PTRACE and also restricts cross-domain ptrace access, so the combination with Yama makes things extra difficult for an attacker.

Yama is enabled in the Debian kernels by default so it’s very easy to setup for Debian users, just edit /etc/sysctl.d/whatever.conf and it will be enabled on boot.

Related posts:

  1. Defense in Depth and Sudo My blog post about logging in as root and whether...
  2. SE Linux and Decrypted Data There is currently a discussion on the Debian-security mailing list...
  3. Secure Boot and Protecting Against Root There has been a lot of discussion recently about the...

Steinar H. Gunderson: Squirrel!

13 April, 2021 - 05:30

“All comments on this article will now be moderated. The bar to pass moderation will be high, it's really time to think about something else. Did you all see that we have an exciting article on spinlocks?” Poor LWN <3

SPINLOCK!

Clint Adams: Unrivaled technical acumen

13 April, 2021 - 02:53

Who are the voting members of the Free Software Foundation?

Posted on 2021-04-12 Tags: barks

Russell Coker: Riverdale

12 April, 2021 - 19:35

I’ve been watching the show Riverdale on Netflix recently. It’s an interesting modern take on the Archie comics. Having watched Josie and the Pussycats in Outer Space when I was younger I was anticipating something aimed towards a similar audience. As solving mysteries and crimes was apparently a major theme of the show I anticipated something along similar lines to Scooby Doo, some suspense and some spooky things, but then a happy ending where criminals get arrested and no-one gets hurt or killed while the vast majority of people are nice. Instead the first episode has a teen being murdered and Ms Grundy being obsessed with 15yo boys and sleeping with Archie (who’s supposed to be 15 but played by a 20yo actor).

Everyone in the show has some dark secret. The filming has a dark theme, the sky is usually overcast and it’s generally gloomy. This is a significant contrast to Veronica Mars which has some similarities in having a young cast, a sassy female sleuth, and some similar plot elements. Veronica Mars has a bright theme and a significant comedy element in spite of dealing with some dark issues (murder, rape, child sex abuse, and more). But Riverdale is just dark. Anyone who watches this with their kids expecting something like Scooby Doo is in for a big surprise.

There are lots of interesting stylistic elements in the show. Lots of clothing and uniform designs that seem to date from the 1940’s. It seems like some alternate universe where kids have smartphones and laptops while dressing in the style of the 1940s. One thing that annoyed me was construction workers using tools like sledge-hammers instead of excavators. A society that has smart phones but no earth-moving equipment isn’t plausible.

On the upside there is a racial mix in the show that more accurately reflects American society than the original Archie comics and homophobia is much less common than in most parts of our society. For both race issues and gay/lesbian issues the show treats them in an accurate way (portraying some bigotry) while the main characters aren’t racist or homophobic.

I think it’s generally an OK show and recommend it to people who want a dark show. It’s a good show to watch while doing something on a laptop so you can check Wikipedia for the references to 1940s stuff (like when Bikinis were invented). I’m half way through season 3 which isn’t as good as the first 2, I don’t know if it will get better later in the season or whether I should have stopped after season 2.

I don’t usually review fiction, but the interesting aesthetics of the show made it deserve a review.

Related posts:

  1. Android Screen Saving Just over a year ago I bought a Samsung Galaxy...

Russell Coker: Storage Trends 2021

12 April, 2021 - 17:01
The Viability of Small Disks

Less than a year ago I wrote a blog post about storage trends [1]. My main point in that post was that disks smaller than 2TB weren’t viable then and 2TB disks wouldn’t be economically viable in the near future.

Now MSY has 2TB disks for $72 and 2TB SSD for $245, saving $173 if you get a hard drive (compared to saving $240 10 months ago). Given the difference in performance and noise 2TB hard drives won’t be worth using for most applications nowadays.

NVMe vs SSD

Last year NVMe prices were very comparable for SSD prices, I was hoping that trend would continue and SSDs would go away. Now for sizes 1TB and smaller NVMe and SSD prices are very similar, but for 2TB the NVMe prices are twice that of SSD – presumably partly due to poor demand for 2TB NVMe. There are also no NVMe devices larger than 2TB on sale at MSY (a store which caters to home stuff not special server equipment) but SSDs go up to 8TB.

It seems that NVMe is only really suitable for workstation storage and for cache etc on a server. So SATA SSDs will be around for a while.

Small Servers

There are a range of low end servers which support a limited number of disks. Dell has 2 disk servers and 4 disk servers. If one of those had 8TB SSDs you could have 8TB of RAID-1 or 24TB of RAID-Z storage in a low end server. That covers the vast majority of servers (small business or workgroup servers tend to have less than 8TB of storage).

Larger Servers

Anandtech has an article on Seagates roadmap to 120TB disks [2]. They currently sell 20TB disks using HAMR technology

Currently the biggest disks that MSY sells are 10TB for $395, which was also the biggest disk they were selling last year. Last year MSY only sold SSDs up to 2TB in size (larger ones were available from other companies at much higher prices), now they sell 8TB SSDs for $949 (4* capacity increase in less than a year). Seagate is planning 30TB disks for 2023, if SSDs continue to increase in capacity by 4* per year we could have 128TB SSDs in 2023. If you needed a server with 100TB of storage then having 2 or 3 SSDs in a RAID array would be much easier to manage and faster than 4*30TB disks in an array.

When you have a server with many disks you can expect to have more disk failures due to vibration. One time I built a server with 18 disks and took disks from 2 smaller servers that had 4 and 5 disks. The 9 disks which had been working reliably for years started having problems within weeks of running in the bigger server. This is one of the many reasons for paying extra for SSD storage.

Seagate is apparently planning 50TB disks for 2026 and 100TB disks for 2030. If that’s the best they can do then SSD vendors should be able to sell larger products sooner at prices that are competitive. Matching hard drive prices is not required, getting to less than 4* the price should be enough for most customers.

The Anandtech article is worth reading, it mentions some interesting features that Seagate are developing such as having 2 actuators (which they call Mach.2) so the drive can access 2 different tracks at the same time. That can double the performance of a disk, but that doesn’t change things much when SSDs are more than 100* faster. Presumably the Mach.2 disks will be SAS and incredibly expensive while providing significantly less performance than affordable SATA SSDs.

Computer Cases

In my last post I speculated on the appearance of smaller cases designed to not have DVD drives or 3.5″ hard drives. Such cases still haven’t appeared apart from special purpose machines like the NUC that were available last year.

It would be nice if we could get a new industry standard for smaller power supplies. Currently power supplies are expected to be almost 5 inches wide (due to the expectation of a 5.25″ DVD drive mounted horizontally). We need some industry standards for smaller PCs that aren’t like the NUC, the NUC is very nice, but most people who build their own PC need more space than that. I still think that planning on USB DVD drives is the right way to go. I’ve got 4PCs in my home that are regularly used and CDs and DVDs are used so rarely that sharing a single DVD drive among all 4 wouldn’t be a problem.

Conclusion

I’m tempted to get a couple of 4TB SSDs for my home server which cost $487 each, it currently has 2*500G SSDs and 3*4TB disks. I would have to remove some unused files but that’s probably not too hard to do as I have lots of old backups etc on there. Another possibility is to use 2*4TB SSDs for most stuff and 2*4TB disks for backups.

I’m recommending that all my clients only use SSDs for their storage. I only have one client with enough storage that disks are the only option (100TB of storage) but they moved all the functions of that server to AWS and use S3 for the storage. Now I don’t have any clients doing anything with storage that can’t be done in a better way on SSD for a price difference that’s easy for them to afford.

Affordable SSD also makes RAID-1 in workstations more viable. 2 disks in a PC is noisy if you have an office full of them and produces enough waste heat to be a reliability issue (most people don’t cool their offices adequately on weekends). 2 SSDs in a PC is no problem at all. As 500G SSDs are available for $73 it’s not a significant cost to install 2 of them in every PC in the office (more cost for my time than hardware). I generally won’t recommend that hard drives be replaced with SSDs in systems that are working well. But if a machine runs out of space then replacing it with SSDs in a RAID-1 is a good choice.

Moore’s law might cover SSDs, but it definitely doesn’t cover hard drives. Hard drives have fallen way behind developments of most other parts of computers over the last 30 years, hopefully they will go away soon.

Related posts:

  1. Storage Trends In considering storage trends for the consumer side I’m looking...
  2. New Storage Developments Eweek has an article on a new 1TB Seagate drive....
  3. Finding Storage Performance Problems Here are some basic things to do when debugging storage...

Jelmer Vernooij: The upstream ontologist

12 April, 2021 - 05:40

The Debian Janitor is an automated system that commits fixes for (minor) issues in Debian packages that can be fixed by software. It gradually started proposing merges in early December. The first set of changes sent out ran lintian-brush on sid packages maintained in Git. This post is part of a series about the progress of the Janitor.

The upstream ontologist is a project that extracts metadata about upstream projects in a consistent format. It does this with a combination of heuristics and reading ecosystem-specific metadata files, such as Python’s setup.py, rust’s Cargo.toml as well as e.g. scanning README files.

Supported Data Sources

It will extract information from a wide variety of sources, including:

Supported Fields

Fields that it currently provides include:

  • Homepage: homepage URL
  • Name: name of the upstream project
  • Contact: contact address of some sort of the upstream (e-mail, mailing list URL)
  • Repository: VCS URL
  • Repository-Browse: Web URL for viewing the VCS
  • Bug-Database: Bug database URL (for web viewing, generally)
  • Bug-Submit: URL to use to submit new bugs (either on the web or an e-mail address)
  • Screenshots: List of URLs with screenshots
  • Archive: Archive used - e.g. SourceForge
  • Security-Contact: e-mail or URL with instructions for reporting security issues
  • Documentation: Link to documentation on the web:
  • Wiki: Wiki URL
  • Summary: one-line description of the project
  • Description: longer description of the project
  • License: Single line license description (e.g. "GPL 2.0") as declared in the metadata[1]
  • Copyright: List of copyright holders
  • Version: Current upstream version
  • Security-MD: URL to markdown file with security policy

All data fields have a “certainty” associated with them (“certain”, “confident”, “likely” or “possible”), which gets set depending on how the data was derived or where it was found. If multiple possible values were found for a specific field, then the value with the highest certainty is taken.

Interface

The ontologist provides a high-level Python API as well as two command-line tools that can write output in two different formats:

For example, running guess-upstream-metadata on dulwich:

 % guess-upstream-metadata
 <string>:2: (INFO/1) Duplicate implicit target name: "contributing".
 Name: dulwich
 Repository: https://www.dulwich.io/code/
 X-Security-MD: https://github.com/dulwich/dulwich/tree/HEAD/SECURITY.md
 X-Version: 0.20.21
 Bug-Database: https://github.com/dulwich/dulwich/issues
 X-Summary: Python Git Library
 X-Description: |
   This is the Dulwich project.
   It aims to provide an interface to git repos (both local and remote) that
   doesn't call out to git directly but instead uses pure Python.
 X-License: Apache License, version 2 or GNU General Public License, version 2 or later.
 Bug-Submit: https://github.com/dulwich/dulwich/issues/new
Lintian-Brush

lintian-brush can update DEP-12-style debian/upstream/metadata files that hold information about the upstream project that is packaged as well as the Homepage in the debian/control file based on information provided by the upstream ontologist. By default, it only imports data with the highest certainty - you can override this by specifying the --uncertain command-line flag.

[1]Obviously this won't be able to describe the full licensing situation for many projects. Projects like scancode-toolkit are more appropriate for that.

Vishal Gupta: Sikkim 101 for Backpackers

11 April, 2021 - 21:04

Host to Kanchenjunga, the world’s third-highest mountain peak and the endangered Red Panda, Sikkim is a state in northeastern India. Nestled between Nepal, Tibet (China), Bhutan and West Bengal (India), the state offers a smorgasbord of cultures and cuisines. That said, it’s hardly surprising that the old spice route meanders through western Sikkim, connecting Lhasa with the ports of Bengal. Although the latter could also be attributed to cardamom (kali elaichi), a perennial herb native to Sikkim, which the state is the second-largest producer of, globally. Lastly, having been to and lived in India, all my life, I can confidently say Sikkim is one of the cleanest & safest regions in India, making it ideal for first-time backpackers.

Brief History
  • 17th century: The Kingdom of Sikkim is founded by the Namgyal dynasty and ruled by Buddhist priest-kings known as the Chogyal.
  • 1890: Sikkim becomes a princely state of British India.
  • 1947: Sikkim continues its protectorate status with the Union of India, post-Indian-independence.
  • 1973: Anti-royalist riots take place in front of the Chogyal's palace, by Nepalis seeking greater representation.
  • 1975: Referendum leads to the deposition of the monarchy and Sikkim joins India as its 22nd state.
Languages

  • Official: English, Nepali, Sikkimese/Bhotia and Lepcha
  • Though Hindi and Nepali share the same script (Devanagari), they are not mutually intelligible. Yet, most people in Sikkim can understand and speak Hindi.

Ethnicity

  • Nepalis: Migrated in large numbers (from Nepal) and soon became the dominant community
  • Bhutias: People of Tibetan origin. Major inhabitants in Northern Sikkim.
  • Lepchas: Original inhabitants of Sikkim

Food
  • Tibetan/Nepali dishes (mostly consumed during winter)
    • Thukpa: Noodle soup, rich in spices and vegetables. Usually contains some form of meat. Common variations: Thenthuk and Gyathuk
    • Momos: Steamed or fried dumplings, usually with a meat filling.
    • Saadheko: Spicy marinated chicken salad.
    • Gundruk Soup: A soup made from Gundruk, a fermented leafy green vegetable.
    • Sinki : A fermented radish tap-root product, traditionally consumed as a base for soup and as a pickle. Eerily similar to Kimchi.
  • While pork and beef are pretty common, finding vegetarian dishes is equally easy.
  • Staple: Dal-Bhat with Subzi. Rice is a lot more common than wheat (rice) possibly due to greater carb content and proximity to West Bengal, India’s largest producer of Rice.
  • Good places to eat in Gangtok
    • Hamro Bhansa Ghar, Nimtho (Nepali)
    • Taste of Tibet
    • Dragon Wok (Chinese & Japanese)

Buddhism in Sikkim
  • Bayul Demojong (Sikkim), is the most sacred Land in the Himalayas as per the belief of the Northern Buddhists and various religious texts.
  • Sikkim was blessed by Guru Padmasambhava, the great Buddhist saint who visited Sikkim in the 8th century and consecrated the land.
  • However, Buddhism is said to have reached Sikkim only in the 17th century with the arrival of three Tibetan monks viz. Rigdzin Goedki Demthruchen, Mon Kathok Sonam Gyaltshen & Rigdzin Legden Je at Yuksom. Together, they established a Buddhist monastery.
  • In 1642 they crowned Phuntsog Namgyal as the first monarch of Sikkim and gave him the title of Chogyal, or Dharma Raja.
  • The faith became popular through its royal patronage and soon many villages had their own monastery.
  • Today Sikkim has over 200 monasteries.
Major monasteries
  • Rumtek Monastery, 20Km from Gangtok
  • Lingdum/Ranka Monastery, 17Km from Gangtok
  • Phodong Monastery, 28Km from Gangtok
  • Ralang Monastery, 10Km from Ravangla
  • Tsuklakhang Monastery, Royal Palace, Gangtok
  • Enchey Monastery, Gangtok
  • Tashiding Monastery, 35Km from Ravangla

Reaching Sikkim
  • Gangtok, being the capital, is easiest to reach amongst other regions, by public transport and shared cabs.
  • By Air:
    • Pakyong (PYG) :
      • Nearest airport from Gangtok (about 1 hour away)
      • Tabletop airport
      • Reserved cabs cost around INR 1200.
      • As of Apr 2021, the only flights to PYG are from IGI (Delhi) and CCU (Kolkata).
    • Bagdogra (IXB) :
      • About 20 minutes from Siliguri and 4 hours from Gangtok.
      • Larger airport with flights to most major Indian cities.
      • Reserved cabs cost about INR 3000. Shared cabs cost about INR 350.
  • By Train:
    • New Jalpaiguri (NJP) :
      • About 20 minutes from Siliguri and 4 hours from Gangtok.
      • Reserved cabs cost about INR 3000. Shared cabs from INR 350.
  • By Road:
    • NH10 connects Siliguri to Gangtok
    • If you can’t find buses plying to Gangtok directly, reach Siliguri and then take a cab to Gangtok.
  • Sikkim Nationalised Transport Div. also runs hourly buses between Siliguri and Gangtok and daily buses on other common routes. They’re cheaper than shared cabs.
  • Wizzride also operates shared cabs between Siliguri/Bagdogra/NJP, Gangtok and Darjeeling. They cost about the same as shared cabs but pack in half as many people in “luxury cars” (Innova, Xylo, etc.) and are hence more comfortable.

Gangtok
  • Time needed: 1D/1N
  • Places to visit:
    • Hanuman Tok
    • Ganesh Tok
    • Tashi View Point [6,800ft]
    • MG Marg
    • Sikkim Zoo
    • Gangtok Ropeway
    • Enchey Monastery
    • Tsuklakhang Palace & Monastery
  • Hostels: Tagalong Backpackers (would strongly recommend), Zostel Gangtok
  • Places to chill: Travel Cafe, Café Live & Loud and Gangtok Groove
  • Places to shop: Lal Market and MG Marg
Getting Around
  • Taxis operate on a reserved or shared basis. In case of the latter, you can pool with other commuters your taxis will pick up and drop en-route.
  • Naturally shared taxis only operate on popular routes. The easiest way to get around Gangtok is to catch a shared cab from MG Marg.
  • Reserved taxis for Gangtok sightseeing cost around INR 1000-1500, depending upon the spots you’d like to see
  • Key taxi/bus stands :
    • Deorali stand: For Darjeeling, Siliguri, Kalimpong
    • Vajra stand: For North & East Sikkim (Tsomgo Lake & Nathula)
    • Rumtek taxi: For Ravangla, Pelling, Namchi, Geyzing, Jorethang and Singtam.

Exploring Gangtok on an MTB

North Sikkim
  • The easiest & most economical way to explore North Sikkim is the 3D/2N package offered by shared-cab drivers.
  • This includes food, permits, cab rides and accommodation (1N in Lachen and 1N in Lachung)
  • The accommodation on both nights are at homestays with bare necessities, so keep your hopes low.
  • In the spirit of sustainable tourism, you’ll be asked to discard single-use plastic bottles, so please carry a bottle that you can refill along the way.
  • Zero Point and Gurdongmer Lake are snow-capped throughout the year

3D/2N Shared-cab Package Itinerary
  • Day 1
    • Gangtok (10am) - Chungthang - Lachung (stay)
  • Day 2
    • Pre-lunch : Lachung (6am) - Yumthang Valley [12,139ft] - Zero Point - Lachung [15,300ft]
    • Post-lunch : Lachung - Chungthang - Lachen (stay)
  • Day 3
    • Pre-lunch : Lachen (5am) - Kala Patthar - Gurdongmer Lake [16,910ft] - Lachen
    • Post-lunch : Lachen - Chungthang - Gangtok (7pm)
  • This itinerary is idealistic and depends on the level of snowfall.
  • Some drivers might switch up Day 2 and 3 itineraries by visiting Lachen and then Lachung, depending upon the weather.
  • Areas beyond Lachen & Lachung are heavily militarized since the Indo-China border is only a few miles away.

East Sikkim Zuluk and Silk Route
  • Time needed: 2D/1N
  • Zuluk [9,400ft] is a small hamlet with an excellent view of the eastern Himalayan range including the Kanchenjunga.
  • Was once a transit point to the historic Silk Route from Tibet (Lhasa) to India (West Bengal).
  • The drive from Gangtok to Zuluk takes at least four hours. Hence, it makes sense to spend the night at a homestay and space out your trip to Zuluk

Tsomgo Lake and Nathula
  • Time Needed : 1D
  • A Protected Area Permit is required to visit these places, due to their proximity to the Chinese border
  • Tsomgo/Chhangu Lake [12,313ft]
    • Glacial lake, 40 km from Gangtok.
    • Remains frozen during the winter season.
    • You can also ride on the back of a Yak for INR 300
  • Baba Mandir
    • An old temple dedicated to Baba Harbhajan Singh, a Sepoy in the 23rd Regiment, who died in 1962 near the Nathu La during Indo – China war.
  • Nathula Pass [14,450ft]
    • Located on the Indo-Tibetan border crossing of the Old Silk Route, it is one of the three open trading posts between India and China.
    • Plays a key role in the Sino-Indian Trade and also serves as an official Border Personnel Meeting(BPM) Point.
    • May get cordoned off by the Indian Army in event of heavy snowfall or for other security reasons.

West Sikkim
  • Time needed: 3N/1N
  • Hostels at Pelling : Mochilerro Ostillo
Itinerary Day 1: Gangtok - Ravangla - Pelling
  • Leave Gangtok early, for Ravangla through the Temi Tea Estate route.
  • Spend some time at the tea garden and then visit Buddha Park at Ravangla
  • Head to Pelling from Ravangla
Day 2: Pelling sightseeing
  • Hire a cab and visit Skywalk, Pemayangtse Monastery, Rabdentse Ruins, Kecheopalri Lake, Kanchenjunga Falls.
Day 3: Pelling - Gangtok/Siliguri
  • Wake up early to catch a glimpse of Kanchenjunga at the Pelling Helipad around sunrise
  • Head back to Gangtok on a shared-cab
  • You could take a bus/taxi back to Siliguri if Pelling is your last stop.

Darjeeling
  • In my opinion, Darjeeling is lovely for a two-day detour on your way back to Bagdogra/Siliguri and not any longer (unless you’re a Bengali couple on a honeymoon)
  • Once a part of Sikkim, Darjeeling was ceded to the East India Company after a series of wars, with Sikkim briefly receiving a grant from EIC for “gifting” Darjeeling to the latter
  • Post-independence, Darjeeling was merged with the state of West Bengal.
Itinerary Day 1 :
  • Take a cab from Gangtok to Darjeeling (shared-cabs cost INR 300 per seat)
  • Reach Darjeeling by noon and check in to your Hostel. I stayed at Hideout.
  • Spend the evening visiting either a monastery (or the Batasia Loop), Nehru Road and Mall Road.
  • Grab dinner at Glenary whilst listening to live music.

Day 2:
  • Wake up early to catch the sunrise and a glimpse of Kanchenjunga at Tiger Hill. Since Tiger Hill is 10km from Darjeeling and requires a permit, book your taxi in advance.
  • Alternatively, if you don’t want to get up at 4am or shell out INR1500 on the cab to Tiger Hill, walk to the Kanchenjunga View Point down Mall Road
  • Next, queue up outside Keventers for breakfast with a view in a century-old cafe
  • Get a cab at Gandhi Road and visit a tea garden (Happy Valley is the closest) and the Ropeway. I was lucky to meet 6 other backpackers at my hostel and we ended up pooling the cab at INR 200 per person, with INR 1400 being on the expensive side, but you could bargain.
  • Get lunch, buy some tea at Golden Tips, pack your bags and hop on a shared-cab back to Siliguri. It took us about 4hrs to reach Siliguri, with an hour to spare before my train.
  • If you’ve still got time on your hands, then check out the Peace Pagoda and the Darjeeling Himalayan Railway (Toy Train). At INR 1500, I found the latter to be too expensive and skipped it.

Tips and hacks
  • Download offline maps, especially when you’re exploring Northern Sikkim.
  • Food and booze are the cheapest in Gangtok. Stash up before heading to other regions.
  • Keep your Aadhar/Passport handy since you need permits to travel to North & East Sikkim.
  • In rural areas and some cafes, you may get to try Rhododendron Wine, made from Rhododendron arboreum a.k.a Gurans. Its production is a little hush-hush since the flower is considered holy and is also the National Flower of Nepal.
  • If you don’t want to invest in a new jacket, boots or a pair of gloves, you can always rent them at nominal rates from your hotel or little stores around tourist sites.
  • Check the weather of a region before heading there. Low visibility and precipitation can quite literally dampen your experience.
  • Keep your itinerary flexible to accommodate for rest and impromptu plans.
  • Shops and restaurants close by 8pm in Sikkim and Darjeeling. Plan for the same.
Carry…
  • a couple of extra pairs of socks (woollen, if possible)
  • a pair of slippers to wear indoors
  • a reusable water bottle
  • an umbrella
  • a power bank
  • a couple of tablets of Diamox. Helps deal with altitude sickness
  • extra clothes and wet bags since you may not get a chance to wash/dry your clothes
  • a few passport size photographs
Shared-cab hacks
  • Intercity rides can be exhausting. If you can afford it, pay for an additional seat.
  • Call shotgun on the drives beyond Lachen and Lachung. The views are breathtaking.
  • Return cabs tend to be cheaper (WB cabs travelling from SK and vice-versa)
Cost
  • My median daily expenditure (back when I went to Sikkim in early March 2021) was INR 1350.
  • This includes stay (bunk bed), food, wine and transit (shared cabs)
  • In my defence, I splurged on food, wine and extra seats in shared cabs, but if you’re on a budget, you could easily get by on INR 1 - 1.2k per day.
  • For a 9-day trip, I ended up shelling out nearly INR 15k, including 2AC trains to & from Kolkata
  • Note : Summer (March to May) and Autumn (October to December) are peak seasons, and thereby more expensive to travel around.
Souvenirs and things you should buy

Buddhist souvenirs :
  • Colourful Prayer Flags (great for tying on bikes or behind car windshields)
  • Miniature Prayer/Mani Wheels
  • Lucky Charms, Pendants and Key Chains
  • Cham Dance masks and robes
  • Singing Bowls
  • Common symbols: Om mani padme hum, Ashtamangala, Zodiac signs
Handicrafts & Handlooms
  • Tibetan Yak Wool shawls, scarfs and carpets
  • Sikkimese Ceramic cups
  • Thangka Paintings
Edibles
  • Darjeeling Tea (usually brewed and not boiled)
  • Wine (Arucha Peach & Rhododendron)
  • Dalle Khursani (Chilli) Paste and Pickle

Header Icon made by Freepik from www.flaticon.com is licensed by CC 3.0 BY

Jonathan Dowland: 2020 in short fiction

11 April, 2021 - 17:53

Following on from 2020 in Fiction: In 2020 I read a couple of collections of short fiction from some of my favourite authors.

I started the year with Christopher Priest's Episodes. The stories within are collected from throughout his long career, and vary in style and tone. Priest wrote new little prologues and epilogues for each of the stories, explaining the context in which they were written. I really enjoyed this additional view into their construction.

By contrast, Adam Robert's Adam Robots presents the stories on their own terms. Each of the stories is written in a different mode: one as golden-age SF, another as a kind of Cyberpunk, for example, although they all blend or confound sub-genres to some degree. I'm not clever enough to have decoded all their secrets on a first read, and I would have appreciated some "Cliff's Notes” on any deeper meaning or intent.

Ted Chiang's Exhalation was up to the fantastic standard of his earlier collection and had some extremely thoughtful explorations of philosophical ideas. All the stories are strong but one stuck in my mind the longest: Omphalos)…

With my daughter I finished three of Terry Pratchett's short story collections aimed at children: Dragon at Crumbling Castle; The Witch's Vacuum Cleaner and The Time-Travelling Caveman. If you are a Pratchett fan and you've overlooked these because they're aimed at children, take another look. The quality varies, but there are some true gems in these. Several stories take place in common settings, either the town of Blackbury, in Gritshire (and the adjacent Even Moor), or the Welsh border-town of Llandanffwnfafegettupagogo. The sad thing was knowing that once I'd finished them (and the fourth, Father Christmas's Fake Beard) that was it: there will be no more.

8/31 of the "books" I read in 2020 were issues of Interzone. Counting them as "books" for my annual reading goal has encouraged me to read full issues, whereas before I would likely have only read a couple of stories from each issue. Reading full issues has rekindled the enjoyment I got out of it when I first discovered the magazine at the turn of the Century. I am starting to recognise stories by authors that have written stories in other issues, as well as common themes from the current era weaving their way into the work (Trump, Brexit, etc.) No doubt the Pandemic will leave its mark on 2021's stories.

Junichi Uekawa: Wrote a timezone checker page.

11 April, 2021 - 09:06
Wrote a timezone checker page. timezone. Shows the current time in blue line. I haven't made anything configurable but will think about it later.

Charles Plessy: Debian Bullseye: more open

11 April, 2021 - 05:21

Debian Bullseye will provide the command /usr/bin/open for your greatest comfort at the command line. On a system with a graphical desktop environment, the command should have a similar result as when opening a document from a mouse-and-click file browser.

Technically, /usr/bin/open is a symbolic link managed by update-alternatives to point towards xdg-open if available and otherwise run-mailcap.

Kentaro Hayashi: Grow your ideas for Debian Project

10 April, 2021 - 20:13

There may be some "If it could be ..." ideas for Debian Project. If idea is concreate and worth to make things forward, it should make a proposal for Project Funding.

salsa.debian.org

But it is a just an idea, or no afford to act as an executor role, that idea will not be achieved.

I thought that It needs an incubator - complemental project.

salsa.debian.org

I've salvaged an idea from closed MR Add proposal about "Formalize reimbursement process" (!5) · Merge Requests · Freexian SARL / Project Funding · GitLab

I'm not confident whether mechanism works, but Debian needs change.

Michael Prokop: A Ceph war story

9 April, 2021 - 19:39

It all started with the big bang! We nearly lost 33 of 36 disks on a Proxmox/Ceph Cluster; this is the story of how we recovered them.

At the end of 2020, we eventually had a long outstanding maintenance window for taking care of system upgrades at a customer. During this maintenance window, which involved reboots of server systems, the involved Ceph cluster unexpectedly went into a critical state. What was planned to be a few hours of checklist work in the early evening turned out to be an emergency case; let’s call it a nightmare (not only because it included a big part of the night). Since we have learned a few things from our post mortem and RCA, it’s worth sharing those with others. But first things first, let’s step back and clarify what we had to deal with.

The system and its upgrade

One part of the upgrade included 3 Debian servers (we’re calling them server1, server2 and server3 here), running on Proxmox v5 + Debian/stretch with 12 Ceph OSDs each (65.45TB in total), a so-called Proxmox Hyper-Converged Ceph Cluster.

First, we went for upgrading the Proxmox v5/stretch system to Proxmox v6/buster, before updating Ceph Luminous v12.2.13 to the latest v14.2 release, supported by Proxmox v6/buster. The Proxmox upgrade included updating corosync from v2 to v3. As part of this upgrade, we had to apply some configuration changes, like adjust ring0 + ring1 address settings and add a mon_host configuration to the Ceph configuration.

During the first two servers’ reboots, we noticed configuration glitches. After fixing those, we went for a reboot of the third server as well. Then we noticed that several Ceph OSDs were unexpectedly down. The NTP service wasn’t working as expected after the upgrade. The underlying issue is a race condition of ntp with systemd-timesyncd (see #889290). As a result, we had clock skew problems with Ceph, indicating that the Ceph monitors’ clocks aren’t running in sync (which is essential for proper Ceph operation). We initially assumed that our Ceph OSD failure derived from this clock skew problem, so we took care of it. After yet another round of reboots, to ensure the systems are running all with identical and sane configurations and services, we noticed lots of failing OSDs. This time all but three OSDs (19, 21 and 22) were down:

% sudo ceph osd tree
ID CLASS WEIGHT   TYPE NAME      STATUS REWEIGHT PRI-AFF
-1       65.44138 root default
-2       21.81310     host server1
 0   hdd  1.08989         osd.0    down  1.00000 1.00000
 1   hdd  1.08989         osd.1    down  1.00000 1.00000
 2   hdd  1.63539         osd.2    down  1.00000 1.00000
 3   hdd  1.63539         osd.3    down  1.00000 1.00000
 4   hdd  1.63539         osd.4    down  1.00000 1.00000
 5   hdd  1.63539         osd.5    down  1.00000 1.00000
18   hdd  2.18279         osd.18   down  1.00000 1.00000
20   hdd  2.18179         osd.20   down  1.00000 1.00000
28   hdd  2.18179         osd.28   down  1.00000 1.00000
29   hdd  2.18179         osd.29   down  1.00000 1.00000
30   hdd  2.18179         osd.30   down  1.00000 1.00000
31   hdd  2.18179         osd.31   down  1.00000 1.00000
-4       21.81409     host server2
 6   hdd  1.08989         osd.6    down  1.00000 1.00000
 7   hdd  1.08989         osd.7    down  1.00000 1.00000
 8   hdd  1.63539         osd.8    down  1.00000 1.00000
 9   hdd  1.63539         osd.9    down  1.00000 1.00000
10   hdd  1.63539         osd.10   down  1.00000 1.00000
11   hdd  1.63539         osd.11   down  1.00000 1.00000
19   hdd  2.18179         osd.19     up  1.00000 1.00000
21   hdd  2.18279         osd.21     up  1.00000 1.00000
22   hdd  2.18279         osd.22     up  1.00000 1.00000
32   hdd  2.18179         osd.32   down  1.00000 1.00000
33   hdd  2.18179         osd.33   down  1.00000 1.00000
34   hdd  2.18179         osd.34   down  1.00000 1.00000
-3       21.81419     host server3
12   hdd  1.08989         osd.12   down  1.00000 1.00000
13   hdd  1.08989         osd.13   down  1.00000 1.00000
14   hdd  1.63539         osd.14   down  1.00000 1.00000
15   hdd  1.63539         osd.15   down  1.00000 1.00000
16   hdd  1.63539         osd.16   down  1.00000 1.00000
17   hdd  1.63539         osd.17   down  1.00000 1.00000
23   hdd  2.18190         osd.23   down  1.00000 1.00000
24   hdd  2.18279         osd.24   down  1.00000 1.00000
25   hdd  2.18279         osd.25   down  1.00000 1.00000
35   hdd  2.18179         osd.35   down  1.00000 1.00000
36   hdd  2.18179         osd.36   down  1.00000 1.00000
37   hdd  2.18179         osd.37   down  1.00000 1.00000

Our blood pressure increased slightly! Did we just lose all of our cluster? What happened, and how can we get all the other OSDs back?

We stumbled upon this beauty in our logs:

kernel: [   73.697957] XFS (sdl1): SB stripe unit sanity check failed
kernel: [   73.698002] XFS (sdl1): Metadata corruption detected at xfs_sb_read_verify+0x10e/0x180 [xfs], xfs_sb block 0xffffffffffffffff
kernel: [   73.698799] XFS (sdl1): Unmount and run xfs_repair
kernel: [   73.699199] XFS (sdl1): First 128 bytes of corrupted metadata buffer:
kernel: [   73.699677] 00000000: 58 46 53 42 00 00 10 00 00 00 00 00 00 00 62 00  XFSB..........b.
kernel: [   73.700205] 00000010: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
kernel: [   73.700836] 00000020: 62 44 2b c0 e6 22 40 d7 84 3d e1 cc 65 88 e9 d8  bD+.."@..=..e...
kernel: [   73.701347] 00000030: 00 00 00 00 00 00 40 08 00 00 00 00 00 00 01 00  ......@.........
kernel: [   73.701770] 00000040: 00 00 00 00 00 00 01 01 00 00 00 00 00 00 01 02  ................
ceph-disk[4240]: mount: /var/lib/ceph/tmp/mnt.jw367Y: mount(2) system call failed: Structure needs cleaning.
ceph-disk[4240]: ceph-disk: Mounting filesystem failed: Command '['/bin/mount', '-t', u'xfs', '-o', 'noatime,inode64', '--', '/dev/disk/by-parttypeuuid/4fbd7e29-9d25-41b8-afd0-062c0ceff05d.cdda39ed-5
ceph/tmp/mnt.jw367Y']' returned non-zero exit status 32
kernel: [   73.702162] 00000050: 00 00 00 01 00 00 18 80 00 00 00 04 00 00 00 00  ................
kernel: [   73.702550] 00000060: 00 00 06 48 bd a5 10 00 08 00 00 02 00 00 00 00  ...H............
kernel: [   73.702975] 00000070: 00 00 00 00 00 00 00 00 0c 0c 0b 01 0d 00 00 19  ................
kernel: [   73.703373] XFS (sdl1): SB validate failed with error -117.

The same issue was present for the other failing OSDs. We hoped, that the data itself was still there, and only the mounting of the XFS partitions failed. The Ceph cluster was initially installed in 2017 with Ceph jewel/10.2 with the OSDs on filestore (nowadays being a legacy approach to storing objects in Ceph). However, we migrated the disks to bluestore since then (with ceph-disk and not yet via ceph-volume what’s being used nowadays). Using ceph-disk introduces these 100MB XFS partitions containing basic metadata for the OSD.

Given that we had three working OSDs left, we decided to investigate how to rebuild the failing ones. Some folks on #ceph (thanks T1, ormandj + peetaur!) were kind enough to share how working XFS partitions looked like for them. After creating a backup (via dd), we tried to re-create such an XFS partition on server1. We noticed that even mounting a freshly created XFS partition failed:

synpromika@server1 ~ % sudo mkfs.xfs -f -i size=2048 -m uuid="4568c300-ad83-4288-963e-badcd99bf54f" /dev/sdc1
meta-data=/dev/sdc1              isize=2048   agcount=4, agsize=6272 blks
         =                       sectsz=4096  attr=2, projid32bit=1
         =                       crc=1        finobt=1, sparse=1, rmapbt=0
         =                       reflink=0
data     =                       bsize=4096   blocks=25088, imaxpct=25
         =                       sunit=128    swidth=64 blks
naming   =version 2              bsize=4096   ascii-ci=0, ftype=1
log      =internal log           bsize=4096   blocks=1608, version=2
         =                       sectsz=4096  sunit=1 blks, lazy-count=1
realtime =none                   extsz=4096   blocks=0, rtextents=0
synpromika@server1 ~ % sudo mount /dev/sdc1 /mnt/ceph-recovery
SB stripe unit sanity check failed
Metadata corruption detected at 0x433840, xfs_sb block 0x0/0x1000
libxfs_writebufr: write verifer failed on xfs_sb bno 0x0/0x1000
cache_node_purge: refcount was 1, not zero (node=0x1d3c400)
SB stripe unit sanity check failed
Metadata corruption detected at 0x433840, xfs_sb block 0x18800/0x1000
libxfs_writebufr: write verifer failed on xfs_sb bno 0x18800/0x1000
SB stripe unit sanity check failed
Metadata corruption detected at 0x433840, xfs_sb block 0x0/0x1000
libxfs_writebufr: write verifer failed on xfs_sb bno 0x0/0x1000
SB stripe unit sanity check failed
Metadata corruption detected at 0x433840, xfs_sb block 0x24c00/0x1000
libxfs_writebufr: write verifer failed on xfs_sb bno 0x24c00/0x1000
SB stripe unit sanity check failed
Metadata corruption detected at 0x433840, xfs_sb block 0xc400/0x1000
libxfs_writebufr: write verifer failed on xfs_sb bno 0xc400/0x1000
releasing dirty buffer (bulk) to free list!releasing dirty buffer (bulk) to free list!releasing dirty buffer (bulk) to free list!releasing dirty buffer (bulk) to free list!found dirty buffer (bulk) on free list!bad magic number
bad magic number
Metadata corruption detected at 0x433840, xfs_sb block 0x0/0x1000
libxfs_writebufr: write verifer failed on xfs_sb bno 0x0/0x1000
releasing dirty buffer (bulk) to free list!mount: /mnt/ceph-recovery: wrong fs type, bad option, bad superblock on /dev/sdc1, missing codepage or helper program, or other error.

Ouch. This very much looked related to the actual issue we’re seeing. So we tried to execute mkfs.xfs with a bunch of different sunit/swidth settings. Using ‘-d sunit=512 -d swidth=512‘ at least worked then, so we decided to force its usage in the creation of our OSD XFS partition. This brought us a working XFS partition. Please note, sunit must not be larger than swidth (more on that later!).

Then we reconstructed how to restore all the metadata for the OSD (activate.monmap, active, block_uuid, bluefs, ceph_fsid, fsid, keyring, kv_backend, magic, mkfs_done, ready, require_osd_release, systemd, type, whoami). To identify the UUID, we can read the data from ‘ceph --format json osd dump‘, like this for all our OSDs (Zsh syntax ftw!):

synpromika@server1 ~ % for f in {0..37} ; printf "osd-$f: %s\n" "$(sudo ceph --format json osd dump | jq -r ".osds[] | select(.osd==$f) | .uuid")"
osd-0: 4568c300-ad83-4288-963e-badcd99bf54f
osd-1: e573a17a-ccde-4719-bdf8-eef66903ca4f
osd-2: 0e1b2626-f248-4e7d-9950-f1a46644754e
osd-3: 1ac6a0a2-20ee-4ed8-9f76-d24e900c800c
[...]

Identifying the corresponding raw device for each OSD UUID is possible via:

synpromika@server1 ~ % UUID="4568c300-ad83-4288-963e-badcd99bf54f"
synpromika@server1 ~ % readlink -f /dev/disk/by-partuuid/"${UUID}"
/dev/sdc1

The OSD’s key ID can be retrieved via:

synpromika@server1 ~ % OSD_ID=0
synpromika@server1 ~ % sudo ceph auth get osd."${OSD_ID}" -f json 2>/dev/null | jq -r '.[] | .key'
AQCKFpZdm0We[...]

Now we also need to identify the underlying block device:

synpromika@server1 ~ % OSD_ID=0
synpromika@server1 ~ % sudo ceph osd metadata osd."${OSD_ID}" -f json | jq -r '.bluestore_bdev_partition_path'    
/dev/sdc2

With all of this, we reconstructed the keyring, fsid, whoami, block + block_uuid files. All the other files inside the XFS metadata partition are identical on each OSD. So after placing and adjusting the corresponding metadata on the XFS partition for Ceph usage, we got a working OSD – hurray! Since we had to fix yet another 32 OSDs, we decided to automate this XFS partitioning and metadata recovery procedure.

We had a network share available on /srv/backup for storing backups of existing partition data. On each server, we tested the procedure with one single OSD before iterating over the list of remaining failing OSDs. We started with a shell script on server1, then adjusted the script for server2 and server3. This is the script, as we executed it on the 3rd server.

Thanks to this, we managed to get the Ceph cluster up and running again. We didn’t want to continue with the Ceph upgrade itself during the night though, as we wanted to know exactly what was going on and why the system behaved like that. Time for RCA!

Root Cause Analysis

So all but three OSDs on server2 failed, and the problem seems to be related to XFS. Therefore, our starting point for the RCA was, to identify what was different on server2, as compared to server1 + server3. My initial assumption was that this was related to some firmware issues with the involved controller (and as it turned out later, I was right!). The disks were attached as JBOD devices to a ServeRAID M5210 controller (with a stripe size of 512). Firmware state:

synpromika@server1 ~ % sudo storcli64 /c0 show all | grep '^Firmware'
Firmware Package Build = 24.16.0-0092
Firmware Version = 4.660.00-8156

synpromika@server2 ~ % sudo storcli64 /c0 show all | grep '^Firmware'
Firmware Package Build = 24.21.0-0112
Firmware Version = 4.680.00-8489

synpromika@server3 ~ % sudo storcli64 /c0 show all | grep '^Firmware'
Firmware Package Build = 24.16.0-0092
Firmware Version = 4.660.00-8156

This looked very promising, as server2 indeed runs with a different firmware version on the controller. But how so? Well, the motherboard of server2 got replaced by a Lenovo/IBM technician in January 2020, as we had a failing memory slot during a memory upgrade. As part of this procedure, the Lenovo/IBM technician installed the latest firmware versions. According to our documentation, some OSDs were rebuilt (due to the filestore->bluestore migration) in March and April 2020. It turned out that precisely those OSDs were the ones that survived the upgrade. So the surviving drives were created with a different firmware version running on the involved controller. All the other OSDs were created with an older controller firmware. But what difference does this make?

Now let’s check firmware changelogs. For the 24.21.0-0097 release we found this:

- Cannot create or mount xfs filesystem using xfsprogs 4.19.x kernel 4.20(SCGCQ02027889)
- xfs_info command run on an XFS file system created on a VD of strip size 1M shows sunit and swidth as 0(SCGCQ02056038)

Our XFS problem certainly was related to the controller’s firmware. We also recalled that our monitoring system reported different sunit settings for the OSDs that were rebuilt in March and April. For example, OSD 21 was recreated and got different sunit settings:

WARN  server2.example.org  Mount options of /var/lib/ceph/osd/ceph-21      WARN - Missing: sunit=1024, Exceeding: sunit=512

We compared the new OSD 21 with an existing one (OSD 25 on server3):

synpromika@server2 ~ % systemctl show var-lib-ceph-osd-ceph\\x2d21.mount | grep sunit
Options=rw,noatime,attr2,inode64,sunit=512,swidth=512,noquota
synpromika@server3 ~ % systemctl show var-lib-ceph-osd-ceph\\x2d25.mount | grep sunit
Options=rw,noatime,attr2,inode64,sunit=1024,swidth=512,noquota

Thanks to our documentation, we could compare execution logs of their creation:

% diff -u ceph-disk-osd-25.log ceph-disk-osd-21.log
-synpromika@server2 ~ % sudo ceph-disk -v prepare --bluestore /dev/sdj --osd-id 25
+synpromika@server3 ~ % sudo ceph-disk -v prepare --bluestore /dev/sdi --osd-id 21
[...]
-command_check_call: Running command: /sbin/mkfs -t xfs -f -i size=2048 -- /dev/sdj1
-meta-data=/dev/sdj1              isize=2048   agcount=4, agsize=6272 blks
[...]
+command_check_call: Running command: /sbin/mkfs -t xfs -f -i size=2048 -- /dev/sdi1
+meta-data=/dev/sdi1              isize=2048   agcount=4, agsize=6336 blks
          =                       sectsz=4096  attr=2, projid32bit=1
          =                       crc=1        finobt=1, sparse=0, rmapbt=0, reflink=0
-data     =                       bsize=4096   blocks=25088, imaxpct=25
-         =                       sunit=128    swidth=64 blks
+data     =                       bsize=4096   blocks=25344, imaxpct=25
+         =                       sunit=64     swidth=64 blks
 naming   =version 2              bsize=4096   ascii-ci=0 ftype=1
 log      =internal log           bsize=4096   blocks=1608, version=2
          =                       sectsz=4096  sunit=1 blks, lazy-count=1
 realtime =none                   extsz=4096   blocks=0, rtextents=0
[...]

So back then, we even tried to track this down but couldn’t make sense of it yet. But now this sounds very much like it is related to the problem we saw with this Ceph/XFS failure. We follow Occam’s razor, assuming the simplest explanation is usually the right one, so let’s check the disk properties and see what differs:

synpromika@server1 ~ % sudo blockdev --getsz --getsize64 --getss --getpbsz --getiomin --getioopt /dev/sdk
4685545472
2398999281664
512
4096
524288
262144

synpromika@server2 ~ % sudo blockdev --getsz --getsize64 --getss --getpbsz --getiomin --getioopt /dev/sdk
4685545472
2398999281664
512
4096
262144
262144

See the difference between server1 and server2 for identical disks? The getiomin option now reports something different for them:

synpromika@server1 ~ % sudo blockdev --getiomin /dev/sdk            
524288
synpromika@server1 ~ % cat /sys/block/sdk/queue/minimum_io_size
524288

synpromika@server2 ~ % sudo blockdev --getiomin /dev/sdk 
262144
synpromika@server2 ~ % cat /sys/block/sdk/queue/minimum_io_size
262144

It doesn’t make sense that the minimum I/O size (iomin, AKA BLKIOMIN) is bigger than the optimal I/O size (ioopt, AKA BLKIOOPT). This leads us to Bug 202127 – cannot mount or create xfs on a 597T device, which matches our findings here. But why did this XFS partition work in the past and fails now with the newer kernel version?

The XFS behaviour change

Now given that we have backups of all the XFS partition, we wanted to track down, a) when this XFS behaviour was introduced, and b) whether, and if so how it would be possible to reuse the XFS partition without having to rebuild it from scratch (e.g. if you would have no working Ceph OSD or backups left).

Let’s look at such a failing XFS partition with the Grml live system:

root@grml ~ # grml-version
grml64-full 2020.06 Release Codename Ausgehfuahangl [2020-06-24]
root@grml ~ # uname -a
Linux grml 5.6.0-2-amd64 #1 SMP Debian 5.6.14-2 (2020-06-09) x86_64 GNU/Linux
root@grml ~ # grml-hostname grml-2020-06
Setting hostname to grml-2020-06: done
root@grml ~ # exec zsh
root@grml-2020-06 ~ # dpkg -l xfsprogs util-linux
Desired=Unknown/Install/Remove/Purge/Hold
| Status=Not/Inst/Conf-files/Unpacked/halF-conf/Half-inst/trig-aWait/Trig-pend
|/ Err?=(none)/Reinst-required (Status,Err: uppercase=bad)
||/ Name           Version      Architecture Description
+++-==============-============-============-=========================================
ii  util-linux     2.35.2-4     amd64        miscellaneous system utilities
ii  xfsprogs       5.6.0-1+b2   amd64        Utilities for managing the XFS filesystem

There it’s failing, no matter which mount option we try:

root@grml-2020-06 ~ # mount ./sdd1.dd /mnt
mount: /mnt: mount(2) system call failed: Structure needs cleaning.
root@grml-2020-06 ~ # dmesg | tail -30
[...]
[   64.788640] XFS (loop1): SB stripe unit sanity check failed
[   64.788671] XFS (loop1): Metadata corruption detected at xfs_sb_read_verify+0x102/0x170 [xfs], xfs_sb block 0xffffffffffffffff
[   64.788671] XFS (loop1): Unmount and run xfs_repair
[   64.788672] XFS (loop1): First 128 bytes of corrupted metadata buffer:
[   64.788673] 00000000: 58 46 53 42 00 00 10 00 00 00 00 00 00 00 62 00  XFSB..........b.
[   64.788674] 00000010: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
[   64.788675] 00000020: 32 b6 dc 35 53 b7 44 96 9d 63 30 ab b3 2b 68 36  2..5S.D..c0..+h6
[   64.788675] 00000030: 00 00 00 00 00 00 40 08 00 00 00 00 00 00 01 00  ......@.........
[   64.788675] 00000040: 00 00 00 00 00 00 01 01 00 00 00 00 00 00 01 02  ................
[   64.788676] 00000050: 00 00 00 01 00 00 18 80 00 00 00 04 00 00 00 00  ................
[   64.788677] 00000060: 00 00 06 48 bd a5 10 00 08 00 00 02 00 00 00 00  ...H............
[   64.788677] 00000070: 00 00 00 00 00 00 00 00 0c 0c 0b 01 0d 00 00 19  ................
[   64.788679] XFS (loop1): SB validate failed with error -117.
root@grml-2020-06 ~ # mount -t xfs -o rw,relatime,attr2,inode64,sunit=1024,swidth=512,noquota ./sdd1.dd /mnt/
mount: /mnt: wrong fs type, bad option, bad superblock on /dev/loop1, missing codepage or helper program, or other error.
32 root@grml-2020-06 ~ # dmesg | tail -1
[   66.342976] XFS (loop1): stripe width (512) must be a multiple of the stripe unit (1024)
root@grml-2020-06 ~ # mount -t xfs -o rw,relatime,attr2,inode64,sunit=512,swidth=512,noquota ./sdd1.dd /mnt/
mount: /mnt: mount(2) system call failed: Structure needs cleaning.
32 root@grml-2020-06 ~ # dmesg | tail -14
[   66.342976] XFS (loop1): stripe width (512) must be a multiple of the stripe unit (1024)
[   80.751277] XFS (loop1): SB stripe unit sanity check failed
[   80.751323] XFS (loop1): Metadata corruption detected at xfs_sb_read_verify+0x102/0x170 [xfs], xfs_sb block 0xffffffffffffffff 
[   80.751324] XFS (loop1): Unmount and run xfs_repair
[   80.751325] XFS (loop1): First 128 bytes of corrupted metadata buffer:
[   80.751327] 00000000: 58 46 53 42 00 00 10 00 00 00 00 00 00 00 62 00  XFSB..........b.
[   80.751328] 00000010: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
[   80.751330] 00000020: 32 b6 dc 35 53 b7 44 96 9d 63 30 ab b3 2b 68 36  2..5S.D..c0..+h6
[   80.751331] 00000030: 00 00 00 00 00 00 40 08 00 00 00 00 00 00 01 00  ......@.........
[   80.751331] 00000040: 00 00 00 00 00 00 01 01 00 00 00 00 00 00 01 02  ................
[   80.751332] 00000050: 00 00 00 01 00 00 18 80 00 00 00 04 00 00 00 00  ................
[   80.751333] 00000060: 00 00 06 48 bd a5 10 00 08 00 00 02 00 00 00 00  ...H............
[   80.751334] 00000070: 00 00 00 00 00 00 00 00 0c 0c 0b 01 0d 00 00 19  ................
[   80.751338] XFS (loop1): SB validate failed with error -117.

Also xfs_repair doesn’t help either:

root@grml-2020-06 ~ # xfs_info ./sdd1.dd
meta-data=./sdd1.dd              isize=2048   agcount=4, agsize=6272 blks
         =                       sectsz=4096  attr=2, projid32bit=1
         =                       crc=1        finobt=1, sparse=0, rmapbt=0
         =                       reflink=0
data     =                       bsize=4096   blocks=25088, imaxpct=25
         =                       sunit=128    swidth=64 blks
naming   =version 2              bsize=4096   ascii-ci=0, ftype=1
log      =internal log           bsize=4096   blocks=1608, version=2
         =                       sectsz=4096  sunit=1 blks, lazy-count=1
realtime =none                   extsz=4096   blocks=0, rtextents=0

root@grml-2020-06 ~ # xfs_repair ./sdd1.dd
Phase 1 - find and verify superblock...
bad primary superblock - bad stripe width in superblock !!!

attempting to find secondary superblock...
..............................................................................................Sorry, could not find valid secondary superblock
Exiting now.

With the “SB stripe unit sanity check failed” message, we could easily track this down to the following commit fa4ca9c:

% git show fa4ca9c5574605d1e48b7e617705230a0640b6da | cat
commit fa4ca9c5574605d1e48b7e617705230a0640b6da
Author: Dave Chinner <dchinner@redhat.com>
Date:   Tue Jun 5 10:06:16 2018 -0700
    
    xfs: catch bad stripe alignment configurations
    
    When stripe alignments are invalid, data alignment algorithms in the
    allocator may not work correctly. Ensure we catch superblocks with
    invalid stripe alignment setups at mount time. These data alignment
    mismatches are now detected at mount time like this:
    
    XFS (loop0): SB stripe unit sanity check failed
    XFS (loop0): Metadata corruption detected at xfs_sb_read_verify+0xab/0x110, xfs_sb block 0xffffffffffffffff
    XFS (loop0): Unmount and run xfs_repair
    XFS (loop0): First 128 bytes of corrupted metadata buffer:
    0000000091c2de02: 58 46 53 42 00 00 10 00 00 00 00 00 00 00 10 00  XFSB............
    0000000023bff869: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
    00000000cdd8c893: 17 32 37 15 ff ca 46 3d 9a 17 d3 33 04 b5 f1 a2  .27...F=...3....
    000000009fd2844f: 00 00 00 00 00 00 00 04 00 00 00 00 00 00 06 d0  ................
    0000000088e9b0bb: 00 00 00 00 00 00 06 d1 00 00 00 00 00 00 06 d2  ................
    00000000ff233a20: 00 00 00 01 00 00 10 00 00 00 00 01 00 00 00 00  ................
    000000009db0ac8b: 00 00 03 60 e1 34 02 00 08 00 00 02 00 00 00 00  ...`.4..........
    00000000f7022460: 00 00 00 00 00 00 00 00 0c 09 0b 01 0c 00 00 19  ................
    XFS (loop0): SB validate failed with error -117.
    
    And the mount fails.
    
    Signed-off-by: Dave Chinner <dchinner@redhat.com>
    Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
    Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
    Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>

diff --git fs/xfs/libxfs/xfs_sb.c fs/xfs/libxfs/xfs_sb.c
index b5dca3c8c84d..c06b6fc92966 100644
--- fs/xfs/libxfs/xfs_sb.c
+++ fs/xfs/libxfs/xfs_sb.c
@@ -278,6 +278,22 @@ xfs_mount_validate_sb(
                return -EFSCORRUPTED;
        }
        
+       if (sbp->sb_unit) {
+               if (!xfs_sb_version_hasdalign(sbp) ||
+                   sbp->sb_unit > sbp->sb_width ||
+                   (sbp->sb_width % sbp->sb_unit) != 0) {
+                       xfs_notice(mp, "SB stripe unit sanity check failed");
+                       return -EFSCORRUPTED;
+               } 
+       } else if (xfs_sb_version_hasdalign(sbp)) { 
+               xfs_notice(mp, "SB stripe alignment sanity check failed");
+               return -EFSCORRUPTED;
+       } else if (sbp->sb_width) {
+               xfs_notice(mp, "SB stripe width sanity check failed");
+               return -EFSCORRUPTED;
+       }
+
+       
        if (xfs_sb_version_hascrc(&mp->m_sb) &&
            sbp->sb_blocksize < XFS_MIN_CRC_BLOCKSIZE) {
                xfs_notice(mp, "v5 SB sanity check failed");

This change is included in kernel versions 4.18-rc1 and newer:

% git describe --contains fa4ca9c5574605d1e48
v4.18-rc1~37^2~14

Now let’s try with an older kernel version (4.9.0), using old Grml 2017.05 release:

root@grml ~ # grml-version
grml64-small 2017.05 Release Codename Freedatensuppe [2017-05-31]
root@grml ~ # uname -a
Linux grml 4.9.0-1-grml-amd64 #1 SMP Debian 4.9.29-1+grml.1 (2017-05-24) x86_64 GNU/Linux
root@grml ~ # lsb_release -a
No LSB modules are available.
Distributor ID: Debian
Description:    Debian GNU/Linux 9.0 (stretch)
Release:        9.0
Codename:       stretch
root@grml ~ # grml-hostname grml-2017-05
Setting hostname to grml-2017-05: done
root@grml ~ # exec zsh
root@grml-2017-05 ~ #

root@grml-2017-05 ~ # xfs_info ./sdd1.dd
xfs_info: ./sdd1.dd is not a mounted XFS filesystem
1 root@grml-2017-05 ~ # xfs_repair ./sdd1.dd
Phase 1 - find and verify superblock...
bad primary superblock - bad stripe width in superblock !!!

attempting to find secondary superblock...
..............................................................................................Sorry, could not find valid secondary superblock
Exiting now.
1 root@grml-2017-05 ~ # mount ./sdd1.dd /mnt
root@grml-2017-05 ~ # mount -t xfs
/root/sdd1.dd on /mnt type xfs (rw,relatime,attr2,inode64,sunit=1024,swidth=512,noquota)
root@grml-2017-05 ~ # ls /mnt
activate.monmap  active  block  block_uuid  bluefs  ceph_fsid  fsid  keyring  kv_backend  magic  mkfs_done  ready  require_osd_release  systemd  type  whoami
root@grml-2017-05 ~ # xfs_info /mnt
meta-data=/dev/loop1             isize=2048   agcount=4, agsize=6272 blks
         =                       sectsz=4096  attr=2, projid32bit=1
         =                       crc=1        finobt=1 spinodes=0 rmapbt=0
         =                       reflink=0
data     =                       bsize=4096   blocks=25088, imaxpct=25
         =                       sunit=128    swidth=64 blks
naming   =version 2              bsize=4096   ascii-ci=0 ftype=1
log      =internal               bsize=4096   blocks=1608, version=2
         =                       sectsz=4096  sunit=1 blks, lazy-count=1
realtime =none                   extsz=4096   blocks=0, rtextents=0

Mounting there indeed works! Now, if we mount the filesystem with new and proper sunit/swidth settings using the older kernel, it should rewrite them on disk:

root@grml-2017-05 ~ # mount -t xfs -o sunit=512,swidth=512 ./sdd1.dd /mnt/
root@grml-2017-05 ~ # umount /mnt/

And indeed, mounting this rewritten filesystem then also works with newer kernels:

root@grml-2020-06 ~ # mount ./sdd1.rewritten /mnt/
root@grml-2020-06 ~ # xfs_info /root/sdd1.rewritten
meta-data=/dev/loop1             isize=2048   agcount=4, agsize=6272 blks
         =                       sectsz=4096  attr=2, projid32bit=1
         =                       crc=1        finobt=1, sparse=0, rmapbt=0
         =                       reflink=0
data     =                       bsize=4096   blocks=25088, imaxpct=25
         =                       sunit=64    swidth=64 blks
naming   =version 2              bsize=4096   ascii-ci=0, ftype=1
log      =internal log           bsize=4096   blocks=1608, version=2
         =                       sectsz=4096  sunit=1 blks, lazy-count=1
realtime =none                   extsz=4096   blocks=0, rtextents=0
root@grml-2020-06 ~ # mount -t xfs                
/root/sdd1.rewritten on /mnt type xfs (rw,relatime,attr2,inode64,logbufs=8,logbsize=32k,sunit=512,swidth=512,noquota)

FTR: The ‘sunit=512,swidth=512‘ from the xfs mount option is identical to xfs_info’s output ‘sunit=64,swidth=64‘ (because mount.xfs’s sunit value is given in 512-byte block units, see man 5 xfs, and the xfs_info output reported here is in blocks with a block size (bsize) of 4096, so ‘sunit = 512*512 := 64*4096‘).

mkfs uses minimum and optimal sizes for stripe unit and stripe width; you can check this e.g. via (note that server2 with fixed firmware version reports proper values, whereas server3 with broken controller firmware reports non-sense):

synpromika@server2 ~ % for i in /sys/block/sd*/queue/ ; do printf "%s: %s %s\n" "$i" "$(cat "$i"/minimum_io_size)" "$(cat "$i"/optimal_io_size)" ; done
[...]
/sys/block/sdc/queue/: 262144 262144
/sys/block/sdd/queue/: 262144 262144
/sys/block/sde/queue/: 262144 262144
/sys/block/sdf/queue/: 262144 262144
/sys/block/sdg/queue/: 262144 262144
/sys/block/sdh/queue/: 262144 262144
/sys/block/sdi/queue/: 262144 262144
/sys/block/sdj/queue/: 262144 262144
/sys/block/sdk/queue/: 262144 262144
/sys/block/sdl/queue/: 262144 262144
/sys/block/sdm/queue/: 262144 262144
/sys/block/sdn/queue/: 262144 262144
[...]

synpromika@server3 ~ % for i in /sys/block/sd*/queue/ ; do printf "%s: %s %s\n" "$i" "$(cat "$i"/minimum_io_size)" "$(cat "$i"/optimal_io_size)" ; done
[...]
/sys/block/sdc/queue/: 524288 262144
/sys/block/sdd/queue/: 524288 262144
/sys/block/sde/queue/: 524288 262144
/sys/block/sdf/queue/: 524288 262144
/sys/block/sdg/queue/: 524288 262144
/sys/block/sdh/queue/: 524288 262144
/sys/block/sdi/queue/: 524288 262144
/sys/block/sdj/queue/: 524288 262144
/sys/block/sdk/queue/: 524288 262144
/sys/block/sdl/queue/: 524288 262144
/sys/block/sdm/queue/: 524288 262144
/sys/block/sdn/queue/: 524288 262144
[...]

This is the underlying reason why the initially created XFS partitions were created with incorrect sunit/swidth settings. The broken firmware of server1 and server3 was the cause of the incorrect settings – they were ignored by old(er) xfs/kernel versions, but treated as an error by new ones.

Make sure to also read the XFS FAQ regarding “How to calculate the correct sunit,swidth values for optimal performance”. We also stumbled upon two interesting reads in RedHat’s knowledge base: 5075561 + 2150101 (requires an active subscription, though) and #1835947.

Am I affected? How to work around it?

To check whether your XFS mount points are affected by this issue, the following command line should be useful:

awk '$3 == "xfs"{print $2}' /proc/self/mounts | while read mount ; do echo -n "$mount " ; xfs_info $mount | awk '$0 ~ "swidth"{gsub(/.*=/,"",$2); gsub(/.*=/,"",$3); print $2,$3}' | awk '{ if ($1 > $2) print "impacted"; else print "OK"}' ; done

If you run into the above situation, the only known solution to get your original XFS partition working again, is to boot into an older kernel version again (4.17 or older), mount the XFS partition with correct sunit/swidth settings and then boot back into your new system (kernel version wise).

Lessons learned
  • document everything and ensure to have all relevant information available (including actual times of changes, used kernel/package/firmware/… versions. The thorough documentation was our most significant asset in this case, because we had all the data and information we needed during the emergency handling as well as for the post mortem/RCA)
  • if something changes unexpectedly, dig deeper
  • know who to ask, a network of experts pays off
  • including timestamps in your shell makes reconstruction easier (the more people and documentation involved, the harder it gets to wade through it)
  • keep an eye on changelogs/release notes
  • apply regular updates and don’t forget invisible layers (e.g. BIOS, controller/disk firmware, IPMI/OOB (ILO/RAC/IMM/…) firmware)
  • apply regular reboots, to avoid a possible delta becoming bigger (which makes debugging harder)

Thanks: Darshaka Pathirana, Chris Hofstaedtler and Michael Hanscho.

Looking for help with your IT infrastructure? Let us know!

Sean Whitton: consfigurator-live-build

9 April, 2021 - 06:35

One of my goals for Consfigurator is to make it capable of installing Debian to my laptop, so that I can stop booting to GRML and manually partitioning and debootstrapping a basic system, only to then turn to configuration management to set everything else up. My configuration management should be able to handle the partitioning and debootstrapping, too.

The first stage was to make Consfigurator capable of debootstrapping a basic system, chrooting into it, and applying other arbitrary configuration, such as installing packages. That’s been in place for some weeks now. It’s sophisticated enough to avoid starting up newly installed services, but I still need to add some bind mounting.

Another significant piece is teaching Consfigurator how to partition block devices. That’s quite tricky to do in a sufficiently general way – I want to cleanly support various combinations of LUKS, LVM and regular partitions, including populating /etc/crypttab and /etc/fstab. I have some ideas about how to do it, but it’ll probably take a few tries to get the abstractions right.

Let’s imagine that code is all in place, such that Consfigurator can be pointed at a block device and it will install a bootable Debian system to it. Then to install Debian to my laptop I’d just need to take my laptop’s disk drive out and plug it into another system, and run Consfigurator on that system, as root, pointed at the block device representing my laptop’s disk drive. For virtual machines, it would be easy to write code which loop-mounts an empty disk image, and then Consfigurator could be pointed at the loop-mounted block device, thereby making the disk image file bootable.

This is adequate for virtual machines, or small single-board computers with tiny storage devices (not that I actually use any of those, but I want Consfigurator to be able to make disk images for them!). But it’s not much good for my laptop. I casually referred to taking out my laptop’s disk drive and connecting it to another computer, but this would void my laptop’s warranty. And Consfigurator would not be able to update my laptop’s NVRAM, as is needed on UEFI systems.

What’s wanted here is a live system which can run Consfigurator directly on the laptop, pointed at the block device representing its physical disk drive. Ideally this live system comes with a chroot with the root filesystem for the new Debian install already built, so that network access is not required, and all Consfigurator has to do is partition the drive and copy in the contents of the chroot. The live system could be set up to automatically start doing that upon boot, but another option is to just make Consfigurator itself available to be used interactively. The user boots the live system, starts up Emacs, starts up Lisp, and executes a Consfigurator deployment, supplying the block device representing the laptop’s disk drive as an argument to the deployment. Consfigurator goes off and partitions that drive, copies in the contents of the chroot, and executes grub-install to make the laptop bootable. This is also much easier to debug than a live system which tries to start partitioning upon boot. It would look something like this:

    ;; melete.silentflame.com is a Consfigurator host object representing the
    ;; laptop, including information about the partitions it should have
    (deploy-these :local ...
      (chroot:partitioned-and-installed
        melete.silentflame.com "/srv/chroot/melete" "/dev/nvme0n1"))

Now, building live systems is a fair bit more involved than installing Debian to a disk drive and making it bootable, it turns out. While I want Consfigurator to be able to completely replace the Debian Installer, I decided that it is not worth trying to reimplement the relevant parts of the Debian Live tool suite, because I do not need to make arbitrary customisations to any live systems. I just need to have some packages installed and some files in place. Nevertheless, it is worth teaching Consfigurator how to invoke Debian Live, so that the customisation of the chroot which isn’t just a matter of passing options to lb_config(1) can be done with Consfigurator. This is what I’ve ended up with – in Consfigurator’s source code:

(defpropspec image-built :lisp (config dir properties)
  "Build an image under DIR using live-build(7), where the resulting live
system has PROPERTIES, which should contain, at a minimum, a property from
CONSFIGURATOR.PROPERTY.OS setting the Debian suite and architecture.  CONFIG
is a list of arguments to pass to lb_config(1), not including the '-a' and
'-d' options, which Consfigurator will supply based on PROPERTIES.

This property runs the lb_config(1), lb_bootstrap(1), lb_chroot(1) and
lb_binary(1) commands to build or rebuild the image.  Rebuilding occurs only
when changes to CONFIG or PROPERTIES mean that the image is potentially
out-of-date; e.g. if you just add some new items to PROPERTIES then in most
cases only lb_chroot(1) and lb_binary(1) will be re-run.

Note that lb_chroot(1) and lb_binary(1) both run after applying PROPERTIES,
and might undo some of their effects.  For example, to configure
/etc/apt/sources.list, you will need to use CONFIG not PROPERTIES."
  (:desc (declare (ignore config properties))
         #?"Debian Live image built in ${dir}")
  (let* (...)
    ;; ...
    `(eseqprops
      ;; ...
      (on-change
          (eseqprops
           (on-change
               (file:has-content ,auto/config ,(auto/config config) :mode #o755)
             (file:does-not-exist ,@clean)
             (%lbconfig ,dir)
             (%lbbootstrap t ,dir))
           (%lbbootstrap nil ,dir)
           (deploys ((:chroot :into ,chroot)) ,host))
        (%lbchroot ,dir)
        (%lbbinary ,dir)))))

Here, %lbconfig is a property running lb_config(1), %lbbootstrap one which runs lb_bootstrap(1), etc. Those properties all just change directory to the right place and run the command, essentially, with a little extra code to handle failed debootstraps and the like.

The ON-CHANGE and ESEQPROPS combinators work together to sequence the interaction of the Debian Live suite and Consfigurator.

  • In the innermost ON-CHANGE expression: create the file auto/config and populate it with the call to lb_config(1) that we need to make, as described in the Debian Live manual, chapter 6.

    • If doing so resulted in a change to the auto/config file – e.g. the user added some more options – ensure that lb_config(1) and lb_bootstrap(1) both get rerun.
  • Now in the inner ESEQPROPS expression, use DEPLOYS to configure the chroot, essentially by forking into the chroot and recursively reinvoking Consfigurator.

  • Finally, if any of the above resulted in a change being made, call lb_chroot(1) and lb_binary(1).

This way, we only rebuild the chroot if the configuration changed, and we only rebuild the image if the chroot changed.

Now over in my personal consfig:

(try-register-data-source
 :git-snapshot :name "consfig" :repo #P"src/cl/consfig/" ...)

(defproplist hybrid-live-iso-built :lisp ()
  "Build a Debian Live system in /srv/live/spw.

Typically this property is not applied in a DEFHOST form, but rather run as
needed at the REPL.  The reason for this is that otherwise the whole image will
get rebuilt each time a commit is made to my dotfiles repo or to my consfig."
  (:desc "Sean's Debian Live system image built")
  (live-build:image-built.
      '("--archive-areas" "main contrib non-free" ...)
      "/srv/live/spw"
    (os:debian-stable "buster" :amd64)
    (basic-props)
    (apt:installed "whatever" "you" "want")

    (git:snapshot-extracted "/etc/skel/src" "dotfiles")
    (file:is-copy-of "/etc/skel/.bashrc" "/etc/skel/src/dotfiles/.bashrc")

    (git:snapshot-extracted "/root/src/cl" "consfig")))

The first argument to LIVE-BUILD:IMAGE-BUILT. is additional arguments to lb_config(1). The third argument onwards are the properties for the live system. The cool thing is GIT:SNAPSHOT-EXTRACTED – the calls to this ensure that a copy of my Emacs configuration and my consfig end up in the live image, ready to be used interactively to install Debian, as described above. I’ll need to add something like (chroot:host-chroot-bootstrapped melete.silentflame.com "/srv/chroot/melete") too.

As with everything Consfigurator-related, Joey Hess’s Propellor is the giant upon whose shoulders I’m standing.

Thorsten Alteholz: My Debian Activities in March 2021

8 April, 2021 - 17:11

FTP master

Things never turn out the way you expect, so this month I was only able to accept 38 packages and rejected none. Due to the freeze, the overall number of packages that got accepted was 88.

Debian LTS

This was my eighty-first month that I did some work for the Debian LTS initiative, started by Raphael Hertzog at Freexian.

This month my all in all workload has been 30h. During that time I did LTS and normal security uploads of:

  • [DLA 2606-1] lxml security update for one CVE
  • [DSA 4880-1] lxml security update for one CVE
  • [DLA 2611-1] ldb security update for two CVEs
  • [DLA 2612-1] leptonlib security update for four CVEs

I also prepared debdiffs for unstable and/or buster for leptonlib and libebml, which for one reason or another did not result in an upload yet.

Last but not least I did some days of frontdesk duties.

Debian ELTS

This month was the thirty-third ELTS month.

During my allocated time I uploaded:

  • ELA-388-1 for zeromq3
  • ELA-390-1 for lxml
  • ELA-391-1 for jasper
  • ELA-393-1 for ldb
  • ELA-394-1 for leptonlib

Last but not least I did some days of frontdesk duties.

Other stuff

On my neverending golang challenge I uploaded (or sponsored for thola dependencies):
golang-github-tombuildsstuff-giovanni, golang-github-apparentlymart-go-userdirs, golang-github-apparentlymart-go-shquot, golang-github-likexian-gokit, olang-gopkg-mail.v2, golang-gopkg-redis.v5, golang-github-facette-natsort, golang-github-opentracing-contrib-go-grpc, golang-github-felixge-fgprof, golang-ithub-gogo-status, golang-github-leanovate-gopter, golang-github-opentracing-basictracer-go, golang-github-lightstep-lightstep-tracer-common, golang-github-o-sourcemap-sourcemap, golang-github-igm-pubsub, golang-github-igm-sockjs-go, golang-github-centrifugal-protocol, golang-github-mna-redisc, golang-github-fzambia-eagle, golang-github-centrifugal-centrifuge, golang-github-chromedp-sysutil, golang-github-client9-misspell, golang-github-knq-snaker, cdproto-gen, golang-github-mattermost-xml-roundtrip-validator, golang-github-crewjam-saml, ssllabs-scan, golang-uber-automaxprocs, golang-uber-goleak, golang-github-k0kubun-go-ansi, golang-github-schollz-progressbar, golang-github-komkom-toml, golang-github-labstack-echo, golang-github-inexio-go-monitoringplugin

Ryan Kavanagh: Writing BASIC-8 on the TSS/8

8 April, 2021 - 07:08

I recently discovered SDF’s PiDP-8. You can access it over SSH and watch the blinkenlights over its twitch stream. It runs TSS/8, a time-sharing operating system written in 1967 by Adrian van de Goor while a grad student here at CMU. I’ve been having fun tinkering with it, and I just wrote my first BASIC program1 since high school. It plots the graph of some user-specified univariate function. I don’t claim that it’s elegant or well-engineered, but it works!

10  DEF FNC(X) = 19 * COS(X/2)
20  FOR Y = 20 TO -20 STEP -1
30     FOR X = -25 TO 24
40     LET V = FNC(X)
50     GOSUB 90
60  NEXT X
70  PRINT ""
80  NEXT Y
85  STOP
90  REM SUBROUTINE PRINTS AXES AND PLOT
100 IF X = 0 THEN 150
110 IF Y = 0 THEN 150
120 REM X != 0 AND Y != 0 SO IN QUADRANT
130 GOSUB 290
140 RETURN
150 GOSUB 170
160 RETURN
170 REM SUBROUTINE PRINTS AXES (X = 0 OR Y = 0)
180 IF X + Y = 0 THEN 230
190 IF X = 0 THEN 250
200 IF Y = 0 THEN 270
210 PRINT "AXES INVARIANT VIOLATED"
220 STOP
230 PRINT "+";
240 GOTO 280
250 PRINT "I";
260 GOTO 280
270 PRINT "-";
280 RETURN
290 REM SUBROUTINE PRINTS FUNCTION GRAPH (X != 0 AND Y != 0)
300 IF 0 <= Y THEN 350
310 REM Y < 0
320 IF V <= Y THEN 410
330 REM Y < 0 AND Y < V SO OUTSIDE OF PLOT AREA
340 GOTO 390
350 REM 0 <= Y
360 IF Y <= V THEN 410
370 REM 0 <= Y  AND V < Y SO OUTSIDE OF PLOT AREA
380 GOTO 390
390 PRINT " ";
400 RETURN
410 PRINT "*";
420 RETURN
430 REM COPYRIGHT 2021 RYAN KAVANAGH RAK AT RAK.AC
440 END

It produces the following output:

                         I
                         I
*           **           I           **
*           **           I           **
**          **          *I*          **          *
**          **          *I*          **          *
**         ***          *I*          ***         *
**         ****         *I*         ****         *
**         ****         *I*         ****         *
**         ****         *I*         ****         *
**         ****        **I**        ****         *
***        ****        **I**        ****        **
***        ****        **I**        ****        **
***        ****        **I**        ****        **
***       *****        **I**        *****       **
***       ******       **I**       ******       **
***       ******       **I**       ******       **
***       ******       **I**       ******       **
***       ******       **I**       ******       **
***       ******      ***I***      ******       **
-------------------------+------------------------
    ******      ******   I   ******      ******
    ******      ******   I   ******      ******
    *****       ******   I   ******       *****
    *****       ******   I   ******       *****
    *****        *****   I   *****        *****
    *****        *****   I   *****        *****
    *****        *****   I   *****        *****
    *****        ****    I    ****        *****
    *****        ****    I    ****        *****
     ****        ****    I    ****        ****
     ****        ****    I    ****        ****
     ***         ****    I    ****         ***
     ***          ***    I    ***          ***
     ***          ***    I    ***          ***
     ***          ***    I    ***          ***
      **          **     I     **          **
      **          **     I     **          **
      *            *     I     *            *
                         I
                         I

Next up, I am going to try my hand at writing some FORTRAN or some FOCAL69. If you like tinkering with old systems, then you should give the TSS/8 a try.

  1. It’s written in the BASIC-8 dialect. ↩︎

Norbert Preining: Debian KDE/Plasma and Digikam Status 2021-04-07

7 April, 2021 - 08:16

Two months have passed since the last status update, but not much has changed since Debian is more or less frozen for the release of Bullseye, and only critical bugfixes are allowed. As reported before Debian/bullseye will have Plasma 5.20.5, Frameworks 5.78, Apps 20.12. Debian/experimental already carries Plasma 5.21.4 and Frameworks 5.80, and that is also the level at the OSC builds.

Debian Bullseye

We are in hard freeze now, and only targeted fixes are allowed, but Bullseye is carrying a good mixture consisting of the KDE Frameworks 5.78, including several backports of fixes from 5.79 to get smooth operation. Plasma 5.20.5, again with several cherry picks for bugs will be in Bullseye, too. The KDE/Apps are mostly at 20.12 level, and the KDE PIM group packages (akonadi, kmail, etc) are at 20.08.

Debian experimental

Frameworks 5.80 (and soon 5.81) and Plasma 5.21.4 are in Debian/experimental.

OBS packages

(short reminder: you need to import my OBS gpg key to make these repos work!)

The OBS packages as usual follow the latest release, and currently ship KDE Frameworks 5.80, KDE Apps 20.12.3, and Plasma 5.21.4. The package sources are as usual (note the different path for the Plasma packages and the App packages, containing the release version!), for Debian/unstable:

deb https://download.opensuse.org/repositories/home:/npreining:/debian-kde:/frameworks/Debian_Unstable/ ./
deb https://download.opensuse.org/repositories/home:/npreining:/debian-kde:/plasma521/Debian_Unstable/ ./
deb https://download.opensuse.org/repositories/home:/npreining:/debian-kde:/apps2012/Debian_Unstable/ ./
deb https://download.opensuse.org/repositories/home:/npreining:/debian-kde:/other/Debian_Unstable/ ./

and the same with Testing instead of Unstable for Debian/testing.

Digikam

Digikam has seen a new release 7.2.0, and packages are available in my OBS archives:

deb https://download.opensuse.org/repositories/home:/npreining:/debian-kde:/other/Debian_Unstable/ ./

and again, same with Testing instead of Unstable for Debian/testing.

Jelmer Vernooij: Automatic Fixing of Debian Build Dependencies

7 April, 2021 - 04:46

In my last blogpost, I introduced the buildlog consultant - a tool that can identify many reasons why a Debian build failed.

For example, here’s a fragment of a build log where the Build-Depends lack python3-setuptools:

849
850
851
852
853
854
855
856
857
858
 dpkg-buildpackage: info: host architecture amd64
  fakeroot debian/rules clean
 dh clean --with python3,sphinxdoc --buildsystem=pybuild
    dh_auto_clean -O--buildsystem=pybuild
 I: pybuild base:232: python3.9 setup.py clean
 Traceback (most recent call last):
   File "/<<PKGBUILDDIR>>/setup.py", line 2, in <module>
     from setuptools import setup
 ModuleNotFoundError: No module named 'setuptools'
 E: pybuild pybuild:353: clean: plugin distutils failed with: exit code=1: python3.9 setup.py clean

The buildlog consultant can identify the line in bold as being key, and interprets it:

 % analyse-sbuild-log --json ~/build.log

 {"stage": "build", "section": "Build", "lineno": 857, "kind": "missing-python-module", "details": {"module": "setuptools", "python_version": 3, "minimum_version": null}}
Automatically acting on buildlog problems

A common reason why Debian builds fail is missing dependencies or incorrect versions of dependencies declared in the package build depends.

Based on the output of the buildlog consultant, it is possible in many cases to determine what dependency needs to be added to Build-Depends. In the example given above, we can use apt-file to look for the package that contains the path /usr/lib/python3/dist-packages/setuptools/__init__.py - and voila, we find python3-setuptools:

 % apt-file search /usr/lib/python3/dist-packages/setuptools/__init__.py
 python3-setuptools: /usr/lib/python3/dist-packages/setuptools/__init__.py

The deb-fix-build command automates these steps:

  1. It builds the package using sbuild; if the package successfully builds then it just exits successfully
  2. It tries to identify the problem by looking through the build log; if it can't or if it's a problem it has seen before (but apparently failed to resolve), then it exits with a non-zero exit code
  3. It tries to find a dependency that can address the problem
  4. It updates Build-Depends in debian/control or Depends in debian/tests/control
  5. Go to step 1

This takes away the tedious manual process of building a package, discovering that a dependency is missing, updating Build-Depends and trying again.

For example, when I ran deb-fix-build while packaging saneyaml, the output looks something like this:

 % deb-fix-build
 Using output directory /tmp/tmpyz0nkgqq
 Using sbuild chroot unstable-amd64-sbuild
 Using fixers: …
 Building debian packages, running 'sbuild --no-clean-source -A -s -v'.
 Attempting to use fixer upstream requirement fixer(apt) to address MissingPythonDistribution('setuptools_scm', python_version=3, minimum_version='4')
 Using apt-file to search apt contents
 Adding build dependency: python3-setuptools-scm (>= 4)
 Building debian packages, running 'sbuild --no-clean-source -A -s -v'.
 Attempting to use fixer upstream requirement fixer(apt) to address MissingPythonDistribution('toml', python_version=3, minimum_version=None)
 Adding build dependency: python3-toml
 Building debian packages, running 'sbuild --no-clean-source -A -s -v'.
 Built 0.5.2-1- changes files at [‘saneyaml_0.5.2-1_amd64.changes’].

And in our Git repository, we see these changes as well:

% git log -p
 commit 5a1715f4c7273b042818fc75702f2284034c7277 (HEAD -> master)
 Author: Jelmer Vernooij <jelmer@jelmer.uk>
 Date:   Sun Apr 4 02:35:56 2021 +0100

     Add missing build dependency on python3-toml.

 diff --git a/debian/control b/debian/control
 index 5b854dc..3b27b73 100644
 --- a/debian/control
 +++ b/debian/control
 @@ -1,6 +1,6 @@
  Rules-Requires-Root: no
  Standards-Version: 4.5.1
 -Build-Depends: debhelper-compat (= 12), dh-sequence-python3, python3-all, python3-setuptools (>= 50), python3-wheel, python3-setuptools-scm (>= 4)
 +Build-Depends: debhelper-compat (= 12), dh-sequence-python3, python3-all, python3-setuptools (>= 50), python3-wheel, python3-setuptools-scm (>= 4), python3-toml
  Testsuite: autopkgtest-pkg-python
  Source: python-saneyaml
  Priority: optional

 commit f03047da80fcd8468ee231fbc4cf8488d7a0acd1
 Author: Jelmer Vernooij <jelmer@jelmer.uk>
 Date:   Sun Apr 4 02:35:34 2021 +0100

     Add missing build dependency on python3-setuptools-scm (>= 4).

 diff --git a/debian/control b/debian/control
 index a476cc2..5b854dc 100644
 --- a/debian/control
 +++ b/debian/control
 @@ -1,6 +1,6 @@
  Rules-Requires-Root: no
  Standards-Version: 4.5.1
 -Build-Depends: debhelper-compat (= 12), dh-sequence-python3, python3-all, python3-setuptools (>= 50), python3-wheel
 +Build-Depends: debhelper-compat (= 12), dh-sequence-python3, python3-all, python3-setuptools (>= 50), python3-wheel, python3-setuptools-scm (>= 4)
  Testsuite: autopkgtest-pkg-python
  Source: python-saneyaml
  Priority: optional
Using deb-fix-build

You can run deb-fix-build by installing the ognibuild package from unstable. The only requirements for using it are that:

  • The package is maintained in Git
  • A sbuild schroot is available for use
Caveats

deb-fix-build is fairly easy to understand, and if it doesn't work then you're no worse off than you were without it - you'll have to add your own Build-Depends.

That said, there are a couple of things to keep in mind:

  • At the moment, it doesn't distinguish between general, Arch or Indep Build-Depends.
  • It can only add dependencies for things that are actually in the archive
  • Sometimes there are multiple packages that can provide a file, command or python package - it tries to find the right one with heuristics but doesn't always get it right

Kees Cook: security things in Linux v5.9

6 April, 2021 - 06:24

Previously: v5.8

Linux v5.9 was released in October, 2020. Here’s my summary of various security things that I found interesting:

seccomp user_notif file descriptor injection
Sargun Dhillon added the ability for SECCOMP_RET_USER_NOTIF filters to inject file descriptors into the target process using SECCOMP_IOCTL_NOTIF_ADDFD. This lets container managers fully emulate syscalls like open() and connect(), where an actual file descriptor is expected to be available after a successful syscall. In the process I fixed a couple bugs and refactored the file descriptor receiving code.

zero-initialize stack variables with Clang
When Alexander Potapenko landed support for Clang’s automatic variable initialization, it did so with a pattern designed to really stand out in kernel crashes. Now he’s added support for doing zero initialization via CONFIG_INIT_STACK_ALL_ZERO, which besides actually be faster, has a few behavior benefits as well. “Unlike pattern initialization, which has a higher chance of triggering existing bugs, zero initialization provides safe defaults for strings, pointers, indexes, and sizes.” Like the pattern initialization, this feature stops entire classes of uninitialized stack variable flaws.

common syscall entry/exit routines
Thomas Gleixner created architecture-independent code to do syscall entry, since much of the kernel’s work during a syscall entry and exit is the same. There was no need to repeat this in each architecture, and having it implemented separately meant bugs (or features) might only get fixed (or implemented) in a handful of architectures. It means that features like seccomp become much easier to build since it wouldn’t need per-architecture implementations any more. Presently only x86 has switched over to the common routines.

SLAB kfree() hardening
To reach CONFIG_SLAB_FREELIST_HARDENED feature-parity with the SLUB heap allocator, I added naive double-free detection and the ability to detect cross-cache freeing in the SLAB allocator. This should keep a class of type-confusion bugs from biting kernels using SLAB. (Most distro kernels use SLUB, but some smaller devices prefer the slightly more compact SLAB, so this hardening is mostly aimed at those systems.)

new CAP_CHECKPOINT_RESTORE capability
Adrian Reber added the new CAP_CHECKPOINT_RESTORE capability, splitting this functionality off of CAP_SYS_ADMIN. The needs for the kernel to correctly checkpoint and restore a process (e.g. used to move processes between containers) continues to grow, and it became clear that the security implications were lower than those of CAP_SYS_ADMIN yet distinct from other capabilities. Using this capability is now the preferred method for doing things like changing /proc/self/exe.

debugfs boot-time visibility restriction
Peter Enderborg added the debugfs boot parameter to control the visibility of the kernel’s debug filesystem. The contents of debugfs continue to be a common area of sensitive information being exposed to attackers. While this was effectively possible by unsetting CONFIG_DEBUG_FS, that wasn’t a great approach for system builders needing a single set of kernel configs (e.g. a distro kernel), so now it can be disabled at boot time.

more seccomp architecture support
Michael Karcher implemented the SuperH seccomp hooks, Guo Ren implemented the C-SKY seccomp hooks, and Max Filippov implemented the xtensa seccomp hooks. Each of these included the ever-important updates to the seccomp regression testing suite in the kernel selftests.

stack protector support for RISC-V
Guo Ren implemented -fstack-protector (and -fstack-protector-strong) support for RISC-V. This is the initial global-canary support while the patches to GCC to support per-task canaries is getting finished (similar to the per-task canaries done for arm64). This will mean nearly all stack frame write overflows are no longer useful to attackers on this architecture. It’s nice to see this finally land for RISC-V, which is quickly approaching architecture feature parity with the other major architectures in the kernel.

new tasklet API
Romain Perier and Allen Pais introduced a new tasklet API to make their use safer. Much like the timer_list refactoring work done earlier, the tasklet API is also a potential source of simple function-pointer-and-first-argument controlled exploits via linear heap overwrites. It’s a smaller attack surface since it’s used much less in the kernel, but it is the same weak design, making it a sensible thing to replace. While the use of the tasklet API is considered deprecated (replaced by threaded IRQs), it’s not always a simple mechanical refactoring, so the old API still needs refactoring (since that CAN be done mechanically is most cases).

x86 FSGSBASE implementation
Sasha Levin, Andy Lutomirski, Chang S. Bae, Andi Kleen, Tony Luck, Thomas Gleixner, and others landed the long-awaited FSGSBASE series. This provides task switching performance improvements while keeping the kernel safe from modules accidentally (or maliciously) trying to use the features directly (which exposed an unprivileged direct kernel access hole).

filter x86 MSR writes
While it’s been long understood that writing to CPU Model-Specific Registers (MSRs) from userspace was a bad idea, it has been left enabled for things like MSR_IA32_ENERGY_PERF_BIAS. Boris Petkov has decided enough is enough and has now enabled logging and kernel tainting (TAINT_CPU_OUT_OF_SPEC) by default and a way to disable MSR writes at runtime. (However, since this is controlled by a normal module parameter and the root user can just turn writes back on, I continue to recommend that people build with CONFIG_X86_MSR=n.) The expectation is that userspace MSR writes will be entirely removed in future kernels.

uninitialized_var() macro removed
I made treewide changes to remove the uninitialized_var() macro, which had been used to silence compiler warnings. The rationale for this macro was weak to begin with (“the compiler is reporting an uninitialized variable that is clearly initialized”) since it was mainly papering over compiler bugs. However, it creates a much more fragile situation in the kernel since now such uses can actually disable automatic stack variable initialization, as well as mask legitimate “unused variable” warnings. The proper solution is to just initialize variables the compiler warns about.

function pointer cast removals
Oscar Carter has started removing function pointer casts from the kernel, in an effort to allow the kernel to build with -Wcast-function-type. The future use of Control Flow Integrity checking (which does validation of function prototypes matching between the caller and the target) tends not to work well with function casts, so it’d be nice to get rid of these before CFI lands.

flexible array conversions
As part of Gustavo A. R. Silva’s on-going work to replace zero-length and one-element arrays with flexible arrays, he has documented the details of the flexible array conversions, and the various helpers to be used in kernel code. Every commit gets the kernel closer to building with -Warray-bounds, which catches a lot of potential buffer overflows at compile time.

That’s it for now! Please let me know if you think anything else needs some attention. Next up is Linux v5.10.

© 2021, Kees Cook. This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 License.

Pages

Creative Commons License ลิขสิทธิ์ของบทความเป็นของเจ้าของบทความแต่ละชิ้น
ผลงานนี้ ใช้สัญญาอนุญาตของครีเอทีฟคอมมอนส์แบบ แสดงที่มา-อนุญาตแบบเดียวกัน 3.0 ที่ยังไม่ได้ปรับแก้