Planet Debian

Subscribe to Planet Debian feed
Planet Debian - http://planet.debian.org/
Updated: 1 hour 23 min ago

Dariusz Dwornikowski: getting real "done date" of a bug from Debian BTS

19 September, 2014 - 15:17

As I wrote in my last post currently, SOAP interface, nor Ultimate Debian Database do not provide a date when a given bug was closed (done date). It is quite hard to calculate statistics on a bug tracker when you do not know when a bug was closed (!!).

Done date of bug can be found in its log. The log itself can be downloaded by SOAP method get_bug_log but the processing of it is quite complicated. The same comes to web scrapping of a BTS's web interface. Fortunatelly the web interface gives a possibility to download a log in an mbox format.

Below is a script that extracts the done date of a bug from its log in mbox format. It uses requests to download the mbox and caches the result in ~/.cache/rfs_bugs, which you need to create. It performs different checks:

  1. Check existence of a header e.g. Received: (at 657783-done) by bugs.debian.org; 29 Jan 2012 13:27:42 +0000
  2. Check for header CC: NUMBER-close|done
  3. Check for header TO: NUMBER-close|done
  4. Check for Close: NUMBER in body.

The code is below:

import requests
from datetime import datetime
import mailbox
import re
import os
import tempfile

def get_done_date(bug_num):

    CACHE_DIR = os.path.expanduser("~") + "/.cache/rfs_bugs/"

    def get_from_cache():
        if os.path.exists("{}{}".format(CACHE_DIR, bug_num)):
            with open("{}{}".format(CACHE_DIR, bug_num)) as f:
                return datetime.strptime(f.readlines()[0].rstrip(), "%Y-%m-%d").date()
        else:
            return None

    done_date = get_from_cache()

    if done_date is not None:
        return done_date
    else:
        r = requests.get("https://bugs.debian.org/cgi-bin/bugreport.cgi?mbox=yes;bug={};mboxstatus=yes".format(self._num))
        d = try_header(r.text)
        if d is None:
            d = try_cc(r.text)
        if d is None:
            d = try_body(r.text)
        if d is not None:
            with open("{}{}".format(CACHE_DIR, bug_num), "w") as f:
                f.write("{}".format(d.date()))
        else:
            return None
        return d.date()

    def try_body(text):
        reg = "\(at\s.+\)\s+by\sbugs\.debian\.org;\s(\d{1,2}\s\w\w\w\s\d\d\d\d)"
        handle, name = tempfile.mkstemp()
        with open(name, "w") as f:
            f.write(text.encode('latin-1'))
        mbox = mailbox.mbox(name)
        for i in mbox.items():
            if i[1].is_multipart():
                for m in i[1].get_payload():
                    if "close" in str(m) or "done" in str(m):
                        try:
                            result = re.search(reg, i[1]['Received'])
                            return datetime.strptime(result.group(1), "%d %b %Y")
                        except:
                            return None
            else:
                if "close" in i[1].get_payload() or "done" in i[1].get_payload():
                    try:
                        result = re.search(reg, i[1]['Received'])
                        return datetime.strptime(result.group(1), "%d %b %Y")
                    except:
                        return None
        return None



    def try_header(text):
        reg = "Received:\s\(at\s\d\d\d\d\d\d-(close|done)\)\s+by.+"
        try:
            result = re.search(reg, r.text)
            line = result.group(0)
            reg2 = "\d{1,2}\s\w\w\w\s\d\d\d\d"
            result = re.search(reg2, line)
            d = datetime.strptime(result.group(0), "%d %b %Y")
            return d
        except:
            return None

    def try_cc(text):
        reg = "\(at\s.+\)\s+by\sbugs\.debian\.org;\s(\d{1,2}\s\w\w\w\s\d\d\d\d)"
        handle, name = tempfile.mkstemp()
        with open(name, "w") as f:
            f.write(text.encode('latin-1'))
        mbox = mailbox.mbox(name)
        for i in mbox.items():
            if ('CC' in i[1] and "done" in i[1]['CC']) or ('To' in i[1] and "done" in i[1]['To']):
                try:
                    result = re.search(reg, i[1]['Received'])
                    return datetime.strptime(result.group(1), "%d %b %Y")
                except:
                    return None

if __name__ == "__main__":
    print get_done_date(752210)

PS: I hope that the script will be not needed in the near future, as Don Armstrong plans a new BTS database, a Debconf14 video is here.

Daniel Pocock: reSIProcate migration from SVN to Git completed

19 September, 2014 - 14:47

This week, the reSIProcate project completed the move from SVN to Git.

With many people using the SIP stack in both open source and commercial projects, the migration was carefully planned and tested over an extended period of time. Hopefully some of the experience from this migration can help other projects too.

Previous SVN committers were tracked down using my script for matching emails to Github accounts. This also allowed us to see their recent commits on other projects and see how they want their name and email address represented when their previous commits in SVN were mapped to Git commits.

For about a year, the sync2git script had been run hourly from cron to maintain an official mirror of the project in Github. This allowed people to test it and it also allowed us to start using some Github features like travis-CI.org before officially moving to Git.

At the cut-over, the SVN directories were made read-only, sync2git was run one last time and then people were advised they could commit in Git.

Documentation has also been created to help people get started quickly sharing patches as Github pull requests if they haven't used this facility before.

Paul Tagliamonte: Docker PostgreSQL Foreign Data Wrapper

19 September, 2014 - 09:49

For the tl;dr: Docker FDW is a thing. Star it, hack it, try it out. File bugs, be happy. If you want to see what it's like to read, there's some example SQL down below.

The question is first, what the heck is a PostgreSQL Foreign Data Wrapper? PostgreSQL Foreign Data Wrappers are plugins that allow C libraries to provide an adaptor for PostgreSQL to talk to an external database.

Some folks have used this to wrap stuff like MongoDB, which I always found to be hilarous (and an epic hack).

Enter Multicorn

During my time at PyGotham, I saw a talk from Wes Chow about something called Multicorn. He was showing off some really neat plugins, such as the git revision history of CPython, and parsed logfiles from some stuff over at Chartbeat. This basically blew my mind.

If you're interested in some of these, there are a bunch in the Multicorn VCS repo, such as the gitfdw example.

All throughout the talk I was coming up with all sorts of things that I wanted to do -- this whole library is basically exactly what I've been dreaming about for years. I've always wanted to provide a SQL-like interface into querying API data, joining data cross-API using common crosswalks, such as using Capitol Words to query for Legislators, and use the bioguide ids to JOIN against the congress api to get their Twitter account names.

My first shot was to Multicorn the new Open Civic Data API I was working on, chuckled and put it aside as a really awesome hack.

Enter Docker

It wasn't until tianon connected the dots for me and suggested a Docker FDW did I get really excited. Cue a few hours of hacking, and I'm proud to say -- here's Docker FDW.

Currently it only implements reading from the API, but extending this to allow for SQL DELETE operations isn't out of the question, and likely to be implemented soon. This lets us ask all sorts of really interesting questions out of the API, and might even help folks writing webapps avoid adding too much Docker-aware logic.

Setting it up The only stumbling block you might find (at least on Debian and Ubuntu) is that you'll need a Multicorn `.deb`. It's currently undergoing an official Debianization from the Postgres team, but in the meantime I put the source and binary up on my people.debian.org. Feel free to use that while the Debian PostgreSQL team prepares the upload to unstable.

I'm going to assume you have a working Multicorn, PostgreSQL and Docker setup (including adding the postgres user to the docker group)

So, now let's pop open a psql session. Create a database (I called mine dockerfdw, but it can be anything), and let's create some tables.

Before we create the tables, we need to let PostgreSQL know where our objects are. This takes a name for the server, and the Python importable path to our FDW.

CREATE SERVER docker_containers FOREIGN DATA WRAPPER multicorn options (
    wrapper 'dockerfdw.wrappers.containers.ContainerFdw');

CREATE SERVER docker_image FOREIGN DATA WRAPPER multicorn options (
    wrapper 'dockerfdw.wrappers.images.ImageFdw');

Now that we have the server in place, we can tell PostgreSQL to create a table backed by the FDW by creating a foreign table. I won't go too much into the syntax here, but you might also note that we pass in some options - these are passed to the constructor of the FDW, letting us set stuff like the Docker host.

CREATE foreign table docker_containers (
    "id"          TEXT,
    "image"       TEXT,
    "name"        TEXT,
    "names"       TEXT[],
    "privileged"  BOOLEAN,
    "ip"          TEXT,
    "bridge"      TEXT,
    "running"     BOOLEAN,
    "pid"         INT,
    "exit_code"   INT,
    "command"     TEXT[]
) server docker_containers options (
    host 'unix:///run/docker.sock'
);


CREATE foreign table docker_images (
    "id"              TEXT,
    "architecture"    TEXT,
    "author"          TEXT,
    "comment"         TEXT,
    "parent"          TEXT,
    "tags"            TEXT[]
) server docker_image options (
    host 'unix:///run/docker.sock'
);

And, now that we have tables in place, we can try to learn something about the Docker containers. Let's start with something fun - a join from containers to images, showing all image tag names, the container names and the ip of the container (if it has one!).

SELECT docker_containers.ip, docker_containers.names, docker_images.tags
  FROM docker_containers
  RIGHT JOIN docker_images
  ON docker_containers.image=docker_images.id;
     ip      |            names            |                  tags                   
-------------+-----------------------------+-----------------------------------------
             |                             | {ruby:latest}
             |                             | {paultag/vcs-mirror:latest}
             | {/de-openstates-to-ocd}     | {sunlightlabs/scrapers-us-state:latest}
             | {/ny-openstates-to-ocd}     | {sunlightlabs/scrapers-us-state:latest}
             | {/ar-openstates-to-ocd}     | {sunlightlabs/scrapers-us-state:latest}
 172.17.0.47 | {/ms-openstates-to-ocd}     | {sunlightlabs/scrapers-us-state:latest}
 172.17.0.46 | {/nc-openstates-to-ocd}     | {sunlightlabs/scrapers-us-state:latest}
             | {/ia-openstates-to-ocd}     | {sunlightlabs/scrapers-us-state:latest}
             | {/az-openstates-to-ocd}     | {sunlightlabs/scrapers-us-state:latest}
             | {/oh-openstates-to-ocd}     | {sunlightlabs/scrapers-us-state:latest}
             | {/va-openstates-to-ocd}     | {sunlightlabs/scrapers-us-state:latest}
 172.17.0.41 | {/wa-openstates-to-ocd}     | {sunlightlabs/scrapers-us-state:latest}
             | {/jovial_poincare}          | {<none>:<none>}
             | {/jolly_goldstine}          | {<none>:<none>}
             | {/cranky_torvalds}          | {<none>:<none>}
             | {/backstabbing_wilson}      | {<none>:<none>}
             | {/desperate_hoover}         | {<none>:<none>}
             | {/backstabbing_ardinghelli} | {<none>:<none>}
             | {/cocky_feynman}            | {<none>:<none>}
             |                             | {paultag/postgres:latest}
             |                             | {debian:testing}
             |                             | {paultag/crank:latest}
             |                             | {<none>:<none>}
             |                             | {<none>:<none>}
             | {/stupefied_fermat}         | {hackerschool/doorbot:latest}
             | {/focused_euclid}           | {debian:unstable}
             | {/focused_babbage}          | {debian:unstable}
             | {/clever_torvalds}          | {debian:unstable}
             | {/stoic_tesla}              | {debian:unstable}
             | {/evil_torvalds}            | {debian:unstable}
             | {/foo}                      | {debian:unstable}
(31 rows)

Success! This is just a taste of what's to come, so please feel free to hack on Docker FDW, tweet me @paultag, file bugs / feature requests. It's currently a bit of a hack, and it's something that I think has long-term potential after some work goes into making sure that this is a rock solid interface to the Docker API.

Jaldhar Vyas: Scotland: Vote A DINNAE KEN

19 September, 2014 - 07:39

From the crack journalists at CNN.

Interesting fact: anyone who wore a kilt at debconf is allowed to vote in the referendum.

Jonathan McDowell: Automatic inline signing for mutt with RT

18 September, 2014 - 18:00

I spend a surprising amount of my time as part of keyring-maint telling people their requests are badly formed and asking them to fix them up so I can actually process them. The one that's hardest to fault anyone on is that we require requests to be inline PGP signed (i.e. the same sort of output as you get with "gpg --clearsign"). That's because RT does various pieces of unpacking[0] of MIME messages that mean that a PGP/MIME signatures that have passed through it are no longer verifiable. Daniel has pointed out that inline PGP is a bad idea and got as far as filing a request that RT handle PGP/MIME correctly (you need a login for that but there's a generic read-only one that's easy to figure out), but until that happens the requirement stands when dealing with Debian's RT instance. So today I finally added the following lines to my .muttrc rather than having to remember to switch Mutt to inline signing for this one special case:

send-hook . "unset pgp_autoinline; unset pgp_autosign"
send-hook rt.debian.org "set pgp_autosign; set pgp_autoinline"

i.e. by default turn off auto inlined PGP signatures, but when emailing anything at rt.debian.org turn them on.

(Most of the other things I tell people to fix are covered by the replacing keys page; I advise anyone requesting a key replacement to read that page. There's even a helpful example request template at the bottom.)

[0] RT sticks a header on the plain text portion of the mail, rather than adding a new plain text part for the header if there are multiple parts (this is something Mailman handles better). It will also re-encode received mail into UTF-8 which I can understand, but Mutt will by default try to find an 8 bit encoding that can handle the mail, because that's more efficient, which tends to mean it picks latin1.

Dariusz Dwornikowski: RFS health in Debian

18 September, 2014 - 16:50

I am working on a small project to create WNPP like statistics for open RFS bugs. I think this could improve a little bit effectiveness of sponsoring new packages by giving insight into bugs that are on their way to being starved (i.e. not ever sponsored, or rotting in a queue).

The script attached in this post is written in Python and uses Debbugs SOAP interface to get currently open RFS bugs and calculates their dust and age.

The dust factor is calculated as an absolute value of a difference between bugs's age and log_modified.

Later I would like to create fully blown stats for an RFS queue, taking into account the whole history (i.e. 2012-1-1 until now), and check its health, calculate MTTGS (mean time to get sponsored).

The list looks more or less like this:

Age  Dust Number  Title
37   0    757966  RFS: lutris/0.3.5-1 [ITP]
1    0    762015  RFS: s3fs-fuse/1.78-1 [ITP #601789] -- FUSE-based file system backed by Amazon S3
81   0    753110  RFS: mrrescue/1.02c-1 [ITP]
456  0    712787  RFS: distkeys/1.0-1 [ITP] -- distribute SSH keys
120  1    748878  RFS: mwc/1.7.2-1 [ITP] -- Powerful website-tracking tool
1    1    762012  RFS: fadecut/0.1.4-1
3    1    761687  RFS: abraca/0.8.0+dfsg-1 -- Simple and powerful graphical client for XMMS2
35   2    758163  RFS: kcm-ufw/0.4.3-1 ITP
3    2    761636  RFS: raceintospace/1.1+dfsg1-1 [ITP]
....
....

The script rfs_health.py can be found below, it uses SOAPpy (only python <3 unfortunately).

#!/usr/bin/python

import SOAPpy
import time
from datetime import date, timedelta, datetime

url = 'http://bugs.debian.org/cgi-bin/soap.cgi'
namespace = 'Debbugs/SOAP'
server = SOAPpy.SOAPProxy(url, namespace)

class RFS(object):

    def __init__(self, obj):
        self._obj = obj
        self._last_modified = date.fromtimestamp(obj.log_modified)
        self._date = date.fromtimestamp(obj.date)
        if self._obj.pending != 'done':
            self._pending = "pending"
            self._dust = abs(date.today() - self._last_modified).days
        else:
            self._pending = "done"
            self._dust = abs(self._date - self._last_modified).days
        today = date.today()
        self._age = abs(today - self._date).days

    @property
    def status(self):
        return self._pending

    @property
    def date(self):
        return self._date

    @property
    def last_modified(self):
        return self._last_modified

    @property
    def subject(self):
        return self._obj.subject

    @property
    def bug_number(self):
        return self._obj.bug_num
    @property
    def age(self):
        return self._age

    @property
    def dust(self):
        return self._dust

    def __str__(self):
        return "{} subject: {} age:{} dust:{}".format(self._obj.bug_num, self._obj.subject, self._age, self._dust)


if __name__ == "__main__":

    bugi = server.get_bugs("package", "sponsorship-requests", "status", "open")
    buglist = [RFS(b.value) for b in server.get_status(bugi).item]
    buglist_sorted_by_dust = sorted(buglist, key=lambda x: x.dust, reverse=False)
    print("Age  Dust Number  Title")
    for i in buglist_sorted_by_dust:
        print("{:<4} {:<4} {:<7} {}".format(i.age, i.dust, i.bug_number, i.subject))

Jaldhar Vyas: Scotland: Vote NO

18 September, 2014 - 12:21
        _  __<;
      </_/ _/__   
     /> >  7   )  
     ~;</7    /   
     /> /   _*<---- Perth    
     ~ </7  7~\_  
        </7     \ 
         /_ _ _ | 

If you don't, the UK will have to rename itself the K. And that's just silly.

Also vote yes on whether Alex Trebek should keep his mustache.

Steve Kemp: If this goes well I have a new blog engine

18 September, 2014 - 01:23

Assuming this post shows up then I'll have successfully migrated from Chronicle to a temporary replacement.

Chronicle is awesome, and despite a lack of activity recently it is not dead. (No activity because it continued to do everything I needed for my blog.)

Unfortunately though there is a problem with chronicle, it suffers from a bit of a performance problem which has gradually become more and more vexing as the nubmer of entries I have has grown.

When chronicle runs it :

  • It reads each post into a complex data-structure.
  • Then it walks this multiple times.
  • Finally it outputs a whole bunch of posts.

In the general case you rebuild a blog because you've made a entry, or received a new comment. There is some code which tries to use memcached for caching, but in general chronicle just isn't fast and it is certainly memory-bound if you have a couple of thousand entries.

Currently my test data-set contains 2000 entries and to rebuild that from a clean start takes around 4 minutes, which is pretty horrific.

So what is the alternative? What if you could parse each post once, add it to an SQLite database, and then use that for writing your output pages? Instead of the complex data-structure in-RAM and the need to parse a zillion files you'd have a standard/simple SQL structure you could use to build a tag-cloud, an archive, & etc. If you store the contents of the parsed-blog, along with the mtime of the source file you can update it if the entry is changed in the future, as I sometimes make typos which I only spot once Ive run make steve on my blog sources.

Not surprisingly the newer code is significantly faster if you have 2000+ posts. If you've imported the posts into SQLite the most recent entries are updated in 3 seconds. If you're starting cold, parsing each entry, inserting it into SQLite, and then generating the blog from scratch the build time is still less than 10 seconds.

The downside is that I've removed features, obviously nothing that I use myself. Most notably the calendar view is gone, as is the ability to use date-based URLs. Less seriously there is only a single theme, which is what is used upon this site.

In conclusion I've written something last night which is a stepping stone between the current chronicle and chronicle2 which will appear in due course.

PS. This entry was written in markdown, just because I wanted to be sure it worked.

NOKUBI Takatsugu: Met with a debian developer from Germany

17 September, 2014 - 16:22

Last weekend, I (knok), Hideki (henrich) and Yutaka (gniibe) met with John Paul Adrian Glaubitz (glaubitz).

In the past, I had met with another Germany developer Jens Schmalzing (jensen) in Japan. He was a good guy, but unfortunately he gone in 2005.

I had an old OpenPGP key with his sign. It is a record of his activity, but the key is weak nowaday (1024D), so I stop to use the key but don’t issue revoke.

Anyway glaubitz is also a good guy, and he loves old videogame console. gniibe gave him five DreamCast consoles. I bring him to SUPER POTATO, a old videogame shop. He bought some software for Virtual Boy.

DebConf 2015 will hold in Germany, I want to go for it if I can.

 

Matthew Garrett: ACPI, kernels and contracts with firmware

17 September, 2014 - 06:51
ACPI is a complicated specification - the latest version is 980 pages long. But that's because it's trying to define something complicated: an entire interface for abstracting away hardware details and making it easier for an unmodified OS to boot diverse platforms.

Inevitably, though, it can't define the full behaviour of an ACPI system. It doesn't explicitly state what should happen if you violate the spec, for instance. Obviously, in a just and fair world, no systems would violate the spec. But in the grim meathook future that we actually inhabit, systems do. We lack the technology to go back in time and retroactively prevent this, and so we're forced to deal with making these systems work.

This ends up being a pain in the neck in the x86 world, but it could be much worse. Way back in 2008 I wrote something about why the Linux kernel reports itself to firmware as "Windows" but refuses to identify itself as Linux. The short version is that "Linux" doesn't actually identify the behaviour of the kernel in a meaningful way. "Linux" doesn't tell you whether the kernel can deal with buffers being passed when the spec says it should be a package. "Linux" doesn't tell you whether the OS knows how to deal with an HPET. "Linux" doesn't tell you whether the OS can reinitialise graphics hardware.

Back then I was writing from the perspective of the firmware changing its behaviour in response to the OS, but it turns out that it's also relevant from the perspective of the OS changing its behaviour in response to the firmware. Windows 8 handles backlights differently to older versions. Firmware that's intended to support Windows 8 may expect this behaviour. If the OS tells the firmware that it's compatible with Windows 8, the OS has to behave compatibly with Windows 8.

In essence, if the firmware asks for Windows 8 support and the OS says yes, the OS is forming a contract with the firmware that it will behave in a specific way. If Windows 8 allows certain spec violations, the OS must permit those violations. If Windows 8 makes certain ACPI calls in a certain order, the OS must make those calls in the same order. Any firmware bug that is triggered by the OS not behaving identically to Windows 8 must be dealt with by modifying the OS to behave like Windows 8.

This sounds horrifying, but it's actually important. The existence of well-defined[1] OS behaviours means that the industry has something to target. Vendors test their hardware against Windows, and because Windows has consistent behaviour within a version[2] the vendors know that their machines won't suddenly stop working after an update. Linux benefits from this because we know that we can make hardware work as long as we're compatible with the Windows behaviour.

That's fine for x86. But remember when I said it could be worse? What if there were a platform that Microsoft weren't targeting? A platform where Linux was the dominant OS? A platform where vendors all test their hardware against Linux and expect it to have a consistent ACPI implementation?

Our even grimmer meathook future welcomes ARM to the ACPI world.

Software development is hard, and firmware development is software development with worse compilers. Firmware is inevitably going to rely on undefined behaviour. It's going to make assumptions about ordering. It's going to mishandle some cases. And it's the operating system's job to handle that. On x86 we know that systems are tested against Windows, and so we simply implement that behaviour. On ARM, we don't have that convenient reference. We are the reference. And that means that systems will end up accidentally depending on Linux-specific behaviour. Which means that if we ever change that behaviour, those systems will break.

So far we've resisted calls for Linux to provide a contract to the firmware in the way that Windows does, simply because there's been no need to - we can just implement the same contract as Windows. How are we going to manage this on ARM? The worst case scenario is that a system is tested against, say, Linux 3.19 and works fine. We make a change in 3.21 that breaks this system, but nobody notices at the time. Another system is tested against 3.21 and works fine. A few months later somebody finally notices that 3.21 broke their system and the change gets reverted, but oh no! Reverting it breaks the other system. What do we do now? The systems aren't telling us which behaviour they expect, so we're left with the prospect of adding machine-specific quirks. This isn't scalable.

Supporting ACPI on ARM means developing a sense of discipline around ACPI development that we simply haven't had so far. If we want to avoid breaking systems we have two options:

1) Commit to never modifying the ACPI behaviour of Linux.
2) Exposing an interface that indicates which well-defined ACPI behaviour a specific kernel implements, and bumping that whenever an incompatible change is made. Backward compatibility paths will be required if firmware only supports an older interface.

(1) is unlikely to be practical, but (2) isn't a great deal easier. Somebody is going to need to take responsibility for tracking ACPI behaviour and incrementing the exported interface whenever it changes, and we need to know who that's going to be before any of these systems start shipping. The alternative is a sea of ARM devices that only run specific kernel versions, which is exactly the scenario that ACPI was supposed to be fixing.

[1] Defined by implementation, not defined by specification
[2] Windows may change behaviour between versions, but always adds a new _OSI string when it does so. It can then modify its behaviour depending on whether the firmware knows about later versions of Windows.

comments

Steinar H. Gunderson: The virtues of std::unique_ptr

17 September, 2014 - 06:30

Among all the changes in C++11, there's one that I don't feel has received enough attention: std::unique_ptr (or just unique_ptr; I'll drop the std:: from here on). The motivation is simple; assume a function like this:

Foo *func() {
        Foo *foo = new Foo;
        if (something_complicated()) {
                // Oops, something wrong happened
                return NULL;
        }
        foo->baz();
        return foo;
}

The memory leak is obvious; if something_complicated() returns false, we presumably leak foo. The classical fix is:

Foo *func() {
        Foo *foo = new Foo;
        if (something_complicated()) {
                delete foo;
                return NULL;
        }
        foo->baz();
        return foo;
}

But this is cumbersome and easy to get wrong. Tools like valgrind have made this a lot easier to detect, but that's a poor substitute; what we want is a coding style where it's deliberately hard to make mistakes. Enter unique_ptr:

Foo *func() {
        unique_ptr<Foo> foo(new Foo);
        if (something_complicated()) {
                // unique_ptr<Foo> destructor deletes foo for us!
                return NULL;
        }
        foo->baz();
        return foo.release();
}

So we have introduced a notion of ownership; the function (or, more precisely, scope) now owns the Foo object. The only way we can leave the function and not have it destroyed is through an explicit call to release() (which returns the raw pointer and clears the unique_ptr). We have smart pointer semantics, so we can use -> just as if we had a regular pointer. In any case, the runtime overhead over a regular pointer is exactly zero.

Ownership does, of course, extend just fine to classes:

class Bar {
public:
        Foo() : foo(new Foo) {}

private:
        unique_ptr<Foo> foo;
};

In this case, the Bar object owns the Foo object, and will destroy it when it goes out of scope without having to do a manual delete in the destructor, operator= and so on; not to mention that it will make your object non-copy-constructible, so you won't get that wrong by mistake. (In this case, you could do the same just by “Foo foo;” instead of using unique_ptr, of course, modulo the copy constructor behavior and heap behavior.)

So far, we could do all of this in C++03. But C++11 includes a very helpful extra piece of the puzzle, namely move semantics. These allow us to transfer the ownership safely:

class Bar {
public:
        Bar(unique_ptr<Foo> arg_foo) : foo(foo) {}

private:
        unique_ptr<Foo> foo;
};

void func() {
        unique_ptr<Foo> foo(new Foo);
        // Do something with foo.
        Bar bar(move(foo));
        // ...
}

Below the Bar constructor line, foo is empty, and bar owns the Foo object! And at no point, the object was without an owner; if there's no more code in the function, bar will get immediately destroyed, and the Foo object with it (since it has ownership). It also deals just fine with exception safety.

If you program with unique_ptr, it is genuinely very hard to get memory leaks. And it's much better than Java-style garbage collection; you don't get the RAM overhead GC needs, your objects are destroyed at predictable times, and destructors are run, so you can get reliable behavior on things like file handles, sockets and the likes, without having to resort to manual cleanup in a finally block. (In a sense, it's like a refcount that can only ever be 0 or 1.)

It sounds so innocuous on paper, but all great ideas are simple. So, go forth and unique_ptr!

Steve Kemp: Applications updating & phoning home

17 September, 2014 - 03:42

Personally I believe that any application packaged for Debian should neither phone home, attempt to download plugins over HTTP at run-time, or update itself.

On that basis I've filed #761828.

As a project we have guidelines for what constitutes a "serious" bug, which generally boil down to a package containing a security issue, causing data-loss, or being unusuable.

I'd like to propose that these kind of tracking "things" are equally bad. If consensus could be reached that would be a good thing for the freedom of our users.

(Ooops I slipped into "us", "our user", I'm just an outsider looking in. Mostly.)

Petter Reinholdtsen: Speeding up the Debian installer using eatmydata and dpkg-divert

16 September, 2014 - 20:00

The Debian installer could be a lot quicker. When we install more than 2000 packages in Skolelinux / Debian Edu using tasksel in the installer, unpacking the binary packages take forever. A part of the slow I/O issue was discussed in bug #613428 about too much file system sync-ing done by dpkg, which is the package responsible for unpacking the binary packages. Other parts (like code executed by postinst scripts) might also sync to disk during installation. All this sync-ing to disk do not really make sense to me. If the machine crash half-way through, I start over, I do not try to salvage the half installed system. So the failure sync-ing is supposed to protect against, hardware or system crash, is not really relevant while the installer is running.

A few days ago, I thought of a way to get rid of all the file system sync()-ing in a fairly non-intrusive way, without the need to change the code in several packages. The idea is not new, but I have not heard anyone propose the approach using dpkg-divert before. It depend on the small and clever package eatmydata, which uses LD_PRELOAD to replace the system functions for syncing data to disk with functions doing nothing, thus allowing programs to live dangerous while speeding up disk I/O significantly. Instead of modifying the implementation of dpkg, apt and tasksel (which are the packages responsible for selecting, fetching and installing packages), it occurred to me that we could just divert the programs away, replace them with a simple shell wrapper calling "eatmydata $program $@", to get the same effect. Two days ago I decided to test the idea, and wrapped up a simple implementation for the Debian Edu udeb.

The effect was stunning. In my first test it reduced the running time of the pkgsel step (installing tasks) from 64 to less than 44 minutes (20 minutes shaved off the installation) on an old Dell Latitude D505 machine. I am not quite sure what the optimised time would have been, as I messed up the testing a bit, causing the debconf priority to get low enough for two questions to pop up during installation. As soon as I saw the questions I moved the installation along, but do not know how long the question were holding up the installation. I did some more measurements using Debian Edu Jessie, and got these results. The time measured is the time stamp in /var/log/syslog between the "pkgsel: starting tasksel" and the "pkgsel: finishing up" lines, if you want to do the same measurement yourself. In Debian Edu, the tasksel dialog do not show up, and the timing thus do not depend on how quickly the user handle the tasksel dialog.

Machine/setup Original tasksel Optimised tasksel Reduction Latitude D505 Main+LTSP LXDE 64 min (07:46-08:50) <44 min (11:27-12:11) >20 min 18% Latitude D505 Roaming LXDE 57 min (08:48-09:45) 34 min (07:43-08:17) 23 min 40% Latitude D505 Minimal 22 min (10:37-10:59) 11 min (11:16-11:27) 11 min 50% Thinkpad X200 Minimal 6 min (08:19-08:25) 4 min (08:04-08:08) 2 min 33% Thinkpad X200 Roaming KDE 19 min (09:21-09:40) 15 min (10:25-10:40) 4 min 21%

The test is done using a netinst ISO on a USB stick, so some of the time is spent downloading packages. The connection to the Internet was 100Mbit/s during testing, so downloading should not be a significant factor in the measurement. Download typically took a few seconds to a few minutes, depending on the amount of packages being installed.

The speedup is implemented by using two hooks in Debian Installer, the pre-pkgsel.d hook to set up the diverts, and the finish-install.d hook to remove the divert at the end of the installation. I picked the pre-pkgsel.d hook instead of the post-base-installer.d hook because I test using an ISO without the eatmydata package included, and the post-base-installer.d hook in Debian Edu can only operate on packages included in the ISO. The negative effect of this is that I am unable to activate this optimization for the kernel installation step in d-i. If the code is moved to the post-base-installer.d hook, the speedup would be larger for the entire installation.

I've implemented this in the debian-edu-install git repository, and plan to provide the optimization as part of the Debian Edu installation. If you want to test this yourself, you can create two files in the installer (or in an udeb). One shell script need do go into /usr/lib/pre-pkgsel.d/, with content like this:

#!/bin/sh
set -e
. /usr/share/debconf/confmodule
info() {
    logger -t my-pkgsel "info: $*"
}
error() {
    logger -t my-pkgsel "error: $*"
}
override_install() {
    apt-install eatmydata || true
    if [ -x /target/usr/bin/eatmydata ] ; then
        for bin in dpkg apt-get aptitude tasksel ; do
            file=/usr/bin/$bin
            # Test that the file exist and have not been diverted already.
            if [ -f /target$file ] ; then
                info "diverting $file using eatmydata"
                printf "#!/bin/sh\neatmydata $bin.distrib \"\$@\"\n" \
                    > /target$file.edu
                chmod 755 /target$file.edu
                in-target dpkg-divert --package debian-edu-config \
                    --rename --quiet --add $file
                ln -sf ./$bin.edu /target$file
            else
                error "unable to divert $file, as it is missing."
            fi
        done
    else
        error "unable to find /usr/bin/eatmydata after installing the eatmydata pacage"
    fi
}

override_install

To clean up, another shell script should go into /usr/lib/finish-install.d/ with code like this:

#! /bin/sh -e
. /usr/share/debconf/confmodule
error() {
    logger -t my-finish-install "error: $@"
}
remove_install_override() {
    for bin in dpkg apt-get aptitude tasksel ; do
        file=/usr/bin/$bin
        if [ -x /target$file.edu ] ; then
            rm /target$file
            in-target dpkg-divert --package debian-edu-config \
                --rename --quiet --remove $file
            rm /target$file.edu
        else
            error "Missing divert for $file."
        fi
    done
    sync # Flush file buffers before continuing
}

remove_install_override

In Debian Edu, I placed both code fragments in a separate script edu-eatmydata-install and call it from the pre-pkgsel.d and finish-install.d scripts.

By now you might ask if this change should get into the normal Debian installer too? I suspect it should, but am not sure the current debian-installer coordinators find it useful enough. It also depend on the side effects of the change. I'm not aware of any, but I guess we will see if the change is safe after some more testing. Perhaps there is some package in Debian depending on sync() and fsync() having effect? Perhaps it should go into its own udeb, to allow those of us wanting to enable it to do so without affecting everyone.

Hideki Yamane: Intel 910 SSD 400GB - $420

16 September, 2014 - 16:10

Intel SSD 910 (400GB, SSDPEDOX400G301) is cheaper than ever in Japan - only $420 (and its spec sheet says "Recommended Customer Price BULK: $1929.00", wow).

Ritesh Raj Sarraf: apt-offline 1.5

16 September, 2014 - 02:17

I am very pleased to announce the release of apt-offline, version 1.5.

In version 1.4, the offline bug report functionality had to be dropped. In version 1.5, it is back again. apt-offline now uses the new Debian native BTS library. Thanks to its developers, this library is much more slim and neat. The only catch is that it depends on the SOAPpy library which currently is not stock in Python. If you run apt-offline of Debian, you may not have to worry as I will add a Recommends on that package. For users using it on Microsoft Windows, please ensure that you have the SOAPpy library installed. It is available on pypi.

The old bundled magic library has been replaced with the version of python magic library that Debian ships. This library is derived from the file package and is portable on almost all Unixes. For Debian users, there will be a Recommends on it too.

There were also a bunch of old, outstanding, and annoying bugs that have been fixed in this release. For a full list of changes, please refer to the git logs.

With this release, apt-offline should be in good shape for the Jessie release.

apt-offline is available on Alioth @ https://alioth.debian.org/projects/apt-offline/

AddThis:  Categories: Keywords: 

Keith Packard: xserver-pending-fixes

16 September, 2014 - 01:14
A Forest of X Server Changes

We’ve got about another month left in the X server merge window for 1.17 and I’ve written a small set of fixes which haven’t been reviewed yet for merging. I thought I’d advertise them a bit and see if I couldn’t encourage a few of you to take a look and see if they’re useful, correct and complete.

All of these are in my personal X server repository:

git://people.freedesktop.org/~keithp/xserver.git
Cleaning up the X Registry

Branch: registry-fixes

I’ll bet most of you don’t even know about this code. It serves as a database mapping various X enumerations to strings to aid in diagnostics. For the security extensions, SECURITY and XSELinux, it holds names for all of the request, event and errors in the core protocol and all registered extensions. For X-Resource, it has the names of the registered resource types.

The X registry gets the request, event and error data from a file, “protocol.txt”, which is installed in /usr/lib/xorg/protocol.txt on my machine. It gets the resource names as a part of resource type allocation.

So, what’s wrong with this? Three basic things:

  1. A simple bug — protocol.txt is left open while the server runs. This consumes a file descriptor for no good reason.

  2. protocol.txt is read and parsed even if the security extensions aren’t available. This wastes time and memory.

  3. The resource names are kept even if X-Resource isn’t in use.

The fixes remove the configure options for including the registry code; these functions are only used by the above extensions, so we can tell whether to include the code based solely on whether the extensions are being built.

Getting rid of the TCP listener by default

Branch: listen-fixes

We’ve had the ‘-nolisten’ option for a while now to disable inbound TCP connections. It’s useful for security reasons, but we’ve never enabled this by default. This patch sequence provides configure options for each of the listen sockets (tcp, unix and local), leaves unix and local enabled by default and disables tcp by default.

A new option, ‘-listen’, is added which allows the user to override the -nolisten defaults in case they actually want to use TCP connections to X.

Glamor bug fixes

branch: glamor-fixes

This branch fixes two bugs:

  1. Scale a large pixmap down to a small pixmap. This happens when you display enormous images in a web page. Iceweasel sends the whole huge image to X and uses Render to scale it to the screen. If the image is larger than a single texture, the X server splits it up into tiles, but the code which tries to perform the merged scale is just broken. Five patches fix this.

  2. Shader-based trapezoids. This code uses area coverage to compute trapezoids. That violates the Render spec, which requires point sampling. Further, the performance of these trapezoids is lower than software (by a lot). This one patch removes the code.

Present bug fixes

branch: present-fixes

A selection of small bug fixes:

  1. Clear pending flips at CloseScreen. This removes a reference to any pending flip pixmap, allowing it to be freed. Otherwise, we’ll leak memory across server reset.

  2. Add support for PresentOptionCopy. This has been in the protocol spec for a while, and was completely trivial to implement. However, it never got done. One tiny little patch.

  3. Expose the Present API to drivers via sdksyms.sh. Until now, the present extension APIs have only been available inside the X server. This exposes them to drivers. This took a few cleanup patches first.

Use Present for Glamor XV

branch: glamor-present-xv

Painting XV to the screen should be done at vblank time to avoid tearing. Present offers vblank synchronized operations. Hooking those two together required a few new present APIs to expose the vblank functionality outside of the present code, then a bit of glamor code to hook up that new API to the XV bits.

Switching Glamor to a GL core profile context

branch: glamor-core-profile

This patch set is still in progress, but demonstrates how close we are. We’ll be requiring OpenGL 3.3 for this so that we get texture swizzling, which is required for our single channel objects.

The changes present on the branch are:

  1. Switch single channel surfaces from GLALPHA to GLRED.

  2. Use vertex array objects.

  3. Switch ephyr over to using a core 3.3 profile.

Still left to do is

  1. Switch Render code to VBOs

The core code uses VBOs everywhere, but the Render code doesn’t. This means that all Render drawing fails, which makes the resulting server not very useful.

My main objective for getting this done is to reduce memory usage by about 16MB, which is the space allocated for software rendering in Mesa in case someone does something which the hardware doesn’t handle, and that can only with some legacy OpenGL APIs.

Please help out!

All of these friendly little patches are looking for a bit of review so that they can get merged before the 1.17 window closes.

Thomas Goirand: Backporting libjs-angularjs and libjs-d3 to Wheezy

16 September, 2014 - 01:13

If you didn’t notice, Javascript isn’t as simple as it used to be… Want to backport the 2 simple javascript libs? No problem. You then “just” need to backport a bunch of other packages which are build-dependencies… (and file #761670, #761672, and #761674 on the way when rebuilding…). Here’s the short list:

gyp
node-abbrev
node-ansi
node-ansi-color-table
node-archy
node-async
node-block-stream
node-combined-stream
node-contextify
node-cookie-jar
node-cssom
node-delayed-stream
node-diff
node-eyes
node-forever-agent
node-form-data
node-fstream
node-fstream-ignore
node-github-url-from-git
node-glob
node-graceful-fs
node-gyp
node-htmlparser
node-inherits
node-ini
node-jake
node-jsdom
node-json-stringify-safe
node-lockfile
node-lru-cache
node-marked
node-mime
node-minimatch
node-mkdirp
node-mute-stream
node-node-uuid
node-nopt
node-normalize-package-data
node-npmlog
node-once
node-optimist
node-osenv
node-qs
node-queue-async
node-read
node-read-package-json
node-request
node-retry
node-rimraf
node-semver
node-sha
node-sigmund
node-slide
node-smash
node-tar
node-tunnel-agent
node-uglify
node-underscore
node-utilities
node-vows
node-whic
node-which
node-wordwrap
nodejs
npm
ruby-ronn

Yes, that’s 66 packages above… And of course, backporting some ruby stuff makes sense… :)

Vincent Sanders: NetSurf 3.2

15 September, 2014 - 23:08
We recently released a new version of NetSurf this was largely to address numerous small bugs but did also include the persistent caching implementation I have written about previously. A release used to require the release manager (usually me) to perform a lot of manual processes and while we had a checklist it was far too easy to miss things.

The Continuous Integration (CI) system combined with signed release tags in git has resulted in a greatly simplified process indeed it has become almost completely automated. The majority of the manual work is now confined to doing the tasks that require actual decision making and checking we are releasing what was intended.

By having the CI system build release binaries the project now has a much clearer and importantly traceable process, I can recommend such a system to any project that produces releases especially if they release binaries for any of their targets.

I have also managed to package and upload this version of NetSurf ready for the Debian Jessie release. I would like to thank Jonathan Wiltshire for his assistance in ensuring this was a good quality package.

The release incorporates the successfully merged work of Rupinder Singh who was our our GSoc 2014 student. Rupinder mainly made improvements to our core DOM implementation and was very responsive and enthusiastic throughout his time despite the mentor team sometimes not being available.

This work goes towards improving NetSurf in the future by ensuring the underlying features are present in our core libraries. The GSoc mentors and all project developers are all pleased with the results of this years GSoc participation and would like to thank everyone involved in making our participation possible.

Along with the good news comes a little bad:
PowerPC Mac OS X
Despite repeated calls for assistance with new hardware and Java builds none has been forthcoming meaning that from this release we ware no longer able to ship PowerPC builds for MAC OS X.

The main issue is the last version of MAC OS X that runs on PPC is Leopard and there is no viable Java 1.6 port necessary for our CI system to run. Additionally the fully loaded PPC Mac mini (kindly donated to us by Mythic Beasts) had become far too slow to keep up with our builds and was causing long delays.
Bugs
We have a lot of bugs, in fact just during this release cycle we have 30 more bugs reported than we closed.So while the new bug reporting system has been a success and our users are reporting issues when they find them the development team is not keeping up..

The failure to keep up stems from the underlying issue of lack of manpower. We have relatively few active developers which is especially problematic when there are many users for a platform, such as RISCOS, but the maintainer is unable to commit enough time to fixing issues.

If you would like to help making NetSurf a better browser we are always happy to work with new contributors.

Junichi Uekawa: ARM assembly.

15 September, 2014 - 19:51
ARM assembly. I was reading up some docs on Unified Assembly Language (UAL). and confusions. I don't seem to be able to find a comprehensive doc about what works and what doesn't. Heh.

Julien Danjou: Python bad practice, a concrete case

15 September, 2014 - 19:09

A lot of people read up on good Python practice, and there's plenty of information about that on the Internet. Many tips are included in the book I wrote this year, The Hacker's Guide to Python. Today I'd like to show a concrete case of code that I don't consider being the state of the art.

In my last article where I talked about my new project Gnocchi, I wrote about how I tested, hacked and then ditched whisper out. Here I'm going to explain part of my thought process and a few things that raised my eyebrows when hacking this code.

Before I start, please don't get the spirit of this article wrong. It's in no way a personal attack to the authors and contributors (who I don't know). Furthermore, whisper is a piece of code that is in production in thousands of installation, storing metrics for years. While I can argue that I consider the code not to be following best practice, it definitely works well enough and is worthy to a lot of people.

Tests

The first thing that I noticed when trying to hack on whisper, is the lack of test. There's only one file containing tests, named test_whisper.py, and the coverage it provides is pretty low. One can check that using the coverage tool.

$ coverage run test_whisper.py
...........
----------------------------------------------------------------------
Ran 11 tests in 0.014s
 
OK
$ coverage report
Name Stmts Miss Cover
----------------------------------
test_whisper 134 4 97%
whisper 584 227 61%
----------------------------------
TOTAL 718 231 67%


While one would think that 61% is "not so bad", taking a quick peak at the actual test code shows that the tests are incomplete. Why I mean by incomplete is that they for example use the library to store values into a database, but they never check if the results can be fetched and if the fetched results are accurate. Here's a good reason one should never blindly trust the test cover percentage as a quality metric.

When I tried to modify whisper, as the tests do not check the entire cycle of the values fed into the database, I ended up doing wrong changes but had the tests still pass.

No PEP 8, no Python 3

The code doesn't respect PEP 8 . A run of flake8 + hacking shows 732 errors… While it does not impact the code itself, it's more painful to hack on it than it is on most Python projects.

The hacking tool also shows that the code is not Python 3 ready as there is usage of Python 2 only syntax.

A good way to fix that would be to set up tox and adds a few targets for PEP 8 checks and Python 3 tests. Even if the test suite is not complete, starting by having flake8 run without errors and the few unit tests working with Python 3 should put the project in a better light.

Not using idiomatic Python

A lot of the code could be simplified by using idiomatic Python. Let's take a simple example:

def fetch(path,fromTime,untilTime=None,now=None):
fh = None
try:
fh = open(path,'rb')
return file_fetch(fh, fromTime, untilTime, now)
finally:
if fh:
fh.close()


That piece of code could be easily rewritten as:

def fetch(path,fromTime,untilTime=None,now=None):
with open(path, 'rb') as fh:
return file_fetch(fh, fromTime, untilTime, now)


This way, the function looks actually so simple that one can even wonder why it should exists – but why not.

Usage of loops could also be made more Pythonic:

for i,archive in enumerate(archiveList):
if i == len(archiveList) - 1:
break


could be actually:

for i, archive in enumerate(itertools.islice(archiveList, len(archiveList) - 1):


That reduce the code size and makes it easier to read through the code.

Wrong abstraction level

Also, one thing that I noticed in whisper, is that it abstracts its features at the wrong level.

Take the create() function, it's pretty obvious:

def create(path,archiveList,xFilesFactor=None,aggregationMethod=None,sparse=False,useFallocate=False):
# Set default params
if xFilesFactor is None:
xFilesFactor = 0.5
if aggregationMethod is None:
aggregationMethod = 'average'
 
#Validate archive configurations...
validateArchiveList(archiveList)
 
#Looks good, now we create the file and write the header
if os.path.exists(path):
raise InvalidConfiguration("File %s already exists!" % path)
fh = None
try:
fh = open(path,'wb')
if LOCK:
fcntl.flock( fh.fileno(), fcntl.LOCK_EX )
 
aggregationType = struct.pack( longFormat, aggregationMethodToType.get(aggregationMethod, 1) )
oldest = max([secondsPerPoint * points for secondsPerPoint,points in archiveList])
maxRetention = struct.pack( longFormat, oldest )
xFilesFactor = struct.pack( floatFormat, float(xFilesFactor) )
archiveCount = struct.pack(longFormat, len(archiveList))
packedMetadata = aggregationType + maxRetention + xFilesFactor + archiveCount
fh.write(packedMetadata)
headerSize = metadataSize + (archiveInfoSize * len(archiveList))
archiveOffsetPointer = headerSize
 
for secondsPerPoint,points in archiveList:
archiveInfo = struct.pack(archiveInfoFormat, archiveOffsetPointer, secondsPerPoint, points)
fh.write(archiveInfo)
archiveOffsetPointer += (points * pointSize)
 
#If configured to use fallocate and capable of fallocate use that, else
#attempt sparse if configure or zero pre-allocate if sparse isn't configured.
if CAN_FALLOCATE and useFallocate:
remaining = archiveOffsetPointer - headerSize
fallocate(fh, headerSize, remaining)
elif sparse:
fh.seek(archiveOffsetPointer - 1)
fh.write('\x00')
else:
remaining = archiveOffsetPointer - headerSize
chunksize = 16384
zeroes = '\x00' * chunksize
while remaining > chunksize:
fh.write(zeroes)
remaining -= chunksize
fh.write(zeroes[:remaining])
 
if AUTOFLUSH:
fh.flush()
os.fsync(fh.fileno())
finally:
if fh:
fh.close()


The function is doing everything: checking if the file doesn't exist already, opening it, building the structured data, writing this, building more structure, then writing that, etc.

That means that the caller has to give a file path, even if it just wants a whipser data structure to store itself elsewhere. StringIO() could be used to fake a file handler, but it will fail if the call to fcntl.flock() is not disabled – and it is inefficient anyway.

There's a lot of other functions in the code, such as for example setAggregationMethod(), that mixes the handling of the files – even doing things like os.fsync() – while manipulating structured data. This is definitely not a good design, especially for a library, as it turns out reusing the function in different context is near impossible.

Race conditions

There are race conditions, for example in create() (see added comment):

if os.path.exists(path):
raise InvalidConfiguration("File %s already exists!" % path)
fh = None
try:
# TOO LATE I ALREADY CREATED THE FILE IN ANOTHER PROCESS YOU ARE GOING TO
# FAIL WITHOUT GIVING ANY USEFUL INFORMATION TO THE CALLER :-(
fh = open(path,'wb')


That code should be:

try:
fh = os.fdopen(os.open(path, os.O_WRONLY | os.O_CREAT | os.O_EXCL), 'wb')
except OSError as e:
if e.errno = errno.EEXIST:
raise InvalidConfiguration("File %s already exists!" % path)


to avoid any race condition.

Unwanted optimization

We saw earlier the fetch() function that is barely useful, so let's take a look at the file_fetch() function that it's calling.

def file_fetch(fh, fromTime, untilTime, now = None):
header = __readHeader(fh)
[...]


The first thing the function does is to read the header from the file handler. Let's take a look at that function:

def __readHeader(fh):
info = __headerCache.get(fh.name)
if info:
return info
 
originalOffset = fh.tell()
fh.seek(0)
packedMetadata = fh.read(metadataSize)
 
try:
(aggregationType,maxRetention,xff,archiveCount) = struct.unpack(metadataFormat,packedMetadata)
except:
raise CorruptWhisperFile("Unable to read header", fh.name)
[...]


The first thing the function does is to look into a cache. Why is there a cache?

It actually caches the header based with an index based on the file path (fh.name). Except that if one for example decide not to use file and cheat using StringIO, then it does not have any name attribute. So this code path will raise an AttributeError.

One has to set a fake name manually on the StringIO instance, and it must be unique so nobody messes with the cache

import StringIO
 
packedMetadata = <some source>
fh = StringIO.StringIO(packedMetadata)
fh.name = "myfakename"
header = __readHeader(fh)


The cache may actually be useful when accessing files, but it's definitely useless when not using files. But it's not necessarily true that the complexity (even if small) that the cache adds is worth it. I doubt most of whisper based tools are long run processes, so the cache that is really used when accessing the files is the one handled by the operating system kernel, and this one is going to be much more efficient anyway, and shared between processed. There's also no expiry of that cache, which could end up of tons of memory used and wasted.

Docstrings

None of the docstrings are written in a a parsable syntax like Sphinx. This means you cannot generate any documentation in a nice format that a developer using the library could read easily.

The documentation is also not up to date:

def fetch(path,fromTime,untilTime=None,now=None):
"""fetch(path,fromTime,untilTime=None)
[...]
"""
 
def create(path,archiveList,xFilesFactor=None,aggregationMethod=None,sparse=False,useFallocate=False):
"""create(path,archiveList,xFilesFactor=0.5,aggregationMethod='average')
[...]
"""


This is something that could be avoided if a proper format was picked to write the docstring. A tool cool be used to be noticed when there's a diversion between the actual function signature and the documented one, like missing an argument.

Duplicated code

Last but not least, there's a lot of code that is duplicated around in the scripts provided by whisper in its bin directory. Theses scripts should be very lightweight and be using the console_scripts facility of setuptools, but they actually contains a lot of (untested) code. Furthermore, some of that code is partially duplicated from the whisper.py library which is against DRY.

Conclusion

There are a few more things that made me stop considering whisper, but these are part of the whisper features, not necessarily code quality. One can also point out that the code is very condensed and hard to read, and that's a more general problem about how it is organized and abstracted.

A lot of these defects are actually points that made me start writing The Hacker's Guide to Python a year ago. Running into this kind of code makes me think it was a really good idea to write a book on advice to write better Python code!

The Hacker's Guide to Python

A book I wrote talking about designing Python applications, state of the art, advice to apply when building your application, various Python tips, etc. Interested? Check it out.

Pages

Creative Commons License ลิขสิทธิ์ของบทความเป็นของเจ้าของบทความแต่ละชิ้น
ผลงานนี้ ใช้สัญญาอนุญาตของครีเอทีฟคอมมอนส์แบบ แสดงที่มา-อนุญาตแบบเดียวกัน 3.0 ที่ยังไม่ได้ปรับแก้