Hacker News
a day ago by notacoward

Back in the mid-2000s I worked at a company that made their own (MIPS-based) chips. NSA was one of our customers - supposedly the "defense" who could be considered the good side of NSA compared to the 10x larger "offense" but still. As we were planning for our second generation, they offered quite a bit of money if we'd implement a "sheep and goats" instruction. It would take two operands: an input and a mask. The masked-in bits of the input (the "sheep") would be packed toward the MSB of the output, while the masked-out bits (the "goats") would be packed toward the LSB. We had a lot of people on staff with serious chops in all sorts of math including cryptography, but none of them could identify an algorithm that would benefit from having such an instruction (as distinct from more conventional range-based bitfield instructions). Since the company went under shortly afterward, it remained a mystery. I still wonder about it.

a day ago by less_less

Half of this instruction is present in AMD64's BMI2 extension as PEXT, and the reverse operation as PDEP. Unlike "sheep and goats", PEXT just extracts the sheep into the LSB and ignores the goats.

If I recall the Knuth lecture correctly, given a "sheep and goats" instruction where one of the sets is packed in reverse order, you can implement any n-bit permutation in something like log2(n) instructions. I don't remember if this is true if they're both packed in forward order. But it would be nice for some hardware crypto designs, like DES or more recently GIFT.

PEXT has at least two additional use cases I know of: manipulating bit indices for databases, and binary (GF2) matrix manipulation. I've used it in a (non-crypto) project to select a subset of columns from a binary matrix, to convert it to systematic form. This subroutine also used popcount.

What I really wanted in that project was another "NSA instruction": bit matrix multiply. Cray supercomputers can multiply two 64x64 binary matrices in one instruction, though I have no idea how many cycles it takes. With AVX2, the best I could do is 6 instructions plus precomputation for 8x8 x 8x32, which is 1/128'th the work.

a day ago by thomasmg

Succinct (space-saving) data structures often need "rank" and "select" operations. Rank(n) is the number of 1 bits up to position n. Select(n) is the reverse: at which position is the n-th 1 bit.

For "rank", the "popcount" instruction can be used. Interestingly, for "select", the "PDEP" instruction can be used: you can put the data array in the PDEP mask, and 1 << n in the value; basically flip the operands. I found this quite fascinating. For details, there is a short paper on this: "A Fast x86 Implementation of Select".

I wonder if those succinct data structures are in any way related to what NSA is doing. I think not, but who knows.

16 hours ago by yvdriess

I've seen it used for large-scale genomics. Saving a few bits if you're dealing with billions of a thing is very useful. They're also vital for being able to pack as much of a datastructure (e.g. a graph) on a single node. Some graph algorithms, e.g. random walks, are latency bound and scale really badly in a distributed system.

a day ago by Bayart

The paper, if anyone wants to save some clicks : https://arxiv.org/abs/1706.00990

a day ago by someguydave

indeed, Cray famously said "If you were plowing a field, which would you rather use: two strong oxen or 1024 chickens?"

Unfortunately we only have 1024 chickens in modern computers.

a day ago by CalChris

Yes, but those chickens now are as powerful as Cray's oxen were then.

a day ago by lowbloodsugar

If you had to digest a million grains, which would you rather use?

a day ago by Enginerrrd

It's out of my depth, but my guess is on sething DES related. Here's a link to some possibly relevant discussion about it.


a day ago by Sniffnoy

Heh, so it's an instruction for INTERCAL's "select" (~) operator...

a day ago by pbsd

That sounds like a decent primitive to accelerate arbitrary bit permutations in software. It's known as GRP in, e.g., [1].

[1] http://palms.ee.princeton.edu/PALMSopen/shi00bit.pdf

a day ago by notacoward

Very interesting. This was published in 2000, and the people we worked with were near Princeton, so this result - the specific utility of such an instruction, if not the semantics themselves - might very well be something that was known to some relevant people but not yet widely enough for any of our people to recognize it. Thanks!

a day ago by dboreham

I have also been around in a company that made CPUs that initially had no bit count instruction. Then at some point the instruction was added. At the time I heard that "men in black with mirrored sunglasses" had shown up and demanded that the instruction be added. Whether or not this was an accurate description of events, you can see the note on page 74 in this document (section 8.2) : http://www.transputer.net/iset/pdf/tis-acwg.pdf recording the instruction having been added.

Edit, I see Roger Shepherd (one of the people in the know at above mentioned company) commented in the comp.arch thread (which I vaguely remember reading at the time) but no mention of MIB...

a day ago by mjevans

The 'and goats' part leads me to conceptualizing the instruction more like:

Bit Scrambler / Chutes - shuffle bits around in a way that divides a stream.

This might also be useful in pre-filters for compression (entropy reduction) if you knew the content of the message. E.G. for ASCII text the upper 2-3 bits of each letter could be ranked to the side for better compression and a reduction of message size.

As others have pointed out, modern CPUs ended up with 'half' that instruction, so I wonder if there were any other reasons for the full instruction.

2 days ago by st_goliath

Some years back, I got myself a copy of Andrew Hodges "Alan Turing - The Enigma", a biography and IMO generally a good read, but also with some gems regarding very early computing history in it.

Specifically, after WWII, Turing worked on the ACE 1 (later reduced to Pilot ACE) project to build an electronic computer, which didn't really progress due to management and bureaucracy overhead. He eventually went to Manchester, once they got their Manchester Mark 1 off the ground, which they tried to commercialize as "Ferranti Mark 1" (https://en.wikipedia.org/wiki/Ferranti_Mark_1).

While employed for the University, Turing IIRC continued to work as an external consultant for whatever became of G.C. & C.S. on the side. According to the book, he convinced them to buy such a machine (presumably for crypt-analysis?) and, on the Manchester side of things, insisted on some modifications to be made, including a "horizontal adder", so it could count the number of bits set in a word with a single instruction, i.e. a popcount instruction. This would pre-date the IBM Stretch mentioned in the article.

a day ago by tptacek

The consensus on the 1992 thread (including a really great comment from 'Animats) seems to be that `popcount` was generally not added to architectures at NSA's request --- that people familiar with those archs knew the actual reason `popcount` wound up in the ISA, and it preceded NSA purchases.


a day ago by Animats

The striking thing is that the IBM System/360 didn't have it. Nor does the System/370. Those were the standard mainframes for a generation.

IBM Z-series machines do have population count, finally.

a day ago by drichel

Counting bits was the bottleneck in the genomic scan I co-authored (Kanoungi et al. 2020). popcnt resulted in insane perfomance gains comared to all other methods.

However, we re-discovered the fact that some Intel CPUs, including the Nehalem mentioned in the article, have a bug that severly affects popcnt's performance, see for example here: https://github.com/komrad36/LATCH/issues/3#issuecomment-2671...

a day ago by adrian_b

It is possible that the "population count" instruction has been included in the instruction sets of most American supercomputers at the request of NSA, which was an important customer for them.

Nevertheless, the first computer having this instruction was a British computer, the Ferranti Mark I (February 1951).

The name used by Ferranti Mark I for this instruction was "sideways add".

Also notable was that Ferranti Mark I had the equivalent of LZCNT (count leading zeroes) too.

Both instructions are very useful and they are standard now for modern instruction sets, but they were omitted in most computers after Ferranti Mark I, except in expensive supercomputers.

a day ago by adrian_b

Moreover, Ferranti Mark I included a hardware random number generator, another feature useful for cryptography, which was reintroduced only recently in modern CPUs.

a day ago by iib

Hardware random number generators do have some security issues though. Linux devs were opposed to solely relying on them, because they can be compromised by the vendor [1]. So they are at best used in algorithms that they can not compromise (still in [1], but lower, in the comments).

[1] https://web.archive.org/web/20180611180213/https://plus.goog...

a day ago by adrian_b

The security issues are not with hardware random number generators in general, but with those that are included inside complex devices like monolithic CPUs or TPMs, so that the owners of those devices cannot verify that the RNG's really do what they are claimed to do.

Discrete hardware RNG's, like that of the Ferranti Mark I, are perfectly secure.

For a modern device, the best way to implement a hardware RNG is to just include an ADC (analog-digital converter) input. Then you may connect externally on the PCB some noisy analog amplifier, e.g. one which has a noisy resistor or diode at its input. Digitizing the noise with the ADC will provide the random numbers and the ADC input can be isolated and tested separately at any time, so the user can verify that there is no hidden functionality.

Most microcontrollers have ADC inputs, so it is easy to add a secure hardware RNG for them. The same could be done for a personal computer by making a noisy amplifier that can be plugged in the microphone input, or by making a USB device with a microcontroller.

a day ago by ncmncm

Indeed, AMD has more than once shipped CPUs in which the random-number instruction would always yield the same value, that had to be monkey-patched to yield apparently random numbers. A valuable hint.

a day ago by st_goliath

I commented on that earlier, including that it probably also has a cryptanalysis background: https://news.ycombinator.com/item?id=27472900

But yes, it definitely pre-dates the 1961 IBM machine in the article.

a day ago by dwheeler

Obviously using a dedicated instruction is fastest in normal cases.

But if you need to implement popcount or many other bit manipulation algorithms in software, a good book to look at is "Hacker's Delight" by Henry S. Warren, Jr, 2003.

"Hacker's Delight' page 65+ discuss "Counting 1-bits" (population counts). There are a lot of software algorithms to do this.

One approach is to set each 2-bit field to the count of 2 1-bit fields, then each 4-bit field to the count of 2 2-bit fields, etc., like this:

    x = (x & 0x55555555) + ((x >> 1) & 0x55555555);
    x = (x & 0x33333333) + ((x >> 2) & 0x33333333);
    x = (x & 0x0f0f0f0f) + ((x >> 4) & 0x0f0f0f0f);
    x = (x & 0x00ff00ff) + ((x >> 8) & 0x00ff00ff);
    x = (x & 0x0000ffff) + (x >> 16);
assuming x is 32 bits.

I think this approach is a classic divide-and-conquer solution.

a day ago by amalcon

A neat one for large, sparse integers is:

  for(i=0; !x; ++i) {
    x = (x-1)&x
  return i;
Which runs only a number of iterations equal to the number of 1 bits. This works because (x-1) actually just flips the rightmost 1 bit and all zeroes to its right, then the & zeroes all of those.

It's not that fast unless your integer is really sparse (since it has a branch), but I've always liked the bit hack.

a day ago by pbsd

When the integer is expected to be dense, you have the corresponding trick

    size_t count = sizeof(x) * 8;
    while(x != -1) {
        x |= x+1;
    return count;
a day ago by throw5away

This is essentially equivalent to feeding the input through bitwise-NOT first. Unfortunately, there are far more integers that are neither sparse nor dense than integers that are sparse or dense.

a day ago by dlemire

You can also go faster using SIMD instructions if you need to compute wider population counts (beyond 64 bits):

Faster Population Counts Using AVX2 Instructions, Computer Journal, Volume 61, Issue 1, 2018 https://arxiv.org/abs/1611.07612

a day ago by undefined
a day ago by dragontamer

> But if you need to implement popcount or many other bit manipulation algorithms in software

Power9, ARM, x86 BMI, Nvidia PTX, AMD GCN, and AMD RDNA all have a popcount instruction.

Yeah, all mainstream CPUs and GPUs made in the past decade...

Unfortunately, there's no system I can think of where you'd need the software solution anymore... Maybe if you wanted popcount on an Arduino??

a day ago by ncmncm

Yet, practically all software running on 64-bit x86 machines is compiled without, because the original amd64 released in 2003 lacked it, and distributions still target that. Likewise, MSVC. There would be good reasons for Apple XCode not to, but that doesn't mean they don't.

If you tell MSVC to issue a popcount instruction with "__popcnt64()" (etc.), it will. If you ask Gcc to issue a popcount instruction with "__builtin_popcount()", it will only do it if you have also told it to target an ISA that has one; otherwise it emulates.

The only portable way to get a popcount instruction, thus far, is to use C++'s std::bitset::count() in circumstances where the compiler believes the instruction would work. Pleasingly, Gcc and Clang are both happy to hold an integer type and its std::bitset representation in the same register at the same time, so there is no runtime penalty for a round-trip through std::bitset.

MSVC's standard library implementation of std::bitset does not use the popcount instruction.

12 hours ago by jart

Try the Cosmopolitan Libc implementation of popcnt(). It uses CPUID checks for compatibility which get hoisted out of a tight loop by the optimizer so there's no performance loss when building for -march=k8. If you build for -march=native then they get DCE'd entirely. See https://github.com/jart/cosmopolitan/blob/master/libc/bits/p...

2 days ago by dragontamer

GPU-programmers use popcount-based programming all the time these days, but the abstractions are built on top and are hardware accelerated.

CUDA's __activemask(); returns the 32-bit value of your current 32-wide EXEC mask. That is to say, if your current warp is:

    int foo = 0;
    if(threadIdx.x %= 2){
      foo = __activemask(); 
foo will be "0b01010101...." or 0x55555555. This __activemask() has a number of useful properties should you use __popc with it.

popcount(__activemask()); returns the number of threads executing.

lanemask_lt() returns "0b0000000000000001" for the 0th lane. 0b0000000000000011 for the 1st lane. 0b0000000000000111... for the 2nd lane... and 111111111...111 for the last 31st lane.

popcount(__activemask() & lanemask_lt()); returns the "active lane count". All together now, we can make a parallel SIMD-stack that can push/pop together in parallel.

    int head = 0;
    char buffer[0x1000];

    while(fooBar()){ // Dynamic! We don't know who is, or is not active anymore
        int localPrefix = __popc(__activemask() & __lanemask_lt());
        int totalWarpActive = __popc(__activemask()); 
        buffer[head + localPrefix] = generateValueThisThread();
        if(localPrefix == 0){
            head += totalWarpActive; // Move the head forward, much like a "push" operation in single-thread land
            // Only one thread should move the head
         __syncthreads(); // Thread barrier, make sure everyone is waiting on activeThread#0 before continuing.

As such, you can dynamically load-balance between GPU threads (!!!) from a shared stack with minimal overheads.

If you want to extend this larger than one 32-wide CUDA-warp, you'll need to use __shared __ memory to share the prefix with the rest of the block.

It is a bad idea (too much overhead) to extend this much larger than a block, as there's no quick way to communicate outside of your block. Still though, having chunks of up to 1024 threads synchronized through a shared data-structure that only has nanoseconds of overhead is a nifty trick.


EDIT: Oh right, and this concept is now replicated very, very quickly in the dedicated __ballot_sync(...) function (which compiles down to just a few assembly instructions).

Playing with the "Exec-mask" is a hugely efficient way to synchronously, and dynamically gather information across your warp. So lots of little tricks have been built around this.

15 hours ago by jiggawatts

Occasionally I am humbled to realise that even in IT there are vast fields of knowledge that are so far removed from my ordinary knowledge as to appear almost like magic.

This is one of those moments!

PS: I once wrote a 3D engine, but clearly my knowledge is now so out of date as to be practically stone age compared to this kind of thing...

2 days ago by 4gotunameagain

Another interesting application of popcount is in computer vision, namely in matching keypoints that use binary descriptors for 3D reconstruction in SLAM/TRN etc

2 days ago by jonatron

Yep, I've used __builtin_popcountll for ORB from OpenCV (256 bit binary descriptors).

2 days ago by 4gotunameagain

Looks like we've done similar things :)

Horror story: I was once developing a TRN system for a spacecraft instrument which uses an ancient x86 processor that does not have popcnt, ended up using a look-up table instead...

a day ago by solarexplorer

Did you know about HAKMEM 169? I guess it was/is not widely known since many people mention lookup tables as the only fast alternative to the popcnt instruction.


Daily Digest

Get a daily email with the the top stories from Hacker News. No spam, unsubscribe at any time.