I Tried Bend

Introduction
Quick background
Goal
Attempt 1: Lists
Attempt 2: Maps
Attempt 3: Maps, merge-sort style
fold is basically just match, but why use fold?
Attempt 4: No maps, custom trees
In general, implementations must be tailored for Bend
Too much matching
Speed test
Conclusion

Introduction

I decided to try the new Bend programming language.

Bend has two main goals:

In Bend, if your code can run in parallel, it will run in parallel
Bend is as easy to write as Python

This post is my attempt at a case study with my thoughts about the language mixed in.

Disclaimer: this post almost certainly has errors and misunderstandings about Bend. My background is as a university math and CS student with some functional programming knowledge.

Code is in this repo.

Quick background

Bend uses two keywords not common in mainstream languages: fold and bend. The idea of fold is to essentially do a Ctrl+F find and replace in a recursive data structure: if I wanted to replace all the leaf nodes (with values) in a tree with their doubles, I would write the following function:

# This is the default type Tree
type Tree:
  Node { ~left, ~right }
  Leaf { value }

def doubleTree(t):
	fold t:
		case Tree/Node:
			return Tree/Node(t.left, t.right)
		case Tree/Leaf:
			return Tree/Leaf(t.value * 2)

When we have our Tree t, we treat it like a linked list where we “go to” its left and right nodes whenever it’s not a leaf, and we double whenever we are at the leaf.

The unseen bit of logic here is that whenever we write t.left, or t.right, since those are defined to be recursive (as denoted by the ~ in the type definition), they are Trees themselves. Thus, the fold folds back and runs the whole chunk of logic on that object too. The Leaf case terminates because value is not a recursive entry.

This is essentially a Breadth First Search on the tree, where each branch can be computed in parallel. Opportunities for parallel computing like this are how Bend is able to multithread programs.

What we wrote is (mostly) equivalent to this C program:

void doubleTree(Tree* t) {
    // No tree
    if (!t) return;
    
    // Leaf
    if (!t->left && !t->right) {
        t->value *= 2;
        return;
    }
    
    // Non-leaf
    doubleTree(t->left);
    doubleTree(t->right);
}

The bend operation is somewhat similar to fold, but it’s used to build up a structure rather than operate on an existing one. The following bend creates a tree of depth d:

def makeTree(d):
	bend d:
		when d > 0:
			t = Tree/Node(fork(d-1), fork(d-1))
		else:
			t = Tree/Leaf(1)
	return t

Here, fork is used to build another tree with a different argument to bend; each fork loops back to the beginning of bend. Fork is just an explicit version of the thing fold does where it finds and recurses on the recursive arguments.

Goal

I’m going to try to implement a function which multiplies two matrices.

Attempt 1: Lists

The first thing I tried was to naturally represent my vectors and matrices as lists and lists of lists respectively. This has a big problem. In Bend, Lists are linked lists. Each list is just a head (the first number) and a reference to the rest of the list. To compute the dot product of two equal length lists, we’d compute:

def dot(v1, v2):
	match v1:
		case List/Nil:
			return 0
		case List/Cons:
			match v2:
				case List/Nil:
					return 0 #should never get here
				case List/Cons:
					return v1.head*v2.head+dot(v1.tail,v2.tail)

(Note: each List is either a List/Nil or a List/Cons, where List/Nil is an empty list and List/Cons is not. When our tail is empty, it will be a List/Nil, but until then it will be a List/Cons.)

This computation is sequential because if the entry-wise products were 1, 2, 3, 4, 5, 6, 7, we’d be computing 1+ (2+ (3+ (4+ (5+ (6 + 7))))), which must be computed in consecutive order. Bend has no way to optimize a computation like this, as each + depends on another + to complete first.

It seems like any computation involving lists will take O(n) time - for this reason, I struggle to think of a situation where the List datatype would ever be useful in a language like Bend, assuming A) it is not used simply as an O(1) sized tuple, and B) the programmer who chose Bend would like to take full advantage of Bend’s main feature. I’ll mention a caveat later.

Attempt 2: Maps

Bend also has a map datatype. This map works similar to normal maps or dictionaries in other languages, and in theory can be used to simulate an indexable list which avoids the sequentiality of Lists.

The underlying structure of a Map is interesting; Maps represent the data as a binary tree where each node is either a Node or a Leaf. Leaves are just blank end caps in this case, and don’t hold data. Each Node either does or doesn’t contain a value, as well as a reference to left and right nodes. Nodes with no value contain a * instead.

This tree is designed such that every key (that we initialized the map with) corresponds with exactly one node, where for each key k (in binary), the corresponding node is the one we get to by reading the binary little-endian (least to most significant bit) and stepping left for 0, right for 1. For example, all odd numbers end up to the right of the root since they have last digit 1. Keys must be representable as nonnegative integers to make this scheme work.

For this approach, I directly use the map structure as a vector (so m[i] is the i’th index of the vector) and compute dot product recursively on the tree. The issue I ran into is that while I can “match” the difference between a Leaf and a Node, I couldn’t find a way to match the difference between a * and an actual value, so I couldn’t find a way to compute dot product recursively as v1.value * v2.value + dot(v1.left, v2.left) + dot(v1.right, v2.right) without the * valued nodes messing up the computation.

Attempt 3: Maps, merge-sort style

Alternatively, I can use the map indexing syntax. My function should take the dot product of the left half and right half of the first map with the left half and right half of the second map respectively.

def dotProduct(v1, v2, l, r):
  if l == r:
    return v1[l] * v2[r]
  else:
    return dotProduct(v1, v2, l, (l+r)/2) + 
	       dotProduct(v1, v2, (l+r)/2+1, r)

This turns out to work, but it’s a bit less efficient than approach 2 would have been - while approach 2 would traverse the tree just once, this approach has to fetch every element individually starting from the top - a slowdown of log n.

We also have to deal with merging our results. When we multiply two vectors, we end up with lots of single numbers that need to go in a new vector, but no straightforward way to merge them all together and keep the correct indexing.

The best (and only), but still problematic way I found to do this was to merge two vectors (and later, two matrices) by pushing one element of the second vector into the first vector, then recursing until the first vector has all of the second vector’s elements. This requires that the second vector has nonstandard map indices starting at whatever the first vector ends at, or else indices will be messed up once the two vectors are mashed together.

This is definitely not computationally optimal because we are doing a ridiculous amount of extra work; Bend’s variables are immutable, so each time we move one element over, we have to copy the first vector or matrix (this is probably optimized under the hood, like in Haskell). We’re also moving each element to a different map O(log n) times instead of O(1) times if we just moved everything into one map by looping through the results. This was the best I could figure out under the restrictions of Bend, however.

Code for this approach is in map.bend.

fold is basically just match, but why use fold?

Quick tangent: Bend has a match keyword which takes a structure and does something with it based on what its type is. It turns out (as mentioned in the docs) that fold is syntactic sugar for match:

fold t:
	case Tree/Node:
		return Tree/Node(t.left, t.right)
	case Tree/Leaf:
		return Tree/Leaf(t.value * 2)

#Same computation using match wrapped in a function
def doubleTree(t):
	match t:
		case Tree/Node:
			return Tree/Node(doubleTree(t.left), 
			                 doubleTree(t.right))
		case Tree/Leaf:
			return Tree/Leaf(t.value * 2)

This also works if we pass arguments to fold (which requires a bit of extra syntax), since functions can take arguments too.

The issue with fold as a replacement of loops is that fold works best when the recursive calls only require themselves and some bookkeeping arguments. When we want to use two structures simultaneously, as in my case with dot products, the fold syntax becomes very long and awkward, as we have to pass the second structure as a bookkeeping argument and then match it to its type when we want to use it - we might as well use two nested match statements instead of a match inside a fold.

I also think using fold instead of match is rather confusing in general, as the way fold manages recursion is not obvious (to me) unless you’ve (I’ve) read and understood the Bend manual. I think match is pretty intuitive in contrast.

It can be seen that bend is also syntactic sugar for an if/else statement wrapped in a function (though bend uses the when else syntax; I don’t know if there’s a difference.)

Attempt 4: No maps, custom trees

Here, we trim the desire for baked-in indexability and use a barebones tree structure to represent matrices and vectors.

We create a MatrixTree datatype with four variations: Leaf (a single value), Vector (a tree where each left and right subtree is either a vector or a leaf), Matrix (a tree where each left and right subtree is either a matrix or a vector), and Nil.

type MatrixTree:
	#Single value in a vector
	Leaf { value }
	#Vector of values divided in two
	Vector {~left, ~right}
	#Matrix of vectors divided in two
	Matrix {~topHalf, ~bottomHalf}
	Nil

From here, our computations are very natural. To multiply two matrices, we need to repeatedly split the first matrix in half until it’s all rows, then multiply each row with the second matrix which we split the same way. Once we’ve recursed down to the level of two vectors, we can take the dot product by summing the dot products of their respective left and right sides.

Out of laziness, I made this approach work only good enough for matrices of size a power of 2, but I don’t think it’s hard to generalize it to any size with a couple more type matching cases.

Code for this approach is in tree.bend.

In general, implementations must be tailored for Bend

While “if your code can run in parallel, it will run in parallel”, what I learned from designing these two matrix data structures is that it turns out making sure your code can run in parallel is not so easy. You will almost certainly have to write or redesign your data to work with Bend in a way that you yourself know will be highly parallelizable - in other words, Bend isn’t magic.

This reduces the value proposition of Bend, at least for me, as I want to get my parallelism “for free” - I don’t want to put in on the order of the same amount of work as I would have if I started in python and then added threads myself. Sure, I don’t have to deal with actual parallelization code, like mutexes and condition variables, but it’s still quite a bit of wrangling to make Bend compute what I want in a way Bend can actually optimize.

One caveat is that we don’t have to make every operation parallel - in our case, we can choose to only parallelize the operation of initializing each dot product, and keep the dot products themselves as slow O(n) lists - bringing the real world computer into the picture, this is probably fine or even better than the fully parallelized alternative. In exchange we will get a somewhat simpler implementation.

Intuitively, I would expect that as soon as we’ve parallelized enough to utilize all of the separate processing units on our computer or GPU, further parallelization isn’t helpful - and this threshold can be reached pretty quickly.

Too much matching

This is a bit of an annoyance I had: every time I want to use a variable, I have to first match its type before I can reference its contents. My code has tall towers of match es and case s when I have any kind of nesting in my types.

Philosophically, it seems like Bend should either be strongly typed or not have this matching requirement. What Bend has is certainly not as ergonomic as Python, which does not care and just throws errors when variables aren’t a compatible type.

Speed test

To test the two implementations, I encoded the multiplication of two identity matrices. I’m sure more interesting matrices can be multiplied, but it’s quite a bit of effort to encode a particular matrix in either of the two bend schemes. To factor out the time it takes to initialize the matrix, I allow each program to run the computation either once or twice, then subtract. Note that Bend can run in both single-threaded and multithreaded mode, using the run and run-c arguments respectively.

I also wrote a naive python implementation which goes about the computation in a fairly standard dot-product-by-dot-product way, as well as a similar C implementation. I had ChatGPT generate a multithreaded version of the python code (don’t be surprised if it’s not good code). Finally, I threw in a Numpy version.

All tests were run on a GCP instance with 8 cores and 60 GB of memory except the C one, which I ran on my computer because the binary wasn’t compatible. I tried Bend’s CUDA execution on the GCP instance’s T4 GPU, but unfortunately kept running into CUDA errors I couldn’t troubleshoot.

In the table, we multiply two identity matrices of size 2^n by 2^n. Times are in seconds. MT means multithreaded. First four result columns are Bend.

n	MT Map	Map	Tree	MT Tree	Py	MT Py	C	Numpy
3	.31	.12
4	2.9	1.7
5	43	29	.48
6			3.9	1.1
7			33	3.3	.36	.27
8				20	2.7	2.2
9					24	19	.32
10							2.4
11							53
12								.37
13								3.0
14								22

Some notes:

Some of the fast bend trials were extremely noisy, sometimes taking longer than the next order of n; I’m not too interested in statistically sound results so I reran those ones.
Sometimes the larger bend trials ran out of memory.
Though I couldn’t get Bend’s CUDA version to work (not a fault of Bend), the docs report a 5x speedup while using CUDA over the regular multithreaded version - this would put the MT Tree on the same order as the (sequential) Python program. (Though this 5x speedup was computed on a small, sub 1s timescale, and I’d expect CUDA to have high multithreading overhead.)

There’s some interesting data here. First, we find that multithreading the map implementation actually slows it down, so it has a very high overhead for multithreading. It’s also about two orders slower than the tree version; this I expected because the tree version has a lot fewer bookkeeping and overall unnecessary computations involved.

Multithreading the tree version boosts it by a little over one order, putting it about one order behind the naive python program. (One order here is about a factor of 8 in raw computation, since the algorithm we compute is O(x^3) for matrices of size x). The multithreaded Python program only has a slight boost over its normal counterpart. From there, the C and Numpy programs are 2 and 5 orders ahead respectively (Numpy might take advantage of the identity-ness of the matrices).

Anecdotally, I also think there’s some role being played by memory and garbage collection; the first trial I run in Bend on a fresh restart seems less likely to hit OOM errors than later ones.

Conclusion

I think what I kind of wanted from Bend going into this is the ability to look at a piece of code like:

int x = 0;
for (int i = 0; i < 100; i++) {
	x += i;
}

and for Bend to read this and (through some math-y functional programming magic) say “Oh look! Each iteration of the for loop doesn’t actually depend on what happened in any of the other iterations. I can parallelize this.” and compile/transpile/run something like

void add_to_sum(int value, std::atomic<int>& result) {
    result += value;
}

atomic<int> x(0);
vector<thread> threads;

for (int i = 0; i < 100; i++) {
	threads.push_back(add_to_sum, i, ref(x));
}

for (auto& t : threads) {
	t.join();
}

(Under the assumption that addition is a really long and tedious computation. The relative ease of computing dot products is perhaps a major flaw in my case study, as it likely fails to play to Bend’s strength.)

Instead, it feels like I have to spell it out for Bend in my implementation. In other words, it feels like I myself would be able to write an interpreter for Bend code that can be maximally multithreaded - just spawn new threads whenever I hit a piece of logic with function calls I can handle separately. It would be really hard and definitely above my pay grade, but the point is I can see how it might be done.

I think Bend’s dream is interesting and cool - our computers have lots of cores and threads, but we almost never write our programs in a way that utilizes them on the first pass. In theory, Bend automates all the heavy lifting of creating threads, mutexes, and so on. In practice, Bend requires a ton of logic fine-tuning just to make a program compatible with Bend’s scheme of computation - and this is ignoring the current lack of raw performance, which I’m sure will improve rapidly over time. I’m interested to see how Bend progresses in development and where it will be a year or two from now.

n	MT Map	Map	Tree	MT Tree	Py	MT Py	C	Numpy
3	.31	.12
4	2.9	1.7
5	43	29	.48
6			3.9	1.1
7			33	3.3	.36	.27
8				20	2.7	2.2
9					24	19	.32
10							2.4
11							53
12								.37
13								3.0
14								22

n	MT Map	Map	Tree	MT Tree	Py	MT Py	C	Numpy
3	.31	.12
4	2.9	1.7
5	43	29	.48
6			3.9	1.1
7			33	3.3	.36	.27
8				20	2.7	2.2
9					24	19	.32
10							2.4
11							53
12								.37
13								3.0
14								22