Tying Random Ends (Adventures in Optimization)

Simulating, curve-fitting, optimizing, and ultimately solving a probability puzzle about tying random string ends together.

In an effort to stave off the absolute existential dread that accompanies me with both a) doing my Master's in AI and b) being a software engineer in 2026, I find it soothing to go back to my roots and solve interesting problems for the hell of it.

This 3blue1brown YouTube short recently popped into my feed:

In short, the problem is this:

Suppose you drop nn random strings of yarn into a box. Then you randomly pick two ends of the strings and connect them. Sometimes, you will make a loop of string. Sometimes you will not. You repeat this process until there are no more ends left. For 50 strings, what is the expected number of loops that you will end up with?

It's short problems like these that I really enjoy solving. I like to call these kinds of questions "intellectual junk food" - small, tight, neat little problems that are made to be solved, and usually have a small "eureka moment" to give you that hit of dopamine your brain desperately craves. (Though I guess a more common term would be "puzzles" now that I think about it...) These questions give you a nice break from the messy real world, where solutions aren't so clear-cut.

When solving these types of problems, especially when dealing with probability, I like to do a computer simulation at first. However, something clicked this time when I watched this video. I realized that I could also come up with various functions, give them parameters, and fit the data to it using PyTorch. Cue all the actuaries giving me a disapproving look for just now realizing this, but hey, we all get there eventually, right?

Creating the Simulation

First up was the simulation logic. This is pretty self-explanatory. Since each string is fully defined by its ends, that's the only thing I have to keep track of. I number the ends from 0 up to 2n-1, put them in a list, and keep track of each end's corresponding other end on the string.

Now, note that when you connect two ends, they get removed from consideration. They either form a loop, or they get attached to another string. Either way, they aren't "ends" anymore.

Thus, we can remove the ends from the list. If the ends already correspond to each other, then we have a loop! Otherwise, we made a string and need to make the ends of this new string correspond to each other.

def perform(n_strings: int) -> int:
  all_ends: list[int] = list(range(2*n_strings))
  end_map: dict[int, int] = {}
  for a, b in batched(all_ends, 2):
    end_map[a] = b
    end_map[b] = a

  result = 0

  while len(all_ends) != 0:
    # Remove a random end from the list
    a = all_ends.pop(random.randrange(len(all_ends)))
    b = all_ends.pop(random.randrange(len(all_ends)))

    # Find the other corresponding end
    a_corresp = end_map[a]

    # If they belong to the same string, then increment result
    if a_corresp == b:
      result += 1
      continue

    b_corresp = end_map[b] # else, merge

    end_map[a_corresp] = b_corresp
    end_map[b_corresp] = a_corresp

  return result

I then sample this for a bunch of strings and trials for each string length.

def run_trials_python(
  max_n_strings: int,
  n_trials: int,
) -> pd.DataFrame:
  rows = []
  for n_strings in tqdm(range(1, max_n_strings + 1), desc="Gathering"):
    for _ in range(n_trials):
      rows.append({'x': n_strings, 'y': perform(n_strings)})

  return pd.DataFrame(rows)

Next up is to code some PyTorch models with tunable parameters. I have a feeling that this is going to increase sharply at first, and then slowly as the number of strings increase. I come up with 4 functions: one based on the harmonic series, a logarithm, a power function, and a polynomial function.

class HarmonicModel(torch.nn.Module):
  def __init__(self):
    super().__init__()
    self.a = torch.nn.Parameter(torch.tensor(1.0))
    self.b = torch.nn.Parameter(torch.tensor(0.0))
    self._H_cache: torch.Tensor | None = None  # Expensive!

  def forward(self, x: torch.Tensor) -> torch.Tensor:
    if self._H_cache is None or self._H_cache.shape != x.shape:
      self._H_cache = torch.tensor([
        sum(1.0/k for k in range(1, int(n)+1)) for n in x.numpy()
      ])
    return self.a * self._H_cache + self.b

  def pretty(self) -> str:
    return f"{self.a.item():.4f} * H(n) + {self.b.item():.4f}"

class LogModel(torch.nn.Module):
  def __init__(self):
    super().__init__()
    self.a = torch.nn.Parameter(torch.tensor(1.0))
    self.b = torch.nn.Parameter(torch.tensor(0.0))

  def forward(self, x: torch.Tensor) -> torch.Tensor:
    return self.a * torch.log(x) + self.b

  def pretty(self) -> str:
    return f"{self.a.item():.4f} * ln(n) + {self.b.item():.4f}"

class PowerModel(torch.nn.Module):
  def __init__(self):
    super().__init__()
    self.a = torch.nn.Parameter(torch.tensor(1.0))
    self.b = torch.nn.Parameter(torch.tensor(0.5))

  def forward(self, x: torch.Tensor) -> torch.Tensor:
    return self.a * x ** self.b

  def pretty(self) -> str:
    return f"{self.a.item():.4f} * n^{self.b.item():.4f}"

class PolynomialModel(torch.nn.Module):
  def __init__(self):
    super().__init__()
    self.a = torch.nn.Parameter(torch.tensor(0.01))
    self.b = torch.nn.Parameter(torch.tensor(0.1))
    self.c = torch.nn.Parameter(torch.tensor(0.0))

  def forward(self, x: torch.Tensor) -> torch.Tensor:
    return self.a * x**2 + self.b * x + self.c

  def pretty(self) -> str:
    return f"{self.a.item():.4f}n² + {self.b.item():.4f}n + {self.c.item():.4f}"

Now, fundamentally what we are doing is taking a large number of samples from model defined in perform, all independent and identically distributed. Because of the central limit theorem, we know that the sample mean becomes a good approximation of the true mean of the underlying distribution. In this case, that's the expected number of created loops which is exactly what we are trying to find!

Thus, I can fit the hypotheses using Adam and MSELoss, using the mean as the data. I also log the 95% confidence interval and r-squared for good measure.

def fit_hypotheses(
    trials_df: pd.DataFrame,
    models: dict[str, torch.nn.Module],
    lr: float = 0.01,
    epochs: int = 5000,
) -> tuple[pd.DataFrame, pd.DataFrame]:
  plot_df = trials_df.groupby('x')['y'].agg(
    mean='mean', std='std', count='count').reset_index()
  plot_df['ci95'] = 1.96 * plot_df['std'] / np.sqrt(plot_df['count'])

  x = torch.tensor(plot_df['x'].values, dtype=torch.float32)
  y = torch.tensor(plot_df['mean'].values, dtype=torch.float32)

  meta_rows = []

  for name, model in tqdm(models.items(), desc="Fitting", total=len(models)):
    optimizer = torch.optim.Adam(model.parameters(), lr=lr)
    loss_fn = torch.nn.MSELoss()

    for _ in range(epochs):
      optimizer.zero_grad()
      loss = loss_fn(model(x), y)
      loss.backward()
      optimizer.step()

    with torch.no_grad():
      fitted = model(x)
      ss_res = torch.sum((fitted - y) ** 2).item()
      ss_tot = torch.sum((y - y.mean()) ** 2).item()
      r2 = 1 - ss_res / ss_tot
      plot_df[name] = fitted.numpy()

    pretty = model.pretty()
    print(f"  {name}: {pretty} (R²={r2:.4f})")
    meta_rows.append({'name': name, 'pretty': pretty, 'r2': r2})

  meta_df = pd.DataFrame(meta_rows)
  return plot_df, meta_df

Then plot:

def plot(
    plot_df: pd.DataFrame,
    meta_df: pd.DataFrame,
    filename: str,
) -> None:
  plt.figure(figsize=(10, 6))
  plt.plot(plot_df['x'], plot_df['mean'], label='mean', linewidth=2)
  plt.plot(plot_df['x'], plot_df['mean'] + plot_df['ci95'],
           linestyle='--', color='gray', alpha=0.5, label='95% CI')
  plt.plot(plot_df['x'], plot_df['mean'] - plot_df['ci95'],
           linestyle='--', color='gray', alpha=0.5)

  for _, row in meta_df.iterrows():
    label = f"{row['name']}: {row['pretty']} (R²={row['r2']:.4f})"
    plt.plot(plot_df['x'], plot_df[row['name']], label=label)

  plt.xlim(left=0)
  plt.ylim(bottom=0)
  plt.xlabel('n (number of strings)')
  plt.ylabel('E(n) (expected loops)')
  plt.legend()
  plt.tight_layout()
  plt.savefig(filename, dpi=150)
  plt.close()

And the main function:

def main() -> None:
  print("--- Phase 1: Gather data ---")
  start_time = time.time()
  trials_df = run_trials_python(100, 100)
  print(f"  Took {time.time() - start_time:.2f} seconds.")

  print("--- Phase 2: Fit hypotheses ---")
  start_time = time.time()
  models: dict[str, torch.nn.Module] = {
    'harmonic': HarmonicModel(),
    'log': LogModel(),
    'power': PowerModel(),
    'polynomial': PolynomialModel(),
  }
  plot_df, meta_df = fit_hypotheses(trials_df, models)
  print(f"  Took {time.time() - start_time:.2f} seconds.")

  print("--- Phase 3: Plot ---")
  plot(plot_df, meta_df, 'random-ends-v1.png')
  print("  Saved to random-ends-v1.png")

And, in less than a second, out popped an image!

100 string lengths, 100 trials each.

Good data! However, I'd like to have more trials per string length to tighten that CI. Additionally, I'd like to extend the graph out further to the right. Let's go to run_trials_python(1000, 1000) and see.

I ran the program and...

--- Phase 1: Gather data ---
  Took 484.36 seconds.
--- Phase 2: Fit hypotheses ---
harmonic: 0.5031 * H(n) + 0.6705 (=0.9896)
log: 0.4982 * ln(n) + 0.9913 (=0.9898)
power: 1.3690 * n^0.1752 (=0.9278)
polynomial: -0.0000n² + 0.0155n + -0.0904 (=-2.8770)
  Took 5.96 seconds.
--- Phase 3: Plot ---
  Saved to random-ends-v1.png

484.36 seconds. The speed was bugging me. 484.36 seconds? Surely it can be faster. Wait a minute... A main function that calls a perform function? Making things efficient? Perhaps I can use... C++? Like a sleeper agent, all of my competitive programming experience came rushing back to me.

Longing. Rusted. Seventeen. Daybreak. Furnace. Nine. Benign. Homecoming. One. Freight Car.

Making it faster

A few insights:

First, this problem is embarrassingly parallel. Each trial is independent of one another. We could totally take advantage of more cores and get a pretty nice speedup.

Next, we don't actually have to remove the ends from the all_ends list. Once we're done with them, we never look at those ends again. Another equivalent way to get this functionality is to simply generate a list of all ends, shuffle them, then go through the list linearly. This is great because, say it with me, heap allocations are slow!

Furthermore, I'm reminded of this famous quote by Robert Galanakis:

The fastest code is the code that does not run.

Robert Galanakis

We are definitely performing too many trials on the smaller string lengths and not enough on the larger ones. Recall the central limit theorem. We can take kk samples from an arbitrary distribution. We know that the mean of these samples is approximately normally distributed with mean μ\mu and variance σ2/k\sigma^2/k. The standard deviation of this distribution, σ/k\sigma/\sqrt{k}, is called the standard error. We don't know the true σ\sigma, but we can approximate it using Bessel's correction: s2=i=1k(xixˉ)2/(k1)s^2 = \sum_{i=1}^{k} (x_i - \bar{x})^2 / (k - 1).

So instead of doing a fixed number of trials, we can keep running trials until the standard error drops below a desired threshold.

Finally, we could definitely use memory more efficiently. Why are we using a dictionary when an array would do just fine? Why not allocate all the memory we could possibly need at the beginning?

Putting it all together, the core simulation function looks like this. Each thread grabs the next unprocessed string count from an atomic counter, runs trials until the standard error is tight enough, and moves on:

void perform_adaptive(
  vector<pair<ul, ul>>& results, // Elements are { # strings, # loops made }
  atomic<ul>& next_n_strings,    // Atomic to grab next string size to calc
  const ul max_n_strings,        // Largest number of strings to calc
  const ul min_trials,           // Min trials before kicking in standard error
  const double epsilon           // Standard error to target
) {
  mt19937 rng(random_device{}());
  const ul max_n_ends = 2*max_n_strings;

  auto end_seq = vector<ul>(max_n_ends); // Pre-allocate based on maximum ends
  auto end_map = vector<ul>(max_n_ends); // we could possibly need.

  for (ul i = 0; i < max_n_ends; i++) // n_strings only increases, so we set
    end_seq[i] = i;                   // end_seq only once.

  ul n_strings;
  while ((n_strings = next_n_strings.fetch_add(1)) <= max_n_strings) {
    const ul n_ends = 2*n_strings;

    ul sum = 0, sum_sq = 0, n = 0; // Stats for standard error stopping
    while (true) {
      // While we don't need to reset end_seq, we do need to reset end_map
      for (ul i = 0; i < n_ends; i += 2) {
        end_map[i]   = i+1; // Could do `end_map[i] = i^1;` with `i++`, but the
        end_map[i+1] = i;   // compiler optimizes it the same way
      }

      shuffle(end_seq.begin(), end_seq.begin() + n_ends, rng);

      ul result = 1; // Always at least 1 loop, so no need to calc last string
      for (ul i = 0; i < n_ends - 2; i += 2) {
        const ul a = end_seq[i];
        const ul b = end_seq[i+1];

        const ul a_other = end_map[a];
        if (a_other == b) {
          result++;
          continue;
        }

        const ul b_other = end_map[b];

        end_map[a_other] = b_other;
        end_map[b_other] = a_other;
      }

      results.push_back({n_strings, result});

      sum    += result;
      sum_sq += result * result;
      n++;
      if (n >= min_trials) {
        double var = ((double)sum_sq - (double)sum * sum / n) / (n - 1);
        double se = sqrt(var / n);
        if (se <= epsilon / 1.96) break; // 95% CI
      }
    }
  }
}

Note the result = 1 on line 39 - the last remaining string must form a loop, so we skip that iteration entirely. It's a tiny optimization, but it made me feel clever.

The main function just spins up threads and collects results. Each thread gets its own pre-allocated result vector to avoid contention:

int main(int argc, char** argv) {
  ios_base::sync_with_stdio(false); // Competitive programming stuff to make
  cin.tie(nullptr);                 // cout faster

  const ul max_n_strings    = stoul(argv[1]);
  const ul min_trials       = stoul(argv[2]);
  const double epsilon      = stod(argv[3]);
  const ul pre_alloc_trials = stoul(argv[4]);

  const ul n_threads = thread::hardware_concurrency() ?: 4;

  atomic<ul> next_n_strings(1);
  vector<thread> threads;

  vector<vector<pair<ul, ul>>> thread_results(n_threads);
  for (auto& v : thread_results)                   // Pre-alloc result arrays
      v.reserve(max_n_strings * pre_alloc_trials); // for performance

  for (ul t = 0; t < n_threads; t++)
    threads.emplace_back(
      perform_adaptive, ref(thread_results[t]), ref(next_n_strings),
      max_n_strings, min_trials, epsilon);

  for (auto& t : threads) t.join();

  cout << "x,y\n"; // Output CSV to stdout
  for (auto& tr : thread_results)
    for (auto& [x, y] : tr)
      cout << x << "," << y << "\n";

  return 0;
}

There are many forms of inter-process communication, but I decided to just simply spit the results out to stdout as a csv file. Taking everything into account, need we update the Python script to shell out to the compiled binary and read the CSV output back in:

def run_trials_cpp(
  max_n_strings: int,
  min_trials: int = 10,
  epsilon: float = 0.1,
  pre_alloc_trials: int = 1000,
) -> pd.DataFrame:
  result = subprocess.run(
    ['./random-ends-v2',
     str(max_n_strings), str(min_trials),
     str(epsilon), str(pre_alloc_trials)],
    stdout=subprocess.PIPE, stderr=None, text=True, check=True,
  )
  return pd.read_csv(io.StringIO(result.stdout))

Moment of truth:

--- Phase 1: Gather data ---
  Took 0.94 seconds.
--- Phase 2: Fit hypotheses ---
harmonic: 0.5055 * H(n) + 0.6530 (=0.9895)
log: 0.5006 * ln(n) + 0.9753 (=0.9897)
power: 1.3651 * n^0.1756 (=0.9286)
polynomial: -0.0000n² + 0.0155n + -0.0904 (=-2.8137)
  Took 6.12 seconds.
--- Phase 3: Plot ---
  Saved to random-ends-v2.png

484 seconds down to 0.94. A 515x speedup. I'm not going to pretend I didn't fist pump at my desk.

10,000 string lengths with adaptive trials, completed in under a second.

So obviously the polynomial approximation is right out. The power model also seems to drift. That leaves the harmonic model and the natural log model, both with suspiciously similar R2R^2 values. Suspicious enough to make you wonder if they're secretly the same thing...

Doing the math

Alright, time to put on the big boy pants and actually solve this thing analytically.

Define a random variable XiX_i that equals 1 if the ii-th pairing creates a loop, and 0 otherwise. The total number of loops is just X1+X2++XnX_1 + X_2 + \cdots + X_n. By linearity of expectation (the single most overpowered tool in all of probability):

E[total loops]=E[X1]+E[X2]++E[Xn]\mathbb{E}[\text{total loops}] = \mathbb{E}[X_1] + \mathbb{E}[X_2] + \cdots + \mathbb{E}[X_n]

Since XiX_i is an indicator variable, E[Xi]=P(loop at step i)\mathbb{E}[X_i] = P(\text{loop at step } i). So what's the probability of forming a loop at step ii?

At step ii, there are n(i1)=ni+1n - (i - 1) = n - i + 1 strings remaining, and therefore 2(ni+1)2(n - i + 1) ends. We pick one end. Now there are 2(ni+1)1=2n2i+12(n - i + 1) - 1 = 2n - 2i + 1 ends left. Exactly one of those belongs to the same string as the end we just picked. So:

E[Xi]=P(loop at step i)=12n2i+1=12(ni)+1\mathbb{E}[X_i] = P(\text{loop at step } i) = \frac{1}{2n - 2i + 1} = \frac{1}{2(n - i) + 1}

Summing it all up:

E[total loops]=i=1n12(ni)+1\mathbb{E}[\text{total loops}] = \sum_{i=1}^{n} \frac{1}{2(n-i)+1}

Let k=nik = n - i, so i=nki = n - k. When i=1i = 1, k=n1k = n - 1; when i=ni = n, k=0k = 0. Substituting:

12(ni)+1=12(n(nk))+1=12k+1\frac{1}{2(n - i) + 1} = \frac{1}{2(n - (n - k)) + 1} = \frac{1}{2k + 1}

So the sum becomes:

k=0n112k+1=11+13+15++12n1\sum_{k=0}^{n-1} \frac{1}{2k+1} = \frac{1}{1} + \frac{1}{3} + \frac{1}{5} + \cdots + \frac{1}{2n-1}

It's the sum of odd reciprocals! And that's why the harmonic model fit so well. The nn-th harmonic number is Hn=k=1n1kH_n = \sum_{k=1}^{n} \frac{1}{k}, which can be split into even and odd terms. Our sum is the odd half-harmonic. In fact, you can show that:

k=1n12k1=H2n12Hn\sum_{k=1}^{n} \frac{1}{2k-1} = H_{2n} - \frac{1}{2}H_n

And since Hnln(n)+γH_n \approx \ln(n) + \gamma (where γ0.5772\gamma \approx 0.5772 is the Euler-Mascheroni constant), we get:

H2n12Hnln(2n)+γ12(ln(n)+γ)=12ln(n)+ln(2)+12γH_{2n} - \frac{1}{2}H_n \approx \ln(2n) + \gamma - \frac{1}{2}(\ln(n) + \gamma) = \frac{1}{2}\ln(n) + \ln(2) + \frac{1}{2}\gamma

So asymptotically, the expected number of loops grows like 12ln(n)\frac{1}{2}\ln(n). That's why the log model and the harmonic model had nearly identical R2R^2 values - they're basically the same thing up to constants! The harmonic series and ln\ln are joined at the hip, and always have been.

For the n=50n = 50 case, we can use Python's Fraction class to get the exact answer:

from fractions import Fraction
n = 50
result = sum(Fraction(1, 2*k-1) for k in range(1, n+1))
print(f'~ {float(result):.10f}')
print(f'{result}')
~ 2.9377748485
3200355699626285671281379375916142064964/1089380862964257455695840764614254743075

Just shy of 3 loops.

Conclusion

So why did I spend a weekend writing a C++ program with multithreading, adaptive sampling, and PyTorch curve fitting to solve a problem that has a two-line closed-form solution? I genuinely don't know. But I had a blast doing it.

I think there's something to be said for the "brute force it first, think later" approach. The simulation gave me intuition. The curve fitting told me what shape the answer had. And only then did the math click into place. Would I have seen the odd reciprocal sum staring at a blank page? Maybe. But probably not as fast as I did staring at a graph that was screaming "I'm a log, you idiot."

And hey, I got to write some C++ for the first time in a while. That alone was worth it.


Source files: random-ends-v1.py, random-ends-v2.py, random-ends-v2.cpp