Python 3.14 finally introduces a free-threaded build. Here's how I tested it with real multi-core workloads — and what it means for Python's future.
The Ghost of the GIL
For as long as I've written Python, there's been one phrase that inevitably shows up in every discussion about performance — the GIL. The Global Interpreter Lock, or GIL, has been both a salvation and a curse for Python developers. It made memory management simple and safe, but it also tied Python's hands. Because of it, only one thread could run Python bytecode at a time. No matter how powerful your machine was, Python never truly used more than one core.
That limitation quietly shaped how we wrote Python for decades. Whenever we needed parallel performance — we reached for subprocesses — multiprocessing, offloading to C extensions, or even rewriting critical paths in Rust or C++. It worked, but it was never elegant. It always felt like Python was jogging while the hardware beneath it was sprinting.
This year, that changed.
With Python 3.14, released in October 2025, CPython finally introduced a free-threaded build — a version of the interpreter where the GIL can be completely disabled. For the first time, threads can truly run in parallel across multiple cores. It's the biggest change to Python's runtime in more than three decades.
Putting it to the Test
I wanted to see what this really meant in practice, so I set up two environments on my MacBook using pyenv. One was Python 3.12, the standard build with the GIL. The other was Python 3.14t, compiled in the free-threading mode using the -t flag.
The default 3.14 installer still includes the GIL, so developers need to opt into the free-threaded build manually.
To keep things simple, I wrote a small script that creates eight threads — one for each core — and uses each of them to calculate the sum of squares up to forty million. Pure Python math. Heavy on CPU, no I/O, no tricks.
Here's a rough look at what it runs:
import threading
import time
import multiprocessing
def num_of_squares(n):
"""Helper function to calculate the sum of squares from 0 to n-1"""
result = sum(i**2 for i in range(n))
return result
def worker_thread(n):
"""Worker function for the multi-threaded sum of squares benchmark"""
name = threading.current_thread().name
print(f"Worker {name}: starting")
start_time = time.time()
result = num_of_squares(n)
end_time = time.time()
thread_time = end_time - start_time
print(f"Worker {name}: calculated sum of squares for {n:,} numbers in {thread_time:.2f} seconds")
return result
def main():
run_cores = multiprocessing.cpu_count()
start_time = time.time()
# One CPU task per thread — raise range for a more demanding benchmark
threads = []
for i in range(run_cores):
thread = threading.Thread(
target=worker_thread,
args=(40_000_000,),
name=str(i),
)
threads.append(thread)
for thread in threads:
thread.start()
# Wait for all threads to complete
for thread in threads:
thread.join()
end_time = time.time()
total_time = end_time - start_time
print(f"All workers completed in {total_time:.2f} seconds")
if __name__ == "__main__":
main()Each thread does the same job, crunching numbers as fast as it can. The only variable was the Python version.
The Results: Eight Cores, One Revelation
Let's see how long it takes to run using regular Python 3.12:
$ python run_of_squares.py
Worker 0: starting
Worker 1: starting
Worker 2: starting
Worker 3: starting
Worker 4: starting
Worker 5: starting
Worker 6: starting
Worker 7: starting
Worker 0: calculated sum of squares for 40,000,000 numbers in 4.48 seconds
Worker 1: calculated sum of squares for 40,000,000 numbers in 4.51 seconds
Worker 2: calculated sum of squares for 40,000,000 numbers in 4.54 seconds
Worker 3: calculated sum of squares for 40,000,000 numbers in 4.57 seconds
Worker 4: calculated sum of squares for 40,000,000 numbers in 4.59 seconds
Worker 5: calculated sum of squares for 40,000,000 numbers in 4.62 seconds
Worker 6: calculated sum of squares for 40,000,000 numbers in 4.65 seconds
Worker 7: calculated sum of squares for 40,000,000 numbers in 4.68 seconds
All workers completed in 36.05 secondsNow, with the GIL-free version:
$ python run_of_squares.py
Worker 0: starting
Worker 1: starting
Worker 2: starting
Worker 3: starting
Worker 4: starting
Worker 5: starting
Worker 6: starting
Worker 7: starting
Worker 0: calculated sum of squares for 40,000,000 numbers in 1.38 seconds
Worker 1: calculated sum of squares for 40,000,000 numbers in 1.39 seconds
Worker 2: calculated sum of squares for 40,000,000 numbers in 1.42 seconds
Worker 3: calculated sum of squares for 40,000,000 numbers in 1.41 seconds
Worker 4: calculated sum of squares for 40,000,000 numbers in 1.43 seconds
Worker 5: calculated sum of squares for 40,000,000 numbers in 1.45 seconds
Worker 6: calculated sum of squares for 40,000,000 numbers in 1.46 seconds
Worker 7: calculated sum of squares for 40,000,000 numbers in 1.47 seconds
All workers completed in 11.17 secondsOn Python 3.12, the result was exactly what we all come to expect. Every worker ran independently, but never truly together. The program took 36.05 seconds. After that, I switched to Python 3.14t's free-threaded build. I ran the same code — same logic — and the computation finished in just 11.17 seconds. That's nearly three times faster purely because the interpreter no longer forced everything through one thread at a time.
Why This Changes Everything
It's hard to exaggerate how transformative this is. For years, Python's threading library has existed mostly for I/O-bound tasks — reading files, handling web requests, waiting for network responses — but it was never useful for real CPU-bound work. With the GIL gone, you no longer have to juggle process pools or shared memory queues just to take advantage of all your cores. You can write single-threaded Python code and use parallel processing, data pipelines, embeddings, deep learning transformations, or agent reasoning that finally scales linearly with the number of cores.
It also changes how you think about Python servers. Models that rely on async I/O and threading will now be far more performant. And since threads share memory, the overhead stays much lower than in multiprocess setups.
A New Chapter for Python
Of course, there are caveats. Single-threaded programs might run a little slower — typically 5 to 10 percent — because every object operation uses atomic reference counting. Some C extensions will need small updates to adapt to the new model. But those are transition costs. The real story is that Python can finally grow into the multi-core world we've been living in for years.
For me, watching these eight threads blaze through their tasks in parallel isn't just a performance win. It felt symbolic: it broke Python's old limitations. For decades, the GIL has been the punchline in every performance debate about the language. Now, with Python 3.14, that punchline is beginning to fade.
This isn't just an optimization. It's liberation. Python 3.14 doesn't merely make your code faster — it lets the language finally use all the power your machine has been offering all along.
Talk to us.
Ready to Scale with AI?
Join forward-thinking teams already working with CosX AI. Tell us your challenge and we'll map the highest-value workflows to automate first.
Written by
CosX AI
Engineering
Published
October 15, 2025
Duration
8 min read
Share
Tags


