There is overhead, and this can be significant in rare cases (for example, for micro-tests), regardless of the optimizations that exist (there are many). The usual case, however, is optimized for consistent manipulation of the reference count for an object.
So the question is, if link counting is so bad for streaming, how does Objective-C do it?
There are several locks in the game, and, in fact, saving / releasing on any given object selects a random lock (but always the same lock) for this object. Thus, reducing lock conflict, without requiring one lock for each object.
(And what Catfish_man said, some classes implement their own link counting scheme to use class-related blocking primitives to avoid competition and / or optimization for their specific needs.)
Implementation details are more complex.
Is Objectice-C reference counting actually technically unsafe with threads?
No - safe for threads.
In fact, typical code will invoke retain and release quite rarely compared to other operations. Thus, even if there were significant overheads for these code paths, it would be amortized in all other operations in the application (where, say, clicking pixels on the screen is really expensive, for comparison).
If an object is thread-separated (generally a bad idea), then blocking the overhead that protects access and data manipulation will usually be significantly more than the overhead of saving / freeing due to the oddness of saving / freeing.
Regarding the overhead of the Python GIL, I would argue that it is more related to how often the reference count goes up and down as part of the operations of the normal interpreter.