I recently published an article about Python’s pathlib module and how I think everyone should be using it.
I won some pathlib converts, but some folks also brought up concerns.
Some folks noted that I seemed to be comparing pathlib
to os.path
in a disingenuous way.
Some people were also concerned that pathlib
will take a very long time to be widely adopted because os.path
is so entrenched in the Python community.
And there were also concerns expressed about performance.
In this article I’d like to acknowledge and address these concerns.
This will be both a defense of pathlib
and a sort of love letter to PEP 519.
Comparing pathlib and os.path the right way
In my last article I compared this code which uses os
and os.path
:
1 2 3 4 5 |
|
To this code with uses pathlib.Path
:
1 2 3 4 |
|
This might seem like an unfair comparison because I used os.path.join
in the first example to ensure the correct path separator is used on all platforms but I didn’t do that in the second example.
But this is in fact a fair comparison because the Path class normalizes path separators automatically.
We can prove this by looking at the string representation of this Path
object on Windows:
1 2 |
|
No matter whether we use the joinpath
method, a /
in a path string, the /
operator (which is a neat feature of Path
objects), or separate arguments to the Path
constructor, we get the same representation in all cases:
1 2 3 4 5 6 7 8 |
|
That last expression caused some confusion from folks who assumed pathlib
wouldn’t be smart enough to convert that /
into a \
in the path string.
Fortunately, it is!
With Path
objects, you never have to worry about backslashes vs forward slashes again: specify all paths using forward slashes and you’ll get what you’d expect on all platforms.
Normalizing file paths shouldn’t be your concern
If you’re developing on Linux or Mac, it’s very easy to add bugs to your code that only affect Windows users.
Unless you’re careful to use os.path.join
to build your paths up or os.path.normcase
to convert forward slashes to backslashes as appropriate, you may be writing code that breaks on Windows.
This is a Windows bug waiting to happen (we’ll get mixed backslashes and forward slashes here):
1 2 3 4 |
|
This just works on all systems:
1 2 3 4 |
|
It used to be the responsibility of you the Python programmer to carefully join and normalize your paths, just as it used to be your responsibility in Python 2 land to use unicode whenever it was more appropriate than bytes.
This is the case no more.
The pathlib.Path
class is careful to fix path separator issues before they even occur.
I don’t use Windows. I don’t own a Windows machine. But a ton of the developers who use my code likely use Windows and I don’t want my code to break on their machines.
If there’s a chance that your Python code will ever run on a Windows machine, you really need pathlib
.
Don’t stress about path normalization: just use pathlib.Path
whenever you need to represent a file path.
pathlib seems great, but I depend on code that doesn’t use it!
You have lots of code that works with path strings.
Why would you switch to using pathlib
when it means you’d need to rewrite all this code?
Let’s say you have a function like this:
1 2 3 4 5 6 7 8 9 10 |
|
This function accepts a directory to create a .editorconfig
file in, like this:
1 2 3 |
|
But our code also works with a Path
object:
1 2 3 |
|
But… how??
Well os.path.join
accepts Path
objects (as of Python 3.6).
And os.makedirs
accepts Path
objects too.
In fact the built-in open
function accepts Path
objects and shutil
does and anything in the standard library that previously accepted a path string is now expected to work with both Path
objects and path strings.
This is all thanks to PEP 519, which called for an os.PathLike
abstract base class and declared that Python utilities that work with file paths should now accept either path strings or path-like objects.
But my favorite third-party library X has a better Path object!
You might already be using a third-party library that has a Path
object which works differently than pathlib’s Path objects.
Maybe you even like it better.
For example django-environ, path.py, plumbum, and visidata all have their own custom Path
objects that represent file paths.
Some of these pathlib
alternatives predate pathlib
and chose to inherit from str
so they could be passed to functions that expected path strings.
Thanks to PEP 519 both pathlib
and its third-party alternatives can play nicely without needing to resort to the hack of inheriting from str
.
Let’s say you don’t like pathlib
because Path
objects are immutable and you very much prefer using mutable Path
objects.
Well thanks to PEP 519, you can create your own even-better-because-it-is-mutable Path
and also has a __fspath__
.
You don’t need to use pathlib
to benefit from it.
Any homegrown Path
object you make or find in a third party library now has the ability to work natively with the Python built-ins and standard library modules that expect Path objects.
Even if you don’t like pathlib
, its existence a big win for third-party Path
objects as well.
But Path objects and path strings don’t mix, do they?
You might be thinking: this is really wonderful, but won’t this sometimes-a-string and sometimes-a-path-object situation add confusion to my code?
The answer is yes, somewhat. But I’ve found that it’s pretty easy to work around.
PEP 519 added a couple other things along with path-like objects: one is a way to convert all path-like objects to path strings and the other is a way to convert all path-like objects to Path
objects.
Given either a path string or a Path
object (or anything with a __fspath__
method):
1 2 3 4 |
|
The os.fspath
function will now normalize both of these types of paths to strings:
1 2 3 |
|
And the Path
class will now accept both of these types of paths and convert them to Path
objects:
1 2 |
|
That means you could convert the output of the make_editorconfig
function back into a Path
object if you wanted to:
1 2 3 |
|
Though of course a better long-term approach would be to rewrite the make_editorconfig
function to use pathlib
instead.
pathlib is too slow
I’ve heard this concern come up a few times: pathlib
is just too slow.
It’s true that pathlib
can be slow.
Creating thousands of Path
objects can make a noticeable impact on your code.
I decided to test the performance difference between pathlib
and the alternative on my own machine using two different programs that both look for all .py
files below the current directory.
Here’s the os.walk
version:
1 2 3 4 5 6 7 8 9 10 |
|
Here’s the Path.rglob
version:
1 2 3 4 5 6 7 8 |
|
Testing runtimes for programs that rely on filesystem accesses is tricky because runtimes vary greatly, so I reran each script 10 times and compared the best runtime of each.
Both scripts found 97,507 Python files in the directory I ran them in. The first one finished in 1.914 seconds (best out of 10 runs). The second one finished in 3.430 seconds (best out of 10 runs).
When I set extension = ''
these find about 600,000 files and the differences spread a little further apart.
The first runs in 1.888 seconds and the second in 7.485 seconds.
So the pathlib
version of this program ran twice as slow for .py
files and four times as slow for every file in my home directory.
The pathlib
code was indeed slower, much slower percentage-wise.
But in my case, this speed difference doesn’t matter much. I searched for every file in my home directory and lost 6 seconds to the slower version of my code. If I needed to scale this code to search 10 million files, I’d probably want to rewrite it. But that’s a problem I can get to if I experience it.
If you have a tight loop that could use some optimizing and pathlib.Path
is one of the bottlenecks that’s slowing that loop down, abandon pathlib
in that part of your code.
But don’t optimize parts of your code that aren’t bottlenecks: it’s a waste of time and often results in less readable code for little gain.
Improving readability with pathlib
I’d like to wrap up these thoughts by ending with some pathlib
refactorings.
I’ve taken a couple small examples of code that work with files and refactored these examples to use pathlib
instead.
I’ll mostly leave these code blocks without comment and let you be the judge of which versions you like best.
Here’s the make_editorconfig
function we saw earlier:
1 2 3 4 5 6 7 8 9 10 11 |
|
And here’s the same function using pathlib.Path
instead:
1 2 3 4 5 6 7 8 9 10 |
|
Here’s a command-line program that accepts a string representing a directory and prints the contents of the .gitignore
file in that directory if one exists:
1 2 3 4 5 6 7 8 9 |
|
This is the same code using pathlib.Path
:
1 2 3 4 5 6 7 8 |
|
And here’s some code that prints all groups of files in and below the current directory which are duplicates:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 |
|
This is the same code that uses pathlib.Path
instead:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 |
|
The changes here are subtle, but I think they add up.
I prefer this pathlib
-refactored version.
Start using pathlib.Path objects
Let’s recap.
The /
separators in pathlib.Path
strings are automatically converted to the correct path separator based on the operating system you’re on.
This is a huge feature that can make for code that is more readable and more certain to be free of path-related bugs.
1 2 3 4 5 6 7 |
|
The Python standard library and built-ins (like open
) also accept pathlib.Path
objects now.
This means you can start using pathlib, even if your dependencies don’t!
1 2 3 4 5 6 |
|
1 2 3 4 5 |
|
And if you don’t like pathlib
, you can use a third-party library that provides the same path-like interface.
This is great because even if you’re not a fan of pathlib
you’ll still benefit from the new changes detailed in PEP 519.
1 2 3 4 5 6 |
|
While pathlib
is sometimes slower than the alternative(s), the cases where this matters are somewhat rare (in my experience at least) and you can always jump back to using path strings for parts of your code that are particularly performance sensitive.
And in general, pathlib
makes for more readable code.
Here’s a succinct and descriptive Python script to demonstrate my point:
1 2 3 4 |
|
The pathlib
module is lovely: start using it!