Categories
Posts

Guessing multiple extensions from dotted filenames in Python

Post at-a-glance: This post is about…

  • How to extract suffixes from filenames with Python
  • Defining a guess_extensions(filename) utility function that extracts extensions from a filename

Recently I was dealing with filenames of the following form: 3a1d3614.2020-09-27.json.gz. That is, the filenames consisted of four parts: an alphanumeric ID, a date, and two extensions separated by dots.

Python offers at least two built-in ways to extract extensions from filenames. The os.path.splitext(path) function splits a path or filename into a pair (root, ext):

>>> from os.path import splitext
>>> filename = "3a1d3614.2020-09-27.json.gz"
>>> splitext(filename)
('3a1d3614.2020-09-27.json', '.gz')

As you can see, this gives us only one of the two extensions. You could proceed to recursively split the root part and collect the extensions, but you wouldn’t know when to stop. This is because splitext doesn’t check whether an extracted extension represents an actual file type extension such as .jpg:

>>> splitext(splitext(splitext(filename)[0])[0])
('3a1d3614', '.2020-09-27')

An essentially equivalent behavior can be achieved by using the pathlib.Path class (available in Python 3.4 and newer):

>>> from pathlib import Path
>>> Path(filename).suffixes
['.2020-09-27', '.json', '.gz']

Although more succinct than splitext, Path.suffixes still includes the date part of the filename. In order to exclude this part and other non-extensions from the suffixes list, we can filter it with the help of the mimetypes module. This module features a types_map containing all file extensions known to Python:

>>> mimetypes.types_map
{'.js': 'application/javascript', '.mjs': 'application/javascript', '.json': 'application/json', …}

We can use this map to determine whether Python considers a suffix a file type extension or not. Thus, we arrive at a nice & simple utility function that gives us what we were looking for:

def guess_extensions(filename):
    return [s for s in Path(filename).suffixes if s in mimetypes.types_map]

>>> guess_extensions(filename)
['.json', '.gz']