How to Get File Type of an URL in Python?

When working with URLs in Python, it's often useful to know the file type or MIME type of the resource being requested. This allows you to programmatically handle different file types in different ways in your code.

In this comprehensive guide, we'll explore the various methods available in Python to detect the file type or MIME type from a URL.

Why Get File Type of URLs?

Here are some common use cases where getting the file type of a URL is helpful:

Downloading and saving files locally: When downloading a file from a URL, you may want to add the correct filename extension based on the file type.
Crawling and scraping websites: When scraping pages from a website, you want to focus on HTML pages and avoid non-HTML resources like images, PDFs, etc. Knowing the file type allows you to filter URLs.
API clients: When working with REST APIs, you often need to handle the response differently based on the Content-Type header. Detecting file types allows you to parse responses appropriately.
Security: Getting the file type can help detect suspicious URLs that don't match the expected file type. This can prevent attacks like XSS by blocking unexpected file types.
Rendering: To display a file correctly, you need to know its type so that it can be rendered properly. PDFs, images, and videos require different handling.

So in summary, file type detection is important for handling URLs and network resources correctly within your Python code. Let's now see how we can do this in Python.

Method 1: Check File Extension

Many URLs contain the file extension explicitly as part of the path or filename. For example:

"https://example.com/files/document.pdf"
"https://example.com/imgs/logo.jpg"

For these URLs, we can use the os.path.splitext() function from Python's standard library to split the URL path into the filename and extension:

import os
from urllib.parse import urlparse

url = "https://example.com/files/document.pdf"

path = urlparse(url).path
filename = os.path.basename(path)
name, ext = os.path.splitext(filename)

print(ext) 
# .pdf

We extract the path from the parsed URL, get the filename, and then split it into name and extension. The mimetypes module in Python can then help us map extensions to MIME types:

import mimetypes
mimetypes.types_map[.pdf]
# 'application/pdf'

Putting it together:

from urllib.parse import urlparse
import os
import mimetypes

def get_file_type_by_extension(url):
    path = urlparse(url).path
    ext = os.path.splitext(os.path.basename(path))[-1]
    if ext in mimetypes.types_map:
        return mimetypes.types_map[ext]
    else:
        return None

This will detect the file type for URLs containing an extension. But what if the URL doesn't have a file extension?

Method 2: HTTP HEAD Request

For URLs without an extension, we can make an HTTP HEAD request to get the MIME type from the Content-Type header in the response. The HEAD method retrieves the headers for a URL without downloading the entire contents. This allows us to efficiently check the file type.

Here's an example using the requests library:

import requests

url = "https://example.com/files/document" 

response = requests.head(url)
print(response.headers["Content-Type"])
# application/pdf

We can wrap this in a function to return the file type for any URL:

import requests

def get_file_type_from_head(url):
    response = requests.head(url)
    if "Content-Type" in response.headers:
        return response.headers["Content-Type"]
    else:
        return None

This makes a request, checks for the Content-Type header, and returns the MIME type if present.

Method 3: Python MIME Sniffing

Python's mimetype module contains the guess_type() function which can determine a file's MIME type by its “magic numbers”. Magic numbers are distinct signatures within the content of a file that identify its format. For example, JPEG images always begin with the bytes FF D8 FF.

guess_type() will download a small portion of the file and inspect its magic numbers to detect the type.

Let's try it:

import mimetype

url = "https://example.com/files/document"

mimetype.guess_type(url) 
# ('application/pdf', None)

The function returns a tuple with the MIME type and encoding detected. One downside is that it needs to download part of the file, so can be slower than checking headers. But it is more robust than relying on extensions.

We can wrap it in a helper function:

import mimetype

def get_file_type_by_magic(url):
    mime_type = mimetype.guess_type(url)[0]
    if mime_type: 
        return mime_type
    else:
        return None

This will detect the type even without an extension or standard headers.

Method 4: File Signatures

Magic number sniffing works, but requires downloading part of the file. We can optimize this using file signatures. Signatures are short distinct byte patterns at the start of a file that can reliably identify the format. For example, JPEG images begin with FF D8 FF E0 in hex. PDFs start with %PDF-.

We can manually check the first bytes of a file against a list of known signatures to detect its type. Here's an example fetching just the first 1024 bytes:

import requests

common_signatures = {
    b"\xFF\xD8\xFF\xE0": "image/jpeg",
    b"%PDF-": "application/pdf"
}

def get_file_type_by_signature(url):
    response = requests.get(url, stream=True)
    file_start = response.raw.read(1024)
    for sig, mime_type in common_signatures.items():
        if file_start.startswith(sig):
            return mime_type
    return None

This only downloads the first 1KB instead of the entire file. We can build up a database of signatures for many file types to detect them. There are also libraries filetype that contain prebuilt signature databases we can use instead of coding the signatures manually.

Method 5: Browser Detection

An alternative approach is to automate a headless browser to access the URL, and then extract the Content-Type from the response headers. This has the advantage of executing any redirects and JavaScript on the page like a normal browser would.

Here's an example using Selenium and Chrome:

from selenium import webdriver

options = webdriver.ChromeOptions() 
options.add_argument("headless")

driver = webdriver.Chrome(options=options)

url = "https://example.com/files/document"
driver.get(url)

content_type = driver.execute_script("return document.contentType;")

print(content_type)
# application/pdf

driver.quit()

We disable the Chrome GUI with headless mode, fetch the page, and use JavaScript to get the contentType property which contains the MIME type. This method is slower but works as a real browser would. The headless browser can also render and screenshot the content if needed.

Comparing the Methods

Let's recap the key differences between the techniques:

File extension – Fast and simple, but only works if a file extension is present.
HTTP HEAD – Efficiently checks headers without downloading file, but requires a standard Content-Type header.
MIME sniffing – Detects type without extension by “magic numbers”, but requires partial download.
Signatures – Optimized magic numbers by checking initial bytes only.
Browser – Slower but renders page like a normal browser, handling any redirects/code.

As a general guideline:

Try file extension or HTTP HEAD first for efficiency.
Fall back to MIME sniffing or signatures if those fail.
Only use a browser when executing JavaScript is required.

The right method depends on your specific use case and performance requirements.

Handling Binary Data

When working with binary files like images, videos, executables, etc. we need to handle the HTTP responses as raw binary data using the content attribute:

import requests 

resp = requests.get(url, stream=True)
data = resp.content # Raw binary content

with open('image.jpg', 'wb') as f:
  f.write(data)

Setting stream=True ensures we get the raw bytes rather than decoding text. We can then save the binary data to a file as needed.

Handling Compression

Some file types like JSON and XML are often served compressed with gzip or deflate. In this case the Content-Encoding header contains the compression algorithm used rather than the raw MIME type.

We need to decompress the content before reading:

import requests
import gzip
from io import BytesIO

resp = requests.get(url)

if resp.headers.get("Content-Encoding") == "gzip":
  data = gzip.decompress(resp.content)
elif resp.headers.get("Content-Encoding") == "deflate":
  data = BytesIO(resp.content)
  data = data.read() 

print(data) # Decompressed content

Checking the Content-Encoding header allows for handling compressed data correctly.

Client-Side Content Sniffing

Note that the server MIME type from HTTP headers is not always reliable. Some misconfigured servers report incorrect Content-Type values like application/octet-stream for all files. As a fallback, browsers like Chrome will perform client-side content sniffing. This ignores the server MIME type and tries to determine the file type based on its magic numbers after downloading it.

We can replicate client-side content sniffing in Python by always checking the magic numbers as a fallback if the server-provided MIME type seems suspicious or generic. This helps handle misconfigured servers that don't provide accurate file types in headers.

Conclusion

Determining the file type from a URL is crucial for various Python applications that manage web resources. The optimal method hinges on your unique needs. The primary strategy involves examining the file extension and HTTP headers, offering efficiency and accuracy. However, when dealing with unreliable servers, MIME sniffing becomes a valuable fallback technique, allowing you to ascertain the file type based on its contents.

This guide aims to provide a thorough understanding of how to identify file types from URLs in Python, ensuring you are well-equipped to manage web resources effectively.