Common Python security pitfalls, and how to avoid them
Python is undoubtedly a popular language. It consistently ranks among the most popular and most loved languages year after year. That's not hard to explain, considering how fluent and expressive it is. Its pseudocode-like syntax makes it extremely easy for beginners to pick it up as their first language, while its vast library of packages (including the likes of giants like Django and TensorFlow) ensure that it scales up for any task required of it.
Being such a widely-used language makes Python a very attractive target for malicious hackers. Let's see a few simple ways to secure your Python apps and keep the black-hats at bay.
Problems and solutions
Python places a lot of importance on zen, or developer happiness. The clearest evidence of that lies in the fact that the guiding principles of Python are summarized in a poem. Try import this in a Python shell to read it. Here are some security concerns that might disturb your zen, along with solutions to restore it to a state of calm.
OWASP Top Ten, a basic checklist for web security, mentions unsafe deserialization as one of the ten most common security flaws. While it's common knowledge that executing anything coming from the user is a terrible idea, serializing and deserializing user input does not seem equally serious. After all, no code is being run, right? Wrong,
PyYAML is the de-facto standard for YAML serialization and deserialization in Python. The library supports serializing custom data types to YAML and deserializing them back to Python objects. See this serialization code here and the YAML produced by it.
Deserializing this YAML gives back the original data type.
As you can see, the line !!python/object:__main__.Person in the YAML describes how to re-instantiate objects from their text representations. But this opens up a slew of attack vectors, that can escalate to RCE when this instantiation can execute code.
The solution, as trivial as it may seem, is to use safe-loading by swapping out the loader yaml.Loader in favor of the yaml.SafeLoader loader. This loader is safer because it completely blocks loading of custom classes.
Standard types like hashes and arrays can still be serialized to and deserialized from YAML documents just like before. Most people, probably including you, won't even realize the difference.
Python has a pair of very dangerous functions: exec and eval. Both are very similar in terms of what they do: process the strings passed to them as Python code. exec expects the string to be a statement, which it will execute and not return a value. eval expects the string to be an expression and will return the computed value of the expression.
Here is an example of what both these functions in action.
You could in theory, pass a statement to eval and get a similar effect as exec, because in Python returning None is virtually the same as not returning anything at all.
The danger of these functions lies in the ability of these functions to execute virtually any code in the same Python process. Passing any input to the function that you cannot be 100% certain about, is akin to handing over your server keys to malicious hackers on a plate. This is the very definition of RCE.
There are ways to mitigate the access that eval has. You can restrict access to globals and locals by passing dictionaries as the second and third arguments to eval respectively. Remember that locals take priority over globals in case of conflict.
That makes the code safer yes. At least it somewhat prevents the data in the variables from being leaked.
But it still doesn't prevent the string from accessing any built-ins like pow, or more dangerously, __import__. To counter that, you need to override __builtins__.
Safe to expose now? Not quite. Because regardless of how secure you make eval, it's job is to evaluate an expression and nothing can stop the expression from taking too long and freezing the server for a long time.
As we Python developers like to say, "eval is evil."
Python's popularity draws the attention of white-hat security researchers just as much as it does that of hackers with malicious intent. As a result, new security vulnerabilities that are constantly discovered, disclosed, and patched. To keep the malicious hackers at bay, your software needs to keep all its dependencies up to date.
A common technique for pinning packages in Python is the ubiquitous requirements.txt file, a simple file that lists all the dependencies and exact versions needed by your project.
Let's say you install Django. As of writing, Django depends on three more packages. If you freeze your dependencies, you end up with the following requirements. Note that only one of these dependencies was installed by you, the other 3 are sub-dependencies.
pip freeze does not place dependencies in levels and that's a problem. For smaller projects with a few dependencies that you can keep a track of mentally, this is not a big deal but as your projects grow, so will your top level dependencies. Conflicts arise when sub-dependencies overlap. Updating individual dependencies is a mess too, because the graph relationship of these dependencies is not clear from the plain text file.
Pipenv and Poetry are two tools that help you manage dependencies better. I prefer Pipenv but Poetry is equally good. Both package managers build on top of pip. Fun fact: DeepSource is compatible with both Pipenv and Poetry as package managers.
Pipenv, for example, tracks your top-level dependencies in a Pipfile and then does the hard work of locking down dependencies in a lockfile name Pipfile.lock, similar to how npm manages Node.js packages. Here's an example Pipfile.
With this, you can get a clear picture of the top-level dependencies of your app. Updating dependencies is also much easier because you just need to update the top level packages and the locking algorithm will figure out the most compatible and up-to-date versions of all the sub-dependencies.
Here is the same example with Django. Notice how Pipenv can identify your top-level dependencies and its dependencies and so on.
If your code is hosted on GitHub, make sure you turn on and configure Dependabot as well. It's a nifty little bot that alerts you if any of your dependencies has gone out of date or if a vulnerability has been identified in the pinned version of a dependency. Dependabot will also make PRs to your repo, automatically updating your packages. Very handy indeed!
Python has a special assert keyword for guarding against unexpected situations. The purpose of assert is simple, verify a condition and raise an error if the condition is not fulfilled. In essence, assert evaluates the given expression and
- if it evaluates to a truthy value, moves along
- if it evaluates to a falsey value, raises an AssertionError with the given message
Consider this example.
This is a very simple example where we check if the user has the permissions to perform an action and subsequently perform the given action. In this case, if user.has_permissions() returns False, our assertion would cause an AssertionError and the execution would be halted. Seems pretty safe, right?
No. Assertions are tools for developers during development and debugging phases. Asserts should not be used to guard critical functionality. The Python constant, __debug__ is set to False during compilation, which removes assert statements from the compiled code to optimise the code for performance. Removing assert statements from the compiled code leaves the function unguarded.
Another reason to avoid asserts is that assertion errors are not helpful to debuggers as they provide no information other than the fact that an assertion did not hold up. Defining apt exception classes and then raising their instances is a much more solid solution.
For an alternative approach, go back to the basics. Here's the same program, this time using if/else and a raising a PermissionError (you'll need to define it somewhere) when the required assertion is unfulfilled.
This code uses straightforward Python constructs, works the same with __debug__ set to True or False and raises clear exceptions that can be handled with much more clarity.
The key takeaway from all of these examples is to never trust your users. Input provided by the users should not be serialized-deserialized, evaluated, executed or rendered. To be safe, you must be careful of what you write and thoroughly audit the code after its been written.
Do you know what's better than scanning for vulnerabilities in code after it's been written? Getting vulnerabilities highlighted as soon as you write code with them.
Static Code Analysis tools such as linters and vulnerability scanners can help you find a lot of issues before they get exploited in the wild. An excellent tool for finding security vulnerabilities in Python is Bandit. Bandit goes through each file, generates an abstract syntax tree (AST) for it and then runs a whole slew of tests on this AST. Bandit can detect a whole bunch of vulnerabilities out-of-the-box and can also be extended for specific scenarios and compatibility with frameworks via plugins. As a matter of fact, Bandit is capable of detecting all of the aforementioned security shortcomings.
If you're a Python developer, I cannot recommend Bandit enough. I could write a whole article extolling its virtues. I might even do that.
You should also consider automating this entire audit and review process using code review automation tools like DeepSource that scans your code, on every commit and for every PR, through it's linters and security analyzers and can automatically fix a multitude of issues. DeepSource also has its own custom-built analyzers for most languages that are constantly improved and kept up-to-date. And it’s incredibly easy to set up!
Who knew it could be so simple?
Experience the zen of Python, and be careful not to let black-hats disturb your peace!