Ruby's Marshal library is not quite as blatantly insecure as pickle (it won't do any string interpolation on load), but you shouldn't trust any of these facilities: you're essentially passing data to a very weird variant of eval().
But [edit, should have said this to begin with] pickle isn't an interchange format. It's not supposed to be secure. Python already offers a myriad of good interchange formats. Interchange isn't pickle's job, and if you use it for that, you've made a serious design error.
Ruby unfortunately blurs the line here by using Marshal as an interchange format in some cases. None of those cases are insecure by design (they all allow code execution by design), but the usage does create a confusing precedent.
You're better off with ASN.1/BER than you are with Pickle or Marshal as a file or protocol format; that's how inappropriate Pickle is to the task.
No; Marshal and pickle are very different (and I confused things by talking in Ruby terms and referring to Python). Ruby Marshal isn't a virtual machine. Pickle is more like Flash or Postscript than RTF, which is what Marshal is like.
Regardless of how insecure pickle is, it is the perfect module to learn how easy basic data-persistence can be in Python through serialization. I don't get articles like this, maybe it's cause I'm still learning a lot, but I don't see the value of posting some inference a module or library already explicitly states in the manual or documentation.
Regardless of my opinion, pickle is a great module to get you going in Python and even better for scripting and storing basic data-sets on your local machine.
I don't get articles like this, maybe it's cause I'm still learning a lot, but I don't see the value of posting some inference a module or library already explicitly states in the manual or documentation.
Proving termination for programs written in models less powerful than a Turing machine does not require you to solve the halting problem. Programs using primitive recursion, finite automata, and regular expressions, for example, can all be proved to terminate, and can express a number of useful computations.
The problem is that the pickle module is far too permissive. In particular, the REDUCE operation invokes a Python callable with an argument tuple on the pickle stack, which means that 'pickles' are at least as powerful (in the Turing-general, halting problem sense) as Python.
Pickle isn't a data structure parsing system, though. It's a system for dumping out and restoring (parts of) the state of the Python interpreter. If pyasn1 allowed code execution or method invocation, you could say it was broken by design. You can't say that about Pickle.
Since the pickle module doesn't marshal many types of stateful objects (filehandles, functions, etc.), I'm not sure it's fair to describe it as a means of saving the "state of the Python interpreter". It really just serves to save a compact representation of data on disk, without explicitly managing packing an unpacking from the binary serialization.
In that respect, it intuitively seems like it should behave more like the ASN1 or YAML formats w.r.t. the safety of data loaded from it. I may understand the risks, and you may understand the risks, but I think that a new Python programmer could be forgiven for simply scanning the standard library documentation and thinking that the pickle format would be safe for transport across untrusted channels.
No, they are saying that Pickle as it is currently designed is too powerful. Depending on the goals of Pickle, it may be possible to reduce its power by redesigning it to do no more than its goals.
You'd have to define exactly what you mean by "completely secure". It's secure now - as in: it will not pickle any user input in a way that it will be executed during unpickling. It's also insecure now in a way that's been described.
So what's news? Unpickling is safe only if you are very sure the pickled data was created by yourself, with the same version of Python. It has always been like this.
Typically you pickle when you wish to offload some objects to persistent storage. It's similarly typical to compute the message digest of the pickle data with some salt padding and store that in a place that you consider relatively safe wrt. to your security needs. Then you don't need to worry about unpickling malicious data.
I don't know what "the digest of pickle data with some salt padding" means (when you say "salt", you trip the "talking about crypto using words only lay programmers use" sensor), but it sure doesn't sound crypto-safe. There are a number of easy-to-make errors with digest schemes that allow attackers to make constrained modifications to documents without breaking the digest.
Long story short, don't do this; instead, PGP/GPG encrypt and sign the pickled file. GPG is strictly better across all axes than hand-hacking your own protection scheme.
Or just hash it and keep a whitelist if you don't need to get too heavy-duty. Though I would personally just go for signing the file -- don't necessarily need to encrypt it unless you have sensitive data in there -- then you just have to keep track of the key instead of maintain a whiltelist of good hash values.
For what I've read salt is a commonly used term in cryptology, and refers to certain schemes for key derivation for hashing/encrypting. Salt can either be public (to make brute-force attacks infeasible) or private (for better security).
Almost always when pickling we're only interested in one aspect of security that is integrity: what we put in is what we get out. Encryption isn't needed and signing doesn't really offer much more with regard to this case. Instead, cross-checking against a message digest to make it hard to modify pickles (or any runtime data offloaded to disk) seems to be almost idiomatic. YMMV.
Not that I wouldn't want to write a fancy GPG based persistent storage but generally it would be an overkill. And overkills, in my experience, are good at blinding the developers from other threats. YMMV.
There are many good, proven message digest algorithms that are useful for implementing a simple salted hashing scheme. In practice, an application developer must eventually take algorithms for granted. We consider MD5 demonstrably weak but SHA, especially variations with longer digests, we consider strong enough for most purposes. So given the assumption that we can trust SHA, I'm sure you're familiar with something like the following:
- take the pickle output from pickle.dumps()
- create some random data and use it as salt
- run the pickle output + salt through hashlib.sha512() or whichever you prefer to obtain the message digest
- store the pickle output somewhere, even in public
- store the message digest somewhere, even in public
- store the salt some place safe
- recompute and verify before calling pickle.loads()
You still have to have trust in some storage that you consider safe. Computing the salt dynamically from a set of fixed and/or runtime values could be done by anyone, and is merely security by obscurity. However, exactly the same applies to GPG: you would have to store the private and public keys somewhere safe, and go from there.
And finally, as for me, I'd probably create more security holes in implementing the GPG integration than just sticking with the standard Python hashlib.
There are a number of easy-to-make errors with digest schemes that allow attackers to make constrained modifications to documents without breaking the digest.
Makes me fear that data corruption in my pickles will crash the system. Maybe someone should implement an interface to 'pickle' that maintains a hash of the string.
Why are they any more likely to be corrupted than any other file (including your python sources or .pyc files (which use another serialisation scheme, marshal))?
I'm guessing because it's possible for the data to be corrupted in just the right way so as to construct some system-crashing (or critical-data corrupting) system() or eval() call. Though that's a pretty extreme paranoia.
As you state, the python files themselves could also be corrupted in such a manor and then run in the interpreter, or a compiled program could get corrupted in just the right way to execute 'rm -rf /' though it's not likely.
What's up with YAML? Doesn't Google use it extensively? Is it a better alternative than JSON even if you are sending serialized data from the Javascript in the browser to the web server and back?
YAML is strictly a superset of JSON in its expressiveness, since it allows tagging of serialized objects with a type name. JSON reduces everything to maps, arrays, and scalars, so any type information has to be encoded in (or inferred from) the structures themselves. I also find it to be a bit easier to read, and so prefer it for configuration files or console dumps of data structures.
However, YAML isn't quite as well-supported in the standard libraries of various programming languages, most notably Javascript. Which one to use depends largely on how heavily you expect browsers to consume your service output.
If you are passing around data from Python, to Javascript to even Flash, JSON is the best. Minimized parsing, smaller subset of YAML and KISS methodology. It doesn't have the verboseness problems that xml has but isn't as compact as binary data. But JSON converts to javascript, actionscript and python objects with almost no effort. It is actually not bad as a socket messaging format over xml or binary as well.
But [edit, should have said this to begin with] pickle isn't an interchange format. It's not supposed to be secure. Python already offers a myriad of good interchange formats. Interchange isn't pickle's job, and if you use it for that, you've made a serious design error.
Ruby unfortunately blurs the line here by using Marshal as an interchange format in some cases. None of those cases are insecure by design (they all allow code execution by design), but the usage does create a confusing precedent.
You're better off with ASN.1/BER than you are with Pickle or Marshal as a file or protocol format; that's how inappropriate Pickle is to the task.