File Carving With Machine Learning

In the world of digital forensics, “file carving” is the act of recovering deleted files from fragments which are either left behind on a fragmented hard drive or by-products of a partial overwrite to a sector of a hard disk despite the absence of the metadata used by the filesystem to make the file available ordinarily.

Cryptography Fundamentals

In this post Ill take a very brief and broad look at some of the core principles and fundamental aspects of cryptography. As a practice it has been around for as long as humanity itself but as a science its history is a bit more recent having been around for a few decades at most. It will come as no surprise that the growth of the Internet, or perhaps more accurately, of Data Communications, which predates the Internet by a decade or three, has been one of the main drivers of the growth in the need for Cryptography. The prevalence and penetration of the Internet has now reached such high levels that there is barely a business in the developed world that does not rely nowadays upon effective encryption for its survival. So, this post is about security and the role played by cryptographic technology in data security. Its a bare introduction but I hope to add links to some blog posts which examine some of the more important areas more deeply so look out for them. The ability to secure data while it is in storage or in transit from an unauthorised compromising access is a critical function of information technology now. Indeed all forms of e-commerce such as credit card processing, equities trading or general banking data processing would, if compromised, lead to losses for the unfortunate organisations of billions of dollars/pounds/whatever not to mention the devastating cost in destruction of confidence going forward.

So lets look at a few high level topics to get going.

Information Theory

The fundamentals of information theory were famously defined by none other than the father of the information age, Claude Shannon in his seminal work from 1948, A Mathematical Theory of Communication. In this paper he defines the problem to which information theory purports to be the solution and makes many mathematical theories besides. More importantly, his work acted as a foundation upon which, time after time, the foremost minds of our science have expanded giving us the mature and usable communications mathematics we have today. Shannon’s paper did not come a moment too soon: this was an age in which the atomic bomb had just been developed, we were still a decade from the polio vaccine and Sputnik, and most pertinent, transistors were only just beginning to replace vacuum tubes in the design of digital machines (in fact, this is also thanks to Shannon). Information theory gave engineers, mathematicians, and scientists the necessary tools to analyse how well their machines were transmitting data to and from one another.

Figure 1 above, details Shannon’s communication model. The information source produces a message. The transmitter creates the signal to be transmitted through a channel; the channel is the medium carrying the signal. The receiver accepts the signal from the channel and transfers the signal back to the original message. The destination is the intended recipient of the message. The noise source introduces errors to the signal during transmission; it interferes with the signal, therefore distorting the transmission signal and impacting the transmitted message. Cryptography uses the same model. Shannon also talked in his paper about two other basic concepts: confusion and diffusion. We’ll look at these in more detail below.

In his paper, Shannon also explained that the statistical frequencies of repetition in the cipher-text (encrypted text) and the key (usually a string of numbers or letters that are stored in a file, which, when processed through a cryptographic algorithm, can encode or decode cryptographic data) should be as low as possible. In other words they should appear to be as random as possible exhibiting no discernable patterns. This is what he meant as confusion. Looking at the key again, if a small part of the key is changed, this should have a widespread knock on effect in changing the cipher-text throughout its scope. This is what he meant as diffusion. Without sufficient confusion and diffusion, it is possible to deduce the key from analysing the original plaintext beside the corresponding cipher-text.

Entropy

Entropy is defined in the dictionary as a lack of order or predictability or as a gradual decline into disorder. In cryptography it represents the amount of randomness in a transmission. Any cryptographic algorithm should produce a cipher-text output which has as much entropy as possible in order to obfuscate the original plaintext from anybody who might examine the corresponding cipher-text. Plaintext transmissions contain order hat can be discerned by the observer even if one does not understand the language being used and it is this order that leads to an ultimate understanding whether it is a language to be decoded or an encrypted stream of text. Thus, discernable order is the enemy of encryption and therefore entropy is to be encouraged.

Random number generation

Randomness is essential for effective encryption and you may be surprised to learn that a computer cannot easily generate randomness. Everything a computer does is under the instruction of an underlying process or algorithm and it is for this reason that it is a significant challenge. This may come as no surprise to the lay reader since a computer may seem to be the ultimate expression of determinism however it is challenging to consider how one might mechanistically program a function or algorithm for generating randomness. In specific relevance to this post, cryptographic processes and algorithms require randomness to be secure as random numbers are required for key distribution, session key generation, generating keys for cryptographic algorithms, the generation of bit streams and initialisation vectors (IVs).

For computers to generate random numbers requires them to capitalise on sources of randomness and unpredictability. Shannon posited that to achieve randomness, two important components are required: a uniform distribution, and independence. In a uniform distribution, the occurrence of zeros and ones is equal. Independence means that no bit can be inferred from the others. Unpredictability is required for random number generation: each number is statistically independent of the other numbers in the sequence. In a computer, random numbers can be generated. Such random numbers are either true random numbers or pseudo random numbers. Thus, randomness can be produced by a true random number generator (TRNG), also known as a random number generator (RNG), or a pseudorandom number generator (PRNG).

As stated previously, deterministic computers cannot generate random numbers. That is to say they cannot do it without external assistance. A TRNG must therefore use an external ancillary non-deterministic source of entropy and some form of function designed to take the randomness provided externally as an argument such that a suitable random number is generated (entropy distillation process). The input source is typically a source of entropy from the physical environment, such as keystroke timing patterns, disk electrical activity or mouse movements; this source is combined with the processing function to generate the required random output in the form required.

Freshness

In order to understand the concept of freshness properly, it is illuminating to first examine that of a replay attack. In a replay attack, an attacker records the transaction traffic on the network involved in logging in to a system then uses the recording played back, effectively resending what had been sent before, to gain access for themselves, even though the recorded traffic they resend may have been hashed or obfuscated.

The notion of freshness is something of an abstraction which tends to make its understanding less than intuitive but it effectively means that we ensure, by ensuring freshness, that each time we transact with a system in order to access it, the interaction traffic will never be the same as it was before. Now, in a world where we don’t want to change our passwords every time we log off, that seems like a challenge but its not too much of a challenge.

The way it is achieved is for the server to generate a one-time pseudorandom number which is sent to the client at the start of the authentication exchange. This number which will only ever be used once (abbreviated to nonce) is typically concatenated to the password before transmission to the server such that it becomes impossible to anticipate the required transmission in order to successfully complete a replay attack.

One-time pad (OTP)

One-time pads are a mainly theoretic encryption device which in theory provide the strongest possible cipher. This carries some caveats however in that the key must be provided and used properly within a strict set of rules. The theory behind the one-time pad is that the key must be at least the same length as the plaintext message and that the key must be truly random. The key and the plaintext are then combined using the most fundamental of encryption devices, the modulo 2 adder, otherwise known as an exclusive OR gate. The result, given a secure key which has not been compromised is cipher-text which has no direct relation to the original plaintext. To decrypt, the same key is used and the operation reversed. For this to actually be completely secure in theory, the following rules must be observed without exception:

  • The OTP (key) MUST be truly random
  • The OTP must be at least as long as or longer than the plaintext original
  • Only two copies of the OTP should exist
  • The OTP must be used only once
  • Both copies of the OTP must be destroyed immediately after use

The OTP process is only absolutely safe if and only if the preceding rules are strictly obeyed. Before computers this was a time-consuming and error prone task which rendered it all but impractical unless carried out by machines such as dedicated telegraphy encryption/decryption devices however nowadays it could conceivably be automated by a computer.

Surprisingly perhaps, manual OTP ciphers are still being used today for sending secret messages to agents (spies) via what are known as numbers stations, or one-way voice links (OWVL) both of which typically use HF transmission and can routinely be heard on short wave(HF) radio bands.

Avalanche effect

When it comes to encryption, one of the most attractive properties of cryptography is known as the avalanche effect, in which two different keys generate very different cipher text for the same input plaintext. As previously discussed, this makes two similar keys that generate different cipher text a source of confusion (Remember Shannon defined this as desirable). We can therefore measure or compare the efficacy of two different encryption algorithms with reference to the avalanche effect they bring about. Plaintext and encryption key are mapped in binary code before encryption process. Avalanche effect is calculated by changing one bit in the plaintext source whilst keeping the key constant and again differently, by changing one bit in the encryption key whilst keeping the plaintext constant. Empirical results show us that the most secure algorithms carry the most significantly high avalanche effect.

Kerckhoffs’ principle

Auguste Kerckhoffs was a linguist and military cryptographer in the late 19th century who had many essays on contemporary cryptography published at the time. A quote attributed to him from 1883 states

‘Military cipher systems should not require secrecy, and it should not be a problem if they fall into enemy hands; a cryptographic system should be secure even if everything about the system, except the key, is public knowledge’

Auguste Kerckhoffs (1883)

(‘Kerckhoffs’s principle’, 2015). In other words, the key must be kept secret, not the algorithm.

XOR function

The XOR function has a unique place in cryptography and you may well ask why this function? Why not the AND gate or the OR gate. The answer is strikingly simple. In a nutshell the XOR operation is reversible. So, if a string of binary data were to be passed through the XOR function sequentially along with a key then the output, if passed as an input to another XOR function along with the same key will produce the original output.

The XOR is a binary operation described as a logic gate in a way that facilitates analysis in karnaugh maps and all of the other associated digital logic canon however, taking it out of the realm of digital logic circuits it is simply a way of expressing the act of modulo 2 addition without a carry. We will go on and investigate the utility of XOR/mod2adders in more detail in future posts and I will add links to them below as they are published.

Confidentiality, Integrity and Availability

Confidentiality, Integrity and Availability, collectively known as the CIA triad in the cyber security world and represent the three main elements of a fundamental model of the properties or attributes of a secure system. For example, if we consider a banking application which must certainly be beyond doubt in terms of its security, must exhibit confidentiality (must prevent any unauthorised access), integrity (must be a true reflection of the reality of the bank accounts and safe from unauthorised modification) and availability (be usable whenever it is needed). In cryptography we are not overly concerned with availability as it is not relevant to the field however a secure system will most certainly lose its availability if it were to lose its confidentiality or suffer a compromise of its integrity.

Again, we will look at these premises in more detail in other posts so it is sufficient for now to state them in this context.

Non-repudiation

Were it not the case that CIA presents such a handy mnemonic, non repudiation would very likely be involved with confidentiality, integrity and availability as the four principles of a secure system. As it is it is now an addendum to the axiom but an essential one none the less.

In a secure system we can think of the non-repudiation of the system as the application of an audit trail provided perhaps by logging. In logging every transaction, interaction and modification of a system and more importantly logging the identity of the entity who was responsible for initiating one of these actions, we ensure that the system under scrutiny in able to maintain its status.

By providing mechanisms which ensure non repudiation we provide an important mechanism for ensuring the survivability of a system through compromise and beyond, hopefully restored to a secure state once again.

For readers familiar with the reactive/proactive bowtie model of system assurance, non repudiation sits very firmly in the reactive side but is essential to a survivable robust secure environment.

Data origin and entity authentication

The subject of the authentication of data origin and entities is a complex and detailed one. Essentially, it is the affirmation that the entity believed to be the source of some data in an interaction is who or what it is believed to be. Ultimately, the receiver needs to have confidence that the message has not been intercepted and or modified in transit and this is accomplished by verifying the identity of the source of the message.

So, how do you prevent an attacker from manipulating messages in transit between an transaction source and its destination? The major considerations are as follows:

  • The recipient must confirm that messages from a source have not been modified along the way.
  • The recipient must confirm that messages from a source are indeed from that source.

Data Origin Authentication is the solution to these problems. The recipient, by verifying that messages have not been tampered with in transit (Data Integrity) and that they originate from the expected sender (Data Authenticity) confirms both considerations bulleted above. Entity authentication assures that entities are currently and actively involved in a communication session. In cryptography, this can be done by using freshness, as discussed earlier.

The three states of data

In distributed computer systems such as computer networks of any size, the data held by, shared and processed by these computational hosts is defined as being in three states. Those states are, rest, transit or use. Ultimately it is the job of and indeed the raison d’etre for our cryptographic algorithms to protect this data and in each of these three states the job of doing that takes on a slightly nuanced form.

Data-at-rest

This is data that is stored in data storage media such as magnetic disk drives, optical media or tapes. All data at rest requires to be physically stored in some form of device. Where it requires confidentiality and integrity protection it must be encrypted. The discussion of this topic must wrestle with such architectural considerations such as, do we encrypt the whole device or just the files themselves? Even considerations such as the additional overhead cost to the environment of the energy consumption of the large scale encryption of data at rest fall into this discussion.

Data-in-transit (motion)

Data in transit is data on the move. Moving can mean across space from a satellite to the earth, across oceans through an undersea fibre optic cable, across free space on a town centre public WIFI system, across an office building through the corporate network or even across a computer architecture from the data bus to the CPU. As you likely suspect, each of these cases has its own nuanced features but there are also shared common overarching consideration appropriate to all cases. This is an enormous field of activity so we will let it suffice for now that the treatment of data in transmission must be considered carefully and dealt with appropriately.

Data-in-use (processing)

As you might expect, data in use is data that is currently being processed by a CPU or indeed by an end user as it is displayed on a screen. This classification can sometimes seem a little anomalous and its certainly the most difficult of the three to assure ourselves that said data is protected in all the ways that we wish it to be. Its almost a given that in order to be used data must be decrypted and therefore encryption is of minimal protection to data in this state.

Conclusion

Thank you for reading this far in a post which could perhaps have seemed a little bit like the contents page to a book. The analogy is, I hope, a sound one as this post contains some of the most core elements of the cryptographic landscape. I’ve mentioned here and there that I intend to add links from each of the sections of this document to other resources of relevance that I post and hopefully if enough time has passed since right now while Im writing this and you landing here to read it then that will be evidenced above. Check back again in the future and look at the links here or even the categories and tags for the blog as it is my firm intention to make this a living document.

Its a bloggers cliché but I would really appreciate your comments. Good and bad. Im committed to displaying them all here and, do that, I will but whatever the motivation its cool to get involved in dialogue whether here in the blog or on twitter or elsewhere.

Could ants power Web3.0 to new heights? OSPF v’s ANTS

Having recently completed my latest study on the subject of “Natural and Artificial Intelligence“, I became aware of advances made in the recent decade towards a new paradigm of network traffic engineering that was being researched. This new model turns its back on traditional destination based solutions, (OSPFEIGRPMPLS) to the combinatorial problem of decision making in network routing  favouring instead a constructive greedy heuristic which uses stochastic combinatorial optimisation. Put in more accessible terms, it leverages the emergent ability of sytems comprised of quite basic autonomous elements working together, to perform a variety of complicated tasks with great reliability and consistency.

In 1986, the computer scientist Craig Reynolds set out to investigate this phenomenon through computer simulation. The mystery and beauty of a flock or swarm is perhaps best described in the opening words of his classic 1986 paper on the subject:

The motion of a flock of birds is one of nature’s delights. Flocks and related synchronized group behaviors such as schools of fish or herds of land animals are both beautiful to watch and intriguing to contemplate. A flock … is made up of discrete birds yet overall motion seems fluid; it is simple in concept yet is so visually complex, it seems randomly arrayed and yet is magnificently synchronized. Perhaps most puzzling is the strong impression of intentional, centralized control. Yet all evidence dicates that flock motion must be merely the aggregate result of the actions of individual animals, each acting solely on the basis of its own local perception of the world.

An analogy with the way ant colonies function has suggested that the emergent behaviour of ant colonies to reliably and consistently optimise paths could be leveraged to enhance the way that the combinatorial optimisation problem of complex network path selection is solved.

The fundamental difference between the modelling of a complex telecommunications network and more commonplace problems of combinatorial optimisation such as the travelling salesman problem is that of the dynamic nature of the state at any given moment of a network such as the internet. For example, in the TSP the towns, the routes between them and the associated distances don’t change. However, network routing is a dynamic problem. It is dynamic in space, because the shape of the network – its topology – may change: switches and nodes may break down and new ones may come on line. But the problem is also dynamic in time, and quite unpredictably so. The amount of network traffic will vary constantly: some switches may become overloaded, there may be local bursts of activity that make parts of the network very slow, and so on. So network routing is a very difficult problem of dynamic optimisation. Finding fast, efficent and intelligent routing algorithms is a major headache for telcommunications engineers.

So how you may ask, could ants help here? Individual ants are behaviourally very unsophisticated insects. They have a very limited memory and exhibit individual behaviour that appears to have a large random component. Acting as a collective however, ants manage to perform a variety of complicated tasks with great reliability and consistency, for example, finding the shortest routes from their nest to a food source.



These behaviours emerge from the interactions between large numbers of individual ants and their environment. In many cases, the principle of stigmergy is used. Stigmergy is a form of indirect communication through the environment. Like other insects, ants typically produce specific actions in response to specific local environmental stimuli, rather than as part of the execution of some central plan. If an ant’s action changes the local environment in a way that affects one of these specific stimuli, this will influence the subsequent actions of ants at that location. The environmental change may take either of two distinct forms. In the first, the physical characteristics may be changed as a result of carrying out some task-related action, such as digging a hole, or adding a ball of mud to a growing structure. The subsequent perception of the changed environment may cause the next ant to enlarge the hole, or deposit its ball of mud on top of the previous ball. In this type of stigmergy, the cumulative effects of these local task-related changes can guide the growth of a complex structure. This type of influence has been called sematectonic. In the second form, the environment is changed by depositing something which makes no direct contribution to the task, but is used solely to influence subsequent behaviour which is task related. This sign-based stigmergy has been highly developed by ants and other exclusively social insects, which use a variety of highly specific volatile hormones, or pheromones, to provide a sophisticated signalling system. It is primarily this second mechanism of sign based sigmergy that has been successfully simulated with computer models and applied as a model to a system of network traffic engineering.

In the traditional network model, packets move around the network completely deterministically. A packet arriving at a given node is routed by the device which simply consults the routing table and takes the optimum path based on its destination. There is no element of probability as the values in the routing table represent not probabilities, but the relative desirability of moving to other nodes.

In the ant colony optimisation model, virtual ants also move around the network, their task being to constantly adjust the routing tables according to the latest information about network conditions. For an ant, the values in the table are probabilities that their next move will be to a certain node.The progress of an ant around the network is governed by the following informal rules:

  • Ants start at random nodes.
  • They move around the network from node to node, using the routing table at each node as a guide to which link to cross next.
  • As it explores, an ant ages, the age of each individual being related to the length of time elapsed since it set out from its source. However, an ant that finds itself at a congested node is delayed, and thus made to age faster than ants moving through less choked areas.
  • As an ant crosses a link between two nodes, it deposits pheromone however, it leaves it not on the link itself, but on the entry for that link in the routing table of the node it left. Other ‘pheromone’ values in that column of the nodes routing table are decreased, in a process analogous to pheromone decay.
  • When an ant reaches its final destination it is presumed to have died and is deleted from the system.R.I.P.



Testing the ant colony optimisation system, and measuring its performance against that of a number of other well-known routing techniques produced good results and the system outperformed all of the established mechanisms however there are potential problems of the kind that constantly plague all dynamic optimisation algorithms. The most significant problem is that, after a long period of stability and equilibrium, the ants will have become locked into their accustomed routes. They become unable to break out of these patterns to explore new routes capable of meeting new conditions which could exist if a sudden change to the networks conditions were to take place. This can be mitigated however in the same way that evolutionary computation introduces mutation to fully explore new possibilities by means of the introduction of an element of purely random behaviour to the ant.

‘Ant net’ routing has been tested on models of US and Japanese communications networks, using a variety of different possible traffic patterns. The algorithm worked at least as well as, and in some cases much better than, four of the best-performing conventional routing algorithms. Its results were even comparable to those of an idealised ‘daemon’ algorithm, with instantaneous and complete knowledge of the current state of the network.

It would seem we have not heard the last of these routing antics.