Computer science researchers have developed an all-purpose system that makes it easy to crunch through common kinds of big datasets up to 100 times faster than conventional methods. The new approach can also reduce file sizes by up to nine orders magnitude.
The system was presented by a team led by scientists from the Massachusetts Institute of Technology to the Association for Computing Machinery’s Conference on Systems, Programming, Languages and Applications in Vancouver, Canada, in late October. It automatically produces computer code that is optimised for what programmers call “sparse data”.
A typical example of sparse data would be a table (a matrix, in the jargon) containing a list of a shop’s customers, say, against a list of its products. Each entry in the matrix contains a 1 if the customer in that row has bought the product in that column, and a 0 otherwise. The 1s are the interesting part of the data – that is, who bought what – but, since most customers have not bought most products, the great majority of the matrix will be full of 0s.
If the matrix is expanded to include more kinds of information – customers’ birthdays, say, or whether they shop online or in person – extra dimensions are added and it becomes what mathematicians call a tensor. The way a computer analyses the data is by performing mathematical operations – multiplication, addition, and so on – on these tensors.
Working with large tensors takes a lot of computer operations, and a lot of them are wasted when the data is sparse. Unless otherwise instructed, a computer will go to the trouble of multiplying every zero by every other zero, for instance, rather than realising it is unnecessary.
That’s where the new system – dubbed Taco, for “tensor algebra compiler” – comes in. While methods for efficiently carrying out many specific kinds of operations on sparse data already exist, they must be painstakingly adapted by hand to meet the needs of each individual situation.
“Nobody knew how to generate code for [sparse data] automatically,” according to MIT’s Saman Amarasinghe, the senior author of the paper describing the Taco system.
“The biggest contribution we make is the ability to generate code for any tensor-algebra expression when the matrices are sparse.”
Taco works by looking at the tensors to be operated on, then building a map of only the non-zero entries. It then throws away all the zeros and keeps the interesting information.
In addition to faster processing – up to 100 times faster, according the researchers – Taco offers the benefit of much smaller file sizes. In one test on a giant trove of customer data from Amazon, the original 107 exabytes – about as much data as is transmitted across the entire internet in a month – was reduced by more than a billion times to a mere 13 gigabytes, which would fit easily on most smartphones.