We live in the era of Big Data, and the volume and variety of data have far outstripped the capacity of manual analysis, and in some cases have exceeded the capacity of conventional databases, requiring more processing power each time. At the same time, computers have become far more powerful. Networking is ubiquitous, and algorithms have been developed that can connect datasets to enable broader and deeper analyses, leading companies to turn their heads to Data Science and its illimited potentialities.

Machine Learning and Artificial Intelligence

Machine Learning and Artificial Intelligence are becoming terms used daily in our working lives. A 2020 Deloitte survey found that 67% of companies are using Machine Learning, and 97% are using or planning to use it in the next year.

In 1959, Arthur Samuel defined Machine Learning as the subfield of Artificial Intelligence that “gives computers the ability to learn without being explicitly programmed” and over the last quarter of a century, Machine Learning has become one of the most important parts of the IT revolution impacting our lives.

Although ML dates from the early days of Artificial Intelligence in the late 1950s, it underwent its first resurgence when the concept of data mining began to take off approximately 20 years ago. Data mining algorithms look for patterns in information. Machine Learning does the same thing but goes one step further: the program changes its behaviour based on what it learns.

Only as good as the data they learn from

Machine Learning starts with data — numbers, photos, text, you name it. Various types of data imaginable are collected from various sources and prepared to be used as training data – the information the ML model will be trained on. The more diverse the training data is, the better the Machine Learning Algorithm will perform.

But although Machine Learning algorithms can really help leverage a company utilizing its data assets for better results and better products, they will always be as good as the data they learn from; if the data they learn from is not diverse enough, is not cleaned or processed, the Machine Learning algorithms can result in overfitting (when a model learns the detail and noise in the training data to the extent that it negatively impacts the performance of the model on new data). In simpler terms, when the data is not diverse and its quality is not as high as it could be, Machine Learning models will produce extremely good results in the data they use for training, but they will perform poorly on new and unseen data. Since data quality and diversity have become extremely important pillars in any Data Science activity, especially in cybersecurity, there is no room for small mistakes that may arise from this.

Written by João Luís Milheiro, Data Scientist Team Leader at Sepio