A Conversation About Data Science
This story touches on a mix of topics like the divide between data science and data engineering, model complexity in machine learning, the transition from academia to data science, and hybrid AI-human solutions.
In a recent interview for the Data for Future podcast, we chatted about a range of topics related to all things data science. So I decided to write this post to cover the main takeaway messages of our conversation.
Should your company hire data scientists or data engineers?
Data scientists can be even more valuable if they have data engineers by their side.
Companies often expect data scientists to do magic with their data. But the company may not have good data architectures and robust ETL pipelines in place, resulting in overall poor data quality, or little practical ability to iterate machine learning models. In such circumstances, data scientist can’t shine, which can lead companies to believe that data science is not worth investing in. If your company is in these circumstances, maybe what they need are data engineers. They could improve these processes, allowing the data to be better “scienced out” at a later stage by data scientists.
By no means am I implying that data scientists are less necessary than engineers. On the contrary, data scientists can be even more valuable if they have data engineers by their side. The engineer will ensure that the data pipelines are scalable, computationally efficient, and robust enough for a production environment. And meanwhile the scientist will have the freedom to iterate models and experiment with new approaches to extract more value from the data.
In the end, the line between data science and data engineering is very thin, and it’s possible that both titles will merge in the near future (together with that other hybrid called “machine learning engineer”). But it is the right balance of the two that creates the perfect cocktail for a data-driven company.
The good and bad of working with ugly data
Data science is much more than fancy deep learning.
“I thought data science was all about training neural networks, but in practice…” — I have heard many starting data scientists making comments like this. This happens because we hear much more often about the fancy, cutting-edge applications of data science, and not so much about the less exciting aspects of it. But in reality, data science is much more than fancy deep learning. And actually, those other “boring” things often occupy the majority of a data scientists’ time.
This is particularly true in sectors like the one I’ve been working in for the past couple of years: The energy sector. From a data scientist’s perspective, this sector is a big mess (at least in my country). Not only are electricity and gas installations old, but the system in place is as well, and has been largely monopolised by a few powerful actors until very recently. Although things are rapidly improving (which is good for data scientists and also for the environment) at the moment all these ingredients frequently result in data chaos. And therefore you may not have many chances to try that amazing deep learning model you read about in a blog.
But working with messy data (in the energy sector or elsewhere) also has a positive side. Learning how to extract value from poor data is a powerful skill, often neglected in data science courses and blogs. At the end of the day, many companies, startups, NGOs and government institutions may still not have the infrastructure or the need to enter the realm of big data.
Simplicity over complexity
Make your model as simple as possible, but no simpler.
A simple machine learning model that performs well is often preferable over a complex model that performs just a little better. Let me give a couple of reasons why. Firstly, simpler models tend to take less time (and computation power) to develop and train, which allows for more flexibility to adapt or tune them if necessary. And secondly, simpler models tend to be easier to interpret. Model interpretability not only allows you to discover unexpected misbehaviour and biases of your model, but it can also reveal valuable insights from the data. Therefore, you should always try to make your model as simple as possible, but no simpler.
What makes a company data driven?
Taking decisions based on data is what makes a company truly data driven.
Every business decision is based on a prediction. For example, you predict that sales will increase if you offer a certain product. Or you predict that churn will decrease if you reduce the price of a certain service. To make those predictions, you could simply rely on your intuition. However, if you happen to have some relevant data, you can also back up those predictions with an analysis, which will increase the accuracy of those predictions, and therefore the chances of success.
At the end of the day, a data-driven company is not necessarily one that uses artificial intelligence or deep learning. Instead, taking decisions based on data is what makes a company truly data driven.
Some advice to academics becoming data scientists
Instead of aiming for the optimal result, try to get baby results early on.
If I could travel back in time, there are plenty of tips I’d give my younger self. And while it would be hard to decide which ones would be the most important, I think the following two would definitely be close to the top:
First, learn how to write good code. In academia, you may not have the necessity to write good code. In the end, your code may only be used by you, and possibly executed just once (say, to generate the result of a specific paper). But if you end up switching to industry, being able to write good code gives you an excellent head start.
Second, try different methodologies and workflows to make your work more efficient. For example, imagine that you want to publish a paper (which is not hard if you are an academic!). You could write all the tasks required to achieve that goal on post-it notes and stick them on a Kanban board. And if you get stuck on a problem, don’t brute force the solution out of your head. Instead, split the problem into small steps and tackle the quick wins first. Instead of aiming for the optimal result, try to get baby results early on. That will generate a constant sense of accomplishment, and possibly ease the way to solving that complicated problem. The earlier you start working with this kind of mindset, the easier your transition to industry will be.
Hybrid AI & human solutions
Let the AI reduce tedious human workload, while letting humans take the harder decisions.
Finally, we talked about my work as a technical mentor at Data Science for Social Good. And I mentioned one of the main lessons I learned there: Sometimes the best solution is neither fully human, nor fully AI-driven, but a hybrid.
As an example, I mentioned one of the projects which consisted of classifying medical papers into different categories. Up until then, Cochrane volunteers were doing that work manually, which was very time consuming. And we could have proposed a fully automated solution using machine learning. However, this would have led to a certain amount of errors, which would have a negative impact on the workload of those volunteers. Potentially, it would also have consequences at a much larger scale: relevant medical research could be neglected.
So we proposed a hybrid AI & human solution: Papers that were definitely relevant in a category were automatically classified. Papers that were definitely irrelevant were automatically discarded. And all papers in between, for which the machine learning model was uncertain, were passed on to humans to decide.
When the potential impact of errors is large, hybrid solutions may be safer than fully automated ones. Let the AI reduce tedious human workload, while letting humans take the harder decisions.
Originally published at https://pablorosado.com on April 7, 2021.