This is the Data Science Conversations podcast with Damien Deighan and Dr. Philipp Diesinger. We feature cutting-edge data science and AI research from the world’s leading academic minds and industry practitioners, so you can expand your knowledge and grow your career. This podcast is sponsored by Data Science Talent, the Data Science Recruitment Experts.
Our guest today is Michael Haft, founder of xplain Data based in Munich. By way of background, Michael has many years of experience in both research and industry, spanning business intelligence, data analytics, data mining, and machine learning. He has experience in a diverse range of business sectors with a particular focus on database management of patients in healthcare and energy analytics in the utility sector.
Michael: Hello, Philipp. Thank you very much for the invitation. I’m very excited to be here.
Philipp: Great. So if we just start at the top with the big picture, Michael, could you give us an overview of causal discovery?
Michael: Yes. A very brief overview: What does causal discovery mean? The mission of causal discovery is understanding cause and effect relationships based on observational data. That sounds a little technical, but what it basically means is understanding why something happens. Compared to a predictive model, which predicts what is likely going to happen, causal discovery wants to understand why it’s going to happen. That’s very often the much more important part, because it’s definitely a part of artificial intelligence. Only if you know the cause and effect relationships in a domain are you able to intervene in an intelligent way to arrive at your goals. You need to know what, in a cause and effect sense, drives your goals in order to behave intelligently to achieve those goals. That’s the mission of causal discovery: just observing a system—no experiments, just observational data—and based on that observational data, understanding cause and effect relationships in that domain.
Philipp: Michael, for our listeners who have a background in statistics or machine learning but not necessarily in causal inference, can you explain a little how you can actually arrive from data at causal relationships? Not just looking at covariance metrics or correlations, but actually pursuing the understanding of the “why” question.
Michael: Right. So it’s certainly a challenge. Typically, how cause and effect relationships are proven is simply by doing an experiment. Every engineer does experiments: they have a parameter, play with that parameter, and keep all the other parameters constant to see the effect of only the parameter they’re interested in. Experiments are the typical way cause and effect relationships are proven. A randomized control trial in medicine proving the causal effect of a drug is also a controlled experiment where environmental conditions are kept constant, at least on a statistical average. But that’s not what we want to do. As you said, Philipp, here it’s about just observational data.
And here comes the problem. If you have just observational data, you’re observing lots of correlations, and most of them are meaningless. Here’s a very simple example of a correlation which might serve for prediction, but from a cause and effect relationship perspective, it’s not relevant. For example, gray hair and glasses are correlated. I can see that with you—you have quite a number of gray hairs already and you’re wearing glasses. Me too, by the way. So if we do a little study and observe all the visitors here around us and do some statistics, we will see that gray hair and glasses are pretty much correlated. From a predictive modeling point of view, that’s perfect.
Michael: So if I could infer from the fact that someone is wearing glasses, I can already make a prediction that they might have gray hair. In a web shop, for example, if you’re selling glasses and you’re selling products for coloring hair, and you know that some people are wearing glasses, you might offer them a product for coloring their hair. From a predictive modeling point of view, a correlation is sufficient. But from a cause and effect point of view, putting aside your glasses wouldn’t have the effect that your hair won’t get gray.
From a cause and effect point of view, there is no causal relationship between gray hair and glasses. The reason is simply that there is a so-called confounder sitting behind that, and that’s age. The older you get, the more often people need glasses, and many of us also get gray hair. That’s the reason why those two things are correlated, but there is no cause and effect relationship between gray hair and glasses. The true cause and effect relationship is that age is causing you to need glasses, and age is causing gray hair.
So that means if you want to understand cause and effect relationships, you need to have information about the age. For a predictive model, you don’t need the age. It might not be perfect, but from wearing glasses, you might predict that someone has gray hair. But from a causal discovery point of view, you need this so-called confounder, this common cause for both. And that is what makes it more challenging. That means you need a very comprehensive picture of the object and focus of analysis. Because as soon as I know the age, I can detect that the true cause is the age and not the glasses, and I can rule out the glasses as a cause of gray hair.
That means for understanding cause and effect relationships, we need a very comprehensive picture of the object and focus of analysis—of the visitors of the shop, for example. We need to know their age. And once we have very comprehensive information about patients, about industrial parts, or whatever, and in this comprehensive information we can’t find a confounder which is explaining a correlation, then this correlation is a very likely candidate for being a cause and effect relationship.
So we assume that, for example, even though we have information about age and lots of other information about the visitors, if there is—taking all these factors into account—still a correlation between gray hair and glasses, only then would we conclude that there is a cause and effect relationship. This explains the challenge: we need very comprehensive information to search for confounders and search for potential other causal factors. We need very comprehensive information for understanding cause and effect. For predictive modeling, just a few correlated dimensions are sufficient to make a prediction. And that’s one of the big differences.
Philipp: Yeah, makes sense. So you mentioned predictive modeling a couple of times now. I think if I understood correctly, you are also combining causal inference or causal discovery with AI. Can you talk about that a little bit?
Michael: Basically, causal discovery or causal inference—only recently Gartner has coined the term “causal AI”—has some justification simply because whenever you start making a prediction, okay, but if you start to intervene into a system, if you want to achieve a goal, then your interventions need to be in a cause and effect relationship to the target. That means behaving intelligently to arrive at your goals requires knowledge about cause and effect. Intelligent behavior requires knowledge about cause and effect.
Therefore, this is one thing I can predict already today: in the next years, causality will become very important in artificial intelligence. To implement truly intelligent systems—a predictive model, a large language model—it’s just predicting the next word. But as soon as it comes to planning interventions, just a prediction isn’t sufficient. You need to know something about cause and effect. And that’s exactly what we do with causal discovery.
Philipp: How does that work on a technical level, like the integration of the causal inference methods that you’re using with AI?
Michael: It comes in two steps. The first step is understanding these cause and effect relationships.
Michael: It’s kind of a post-mortem analytics. It’s generating the evidence, the knowledge, which is then important to integrate into a business process. So the first step is the causal discovery part: scanning the data, massaging the data, doing this deep search for confounders, and finally arriving at the very likely candidates for cause and effect relationships, indeed in terms of an entire graph.
That’s also the difference between predictive modeling and causal discovery. For a predictive model, you only need the direct causes—at best, the direct causes. In causal discovery, we also want to have the indirect effects, entire causal pathways. And once we have generated that knowledge, we can start to bring it back into an operational process. For example, in industries—we have a couple of projects in industry—we mine the data for causal relationships, and then we bring this knowledge back into the process to optimize it and plan optimal interventions.
Philipp: Data is crucial, as anybody would expect. In your case, it’s data completeness, but also having the right architecture behind the data, choosing complex data objects. Can you elaborate on that a little bit?
Michael: Right. So our causal discovery algorithms are kind of the icing on the cake. I hope it’s intuitive to understand that this is important. But there is a technology sitting behind that, and we call it object analytics. This object analytics technology in itself is an asset. We hope that this is the next paradigm shift in data science and machine learning. And I’m going to tell you why.
As I said, understanding cause and effect requires a very comprehensive picture. Our super simple example with age—if you don’t have the age in your data, you would draw very wrong conclusions about cause and effect. In general, that means we need comprehensive information. In healthcare, for example, where we started, we need very comprehensive information about patients. What does that mean? You very quickly get to the term “electronic health record.” An electronic health record typically has 150 to 200 tables: diagnoses, prescriptions, procedures, lab values, microbiology values, and so on. A typical implementation of an electronic health record has 150 tables. That’s where you find comprehensive information about patients.
And that means there is a huge gap between what today’s data science or machine learning algorithms can digest and what comprehensive information means. You can’t throw 150 tables at typical machine learning algorithms today. What a machine learning algorithm—or most data algorithms—needs is a flat table, where a patient is a row in that table with a hundred or maybe a thousand dimensions, each dimension is a column, the final column is the target, and then you can build a predictive model. That’s the typical setting.
But a patient isn’t a flat table. No matter how many dimensions you put into that table, it will never be a comprehensive picture of a patient. A comprehensive picture of a patient has 150 different tables with billions of events: diagnoses, prescriptions, and so on. That’s what our object analytics technology allows. We take the data, typically from a complex relational data schema, and turn it into our object analytics representation. It’s called an object analytics database and indeed deserves the name “database.” It’s an analytical database allowing you to efficiently do analytical operations. And the causal discovery algorithms take advantage of that. They use this comprehensive, holistic picture of patients in that example to scan the data from thousands or millions of different perspectives.
You don’t need to do some kind of feature engineering, build features, and put them into a flat table. Our causal discovery algorithms are kind of X-raying these objects from a million different perspectives, evaluating millions of potential causal factors and millions of confounders. Only once we have found one feature, and despite this intense search, we have evaluated millions of other factors as potential confounders but we can’t explain the correlation via confounders, then we have the justification to assume that it might be a direct cause and effect relationship. And this is how those things come together.
You may claim that something is a cause and effect relationship only if you have excluded all confounders. And excluding all confounders means, in our case, a super deep search based on our object analytics technology.
[Narrator]
I would like to take a brief moment to tell you about our quarterly industry magazine called The Data Scientist and how you can get a complimentary subscription. My co-host on the podcast, Philipp Diesinger, is a regular contributor, and the magazine is packed full of features from some of the industry’s leading data and AI practitioners. We have articles spanning deep technical topics from across the data science and machine learning spectrum, plus there’s career advice and industry case studies from many of the world’s leading companies. Go to datasciencetalent.co.uk/media to get your complimentary magazine subscription. And now we head back to the conversation.
Philipp: So I think we’ve framed it a little bit. You talked about causal inference versus correlation and so on, so we’re on the statistical side. You’ve been on the journey of putting all of this together into products. Is that correct?
Michael: Yeah, that’s right. Explained Data is a fully self-funded startup company. I don’t know whether we still deserve the name “startup company” because Explained Data has been existing for a while. Our mission is not a service business but a product business. We generate revenue in terms of license fees, and we offer that object analytics and causal discovery technology.
The object analytics database is one independent, important part of it. Indeed, a number of our customers are using our object analytics technology without the causal discovery algorithm. One example is AOK Northwest. They have presumably the fastest relational database, SAP HANA, an in-memory database. Nevertheless, they are using our object analytics database because it’s much easier to explore the patient holistically—diagnoses in relation to prescriptions and things like that.
Philipp: AOK being a healthcare provider. It’s a public sickness fund, right?
Michael: They have data—they have approximately 3 million patients and all the prescriptions and diagnoses and that kind of thing, so-called secondary observational data. With a total order of magnitude, I guess it might be three to five billion events stored in this object analytics model for 3 million patients. So that’s the order of magnitude and the comprehensiveness of the data which we have loaded there.
And the object analytics database in itself is an asset because it allows you, on a purely descriptive level, to explore patient journeys: how the journey goes from one diagnosis to the next one, what kind of medications happen in between—no predictive modeling, no causal discovery, just being able to explore the patient journey in such a complex relational data schema. That is an asset which the object analytics database brings to the table.
And then, on top of it, the icing on the cake is the causal discovery algorithms, which are then using this holistic picture for a deep search for causal factors—simultaneously a deep search for confounders. The justification to claim that some effects which we discover are cause and effect relationships mainly comes from that deep search, because deep search means we haven’t overlooked something that SAP HANA cannot deliver to them, but Explained Data can.
Causal discovery is such a use case, obviously. With any relational database and typically any machine learning algorithms, the first step you need to do is boil down these multiple tables into a flat table with manually defined features, and you very often overlook features. So that’s one thing which, with any relational database, is hard to do. The object analytics in itself addresses this.
Certainly, causal discovery is something which we want to offer to the data science community because it’s an important element of artificial intelligence. But indeed, we have a vision beyond that. We hope that soon, all the data science community is no longer operating on flat tables but operating on holistic objects. A row in a table is replaced by a whole object, and the future of data science algorithms are going to operate on holistic objects, not on flat tables.
Philipp: I want to come back to the use case question from a business perspective, or from a business question perspective. What are typical questions that causal methods can deliver, or where can they outperform traditional analytics?
Michael: Yeah, let’s jump into a few examples. Here’s a nice and very easy to understand example. A project which we had recently in industry is understanding causes for failures at the end of a long manufacturing process.
Michael: In this case, it was the manufacturing of cylinder heads. The cylinder head goes through a number of steps. The raw part comes into the manufacturing line, and then holes are drilled into it, parts are mounted onto it, and then it goes into the washing machine. From the washing machine, it goes into the leakage test. And quite a number of those cylinder heads failed the leakage test—close to 10% failed the final leakage test. The question is: what are the causes along the entire manufacturing line which finally result in a leaky cylinder head?
You can imagine that along the entire manufacturing process, each machine today is recording data. It’s recording all the parameters: the forces with which holes are drilled, the temperatures, and so on. In each of those steps, data are emerging. That means at the end, you have a number of tables which describe the entire manufacturing process. As with the patient case, it’s not 150 tables, but it’s 20 to 30 tables along 30 manufacturing steps. And then you need to understand the true causes of what, at the end, caused the cylinder head to be leaky.
An obvious correlation was [unclear]. And indeed, the true cause and effect relationship was that the test device for the leakage test delivered false negative results if the part temperature was low. So if the cylinder head temperature was low, the leakage test didn’t operate properly and delivered false negative results. This was particularly annoying because they had been throwing away lots of cylinder heads for no good reason.
The problem, however, was that they couldn’t exchange this leakage test device because it was expensive. But the causal discovery algorithms—as I said, it’s not just finding the direct causes, it’s detecting the entire causal pathways. One of the pathways was that an indirect factor, which in turn led to low part temperatures, was a long waiting time between the washing machine and leakage test. Because in the washing machine, the parts are cleaned from all particles with hot water, then the parts are hot. And if there’s a long waiting time between the washing machine and leakage test, the parts cool down, and then the leakage test delivers false negative results.
But that was a causal pathway which has been identified by the causal discovery algorithms, and that finally provided the intelligence for an intelligent intervention. What they did then—because they couldn’t fix the leakage test device—was simply program the washing machine such that once parts are queued on the conveyor belt towards the leakage test, the washing machine stops washing. Only if the path was [unclear] does the washing machine start washing, so the parts go into the leakage test with the defined temperature, and then they don’t have false negative results.
So you see, planning intelligent interventions requires knowledge about cause and effect, certainly, and not just the direct causes—the entire causal pathways. In that case, the causal pathway from the washing machine, long waiting time, cooling down, and finally false negative results. That’s a very nice and intuitive-to-understand project in industry.
Other projects are in manufacturing of printed circuit boards, where we offer a pre-configured solution for printed circuit board assembly lines. So those are two examples from industry.
In healthcare, the PhysioNet community—the MIMIC data are hospital data, around 500,000 hospital cases for 300,000 patients—and there we’re currently analyzing causes for so-called pressure injuries. You might know what a pressure injury is. Typically, older patients who are lying on their back for very long periods get wounds, so-called pressure ulcers or pressure injury wounds, at their back or at their legs or at their heels.
Michael: We’re particularly careful with this, and we’re looking at intelligent interventions to avoid that. We’re going to soon publish a few results here, and that’s indeed very exciting—the results in that project—because it has the opportunity to, on the one hand, improve patient care and, on the other hand, save a lot of money.
Philipp: Makes a lot of sense. Very impressive use cases also. If we think back a little bit, when did you personally switch from this predictive mindset, regression mindset, to a causal mindset? And what triggered that in your journey?
Michael: Yeah, that’s an interesting question. I don’t really know the answer to what triggered me to go in this direction. I think I believe it’s simply curiosity. It’s a certain type of mindset which always wants to know why. It’s asking the question “why” again and again. Why do things happen? And whenever you ask the question “why,” you soon get to realize that predictive models and all the stuff which exists today—it’s not sufficient. We need more than that. We need to understand cause and effect based on observational data.
Most of the data which we have today—I think 99% of the data which we have today—are observational data. Any data in a CRM system, in a shop, or in industry or healthcare—all the data which are collected in a non-controlled way are implicitly observational data. So 99% of all data are observational. Only a small part is collected under controlled experimental conditions. A randomized control trial is a super expensive thing. If we can draw intelligence and understand cause and effect without making super expensive randomized control trials, but simply by observing a system, that’s definitely a big step forward. This curiosity and that opportunity to do that was my personal motivation to go in that direction.
Philipp: You explained a lot about causal methods, causal discovery, and so on. Are there also areas where you say traditional methods still have a place and would work better, or maybe would work equally well but consume fewer resources?
Michael: Predictive modeling definitely has its justification. We can’t replace all the predictive modeling. Certainly, with a causal model or a directed acyclic graph, there comes a probabilistic prediction implicitly. So it also serves as a predictive model, but that’s not the primary focus.
Predictive models definitely have their justification, and we can’t replace that. In particular, in many domains—I said that causal discovery requires very comprehensive information, and very often this comprehensive information isn’t available, which means that causal discovery becomes difficult. At least you need to be very careful in interpreting the results. That’s different with predictive modeling. Predictive modeling, from a certain point of view, is much easier to do. It requires less comprehensive data, less data. Any information correlated to the target is good enough. So that means we will not replace predictive modeling. These two things need to come together. Predictive modeling certainly—very often—is the first step. But then the next step is understanding the causes behind the predictions and the causes behind the dependencies. That’s the second step.
Philipp: Yeah, that makes sense. You talked earlier about the AOK use case and other healthcare use cases, and also the industry use case. From a client perspective, in your experience, what does it take from a company to become causally mature—to embrace these methods? What roles need to be involved, like data science, product engineering, research, leadership maybe also? What can companies do to make progress on that?
Michael: Yeah, the first thing certainly is there needs to be a little more education. That’s why I’m very happy to be here. It’s not trivial to understand the difference between a predictive model and causal discovery. Certainly, if I explain it here in this podcast, I hope I can make it very plausible and explain why this is important. But this is a kind of journey. All the data science world is so focused on predictive modeling that for many data scientists, it still sounds strange that this is not sufficient. The difference between predictive modeling and causal discovery is still not really perceived everywhere. Generating this perception is a way to go, and there is certainly a little education necessary for that. More education in that domain and more visibility for the importance of causal discovery—that predicting what is only the first step, understanding why needs to be the second step. That is one thing.
Another thing is we certainly struggle on the journey, on this maturity journey. Another important element certainly is availability of data. As I explained, we need very comprehensive data. This comprehensiveness of data—recording all this data, making it available—that is also very often a way to go in typical companies.
Philipp: Makes sense. Who are the stakeholders that you would engage with typically in a company? Is it the data science department or data analytics, or is it maybe also the business owner?
Michael: It’s both, but typically it starts with the business owners because the business owners have the problem. In a manufacturing line, the one who is responsible for quality has the responsibility to have high-quality products or a low failure rate. So typically it starts with the business user because he is the one who has the problem. He has to deliver high-quality products with low failure rates. So typically it starts with the business side.
But I hope sometimes it also starts with the data science department. As far as available, data science skills are increasingly available in many companies, but there’s also still a way to go. I hope that we sooner or later have more perception in the data science community. Maybe I’m going to talk in a few minutes about our community edition, which we are about to release. Our goal is to penetrate the data science community so that these new technologies are not just driven forward by the business side but also driven forward by the data science community, data science departments.
If you ask a typical data scientist, I think only 50% of them have heard about causal AI and causal discovery, and only 20% of them have in-depth experience and know what it indeed is.
Michael: So there is still a way to go on the data science side to generate awareness of that topic and the understanding of what the difference is and why it’s important. This is something we’re trying to push forward, in particular by means of our community edition, which we are about to release.
Philipp: You mentioned a couple of times that you’re advocating for this shift away from predictive models, the regression model mindset, towards other methods including causal inference, and that it could have an impact on current-gen AI methods. In your mind, what would be the potential that causal discovery has for causal AI—this term from Gartner that you brought up—in the next three, four, five years?
Michael: I hope that in the next three, four, five years, causal AI is going to become one of the super important topics. If you ask ChatGPT “what should I do so as not to get gray hair?” you’re going to get some answers. But it’s just rephrasing what it has read on the web, more or less. It’s not indeed understanding cause and effect. But I imagine a future version of artificial intelligence which, by means of all the observational data—in particular in healthcare—so in healthcare in Germany, we have data of 60 million patients, electronic health records. Germany is a difficult place because of the data protection regulations.
[unclear] of 50 million patients, implicitly asking those data or asking an algorithm which has access to the data for cause and effect relationships. What should I do not to get gray hair? That’s one thing, but what are drivers for breast cancer? What are drivers for pressure ulcers? What should I do with that patient so as to avoid pressure ulcers? And you get a response in terms of cause and effect. That is the vision of how causal knowledge at the end drives our artificial intelligence capabilities, in particular in healthcare, to treat patients effectively, to also have early alerts if patients are at risk of certain diseases, and also know what to do to avoid these risks. And certainly also discover new cause and effect relationships.
I mean, in Germany, there exist approximately 10,000 different pharmaceutical products. And for many of them, we don’t know the causal side effects, and for many of them we don’t know that they might have beneficial effects for other diseases as well. Discovering all this, and not just discovering this empirical knowledge but bringing it back into our healthcare processes—that’s the vision behind causal AI. And I hope, I very much hope, that in the next years, this topic of causal AI is going to become very big.
Philipp: When you engage with people on causal AI, on causal inference methods, when you explain what Explained Data is doing and so on, what are assumptions or maybe claims in people’s minds that you disagree with? Anything that you have to correct typically when you introduce these methods?
Michael: Difficult to answer. What are the obstacles? It takes typically two or three iterations. Many of the data scientists in a company already have some predictive models running, and getting them away from that mindset and explaining why a predictive model isn’t sufficient, why they need more than that, is typically done only in two or three iterations to really understand the difference between a predictive model and causal discovery. Typically, you understand it only as soon as you start using the technology. That’s why we want to offer a community edition to the data science community, because only then do you see the difference.
On the one hand, the difference between working on a flat table versus working with holistic objects—that we need such holistic objects—that’s one thing. But the other thing is a predictive model doesn’t explain whole causal pathways. In the cylinder head example, the important step was understanding the whole causal pathway from the washing machine towards a false negative result in the leakage test. And that is also one thing which predictive models wouldn’t deliver. You’re going to understand the meaning of that only as soon as you do it. So it typically takes two or three iterations until a data scientist who is deeply into predictive modeling sees what’s the additional benefit of doing it and how to work with holistic objects.
We have some cases where people have been very skeptical at the very beginning, started to work with it, but now don’t want to give it away anymore. There’s a project at [unclear] where the data scientist probably wouldn’t give it away again, because as soon as you operate, as soon as you’re using it, as soon as you’re working with holistic objects—you don’t need to flatten your world into a flat table, but you can work with the real-world data as they are. That is something you won’t miss once you start using it.
Philipp: Sounds very promising. So if listeners are intrigued now about these methods, do you have a good paper or read that they could start engaging with this topic? Anything that comes to your mind?
Michael: There is one thing to read very soon. We are publishing on the pressure ulcer case—we are publishing a paper at ISPOR, probably one of the biggest healthcare conferences. It’s going to be published there very soon. There is certainly some educational material on our website, and we are, as I said, going to release the community edition. The community edition comes certainly with a lot of documentation. There’s a “getting started” area on our webpage which explains all those things on a very intuitive level. There are explainer videos available.
We are going to release—certainly you need to pay license fees for our technology. As a startup company, we cannot afford to give everything away for free. It has a web-based interface, and we have a browser-based user interface where you can start exploring objects statistically in your browser. And there is also a Python interface. So anything you do there, you can do it via Python code.
Michael: So what we offer to the MIMIC community is a ready-to-go object analytics model for these MIMIC data, for those hospital data. You don’t need to do all the data loading, all the building of these objects. The entry hurdle is very low. Just start your browser, start using the MIMIC data. That’s where we start to offer that to the data science—in that case, to the healthcare or MIMIC or PhysioNet community. And we will see the results and the feedback we get. We’ll certainly need to improve things here and there.
Then comes the second step. Defining an object model requires some skills—how to map a relational data schema onto an object analytics model. That’s something a little bit difficult sometimes. But in healthcare, the object model is more or less always the same. The root object is the patient, and the patient has sub-objects such as prescriptions and diagnoses and hospital cases. A hospital case, in turn, has multiple sub-objects: lab values, procedures in the hospital. But the structure in healthcare is typically very comparable all around the world. So we can pre-configure an object analytics model for the healthcare community. That’s the second thing in the community edition. We will release a pre-configured object analytics model where other companies can load in their patient data. So they don’t need to think about the structure of the object analytics model.
And then, towards the end of the year, hopefully we release the full community edition. On a non-commercial basis, for evaluation purposes, you can use it. You can load your data into this object analytics environment. Universities can use it for free, but certainly if you start using it on a commercial level, you need to buy a license.
Philipp: That makes sense. That’s fair. And the website that you mentioned is explaindata.com. Is that correct?
Michael: Yes. “Explain” being spelled with an X in the beginning. It has a lot to do with explainability. There is a data science branch which is named explainable data science. And those two things come very close together. That’s also the root of the name: explaining data in the most intuitive and simple way.
Philipp: Sounds very good. And what’s the timeline for this launch of your community edition?
Michael: I hope we can launch the community edition for the MIMIC data within the next two months. It’s specifically for the PhysioNet or MIMIC community. The MIMIC community has approximately 10,000 to 30,000 users of these MIMIC data existing worldwide. So this is the first target group. We hope that we can release it to the MIMIC community within the next two months. We certainly need to discuss it with the PhysioNet community first and how we can release it. Then, if that was successful, I hope that towards the mid or second half of the year, we can release the full community edition to the data science community.
Philipp: So causal discovery and causal inference, Michael—are those two terms completely interchangeable, or is there a subtle distinction we should define?
Michael: They are very often used interchangeably, but from my point of view, there is a little difference. What you want to discover typically is a causal graph, a DAG—a directed acyclic graph—which shows you the causal structure. Discovering that graph is the causal discovery part of it. And then causal inference means, based on such a graph and based on the discovered causal effect relationships, drawing conclusions about, for example, what would be the effect if I do some intervention. Computing the effects of interventions based on a given causal structure—that is what I understand as causal inference. So those two things, in that sense, are a little bit different, but very often they are used interchangeably.
Philipp: Many real-world enterprise data sets often come with significant data quality challenges. Is it necessary to do causal discovery or use object analytics that your data is very clean, very high quality? Or how do you deal with that situation?
Michael: Yeah, definitely, the higher the data quality, the better. But that doesn’t mean that if the data quality isn’t perfect, you can’t start. That is indeed what we hear very often: “We need to improve our data quality before we start doing this kind of analytics.” And I wouldn’t agree to that, simply because you very often see the quality problems in your data only as soon as you analyze sub-objects in relation to each other. A single table might be okay, but if you analyze one table in relation to another table—the prescriptions in relation to diagnoses in healthcare, for example—you see that those don’t fit together. So it’s also a perfect means to understand your data quality and where you need to improve. It’s an iterative process. I mean, we definitely would not wait to load the data and start working with the data until you think you have perfect data quality. It’s an iterative process. You should start loading the data as soon as possible to have the transparency and see the weaknesses, and then in iterative steps improve the data quality.
Philipp: In terms of the future development of the field, what do you see as the role of causal AI in developing more explainable AI systems?
Michael: As I said, explainability and causality are two things which are very close together. If you want to explain something, you certainly want to explain it in terms of cause and effect. So that’s why those two things definitely come very close together. As I said, I believe that very soon—and I can see this already today—people more and more don’t just want to know what is going to happen, but why. If you have a black-box prediction, it just throws a prediction at you: “This is your probability that your hair will become gray.” Then your next question typically is “why?” And this question of “why” is going to be asked more and more, in particular as soon as you want to plan intelligent interventions. Then you need to know the why.
That’s what I believe: that in future developments in data science and artificial intelligence, the question about the why will become super important. That is one of my predictions, not just because Explained Data is active in that space—I really believe that this is the next thing which is going to be very important. I don’t want to make a lot of advertising here for Explained Data, but I think that we have an important key to that. And it isn’t sufficient to operate on flat tables. We want to offer the means so that this question about the why can be successful at all, because it’s important. That is part of my [unclear] interfaces for our object analytics database. I hope that these open interfaces are used to develop totally different algorithms, not just our causal discovery algorithms, which operate on real-world data. And real-world data means it’s not a flat table.
Those are the two predictions which I have for the future of data science. The question of “why” is going to become very important. And along with that, it’s becoming very important.
Philipp: Something you see?
Michael: Yes, definitely. That’s definitely something I see. There will definitely be a causal layer. And I hope it’s going to develop.
Michael: And at the end, I hope that it’s nicely integrated into what we have today. So today, you can ask ChatGPT for advice for whatever. But I imagine the next version of such an intelligent system where it’s not just rephrasing text from the web, which is what a large language model does. It’s indeed drawing conclusions on cause and effect and answering your questions like “What should I do so as not to get gray hair?” or “What should I do to improve my health status?” And it’s giving you answers, not just for ordinary people but also in medicine.
I imagine that the future of medicine is that each doctor has the empirical knowledge of hundreds of millions of patients at hand in terms of a causal machine which is using this empirical knowledge to support the doctor. This kind of causal AI is certainly a layer which is missing today, and which I hope will develop in the next two to five years. It will take a little time, but particularly in healthcare, I’m absolutely sure that this is a key to much more intelligent and particularly much more cost-efficient healthcare systems.
Philipp: So that concludes today’s episode. Michael, thank you so much for joining us today. It was an absolute pleasure talking to you.
Michael: Thank you, Philipp, and thank you, Damien.
[Narrator]: And thank you also to my co-host, Philipp, and of course, to you for listening. Before we leave you, just a quick mention: we will feature a repurposed version of this in a future Data and AI magazine. That’s our enterprise publication. You can subscribe to the magazine for free at datasciencetalent.co.uk/media, and we look forward to having you with us on the next show.