Skip to content

Episode 28

Enterprise Data Architecture in The Age of AI – How To Balance Flexibility, Control and Business Value

Nikhil Srinidhi

Transcript
Share this:

Description

In this episode, we had the privilege of speaking with Nikhil Srinidhi from Rewire.

Nikhil helps large organizations tackle complex business challenges by building high-performing teams focused on data, AI, and technology. With practical experience in data and software engineering, he drives impactful and lasting change. Before joining Rewire in 2024, Nikhil spent over six years at McKinsey and QuantumBlack, where he led holistic data and AI initiatives, particularly for clients in life sciences and healthcare. Earlier in his career, he worked as a data engineer in Canada, specializing in financial services. Nikhil holds a degree in Electrical Engineering and Economics from McGill University in Montreal, Canada.

Show Notes
Resources

Episode Summary 

Data Architecture Fundamentals Nikhil defines data architecture as essential for organizational alignment, comparing it to building blueprints. He distinguishes between the static aspects (technology components) and dynamic aspects (data flows), emphasizing how architecture becomes critical as organizations scale beyond small teams.

  • Pragmatic Implementation Approach – Good data architecture should be modular, flexible, and designed with human usability in mind. Nikhil cautions against over-abstraction, advocating for pragmatic approaches where organizations commit to specific technologies when appropriate while maintaining flexibility elsewhere
  • Market Evolution Impact – The data technology market has evolved from end-to-end suites to specialized tools for specific functions, creating new architectural challenges. This has forced enterprises to make strategic decisions about where to place data versus capabilities, increasing the importance of thoughtful architecture.
  • Life Sciences Applications – In life sciences, data architecture helps integrate diverse data types (genomic, clinical trials, real-world evidence) while maintaining compliance with regulatory requirements. Computational governance for sensitive healthcare data presents unique challenges, especially regarding permissible data usage.
  • ROI Measurement Challenges – Measuring the ROI of data architecture investments remains difficult as architecture is fundamentally enabling rather than directly value-generating. Nikhil suggests using proxy metrics and comparing centralized versus decentralized implementation costs to demonstrate value.

Nikhil Srinidhi LinkedIn – https://www.linkedin.com/in/nikhilsrinidhi/

Rewire – https://rewirenow.com/en/

The Data & AI Magazine – Media – Data Science Talent

Series you might like

AI V Humans

2 Parts

Data Strategy Evolved: How the Biological Model fuels enterprise data performance

1 Part

Deep Fakes

2 Parts

Enhancing GenAI with Knowledge Graphs: A Deep Dive

1 Part

Enterprise Data Architecture in The Age of AI - How To Balance Flexibility, Control and Business Value

1 Part

Future AI Trends: Strategy, Hardware & AI Security at Intel

1 Part

How AI Is Driving The Eradication Of Malaria

1 Part

How AI is Reshaping Startup Dynamics and VC Strategies

1 Part

How Observability is Advancing Data Reliability and Data Quality

1 Part

How Science is (mis)communicated in Online Media

1 Part

How to Leverage Data For Exponential Growth

1 Part

How to Use Neural Networks

2 Parts

How XPRIZE is enabling AI for social good

1 Part

Image Processing

1 Part

Key Principles For Scaling AI In Enterprise: Leadership Lessons

1 Part

Mapping forests: Verifying carbon offsetting with machine learning

1 Part

Maximising the Impact of Your Data & AI Consulting Projects

1 Part

The Evolution of GenAI: From GANs to Multi-Agent Systems

1 Part

The future of LLMs, ELMs and the semantic layer

1 Part

The Path to Responsible AI

1 Part

The Pitfalls of Using AI Systems for Hiring - Julia Stoyanovich, NYU

1 Part

Transforming Freight Logistics with AI and Machine Learning

1 Part

Using Open Source LLMs in Language for Grammatical Error Correction (GEC)

1 Part

Using Time Series Analysis to Uncover Why Gun Sales Increase After Mass Shootings

1 Part

Why Evolutionary Biology Has Big Implications For Future AI Development

1 Part

Transcript

Enterprise Data Architecture in The Age of AI – How To Balance Flexibility, Control, and Business Value

Data Science Conversations Podcast with Damian Deighan and Dr. Philipp Diesinger

This podcast features cutting-edge data science and AI research from the world’s leading academic minds and industry practitioners so you can expand your knowledge and grow your career. This podcast is sponsored by Data Science Talent, the Data Science Recruitment Experts.

Speaker 1 (Damian): Welcome to the Data Science Conversations podcast. My name is Damian Deighan and I’m here with my co-host Philipp Diesinger. How’s it going Philipp?

Speaker 2 (Philipp): Good, thanks Damian. So today we’re talking to Nikhil Srinidhi and by way of background, Nikhil helps large organizations tackle complex business challenges by building high-performing teams focused on data, AI and technology. With practical experience in data and software engineering, he drives impactful and lasting change. Career-wise, before he joined his current employer Rewire as a partner in 2024, Nikhil spent over six years at McKinsey and Quantum Black, where he led holistic data and AI initiatives, particularly in the life sciences and healthcare sector. Early in his career, he worked as a data engineer in Canada, specializing in financial services. And Nikhil also holds a degree in electrical engineering and economics from McGill University in Montreal, Canada. Welcome to the show, Nikhil, it’s a delight to have you here.

Speaker 3 (Nikhil): Hi, Damian, thank you for having me.

Speaker 1 (Damian): To kick us off, Nikhil, maybe just give us how you went from your engineering degree and almost straight into a career as a data engineer.

Speaker 3 (Nikhil): I realized while studying electrical electronics engineering that something we were always focused on was effectively the signal to noise ratio. We did that on hardware, we did that in software. And the more I thought about it, I realized that’s exactly what companies and businesses are also trying to do. There’s a lot of noise out there. And finding the signal essentially means being able to use data, information, knowledge in the right way. So that got me into my first role as a data engineer where I really helped large financial service organizations make sense of the information, bring them together, see them in the same level, harmonize it, and recognize how exactly it can be used to drive business value. And the data engineering aspect really fascinated me and really finding the signal in that noise.

Speaker 2 (Philipp): Nikhil, we wanted to discuss today a little bit data architecture topics. Maybe we can start with the big picture on a very high level view. How would you define data architecture and why is it important to business leaders to care about it?

Speaker 3 (Nikhil): I think I will have to take a little bit more of a philosophical entry into this topic. If we think about a small company, if we have just a small team and we are building a product, we’re using data, we don’t need a data architecture per se because we are working together with each other all the time. So our daily conversations, that is the data architecture, how we start developing the product, our agreed upon design, that is in effect the data architecture. The problem arises when you think about scale. So once you have an organization, multiple teams, you need something that contains that kind of agreement, those kinds of design patterns, the principles that you use to make trade-offs between one technology or the other, one data integration pattern or the other, because you cannot have 100 meetings a week explaining it to the whole organization.

Speaker 3 (Nikhil): So that’s when the whole idea of data architecture came in, software architecture came in, enterprise architecture came in, and I think it is super important for an organization because if you do not get this right, everyone is effectively trying to build something with different blueprints. Imagine a building that’s being built with each person having a different schematic of the building. First and foremost, the building may not have the same artistic output that you are looking for. It may also have really structural and safety issues. So really connecting it to the origin of the word “architecture” as well, I think it is really a way of making sure everyone is working towards the same end goal in mind.

And when it comes to data, it is about how do you work with technology? How do you work with different types of data? How do you process it? How do you deal with structural issues, quality issues in a way that it really moves the needle forward in getting towards that building? And so I would argue that architecture is, in fact, key to getting data or AI scale done correctly. And so that’s why I think organizations have been investing a lot into it. At the same time, I also see that organizations have not gotten all they’d like out of it. So there’s still a return on investment question around architecture: is it really delivering what they’d like? And so I think that’s a great discussion in general to have—the merits and also the demerits of architecture and how it applies.

Speaker 2 (Philipp): That makes a lot of sense. Maybe as a follow up, you already mentioned that in the last couple of years, things have been changing and these topics are getting more and more visibility. What is something that has changed in the last couple of years that made data architecture more of a boardroom topic and what role does AI play in this?

Speaker 3 (Nikhil): If we think about the situation about 20 years ago when organizations started really making use of data that they had, often a lot of the tools, technologies, everything was within one entire end-to-end suite. So companies that provided the database technologies, for example, IBM, they also provided the ETL applications. They also provided the visualization applications. They provided the full value chain.

And I think over time, as the space got more and more important, you started seeing many small companies playing in very niche parts of the data value chain. So you have companies dedicated to supporting just unique ways of storing information, unique ways of retrieving it, unique ways of processing it. And the competitive advantage started becoming at several points along this data value chain.

And so the fact that data became important meant that the entire market flocked towards making solutions for the space, which meant that enterprises suddenly had to step back and think about questions they didn’t have to before. Which combination of technologies should I use? Where should I do what? Because now you had data in one place, you had certain other capabilities in a different place. So do you move the data? Do you move the capability? These types of questions started exponentially growing, essentially, given that market change.

And so I think that that’s really what drove the importance of the topic. And of course, now with AI, generative AI, that requires a lot of unstructured information. You don’t think of just data architecture. You start thinking of knowledge architecture, information architecture, how do you make sure the right information is fed to these models? And so the problem is only growing and it’s growing very fast.

Speaker 2 (Philipp): Makes a lot of sense. You’re already mentioning that there are different components of data architecture. There’s often confusion between infrastructure, data pipelines, governance, and so on. Could you help us make sense of the key layers of a modern data stack? And maybe for each level, what matters the most?

Speaker 3 (Nikhil): To understand this word “architecture” in the context of data simply, I would just break it down into two aspects. The first is the more static aspect, which sometimes is referred to as the data technology architecture. It is what you need to put in place in terms of tools, components, which vendors you use to provide capabilities from data ingestion, storage, processing, curation, eventually serving that data and eventually consuming it as well. So that’s the static aspect—what are the capabilities you provide at rest.

The second aspect of data architecture is really around the data flows. So how does data flow in an enterprise from where the information is created all the way to where it’s consumed, and what are the different steps that happen along the way, ensuring that there is some level of control and clarity in where it should be completely standardized and where it should be allowed to vary depending on the use of the data.

So I would say there’s a static part, there’s a dynamic part, and the important part within each is effectively to provide guidelines. It’s to provide guidance on how these patterns, these technologies can be applied in the organization at scale, which is why beyond providing guidelines, a successful data architecture almost goes into becoming very easily applied by the product team or the individual that’s actually building something in the organization.

Speaker 2 (Philipp): So you already mentioned that detecting a good data architecture network is the ease of use. Are there other principles that you would say good modern data architecture should have or should look like?

Speaker 3 (Nikhil): Definitely. Ironically, the word architecture also talks about something that’s very constant, right? When you build a building, you build it to last 20 to 30 years. However, for data architecture, one of the qualities is actually modularity and flexibility. So how do you build different parts of the data pipeline—from acquisition all the way to consumption of the data—in a way that if there’s a disruption in one of those components along the way, if tomorrow we realize there’s an entirely new way of storing information, or perhaps with quantum computing, there’s an entirely new way of actually building processing pipelines, we do not necessarily have to break the entire thing. We can really just switch it. So I would say modularity, flexibility is one.

The other aspect that is often ignored is actually the human angle. So the ease of use, like you said, how do you really make the architecture implementable? Not just PowerPoint slides that describe what to do, but really, how do you encapsulate it as code, as reusable modules, procedures that development teams can easily just pull from a repo and integrate into their code base. I think the more practical and tangible you make it, the better.

And another quality, of course, is also it needs to be observable. And this is already applied at the data level, but I would say even at the architecture level, you should be able to tell at an organizational level which parts of my data architecture are incurring the highest amount of cost, which ones are growing the fastest, where is perhaps the leverage reducing when perhaps it should increase. I think this kind of observability and transparency is also critical.

Speaker 2 (Philipp): And you briefly already touched on the business goals of the organization. How should the data architecture align with the company’s data strategy or business goals? And where do you see typically disconnects there?

Speaker 3 (Nikhil): I think this is definitely an area where there is not enough clarity in an enterprise setting. For simplicity’s sake, and I of course recognize there are exceptions, I would say the business strategy or even the data strategy describes what to do with data. So firstly, why do we need data? What kind of business objective will it help us achieve?

Speaker 3 (Nikhil): What data are we talking about here that we feel is really our competitive advantage for the enterprise? So it’s really answering these types of questions. What kind of data, what to do with it? Who to do it for? How’s the customer or the end user going to benefit? I would expect this to be answered in the data strategy, business strategy, which go hand in hand, of course.

The architecture really focuses on how, and I think here it is really making sure that it’s built in the most effective, efficient way that you reach your goals with the most optimum amount of resources with the highest quality. And it also provides that perspective of how to make the tradeoffs because, for example, three things: you have cost, you have time, and you have quality. You can’t have all three perfectly done. You can’t have lower cost, higher quality, and speed.

And being able to have a guide on how to make decisions where one thing is traded off—for example, you incur technical debt or you have project delays—this is for me what architecture should provide crystal clear clarity on. An organization is not falling into what they build, but rather they are very consciously, intentionally walking into it.

And I think whoever’s building the architecture needs to be very well-versed with the business strategy and the data strategy. And the moment architecture becomes highly generic, where you can take one organization’s data architecture, switch the name, and it becomes another organization in a different industry’s data architecture, that’s when you know it’s probably not going to work.

Speaker 2 (Philipp): What would you say are the biggest misconceptions about modern data architecture that you see in companies?

Speaker 3 (Nikhil): It’s a good question. I think it also depends, of course, on the industry. But the biggest misconception I have come across is that extreme abstraction is always going to make your architecture better. I think that there’s a tendency for architecture to sometimes get overly theoretical. And we definitely want to ensure we make pragmatic tradeoffs.

So how do you make it pragmatic? There might be a specific part of the architecture like your storage solution, where it is okay not to have all the flexibility through abstractions, all the modularity, where you do double down on specific technologies, specific types of storage patterns. For example, if you want to store all your data as iceberg tables or as parquet files, if that’s a decision you’ve made for now, you can go with it. You don’t have to build it in a way that you are always non-committal about your decisions.

I think that’s for me the big misconception. It is okay to commit to something. It is important to recognize where the commitment actually benefits you and where it could become a cost, and it doesn’t have to be in the entire flow of data—it has to be specific.

So for example, if you take an industry example like in the life sciences sector, especially in R&D, you would like to give a lot more sovereignty and freedom of choice for consumers to be able to actually explore datasets, explore real world data, to explore clinical trials that are available, to explore data that they also acquire in different ways and forms. Different teams may have a different way of looking at it, and so it’s difficult to force them all to use one application.

So diversity is completely fine there, but there’s no point in trying to build the most perfect storage layer, for example, that tries to remain neutral. It is okay to maybe double down on one specific technology and say, “Okay, we’re going to go with Amazon S3 for all our data. It’s going to be stored as files, and we’re going to make sure that’s the common way through.” I think the misconception that the architecture has to be somehow perfectly modular in every angle leads to a lot of unnecessary work.

Speaker 2 (Philipp): Makes sense. You already touched on the life science space. Of course, Gen AI is a big hype or a big trend in the life science sector, especially for pharma, especially for R&D. How have you seen in the past one or two years how that influences decisions on data architecture, or what are requirements to be ready for the future of GenAI?

Speaker 3 (Nikhil): I think the interesting thing about GenAI is that it has achieved a level of visibility at all levels, from board to developer. And in terms of data architecture, the realization that no matter what techniques you may have at the base of it, if you cannot make use of proprietary information that you have that actually will give you competitive advantage and a lot more tailored insights for your own business, the kind of benefit that GenAI will give you is actually the same it’ll give any company.

So if we only focus specifically on what gives you competitive advantage, I would say the biggest area that architecture is currently grappling with is how do you provide the right endpoints for data to be accessed and injected into prompts, injected into workflows that interface with LLMs. So how do you build the right context? How do you also use data models that you’ve been building which have metadata to help the GenAI application also understand your business a little better because of the data model that already has entities relationships set up. And I think this interface between your structured data and then the very unstructured context prompt, this is one area.

And of course, the broader question is also, how do you think about unstructured data that you currently have? Like just information in documents, information in PDFs and PowerPoint slides. How do you make this part of also the data architecture or rather the knowledge architecture going forward? And I think that there is no clear approach to this just yet. It is still something companies are grappling with.

Speaker 1 (Damian): Another topic that many teams are stuck in is the debate of centralization versus decentralization. So we’re talking about data mesh, data warehouse, and so on. How should organizations think about that tradeoff? What’s a healthy way to approach this?

Speaker 3 (Nikhil): I’ll say something controversial, and perhaps it doesn’t make the final cut. I think data mesh as a term, while the concept was very attractive, it was elegant. It was described in a way that just made sense for enterprises that were decentralized that were complex. While it was elegant, I fear that the word and the term created more confusion and chaos than good so far.

And I’ll say this because it became a question about decentralization versus centralization. And the fact of the matter is the easy answer is it always depends. There are some things that you’d like centralized, there are some things that you would like decentralized.

So for example, if there is some data that you’re collecting that has very high value, for example, customer touch points or sales reps and their interaction and the kind of information they’re collecting, that data domain that would house that kind of information, you would want to create as much standardization around it. You would want it to be common across various parts of your business, because at the end of the day, it is customer touchpoint. So you may want to look at the customer at the center, regardless of what kind of business you have. So to do that, centralization may be completely fine, in terms of the data domain, the expertise, how the information is fit to the whole organization.

Because when it’s centralized, you can also control it. The word centralized also triggers allergic reactions in many organizations because it gets translated to bottleneck. And that’s where I would argue that federation or rather decentralization, it works best when you recognize that there will be a natural bottleneck that gets created.

So for example, if you end up having everyone who works with data in a central team, all your data capabilities, you know that that will create a bottleneck because a lot of the advantage comes when your data practitioners are deep in the business context. So if someone is working in the R&D space or is working in the clinical space, the closer they are to the domain knowledge, the better it is, even if they may have a background as a data engineer or data scientist. So in these situations, when something centralized, requirements are thrown back and forth.

So I would argue that when we think about this debate between centralization and decentralization, let’s really focus on how do we want data and knowledge and expertise to also flow.

Speaker 3 (Nikhil): So there’s the benefit of having expertise at the edge, you know, within the grassroots of the organization. But there’s also the other angle, the flip side where you want to control the variability. If you think about it from a lean perspective, how do I reduce variability in a very specific part of our value flow? So definitely a complex question, but it really depends and both should be looked at without any emotion attached to it.

Speaker 2 (Philipp): Makes a lot of sense. Maybe another even more open question. So in the tech space, especially in the data tech space, there’s obviously a lot of noise that we are constantly exposed to. What is your personal approach to separating maybe the signal from the noise, like you said in the beginning, or the signal even from the hype where we talked about GenAI for instance already? I think everybody still has this question on their mind, how much of it is hype? Where is it actually, where is practical utilization already in the topic? So what is your experience or what is your approach on this?

Speaker 3 (Nikhil): The approach I go to personally is one that people won’t find unfamiliar. So it is firstly understanding as an organization, what are the types of data. A data map of this in itself is a great starting point. And I would say once you have something that is more or less 70 to 80% correct, or 70-80% representative, the important thing is to also then draw a line and say, I’m not going to try to make this 100%.

A few colleagues I’ve worked with have given me an interesting quote, which is, “All models are wrong, but some are useful.” And so I really follow that. And getting something at the 80% mark means you can start enabling the business objective itself. I need to be able to articulate that with business language. I think that’s number one.

Second, what are the technologies and innovations that are in flux within this capability? So for example, in data storage, if I even had to rewind three to four years ago, the innovation would be that storage solutions were moving away from your black box databases to also file format types of storage, which is the Delta Lake idea that also Databricks is pushing. So using Iceberg files, Parquet files, like it has been in the Hadoop universe.

So knowing that this is something that’s moving in that direction, I can also keep myself updated in terms of what kind of technologies are coming up that plays to these trends. I should understand the kinds of trends that are coming up within that capability area and of course vendors, and also recognizing how do you evaluate them proactively rather than necessarily having to do some sort of POC for each and every capability which takes up a lot of time and resource.

Then the third thing I would do to also figure out the signal from the noise is then what is good enough, right? So another quote, “Perfect is the enemy of good” or “of great.” So what is the right level that you are getting towards? So almost thinking in terms of service levels, if I’d like to call it, for this capability to be offered, what should I choose? And what is okay as technical debt that I incur?

So being aware of it as well. I think half the time organizations pick solutions, pick design patterns, integration patterns, like, “Oh, we’re going to go for an event-driven architecture” with the silver bullet mentality. But being very honest with yourself and saying, “You know what, this is great in 80% of the cases, but here’s the 20% that will not work as well.” And being aware of that, I think also helps to dehype the signal itself and make it a lot more real.

So I would say just to recap effectively, it’s know the capabilities’ connection to business value, ensure you understand the trends and how the market’s moving in the space so that you identify leapfrog opportunities to solve problems that you couldn’t solve earlier. And last but not least, identify the extent at which the capability needs to be implemented, delivered, or existing, recognizing that you have limited resources and you have to be able to prioritize.

Speaker 1 (Damian): We talked a lot about data and technology now, but of course, there is also a very important human component, so architecture, data, analytics, and so on is only as good as the people behind it, of course. What would be a good mindset and capability shift, maybe that organizations need to build around data today?

Speaker 3 (Nikhil): Firstly, if we work backwards from what would a good organization look like that has a successful architecture that’s implemented.

Speaker 3 (Nikhil): I would say it is product teams that are able to very rapidly reuse design patterns, integration patterns components so that they can focus most of their time solving the problems that really require their expertise. So moving away from what we call the pre-work and actually getting to the actual work.

At the moment, we all know that a data scientist will spend 70-80% on data cleaning, data prep. It’s the same thing for architecture. So if we work backwards, we want everyone to just have it so easy that they can pull integration pattern codes, templates, snippets, integrate into their pipeline to figure out how can I pull together diverse sets of data? How can I bring them together, process them, and not have to sort of reinvent the wheel every single time?

And I think if we then work backwards and see what does an individual need to have as a mindset, I think I would identify the first thing is actually building with the mindset of reusability. So if it takes X amount of effort, like 10 hours of effort for you to build a module that works for your data, it may take you three more hours to have another conversation and then build your code in a slightly more generalized fashion. Recognizing that there is this ROI by doing that at that right moment.

So it is also knowing when to invest the time to make something more applicable or rather to make something more adherent to the architecture that the organization has in mind. That kind of awareness is probably the most important thing—knowing is it okay that I actually spend a little longer, but I do it the way that is recommended by, for example, the reference architecture that’s been set up by the architecture group in the company.

And of course, there’s the flip side, the people who are actually building it, they need to have a customer-facing mindset, if you think about it. If I’m an architect or a group of data engineers, I want to make sure that whatever I’m building, be it guidelines, be it usable, reusable code, it needs to be so seamlessly and easily understood and applied. The more I think of other product teams as internal customers, I think that mentality switch will actually make it adopted, and then you have the flywheel kind of working because the more it’s adopted, the more momentum it also gives you. It’s the awareness and then it’s also the customer-centric thinking.

Speaker 1 (Damian): If we stay on this topic of the human component a little bit and to make it a little bit more concrete also in terms of a structural organizational view, what do successful organizations typically do? How do they typically structure the data architecture capability, like who owns it, who governs it, who drives it, are there some best practices that you have seen over the years, what’s your take on that?

Speaker 3 (Nikhil): I think here there are a few different ways of doing this. So, for example, in many organizations, you actually have architecture as part of IT, part of like the digital organization. And in some organizations, there are also separate chapters called architecture, which I found can be challenging just because if you have one organization that’s trying to tell a different part of the organization what to do.

So if I’m in the architecture chapter and I can say, please use event-driven architecture and use this specific technology and this is the standard we use. And the engineers sit somewhere else and they’re not necessarily seen to feel like they’re in the same team. It ends up feeling like someone is telling someone else what to do and someone else thinks that they actually don’t know better and that they themselves would know what to do better.

So my personal view of where I’ve seen as well that worked really well is when architecture as a capability is developed in individuals, as they have grown from being engineers from an implementation role.

Speaker 3 (Nikhil): So the most successful architects are those who have really built it, who have been very much involved in the product. They don’t have to always be necessarily software engineers. It of course helps, but if they’ve been very close to products and actually being part of a team and seeing the impact of that. And then they’ve grown into a role where they broaden their focus. So from going from one product to two products, two products to four products, or five processes to 10 processes. That’s the most successful way of scaling the architectural mindset.

But of course, at the end of the day, that does not work in many solutions just because of their size and also because of their origin story. So I would imagine that the best way to go about it at the moment is, even though they may be separate chapters, really intentionally invest in bringing them together in the product teams and make the product teams where developers, architects, business product owners are all collaborating make that the first level of identity an employee has.

So if you ask an employee “what do you do,” they say “I’m part of product team X” and they don’t say “I am in the architecture chapter.” And this mindset is very difficult to switch because people always like their reporting lines rather than what they work on. So even in product organizations or recent organizations that switch to it, I think making this switch in a mindset is probably where it should be invested and that really requires bringing people together. It is a people issue to resolve, so ensuring there’s trust between different groups and also recognizing what architecture is at that product team level, I would say is really the way to go.

Speaker 1 (Damian): We already know that data architecture, data topics have found their path already even into board discussions or executive level discussions. And of course, everybody’s talking about ROI from data investments there. But how should we really be measuring the impact of data architecture and tech? Like what’s the smarter way or different way to think about the value?

Speaker 3 (Nikhil): There is no clear cut answer simply because when data architecture is invested into, it is so fundamentally enabling that it is difficult to directly attribute value to it. An analogy would be something like, here’s a highway, and can I really figure out what part of my GDP is because of this highway? It is difficult to do that. Of course you have proxies. You can say there are this many cars that go on the highway that goes to this part of the city and you can use proxies to almost estimate what kind of value a change in architecture brings.

But all in all it is still quite abstract. And so I would say that the most important thing is at least storing ROI. Because at the end of the day, if there’s a highway, nobody questions whether that highway was important or not. Everyone knows it’s important. And it’s important to give people that feeling because the numbers are often a proxy to give people feeling. But if people are just using it, they know it’s important, like nobody questions the ROI of their laptop that they’ve been given by IT. So I think we need to really dream about a future where that’s also what it is like for data architecture.

And of course, the way to come about it is to make sure it’s a balance of value back. So you have the use case, you have what’s driving value, and you’re able to use that to decompose it into what you need to build. So you’re not just building something and expecting that one day what you’re building will be used. So ensuring that whatever you build is connected to an initiative that’s connected to a budget that has a specific business objective, that has a business case behind it, I would say.

The flip side of that is, of course, that if you asked everybody to invest in some common goods, like public resources, there’s no direct incentive for individual product teams. So you do need centralized teams for this. So I would say one way of doing this is also to ensure that a central team also does have some of the budget to actually provide these base capabilities that are needed that are used by product teams that themselves don’t have an incentive to invest in it.

So it is a bit of an economic approach but that way at least when there is a budget allocated and there is something delivered. One way of also benchmarking is to see if you had let a product team directly build this up completely on their own using AWS accounts or Azure accounts and directly paid for vendor commercial licenses, what would that have cost? And what is it costing your central team to provide? And is there an economies of scale argument that’s also there. And I think this kind of thinking could help to also build the ROI story.

So just to summarize, you start by recognizing that there are some capabilities that individual product teams will never be incentivized to build. You ensure you have a central team for that. You ensure a budget is allocated. You measure proxy KPIs and report them where possible, because at the end of the day, it’s the feeling of value. But at the same time, you can also benchmark against what would happen if there was no central provider, what would it cost individual teams to have done that on their own? And being able to tell that story, I think would probably be a way to justify the ROI and also track it.

Speaker 1 (Damian): Makes a lot of sense. We’ve talked a lot about your experience and it’s a little bit of perspective of the past and the present, but maybe if we shift a little bit towards the future, if there is a data leader today and you can give them one piece of advice, what would that be?

Speaker 3 (Nikhil): I would say simplify and make data architecture accessible, meaning make the topic accessible. Use simple English. Don’t use jargon. Don’t isolate individuals. Make it a topic that even business users and people in the domain want to know about and want to understand. Just like how Microsoft was able to do that with typing on a keyboard or with their PCs or with Excel and making everyone comfortable to do something with numbers, I think architecture needs to adopt that principle. At the moment, it is still very much seen as a very technology discussion and it doesn’t mean everyone needs to spend their cognitive capacity on it, but it’s helpful if everyone understands its place at least. That doesn’t take too much.

Speaker 1 (Damian): I would like to take a brief moment to tell you about our quarterly industry magazine called The Data & AI Magazine and how you can get a complimentary subscription. My co-host on the podcast Philipp Diesinger is a regular contributor and the magazine is packed full of features from some of the industry’s leading data and AI practitioners. We have articles spanning deep technical topics from across the data science and machine learning spectrum, plus there’s careers advice and industry case studies from many of the world’s leading companies. So go to datasciencetalent.co.uk/media to get your complimentary magazine subscription. And now we head back to the conversation.

So Nikhil, if we go back to your background in life sciences, are there any use cases that you can perhaps take us in a deep dive now that would illustrate some of the philosophy and points you’ve shared?

Speaker 3 (Nikhil): Yeah, so I think the role of data architecture in the life science sector is quite significant and it actually applies across the entire value chain. So if you think about it from R&D, so how do you bring together lots of different types of data and make it usable as quickly as possible to really answer the types of questions that researchers are having with the kind of unstructured information as well. That’s one area.

There’s also a big push in interoperability and standardization. So this is huge given, especially in life sciences, there’s a lot of work around file standards like FHIR, HL7. How do you bring that into your data flows? So if you’re designing something internally, why not already use some of these standards and formats to just make it easier than have to build an adapter later on. So this is, for example, another area.

And I would also add, you know, besides the commercial space as well, where we know very well how data is used, there’s also the increase in effectiveness in terms of filing for regulatory approval and generating evidence and being able to file, there’s a lot of value in being able to ensure you have the right audit trails for how the data moves in the enterprise, especially as companies start going into the software as medical device space. I think this level of knowing how information and data is traveling through various layers of processing, data architecture is what makes that come alive.

And so I think there are lots of areas to do all of this above while maintaining a healthy cost perspective given the life science sector. So if we think about just any life sciences company, one of the biggest areas of competitive advantage is becoming better at research and development. So that is often a huge competitive advantage. So how do I take ideas to the market? And how do I balance between a very academia-driven approach, with also a data-driven approach and a technology-driven approach?

Speaker 3 (Nikhil): I think this is one area where specifically data architecture can be quite impactful for a life science organization. So if you imagine the whole space of developing different types of solutions that require medical data to actually support patient-facing support systems for clinical decision support systems, in all of these it is highly critical to get it right in terms of how data flows but to also ensure that the data that is seen, that is used, has a level of authenticity and trust that’s enabled.

So if you take a step back, the kinds of data that we’re working with here vary from real-world data that you can purchase, that you can get from, for example, healthcare providers, you can get it from hospitals, especially EMRs and EHRs, how do you take that information, combine it with very structured type of information, build the right models around it, and then provide it in a way that different teams can use it to drive innovation in drastically different spaces. And I think data architecture there is less of something that would give you an offensive advantage, but rather it just should reduce the resistance and friction to letting the entire research and development process flow through.

So how do you, for example, build the right integration patterns to interface with external data APIs? All the data sets that you’re buying that you’re purchasing are probably made accessible to you via APIs that you need to call, and you’re often also bound by contracts that require you to report how often these data sets are used. So if you are using a specific data set 30 times, it corresponds to a certain cost for you. However, if you are not able to report on that, then the entire commercial model that you even can negotiate with the data providers will change. Because then they will naturally have to charge the company something higher because they don’t have a sense of how it’s being used. So they would probably have to be more conservative on the estimate.

So being able to also just acquire data in different forms with the right types of APIs and record usage in itself is a huge step forward. And a good data architecture is needed because across that architecture, you would also apply data observability principles. So how is my data coming in? When is it coming in? How fresh is the information? How is it stored? How big is it? Who is actually consuming this information going forward? What kind of projects do they belong to? How are they then using it? Are they integrating some of these datasets directly into some of our products or into some of our tools?

I think this entire space of like, what technology, do you store them in Snowflake? Do you store them just in an AWS account? The organizations that tend to be more successful in this are the ones that go for leaner solutions that really focus on four to five integration patterns. And by integration patterns, I mean, if there’s a specific data source being shared in specific format with the specific API, you recognize that and you build a pipeline that anyone can integrate to draw on that specific source.

And you create five or six acceptable integration patterns that you say as an organization, this is how we get external data. If there’s a way that isn’t covered by these five, please talk to the team and we can help you with that. I think this level of control is also required, because the moment you lose that control, it becomes very difficult to start tracing data and having lineage around it.

Speaker 3 (Nikhil): So I would say a lot of the value comes from acquisition in the pipeline and the second tool of value comes from how data is consumed. So what kind of tools can you actually provide an organization to actually look at the patient data?

For example, by having multimodality information, genomic information, their medical history, their diagnostic tests, how do you bring them all together to actually provide that kind of realistic view. And this is also an area where data architecture is very important because this goes a lot more into the actual data itself.

So what are the links between the information? How do you ensure that you can link different points of data to one object? And also what kind of tools can you provide to the end user to be able to explore this information? And you know, the classic example is you combine a data set and you provide a table with filters and you let whoever they want filter on the master data set. But I think recognizing the kinds of questions your users would have would also allow you to support those journeys. And so I would always take, in these situations, the successful companies have always taken a more tailored approach identifying personas and then really building up that link between all these different types of data, especially in the R&D space.

Speaker 2 (Philipp): You mentioned before Nikhil to talk about different stakeholders and so on. Did you want to go into that a little bit more into the life science space?

Speaker 3 (Nikhil): In general, what I’ve seen in the life science space is that there is a need for data architecture innovation, like there are dramatically different types of tools, different types of data that’s needed across the organization, which means you also need different types of technologies and integration patterns underneath. But at the same time, you are still seeing technology IT as also a cost bucket that needs to be managed.

And also, you’re looking at return on investment, because the more technologies you have, it also means data gets siloed very quickly. And so another area that is often important for, especially life science companies to look at is, where do they draw the line in terms of how much variability in the data architecture they have, especially in storage solutions, especially in the data acquisition efforts, and where do they allow this freedom?

I would say that this is the one other area where I’ve mostly spent my time on because it very quickly balloons to a very large bill for the IT organization. And when you combine that with the inability to directly link value to it, it becomes a situation where organizations cut costs in technology because they don’t know what value it has. And it actually does have some sort of value, but they’re not able to talk about it. And it can dramatically affect the company in terms of their capabilities, you know, whether they’re doing commercial excellence, whether they’re trying to improve their pipeline of drug discovery. These things have large impacts, but they’re so far away currently that I think it is critical to bring these two worlds a bit closer.

Speaker 2 (Philipp): Yeah, and in the life science space, of course, we have vastly different types of data sets. If you think about research data, like omics data, you have clinical trial protocols, you have actual experimental data that are being conducted on active ingredients and so on. What kind of role does that play in your experience regarding the data architecture and how can it be mitigated?

Speaker 3 (Nikhil): I think especially when you have such diverse multi-modality data and also dramatically different sizes as well, right?

Speaker 3 (Nikhil): Right.

So different technologies. Here, the biggest advantage then is to ensure that you really have good abstracted data APIs even for consumption within the company. So if I’m consuming imaging information or if I’m consuming clinical trial information, how do I also have the appropriate metadata around it that describes exactly what this data contains, what are its limitations, under what context was it collected, under what context can it be used depending on the agreement.

And I think this kind of metadata is key if you want to automate data pipelines, or even if you want to bring about computational governance. So this is a key capability, for example, where you have agreements, you’re dealing with very sensitive information, healthcare information, and often data is collected with a very predefined purpose that, for example, we are collecting this information in order to research this specific disease or this specific condition. And it is then not directly clear, can I also use that information to look at something else in the future?

And these kinds of agreements that have been either made in the past or that have not been made yet, they also need to get to the granularity where the legal contracts that you sign with institutions, individuals, organizations about the use of data are somewhat translatable and depicted as code in a way that can automatically influence pipelines down the stream where you actually have to implement and enforce that governance.

So for example, if this data set is only allowed to be used for this kind of R&D, then it needs to almost show up at the data architecture level that only someone from this specific part of the organization because they’re working in this project can access this information during this period. And the day that the project ends, that access is revoked and all this is done automatically, which is not the case yet. It is still, I would say, quite a hybrid. So this computational governance because of the multimodality of the information combined with sensitivity is I would say the biggest problem a lot of these companies are trying to solve today.

Speaker 2 (Philipp): You’re of course describing a very complex scenario for managing the data from a compliance perspective or from an organization perspective. If we look at the other end of the spectrum where we have a data consumer who has to also navigate then not only a very complex data catalog, as you said, different types of data sets, regulatory and compliance topics, and so on.

Could this be an area or a use case where GenAI could help in the future for such stakeholders? Let’s say a researcher maybe at the pharmaceutical company who needs to have access, basically, for the daily work to all of these different complex data sets. Could GenAI be something that helps with this in the future or is there something that you would say is more like a hype at the moment and will not realize in a solution?

Speaker 3 (Nikhil): I think GenAI has immense capability here to actually help the space because a lot of the issues are around how do you process the right types of information in a very complex setting, recognizing there are legal guidelines, ethical guidelines, contractual guidelines that you want to make sure that works. And then it also interfaces from the legal space to the system space, where the information actually becomes bits and bytes.

A set of questions and a conversation that you have, you can at least determine what kind of use is this person thinking about, what kind of modalities are involved, where do those data sets actually sit, which ones are bound by certain rules. And this is where also this ability to deploy agents also can make sense here because when you want to really try to get this kind of guidance it means you need clarity that’s fed into the model as context that it can then base its analysis upon or if it’s a RAG like retrieval type of approach you need to know exactly where to retrieve the guidance from.

So the kind of logic to evaluate is sometimes something that may need to be deterministically encoded somewhere for it to be able to be used. And that requires even individuals to identify or rather create what I call, it’s like creating labeled data for this kind of application.

Speaker 3 (Nikhil): So if this was the scenario, this was the data, this was the user, this is what they wanted. Here’s kind of the guidance that the AI should provide. And with that level of labeled information, then you are sure of a little bit more certainty, but still huge potential for sure and an area to definitely explore.

Speaker 2 (Philipp): I want to double click a little bit more into this topic. You already mentioned that there will be an agentic, basically, dimension to leveraging GenAI. Another one is, of course, that all of these organizations have lots and lots of unstructured data that is already available and could be vectorized and embedded to utilize it and to better also navigate it. So to increase the utilization, maybe even significantly, that includes research papers, but it includes also large, you know, 100-page clinical trial protocol documents, for instance, that are rich with insights and with value if you can just extract the right parts of it at the right time.

So given that there is a very specific use case already that is tangible and solvable, how do you see this capability of being able to vectorize and embed such unstructured huge amounts of unstructured data, is that something that you see part of a data architecture? Does it have a place there? Is it more like a commodity that you get anyway with subscribing to standard kind of technologies? How would you see this evolving in the future?

Speaker 3 (Nikhil): Yes, so I think definitely with, like you mentioned, vector databases, chunking, indexing, creating these embeddings in this multi-dimensional spaces is the first step. And the moment that’s done, I think the architecture is still limited by how do you ensure that your data sources can be accessed via APIs, via programmatic calls, protocols. And so I think you still need that so that all the different islands of information, there’s a very consistent standardized way of interfacing with it.

So when you build that, let’s say around all the different types of data that you need. So you have research papers, of course, but then you also may want your own R&D data that you’ve been collecting based on different experiments. So your experiment database with all the different values that are seen, which might be more structured. How do you then combine the two worlds?

I think this is really the upgrades that the data architectures are currently going through, of course, driven by use case and not necessarily bottom up. How do you also, in some cases, pre-aggregate information? So you have lots of data at very transactional levels that exist. So if you take an example of just readings that are coming through or just observation points, you may want to actually pre-aggregate certain amounts of data to already cater to LLM based queries that need to get access to that information, which may require that because we know that large language models on their own are not necessarily designed to directly work with quantitative data rather they need to work via different types of protocols and approaches.

Speaker 1 (Damian): Nikhil just changing gears slightly. How have you seen the role of data architects and data engineers evolve in the last decade in terms of what has changed and what has stayed the same?

Speaker 3 (Nikhil): In the last 10 years or so, the importance of software engineering is growing for data engineers and data architects. They need to understand how software systems work. In the past, I would say data engineers and architects mostly worked within databases, database paradigms. SQL, to some extent, they worked with visual ETL tools that they could drag and drop. They worked with large applications like Informatica, Talend that are still, of course, relevant.

However, data processing pipelines now are being written in code, and that means knowing how to work with it in version control, knowing how to write modularized functions, knowing how to comment on code. This capability has become paramount. And so this is, I would say, the biggest change, number one.

The second change I would highlight is that the level of business understanding and ambiguity also means that data engineers and architects need to become better at asking questions. In the past, they mostly received requirements.

Speaker 3 (Nikhil): So back then it was called an ETL developer. They received requirements. They said, “I need a table that tells me for each patient, what was their average pulse rate over the last one year, or what was their lipid level measurement over the last one year.” And then you would just write the code, you would aggregate it, deploy it, they have the value, and then either they trust it or they don’t.

Now the issue is the conversation is two-way. The business wants to know what more they can do with data, which means they need to also understand what’s inside it. It’s not just a one-way request fulfillment. So it means it needs to become a dialogue. And I think this is the second change.

What has remained constant though is that, which I think is actually something that maybe hasn’t kept enough of focus in the engineering and architecture community, is the importance of data modeling. So when we had Hadoop and when we had Data Lakes and all of that like 10 years ago, everyone thought, “Okay, this is the end of data modeling. I don’t need to do data modeling anymore. I can just schema on read. I can just build my schema on the fly.”

I think this has created a lot of problems in companies. There’s immense technical debt and many data engineers and architects have forgotten the importance of data modeling, I feel, and really recognizing that a data model now is a representation of your business and representation of the kinds of questions that give you competitive advantage. And I would also add that a good data model also helps LLMs understand the kind of business you’re running a bit better because they’re not stochastically just working off prior knowledge, but they are also working within the paradigms of your data model/knowledge graph.

And so I think this is one thing that I think has been constant, but we have also not thought of it that way.

Speaker 1 (Damian): Really insightful. In terms of the future, where do you see and how do you see the roles evolving of the data architect and data engineer?

Speaker 3 (Nikhil): I think the data architects itself, the successful data architects can still be good data engineers. So when they are also coming up with system designs and drawing boxes and lines, they are able to also understand how that translates to code. And I think that confluence of the two roles is something that I see will happen more and more, because one needs to be able to easily navigate between the design, between the pattern, between the problem and between the code, especially with coding support tools like Co-Pilot or Cursor, all of these types of things that developers are using.

The divide between architects and engineers will be less relevant, I feel, in the future. Simply because the reason why you have that today is because it was too difficult for someone deep in the code to be at the, let’s say, the six-foot view, to also be at the ten-thousand-foot view. It was just not practical because we didn’t have anything to take us so quickly up and down. We didn’t have that elevator. And I think that elevator is coming and it is becoming more and more important for someone to be able to move between these worlds.

And I think organizations are also realizing that. Some organizations have a policy exclusively that we only have architects that code and still code. And I think that also creates a very different type of mindset in terms of being able to walk the talk. So you don’t just draw a line and say, “This is how data should flow,” but rather you have already evaluated, “Is it actually possible given the kinds of limitations with the specific technology? Like if we use Kafka for this size of messages, does it actually hold?” And right now, I think this takes a few ping-pong matches for it to eventually settle. And that may not be the case.

Speaker 1 (Damian): So that concludes today’s episode. Before we leave you, I just want to quickly mention our magazine, the Data and AI Magazine. It’s packed full of insight into what’s happening in the world of enterprise data and AI and you can subscribe for that magazine free at datasciencetalent.co.uk/media. We will be featuring a written version of this conversation in a future issue of the magazine.

So it just remains for me to say Nikhil, thank you so much for joining us today. It was an incredible conversation. We covered a lot of ground.

Speaker 3 (Nikhil): It’s been really great. Thank you so much.

Speaker 1 (Damian): Thank you, and thank you also to my co-host Philipp and of course to you guys for listening. Do check out our other episodes of Data Science Conversations and we look forward to having you with us on the next show.